利用 Crossplane 和 OCI 构筑 Prometheus 定制化 TSDB 的声明式交付系统


我们团队对 Prometheus 的一个特定 TSDB 模块进行了深度定制,以优化高基数标签场景下的内存占用。这一改动虽然带来了显著的性能提升,但也引入了一个棘手的工程问题:如何稳定、可靠地将这个定制化的 Prometheus 二进制文件及其配置部署到多个 Kubernetes 集群中?最初的方案是手动的:CI/CD 流水线编译出二进制文件,上传到对象存储,再通过一套复杂的 Ansible Playbook 拉取二进制文件、渲染配置、创建 Kubernetes 资源。整个过程充满了不确定性,版本管理混乱,回滚操作更是如履薄冰。

在真实项目中,这种手动胶水式的运维流程是技术债的温床。每一次部署都需要运维工程师的深度介入,无法实现真正的 GitOps 闭环。我们需要的不是另一个脚本,而是一个平台化的、声明式的解决方案。我们希望能够像声明一个 RDSInstanceGKECluster 一样,通过一个简单的 YAML 文件来定义和管理我们的定制版 Prometheus 服务。

# 理想中的目标状态
apiVersion: custom.prometheus.io/v1alpha1
kind: CustomPrometheus
metadata:
  name: prometheus-for-billing-team
spec:
  # OCI artifact tag, pointing to our custom build
  version: "v2.45.0-custom-mem-opt-a1b2c3d"
  replicas: 2
  # User-provided configuration snippet
  scrapeConfig: |
    - job_name: 'node'
      static_configs:
      - targets: ['node-exporter.monitoring.svc.cluster.local:9100']
  # ... other parameters

这个设想将我们引向了 Crossplane。Crossplane 允许我们将任何基础设施或服务(无论是云厂商的 API 还是内部应用)抽象成 Kubernetes 的自定义资源 (CRD)。而另一个关键问题——如何分发这个定制的二进制文件——则通过 OCI (Open Container Initiative) 规范找到了答案。OCI 镜像规范不仅仅是为容器镜像服务的,任何制品(Artifacts),比如 Helm Charts、WASM 模块,甚至是我们的 Go 二进制文件,都可以打包成 OCI Artifact,并存储在任何兼容 OCI 的镜像仓库中。

这套组合拳的威力在于:CI/CD 负责构建和发布 OCI Artifact,Crossplane 负责在 Kubernetes 中解析我们的自定义资源 CustomPrometheus,并动态编排底层资源,包括从镜像仓库拉取指定的 OCI Artifact 版本,并注入到 Pod 中运行。这彻底改变了交付模式。

第一阶段:构建与打包 OCI 制品

核心任务是将编译后的 Prometheus 二进制文件和默认配置文件打包成一个标准的 OCI 制品。我们不把它打包成一个可运行的容器镜像,因为我们希望在运行时能有更大的灵活性,比如动态挂载不同的配置文件。使用 oras CLI 工具可以轻松实现这一点。

首先,是我们的 CI 构建脚本。这里以一个简化的 build.sh 为例,它模拟了编译和准备制品目录的过程。

#!/bin/bash
set -eo pipefail

# --- Configuration ---
# Assume these are passed from the CI environment
# e.g., git describe --tags
APP_VERSION="${1:-v2.45.0-custom-dev}"
# OCI registry URL
REGISTRY_URL="your-registry.io/prometheus-custom"
# Go build flags for optimization
GO_BUILD_FLAGS="-s -w"
# Source code path
PROMETHEUS_SRC_PATH="./prometheus"

# --- Main Logic ---
echo "INFO: Starting build for Prometheus version ${APP_VERSION}..."

# 1. Prepare build environment
BUILD_DIR=$(mktemp -d)
trap 'rm -rf -- "$BUILD_DIR"' EXIT

ARTIFACT_DIR="${BUILD_DIR}/artifact"
mkdir -p "${ARTIFACT_DIR}/bin"
mkdir -p "${ARTIFACT_DIR}/config"

echo "INFO: Build directory: ${BUILD_DIR}"
echo "INFO: Artifact directory: ${ARTIFACT_DIR}"

# 2. Build the custom Prometheus binary
echo "INFO: Building custom Prometheus binary..."
# In a real CI, this would involve cloning a specific git commit
# cd ${PROMETHEUS_SRC_PATH}
# make build
# For demonstration, we just create a mock binary
echo "mock prometheus binary version ${APP_VERSION}" > "${ARTIFACT_DIR}/bin/prometheus"
chmod +x "${ARTIFACT_DIR}/bin/prometheus"
echo "INFO: Binary created at ${ARTIFACT_DIR}/bin/prometheus"

# 3. Prepare default configuration
echo "INFO: Creating default prometheus.yml..."
cat <<EOF > "${ARTIFACT_DIR}/config/prometheus.yml"
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'self'
    static_configs:
      - targets: ['localhost:9090']
EOF

# 4. Package and Push using ORAS (OCI Registry as Storage)
# The target OCI reference combines the registry URL and the version tag
OCI_TARGET="${REGISTRY_URL}:${APP_VERSION}"

echo "INFO: Pushing artifact to ${OCI_TARGET}..."

# oras push <target> <file1>:<mediatype> <file2>:<mediatype> ...
# We define custom media types for our components for clarity
oras push "${OCI_TARGET}" \
  --config /dev/null:application/vnd.custom.config.v1+json \
  "${ARTIFACT_DIR}/bin/prometheus:application/vnd.custom.binary.v1" \
  "${ARTIFACT_DIR}/config/prometheus.yml:application/vnd.custom.prometheus.config.v1+yaml"

echo "SUCCESS: Artifact ${OCI_TARGET} pushed successfully."

# Optional: You can inspect the manifest
echo "INFO: Inspecting pushed manifest..."
oras manifest fetch "${OCI_TARGET}"

这个脚本的核心在于 oras push 命令。它将 prometheus 二进制文件和 prometheus.yml 配置文件作为两个不同的 “层 (layers)” 推送到了 OCI 仓库。这里的坑在于,必须为每个文件指定一个媒体类型 (media type),这对于后续工具识别内容至关重要。我们定义了自定义的媒体类型,如 application/vnd.custom.binary.v1

第二阶段:定义 Crossplane 声明式 API

有了制品,接下来就是用 Crossplane 创建我们的 CustomPrometheus API。这分为两步:定义 API 的结构(XRD)和实现这个 API 的逻辑(Composition)。

CompositeResourceDefinition (XRD)

XRD 定义了 CustomPrometheus 资源的 spec 字段,相当于为我们的新 API 创建了 schema。

apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: customprometheuses.custom.prometheus.io
spec:
  group: custom.prometheus.io
  names:
    kind: CustomPrometheus
    listKind: CustomPrometheusList
    plural: customprometheuses
    singular: customprometheus
  claimNames:
    kind: PrometheusClaim
    listKind: PrometheusClaimList
    plural: prometheusclaims
    singular: prometheusclaim
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              version:
                type: string
                description: "The OCI artifact tag for the custom Prometheus build."
              replicas:
                type: integer
                description: "Number of Prometheus replicas."
                default: 1
              ociRegistry:
                type: string
                description: "The base URL of the OCI registry."
                default: "your-registry.io/prometheus-custom"
              scrapeConfig:
                type: string
                description: "The content of scrape_configs section in prometheus.yml."
            required:
              - version
              - scrapeConfig

这里的 spec 字段清晰地暴露了用户可以配置的参数:version(对应 OCI tag),replicas(副本数),以及 scrapeConfig(允许用户注入自定义的抓取配置)。

Composition

Composition 是真正的实现逻辑,它将 CustomPrometheus 资源映射到一组底层的 Kubernetes 原生资源,如 DeploymentConfigMapService

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: customprometheus.k8s.custom.prometheus.io
  labels:
    provider: kubernetes
spec:
  writeConnectionSecretsToNamespace: crossplane-system
  compositeTypeRef:
    apiVersion: custom.prometheus.io/v1alpha1
    kind: CustomPrometheus
  resources:
    # 1. ConfigMap for Prometheus configuration
    - name: prometheus-config
      base:
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: prometheus-config # This name will be patched
        data:
          prometheus.yml: |
            global:
              scrape_interval: 15s
              evaluation_interval: 15s
            
            scrape_configs:
            # This section will be patched
      patches:
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "%s-config"
        - fromFieldPath: "spec.scrapeConfig"
          toFieldPath: "data.prometheus\\.yml" # Note the escaped dot
          transforms:
            - type: replace
              replace:
                search: "scrape_configs:"
                replacement: |
                  scrape_configs:
                  - job_name: 'self'
                    static_configs:
                      - targets: ['localhost:9090']
                  {{ . }} # Placeholder for user's config
            - type: string
              string:
                # This combines the base config with the user's scrapeConfig
                fmt: |
                  global:
                    scrape_interval: 15s
                    evaluation_interval: 15s

                  %s
          policy:
            fromFieldPath: Required

    # 2. Deployment for running the custom Prometheus
    - name: prometheus-deployment
      base:
        apiVersion: apps/v1
        kind: Deployment
        spec:
          replicas: 1 # This will be patched
          selector:
            matchLabels:
              app: custom-prometheus # This will be patched
          template:
            metadata:
              labels:
                app: custom-prometheus # This will be patched
            spec:
              # This initContainer is the core of the solution
              initContainers:
              - name: oci-artifact-puller
                image: ghcr.io/oras-project/oras:v1.0.0
                command: ["/bin/sh", "-c"]
                args:
                  - |
                    set -ex
                    # The OCI artifact URL is constructed dynamically
                    OCI_URL="${OCI_REGISTRY}:${OCI_TAG}"
                    echo "Pulling from ${OCI_URL}"
                    # Oras pulls files to a target directory
                    oras pull "${OCI_URL}" -o /opt/prometheus-custom
                    echo "Artifacts pulled successfully."
                    ls -l /opt/prometheus-custom
                env:
                - name: OCI_REGISTRY
                  value: "your-registry.io/prometheus-custom" # To be patched
                - name: OCI_TAG
                  value: "latest" # To be patched
                volumeMounts:
                - name: binary-storage
                  mountPath: /opt/prometheus-custom
              
              containers:
              - name: prometheus
                image: alpine:latest # A minimal base image, since we bring our own binary
                command: ["/opt/prometheus-custom/bin/prometheus"]
                args:
                  - "--config.file=/etc/prometheus/prometheus.yml"
                  - "--storage.tsdb.path=/prometheus"
                  - "--web.console.libraries=/usr/share/prometheus/console_libraries"
                  - "--web.console.templates=/usr/share/prometheus/consoles"
                ports:
                - containerPort: 9090
                volumeMounts:
                - name: binary-storage
                  mountPath: /opt/prometheus-custom
                - name: config-volume
                  mountPath: /etc/prometheus
                - name: data-volume
                  mountPath: /prometheus # In production, this should be a PVC
              volumes:
              - name: binary-storage
                emptyDir: {}
              - name: config-volume
                configMap:
                  name: prometheus-config # To be patched
              - name: data-volume
                emptyDir: {} # WARNING: Not for production use. Use a PersistentVolumeClaim.

      patches:
        # Patch deployment metadata and labels
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
        - fromFieldPath: "metadata.name"
          toFieldPath: "spec.selector.matchLabels.app"
        - fromFieldPath: "metadata.name"
          toFieldPath: "spec.template.metadata.labels.app"
        # Patch replicas
        - fromFieldPath: "spec.replicas"
          toFieldPath: "spec.replicas"
        # Patch initContainer env vars with OCI info from spec
        - fromFieldPath: "spec.ociRegistry"
          toFieldPath: "spec.template.spec.initContainers[0].env[0].value"
        - fromFieldPath: "spec.version"
          toFieldPath: "spec.template.spec.initContainers[0].env[1].value"
        # Patch the configmap volume name
        - fromFieldPath: "metadata.name"
          toFieldPath: "spec.template.spec.volumes[1].configMap.name"
          transforms:
            - type: string
              string:
                fmt: "%s-config"

    # 3. Service to expose the Prometheus deployment
    - name: prometheus-service
      base:
        apiVersion: v1
        kind: Service
        spec:
          selector:
            app: custom-prometheus # To be patched
          ports:
            - protocol: TCP
              port: 9090
              targetPort: 9090
      patches:
        - fromFieldPath: "metadata.name"
          toFieldPath: "metadata.name"
        - fromFieldPath: "metadata.name"
          toFieldPath: "spec.selector.app"

这个 Composition 有几个关键的设计:

  1. initContainer 模式: 这是整个方案的核心。我们使用一个标准的 oras 镜像作为 initContainer,它的唯一任务就是从 OCI 仓库拉取指定版本的制品,并将其解压到一个共享的 emptyDir 卷 (binary-storage) 中。
  2. 主容器: 主容器使用一个极简的基础镜像(如 alpine),因为它不需要包含 Prometheus 本身。它直接从共享卷 (/opt/prometheus-custom/bin/prometheus) 中执行二进制文件。
  3. 配置注入: ConfigMap 资源通过 patches 动态地将用户提供的 spec.scrapeConfig 内容合并到基础配置中,然后通过 config-volume 挂载到主容器。一个常见的错误是 YAML 路径中的点 (.) 需要转义 (\\.)。
  4. 动态 patching: Crossplane 的 patches 机制将 CustomPrometheus 资源的 spec 字段值“粘贴”到底层模板的相应位置。例如,spec.version 被 patch 到了 initContainer 的环境变量 OCI_TAG 中。
graph TD
    A[Git Commit: CustomPrometheus YAML] --> B{Crossplane Controller};
    B -- Reconciles --> C[XRD: customprometheuses.custom.prometheus.io];
    C -- Uses --> D[Composition: customprometheus.k8s...];
    D -- Renders Resources --> E[1. ConfigMap];
    D -- Renders Resources --> F[2. Deployment];
    D -- Renders Resources --> G[3. Service];
    
    subgraph "Managed K8s Resources"
        E;F;G;
    end

    subgraph "Inside Pod"
        H[initContainer: oras pull] --> I[Shared emptyDir Volume];
        I --> J[Main Container: executes binary from Volume];
    end

    F -- Creates --> Pod;
    Pod -- Contains --> H;
    Pod -- Contains --> J;
    
    K[OCI Registry] -- oras pull --> H;
    L[CI/CD Pipeline] -- oras push --> K[OCI Registry];
    
    A -- Triggers GitOps --> B;

第三阶段:实践与验证

将上述 XRD 和 Composition 应用到安装了 Crossplane 的 Kubernetes 集群后,我们就可以创建 CustomPrometheus 实例了。

apiVersion: custom.prometheus.io/v1alpha1
kind: CustomPrometheus
metadata:
  name: prometheus-for-billing-team
  namespace: billing-services
spec:
  version: "v2.45.0-custom-mem-opt-a1b2c3d" # Our custom build tag
  ociRegistry: "our-harbor.corp.net/internal-tools/prometheus-custom"
  replicas: 1
  scrapeConfig: |
    - job_name: 'billing-api'
      kubernetes_sd_configs:
        - role: endpoints
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_label_app]
          action: keep
          regex: billing-api-service

应用这个 YAML 后,Crossplane 会立即开始工作。我们可以通过 kubectl 观察到:

  1. kubectl get customprometheus -n billing-services 会显示资源的状态,如 SYNCED
  2. kubectl get managed -n billing-services 会列出 Crossplane 创建的底层资源:一个 ConfigMap,一个 Deployment 和一个 Service
  3. kubectl get pods -n billing-services 会显示正在创建的 Pod。查看 Pod 的事件,可以看到 initContainer 成功拉取 OCI 制品,然后主容器启动。

升级变得异常简单。当 CI/CD 构建出新版本 v2.45.0-custom-mem-opt-f4e5g6h 并推送到 OCI 仓库后,开发者或 SRE 只需要在 Git 中修改 CustomPrometheus YAML 的 spec.version 字段。GitOps 工具(如 ArgoCD)检测到变更后,会自动应用更新。Crossplane 会执行滚动更新,平滑地将旧版本的 Pod 替换为运行新二进制文件的新 Pod。

局限性与未来展望

这套方案虽然优雅地解决了定制化二进制文件的声明式交付问题,但在真实生产环境中,它并非银弹。

首先,emptyDir 用于存储二进制文件意味着每次 Pod 重启都需要重新从 OCI 仓库拉取,这会增加启动延迟,并对 OCI 仓库造成压力。对于较大的二进制文件,可以考虑使用缓存策略或将二进制文件构建到一个更持久的 CSI 卷中,但这会增加 Composition 的复杂性。

其次,对于 Prometheus 这样的有状态服务,TSDB 数据的持久化至关重要。当前示例中的 emptyDir for /prometheus 仅用于演示,生产环境必须将其替换为由 PersistentVolumeClaim 模板管理的持久化存储。这需要在 XRD 和 Composition 中添加对存储类、容量等参数的支持。

最后,当前的 Composition 逻辑相对简单。一个更成熟的方案可能会演变成一个专用的 Crossplane Provider。Provider 使用 Go 语言编写,能够实现更复杂的调谐逻辑,例如在更新前检查新版本 OCI 制品的健康状况,或者管理数据库 schema 迁移等,而这超出了 Composition 的能力范围。但这套基于 Composition 的方法论证了其核心价值,并为向更复杂的 Provider 演进奠定了坚实的基础。


  目录