我们团队对 Prometheus 的一个特定 TSDB 模块进行了深度定制,以优化高基数标签场景下的内存占用。这一改动虽然带来了显著的性能提升,但也引入了一个棘手的工程问题:如何稳定、可靠地将这个定制化的 Prometheus 二进制文件及其配置部署到多个 Kubernetes 集群中?最初的方案是手动的:CI/CD 流水线编译出二进制文件,上传到对象存储,再通过一套复杂的 Ansible Playbook 拉取二进制文件、渲染配置、创建 Kubernetes 资源。整个过程充满了不确定性,版本管理混乱,回滚操作更是如履薄冰。
在真实项目中,这种手动胶水式的运维流程是技术债的温床。每一次部署都需要运维工程师的深度介入,无法实现真正的 GitOps 闭环。我们需要的不是另一个脚本,而是一个平台化的、声明式的解决方案。我们希望能够像声明一个 RDSInstance
或 GKECluster
一样,通过一个简单的 YAML 文件来定义和管理我们的定制版 Prometheus 服务。
# 理想中的目标状态
apiVersion: custom.prometheus.io/v1alpha1
kind: CustomPrometheus
metadata:
name: prometheus-for-billing-team
spec:
# OCI artifact tag, pointing to our custom build
version: "v2.45.0-custom-mem-opt-a1b2c3d"
replicas: 2
# User-provided configuration snippet
scrapeConfig: |
- job_name: 'node'
static_configs:
- targets: ['node-exporter.monitoring.svc.cluster.local:9100']
# ... other parameters
这个设想将我们引向了 Crossplane。Crossplane 允许我们将任何基础设施或服务(无论是云厂商的 API 还是内部应用)抽象成 Kubernetes 的自定义资源 (CRD)。而另一个关键问题——如何分发这个定制的二进制文件——则通过 OCI (Open Container Initiative) 规范找到了答案。OCI 镜像规范不仅仅是为容器镜像服务的,任何制品(Artifacts),比如 Helm Charts、WASM 模块,甚至是我们的 Go 二进制文件,都可以打包成 OCI Artifact,并存储在任何兼容 OCI 的镜像仓库中。
这套组合拳的威力在于:CI/CD 负责构建和发布 OCI Artifact,Crossplane 负责在 Kubernetes 中解析我们的自定义资源 CustomPrometheus
,并动态编排底层资源,包括从镜像仓库拉取指定的 OCI Artifact 版本,并注入到 Pod 中运行。这彻底改变了交付模式。
第一阶段:构建与打包 OCI 制品
核心任务是将编译后的 Prometheus 二进制文件和默认配置文件打包成一个标准的 OCI 制品。我们不把它打包成一个可运行的容器镜像,因为我们希望在运行时能有更大的灵活性,比如动态挂载不同的配置文件。使用 oras
CLI 工具可以轻松实现这一点。
首先,是我们的 CI 构建脚本。这里以一个简化的 build.sh
为例,它模拟了编译和准备制品目录的过程。
#!/bin/bash
set -eo pipefail
# --- Configuration ---
# Assume these are passed from the CI environment
# e.g., git describe --tags
APP_VERSION="${1:-v2.45.0-custom-dev}"
# OCI registry URL
REGISTRY_URL="your-registry.io/prometheus-custom"
# Go build flags for optimization
GO_BUILD_FLAGS="-s -w"
# Source code path
PROMETHEUS_SRC_PATH="./prometheus"
# --- Main Logic ---
echo "INFO: Starting build for Prometheus version ${APP_VERSION}..."
# 1. Prepare build environment
BUILD_DIR=$(mktemp -d)
trap 'rm -rf -- "$BUILD_DIR"' EXIT
ARTIFACT_DIR="${BUILD_DIR}/artifact"
mkdir -p "${ARTIFACT_DIR}/bin"
mkdir -p "${ARTIFACT_DIR}/config"
echo "INFO: Build directory: ${BUILD_DIR}"
echo "INFO: Artifact directory: ${ARTIFACT_DIR}"
# 2. Build the custom Prometheus binary
echo "INFO: Building custom Prometheus binary..."
# In a real CI, this would involve cloning a specific git commit
# cd ${PROMETHEUS_SRC_PATH}
# make build
# For demonstration, we just create a mock binary
echo "mock prometheus binary version ${APP_VERSION}" > "${ARTIFACT_DIR}/bin/prometheus"
chmod +x "${ARTIFACT_DIR}/bin/prometheus"
echo "INFO: Binary created at ${ARTIFACT_DIR}/bin/prometheus"
# 3. Prepare default configuration
echo "INFO: Creating default prometheus.yml..."
cat <<EOF > "${ARTIFACT_DIR}/config/prometheus.yml"
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'self'
static_configs:
- targets: ['localhost:9090']
EOF
# 4. Package and Push using ORAS (OCI Registry as Storage)
# The target OCI reference combines the registry URL and the version tag
OCI_TARGET="${REGISTRY_URL}:${APP_VERSION}"
echo "INFO: Pushing artifact to ${OCI_TARGET}..."
# oras push <target> <file1>:<mediatype> <file2>:<mediatype> ...
# We define custom media types for our components for clarity
oras push "${OCI_TARGET}" \
--config /dev/null:application/vnd.custom.config.v1+json \
"${ARTIFACT_DIR}/bin/prometheus:application/vnd.custom.binary.v1" \
"${ARTIFACT_DIR}/config/prometheus.yml:application/vnd.custom.prometheus.config.v1+yaml"
echo "SUCCESS: Artifact ${OCI_TARGET} pushed successfully."
# Optional: You can inspect the manifest
echo "INFO: Inspecting pushed manifest..."
oras manifest fetch "${OCI_TARGET}"
这个脚本的核心在于 oras push
命令。它将 prometheus
二进制文件和 prometheus.yml
配置文件作为两个不同的 “层 (layers)” 推送到了 OCI 仓库。这里的坑在于,必须为每个文件指定一个媒体类型 (media type),这对于后续工具识别内容至关重要。我们定义了自定义的媒体类型,如 application/vnd.custom.binary.v1
。
第二阶段:定义 Crossplane 声明式 API
有了制品,接下来就是用 Crossplane 创建我们的 CustomPrometheus
API。这分为两步:定义 API 的结构(XRD)和实现这个 API 的逻辑(Composition)。
CompositeResourceDefinition (XRD)
XRD 定义了 CustomPrometheus
资源的 spec
字段,相当于为我们的新 API 创建了 schema。
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: customprometheuses.custom.prometheus.io
spec:
group: custom.prometheus.io
names:
kind: CustomPrometheus
listKind: CustomPrometheusList
plural: customprometheuses
singular: customprometheus
claimNames:
kind: PrometheusClaim
listKind: PrometheusClaimList
plural: prometheusclaims
singular: prometheusclaim
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
version:
type: string
description: "The OCI artifact tag for the custom Prometheus build."
replicas:
type: integer
description: "Number of Prometheus replicas."
default: 1
ociRegistry:
type: string
description: "The base URL of the OCI registry."
default: "your-registry.io/prometheus-custom"
scrapeConfig:
type: string
description: "The content of scrape_configs section in prometheus.yml."
required:
- version
- scrapeConfig
这里的 spec
字段清晰地暴露了用户可以配置的参数:version
(对应 OCI tag),replicas
(副本数),以及 scrapeConfig
(允许用户注入自定义的抓取配置)。
Composition
Composition 是真正的实现逻辑,它将 CustomPrometheus
资源映射到一组底层的 Kubernetes 原生资源,如 Deployment
、ConfigMap
和 Service
。
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: customprometheus.k8s.custom.prometheus.io
labels:
provider: kubernetes
spec:
writeConnectionSecretsToNamespace: crossplane-system
compositeTypeRef:
apiVersion: custom.prometheus.io/v1alpha1
kind: CustomPrometheus
resources:
# 1. ConfigMap for Prometheus configuration
- name: prometheus-config
base:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config # This name will be patched
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# This section will be patched
patches:
- fromFieldPath: "metadata.name"
toFieldPath: "metadata.name"
transforms:
- type: string
string:
fmt: "%s-config"
- fromFieldPath: "spec.scrapeConfig"
toFieldPath: "data.prometheus\\.yml" # Note the escaped dot
transforms:
- type: replace
replace:
search: "scrape_configs:"
replacement: |
scrape_configs:
- job_name: 'self'
static_configs:
- targets: ['localhost:9090']
{{ . }} # Placeholder for user's config
- type: string
string:
# This combines the base config with the user's scrapeConfig
fmt: |
global:
scrape_interval: 15s
evaluation_interval: 15s
%s
policy:
fromFieldPath: Required
# 2. Deployment for running the custom Prometheus
- name: prometheus-deployment
base:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1 # This will be patched
selector:
matchLabels:
app: custom-prometheus # This will be patched
template:
metadata:
labels:
app: custom-prometheus # This will be patched
spec:
# This initContainer is the core of the solution
initContainers:
- name: oci-artifact-puller
image: ghcr.io/oras-project/oras:v1.0.0
command: ["/bin/sh", "-c"]
args:
- |
set -ex
# The OCI artifact URL is constructed dynamically
OCI_URL="${OCI_REGISTRY}:${OCI_TAG}"
echo "Pulling from ${OCI_URL}"
# Oras pulls files to a target directory
oras pull "${OCI_URL}" -o /opt/prometheus-custom
echo "Artifacts pulled successfully."
ls -l /opt/prometheus-custom
env:
- name: OCI_REGISTRY
value: "your-registry.io/prometheus-custom" # To be patched
- name: OCI_TAG
value: "latest" # To be patched
volumeMounts:
- name: binary-storage
mountPath: /opt/prometheus-custom
containers:
- name: prometheus
image: alpine:latest # A minimal base image, since we bring our own binary
command: ["/opt/prometheus-custom/bin/prometheus"]
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
ports:
- containerPort: 9090
volumeMounts:
- name: binary-storage
mountPath: /opt/prometheus-custom
- name: config-volume
mountPath: /etc/prometheus
- name: data-volume
mountPath: /prometheus # In production, this should be a PVC
volumes:
- name: binary-storage
emptyDir: {}
- name: config-volume
configMap:
name: prometheus-config # To be patched
- name: data-volume
emptyDir: {} # WARNING: Not for production use. Use a PersistentVolumeClaim.
patches:
# Patch deployment metadata and labels
- fromFieldPath: "metadata.name"
toFieldPath: "metadata.name"
- fromFieldPath: "metadata.name"
toFieldPath: "spec.selector.matchLabels.app"
- fromFieldPath: "metadata.name"
toFieldPath: "spec.template.metadata.labels.app"
# Patch replicas
- fromFieldPath: "spec.replicas"
toFieldPath: "spec.replicas"
# Patch initContainer env vars with OCI info from spec
- fromFieldPath: "spec.ociRegistry"
toFieldPath: "spec.template.spec.initContainers[0].env[0].value"
- fromFieldPath: "spec.version"
toFieldPath: "spec.template.spec.initContainers[0].env[1].value"
# Patch the configmap volume name
- fromFieldPath: "metadata.name"
toFieldPath: "spec.template.spec.volumes[1].configMap.name"
transforms:
- type: string
string:
fmt: "%s-config"
# 3. Service to expose the Prometheus deployment
- name: prometheus-service
base:
apiVersion: v1
kind: Service
spec:
selector:
app: custom-prometheus # To be patched
ports:
- protocol: TCP
port: 9090
targetPort: 9090
patches:
- fromFieldPath: "metadata.name"
toFieldPath: "metadata.name"
- fromFieldPath: "metadata.name"
toFieldPath: "spec.selector.app"
这个 Composition
有几个关键的设计:
-
initContainer
模式: 这是整个方案的核心。我们使用一个标准的oras
镜像作为initContainer
,它的唯一任务就是从 OCI 仓库拉取指定版本的制品,并将其解压到一个共享的emptyDir
卷 (binary-storage
) 中。 - 主容器: 主容器使用一个极简的基础镜像(如
alpine
),因为它不需要包含 Prometheus 本身。它直接从共享卷 (/opt/prometheus-custom/bin/prometheus
) 中执行二进制文件。 - 配置注入:
ConfigMap
资源通过patches
动态地将用户提供的spec.scrapeConfig
内容合并到基础配置中,然后通过config-volume
挂载到主容器。一个常见的错误是 YAML 路径中的点 (.
) 需要转义 (\\.
)。 - 动态 patching: Crossplane 的
patches
机制将CustomPrometheus
资源的spec
字段值“粘贴”到底层模板的相应位置。例如,spec.version
被 patch 到了initContainer
的环境变量OCI_TAG
中。
graph TD A[Git Commit: CustomPrometheus YAML] --> B{Crossplane Controller}; B -- Reconciles --> C[XRD: customprometheuses.custom.prometheus.io]; C -- Uses --> D[Composition: customprometheus.k8s...]; D -- Renders Resources --> E[1. ConfigMap]; D -- Renders Resources --> F[2. Deployment]; D -- Renders Resources --> G[3. Service]; subgraph "Managed K8s Resources" E;F;G; end subgraph "Inside Pod" H[initContainer: oras pull] --> I[Shared emptyDir Volume]; I --> J[Main Container: executes binary from Volume]; end F -- Creates --> Pod; Pod -- Contains --> H; Pod -- Contains --> J; K[OCI Registry] -- oras pull --> H; L[CI/CD Pipeline] -- oras push --> K[OCI Registry]; A -- Triggers GitOps --> B;
第三阶段:实践与验证
将上述 XRD 和 Composition 应用到安装了 Crossplane 的 Kubernetes 集群后,我们就可以创建 CustomPrometheus
实例了。
apiVersion: custom.prometheus.io/v1alpha1
kind: CustomPrometheus
metadata:
name: prometheus-for-billing-team
namespace: billing-services
spec:
version: "v2.45.0-custom-mem-opt-a1b2c3d" # Our custom build tag
ociRegistry: "our-harbor.corp.net/internal-tools/prometheus-custom"
replicas: 1
scrapeConfig: |
- job_name: 'billing-api'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
action: keep
regex: billing-api-service
应用这个 YAML 后,Crossplane 会立即开始工作。我们可以通过 kubectl
观察到:
-
kubectl get customprometheus -n billing-services
会显示资源的状态,如SYNCED
。 -
kubectl get managed -n billing-services
会列出 Crossplane 创建的底层资源:一个ConfigMap
,一个Deployment
和一个Service
。 -
kubectl get pods -n billing-services
会显示正在创建的 Pod。查看 Pod 的事件,可以看到initContainer
成功拉取 OCI 制品,然后主容器启动。
升级变得异常简单。当 CI/CD 构建出新版本 v2.45.0-custom-mem-opt-f4e5g6h
并推送到 OCI 仓库后,开发者或 SRE 只需要在 Git 中修改 CustomPrometheus
YAML 的 spec.version
字段。GitOps 工具(如 ArgoCD)检测到变更后,会自动应用更新。Crossplane 会执行滚动更新,平滑地将旧版本的 Pod 替换为运行新二进制文件的新 Pod。
局限性与未来展望
这套方案虽然优雅地解决了定制化二进制文件的声明式交付问题,但在真实生产环境中,它并非银弹。
首先,emptyDir
用于存储二进制文件意味着每次 Pod 重启都需要重新从 OCI 仓库拉取,这会增加启动延迟,并对 OCI 仓库造成压力。对于较大的二进制文件,可以考虑使用缓存策略或将二进制文件构建到一个更持久的 CSI 卷中,但这会增加 Composition
的复杂性。
其次,对于 Prometheus 这样的有状态服务,TSDB 数据的持久化至关重要。当前示例中的 emptyDir
for /prometheus
仅用于演示,生产环境必须将其替换为由 PersistentVolumeClaim
模板管理的持久化存储。这需要在 XRD 和 Composition
中添加对存储类、容量等参数的支持。
最后,当前的 Composition
逻辑相对简单。一个更成熟的方案可能会演变成一个专用的 Crossplane Provider。Provider 使用 Go 语言编写,能够实现更复杂的调谐逻辑,例如在更新前检查新版本 OCI 制品的健康状况,或者管理数据库 schema 迁移等,而这超出了 Composition
的能力范围。但这套基于 Composition
的方法论证了其核心价值,并为向更复杂的 Provider 演进奠定了坚实的基础。