GitOps at Scale: Managing 100+ Microservices with Argo CD and Helm



GitOps Sounds Simple Until You Have 100 Services

The core GitOps promise is elegant: Git is the source of truth. Every cluster state change goes through a pull request. Rollbacks are git reverts. Audit trails are git history.

In practice, scaling GitOps from 5 services to 100+ exposes a lot of rough edges. Repository structure, sync strategies, secrets management, drift detection, environment promotion — all of these have multiple valid approaches and none of them are obvious until you’ve made the wrong choice and lived with it.

This post is about what actually works in large-scale GitOps deployments, specifically with Argo CD and Helm.

Kubernetes Orchestration Photo by Growtika on Unsplash


Repository Structure: The Foundational Decision

The biggest source of GitOps pain at scale is getting repository structure wrong early. Two main patterns:

Option A: Monorepo (App Code + K8s Config)

repo/
├── services/
│   ├── payment-service/
│   │   ├── src/
│   │   ├── Dockerfile
│   │   └── helm/          ← K8s config lives with the code
│   │       ├── Chart.yaml
│   │       ├── values.yaml
│   │       └── values-prod.yaml
│   └── user-service/
│       ├── src/
│       └── helm/
└── .github/workflows/

Pros: Single source of truth, easy to see what changed with a feature. Good for small teams.

Cons: CI/CD pipeline complexity — pushing code and pushing Kubernetes config use different triggers and review processes. Hard to enforce separation of concerns.

company-app-code/          ← developers own this
├── payment-service/
│   ├── src/
│   └── Dockerfile
└── user-service/

company-gitops/            ← platform team owns this
├── apps/
│   ├── payment-service/
│   │   ├── Chart.yaml
│   │   ├── values-dev.yaml
│   │   ├── values-staging.yaml
│   │   └── values-prod.yaml
│   └── user-service/
│       └── ...
├── argocd/
│   ├── apps/              ← Argo CD Application manifests
│   └── projects/          ← Argo CD AppProject manifests
└── clusters/
    ├── dev/
    ├── staging/
    └── prod/

Why this wins at scale:

  • Clear ownership boundaries
  • Config changes go through a separate review process (often more rigorous for prod)
  • Argo CD watches the config repo, not the app repo
  • Secrets and infrastructure config are isolated from application code

App of Apps Pattern

For 100+ services, the “App of Apps” pattern in Argo CD is essential. Instead of registering each service individually in Argo CD, you have a root Application that manages all other Applications:

# argocd/apps/root.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/gitops
    targetRevision: HEAD
    path: argocd/apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
# argocd/apps/payment-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: production
  source:
    repoURL: https://github.com/company/gitops
    targetRevision: HEAD
    path: apps/payment-service
    helm:
      valueFiles:
        - values-prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: payment
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
  revisionHistoryLimit: 10

Now adding a new service = adding one YAML file to argocd/apps/. Argo CD picks it up automatically.


Image Update Automation

The most common question: “How does a new Docker image get deployed?” You have two main approaches:

Approach 1: CI Pipeline Updates values.yaml

# CI pipeline (GitHub Actions, after building and pushing image)
- name: Update image tag in GitOps repo
  uses: actions/checkout@v4
  with:
    repository: company/gitops
    token: $

- name: Update image tag
  run: |
    cd gitops
    # Update the image tag in values.yaml
    yq e ".image.tag = \"$\"" -i apps/payment-service/values-staging.yaml
    
    git config user.email "ci@company.com"
    git config user.name "CI Bot"
    git add apps/payment-service/values-staging.yaml
    git commit -m "chore: update payment-service to $"
    git push

Argo CD detects the change and syncs. Simple, auditable, widely used.

Approach 2: Argo CD Image Updater

# Annotation-based image update automation
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  annotations:
    argocd-image-updater.argoproj.io/image-list: payment=company/payment-service
    argocd-image-updater.argoproj.io/payment.update-strategy: semver
    argocd-image-updater.argoproj.io/payment.allow-tags: "^v[0-9]+\\.[0-9]+\\.[0-9]+$"
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/git-branch: main

Image Updater polls the container registry and opens PRs (or commits directly) when a new matching tag is found. Great for automated patch updates.


Environment Promotion

The promotion flow from dev → staging → prod deserves careful design:

Dev auto-sync on every commit to main
    │
    ▼
Staging auto-sync on schedule (or PR approval)
    │
    ▼
Prod: Manual approval required (PR + CODEOWNERS + required reviews)
# Using Argo CD ApplicationSets for multi-environment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payment-service
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - env: dev
            cluster: dev-cluster
            autoSync: "true"
          - env: staging
            cluster: staging-cluster
            autoSync: "true"
          - env: prod
            cluster: prod-cluster
            autoSync: "false"  # manual for prod
  template:
    metadata:
      name: "payment-service-"
    spec:
      project: ""
      source:
        repoURL: https://github.com/company/gitops
        path: "apps/payment-service"
        helm:
          valueFiles:
            - "values-.yaml"
      destination:
        server: ""
        namespace: payment
      syncPolicy:
        automated:
          prune: ""
          selfHeal: ""

Secrets Management: The Hardest Part

Secrets don’t belong in Git. But your Kubernetes workloads need secrets. The two best solutions in 2026:

External Secrets Operator (ESO)

# ExternalSecret syncs from AWS Secrets Manager / Vault / etc.
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: payment-service-secrets
  namespace: payment
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore
  target:
    name: payment-service-secrets
    creationPolicy: Owner
  data:
    - secretKey: DATABASE_PASSWORD
      remoteRef:
        key: production/payment-service
        property: database_password
    - secretKey: STRIPE_SECRET_KEY
      remoteRef:
        key: production/payment-service
        property: stripe_secret_key

This manifest is safe to commit. The actual secret values live in AWS Secrets Manager (or Vault, GCP Secret Manager, etc.) and are synced to Kubernetes Secrets by the ESO controller.

Sealed Secrets (Simpler but Less Flexible)

# Encrypt a secret with the cluster's public key
echo -n "my-secret-value" | \
  kubeseal --raw \
  --from-file=/dev/stdin \
  --namespace payment \
  --name payment-service-secrets

# The SealedSecret is safe to commit
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: payment-service-secrets
  namespace: payment
spec:
  encryptedData:
    DATABASE_PASSWORD: AgB5...encrypted...

Sealed Secrets are simpler but tied to a specific cluster. ESO is better for multi-cluster and rotation scenarios.


Drift Detection and Alerting

In a production system, you need to know when cluster state diverges from Git state:

# Custom drift alerting script
import subprocess
import json
import requests

def check_argo_sync_status():
    result = subprocess.run(
        ["argocd", "app", "list", "-o", "json", "--server", ARGO_URL],
        capture_output=True, text=True
    )
    apps = json.loads(result.stdout)
    
    drifted = []
    for app in apps:
        status = app["status"]["sync"]["status"]
        health = app["status"]["health"]["status"]
        
        if status != "Synced" or health != "Healthy":
            drifted.append({
                "name": app["metadata"]["name"],
                "sync_status": status,
                "health": health,
                "message": app["status"].get("operationState", {}).get("message", "")
            })
    
    return drifted

def alert_on_drift(drifted: list):
    if not drifted:
        return
    
    # Send to Slack/PagerDuty
    for app in drifted:
        if app["health"] == "Degraded":
            # Critical: page on-call
            pagerduty_alert(app)
        else:
            # Non-critical: Slack notification
            slack_notify(f"⚠️ {app['name']} is {app['sync_status']}/{app['health']}")

# Run every 5 minutes via cron
if __name__ == "__main__":
    drifted = check_argo_sync_status()
    alert_on_drift(drifted)

Performance at Scale

With 100+ apps, Argo CD itself can become a bottleneck. Key tuning parameters:

# argocd-cmd-params-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  # Increase concurrent reconciliations
  controller.operation.processors: "25"
  controller.status.processors: "50"
  
  # Sharding for 100+ clusters
  controller.sharding.algorithm: "round-robin"
  
  # Repo server caching
  reposerver.parallelism.limit: "10"
  
  # Reduce unnecessary reconciliations
  timeout.reconciliation: "180s"
  timeout.hard.reconciliation: "0s"

For very large deployments (500+ apps), consider:

  1. Argo CD instance per cluster — each cluster manages itself
  2. Multi-instance hub-spoke — one control plane, multiple app instances
  3. ApplicationSet with sharding — distribute work across multiple controllers

Rollback Strategy

# List revision history
argocd app history payment-service

ID  DATE                     REVISION
5   2026-06-27 12:00:00 UTC  HEAD (abc1234)
4   2026-06-26 18:00:00 UTC  def5678
3   2026-06-26 09:00:00 UTC  ghi9012

# Rollback to revision 4
argocd app rollback payment-service 4

# Or do it via Git (preferred for auditability)
git -C gitops revert HEAD --no-edit
git push
# Argo CD detects and syncs automatically

The Git revert approach is preferred because it creates an audit trail and keeps Git as the source of truth. Direct Argo CD rollbacks bypass the GitOps workflow — use them only for emergencies, then immediately create the corresponding Git revert.


Conclusion

GitOps at scale is absolutely achievable, but it requires deliberate architecture decisions early. The patterns that work:

  1. Separate config repository with clear ownership
  2. App of Apps for managing 100+ Argo CD Applications
  3. ApplicationSets for multi-environment/multi-cluster deployments
  4. External Secrets Operator for secrets management
  5. CI-driven image updates with PR-based promotion
  6. Drift detection alerting so problems surface immediately

The teams that struggle with GitOps at scale are usually the ones who started simple (monorepo, no automation) and tried to scale without restructuring. Getting the foundations right pays dividends for years.


What GitOps patterns is your team using at scale? Curious to hear how others handle the environment promotion and secrets management challenges.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)