“Manual deployments are technical debt with compound interest. Every time you run kubectl apply by hand you’re borrowing against future reliability.”

This is a full breakdown of a push-to-deploy GitOps pipeline on Kubernetes — Flask webhook orchestration server, isolated test namespace with resource quotas, RBAC scoped to minimum permissions, network policy isolation between test and production, and blue-green deployment with automated rollback. Built because the manual process was unsustainable, documented because the failure modes are worth knowing.


The Problem with Manual Deployments#

# The old process
docker build -t my-app:v1.2.3 .
docker push my-registry/my-app:v1.2.3
kubectl set image deployment/my-app my-app=my-registry/my-app:v1.2.3
kubectl rollout status deployment/my-app
# Realize config map wasn't updated
kubectl apply -f configmap.yaml
kubectl rollout restart deployment/my-app
# Watch pods crashloop
kubectl get pods --watch

The failure modes compound. You forget a config map. You push to the wrong environment. You apply a manifest that was edited locally and never committed. Manual processes don’t just create toil — they create inconsistency, and inconsistency is where incidents come from.

GitOps fixes this at the source: Git is the single source of truth. If it’s not committed, it doesn’t exist in the cluster. Every deployment is auditable, every rollback is a revert.


Architecture#

Git Push
    ↓
GitLab Webhook (HTTPS + signature verification)
    ↓
Flask Orchestration Server
    ├─ Signature validation
    ├─ Payload parsing
    └─ Pipeline triggering
    ↓
Kubernetes — test namespace
    ├─ Clone repo
    ├─ Run tests
    └─ Build image
    ↓
Kubernetes — production namespace
    ├─ Blue-green rollout
    ├─ Health checks
    └─ Automatic rollback on failure

Flask Orchestration Server#

Flask handles incoming webhook events, validates them, and triggers Kubernetes jobs. Lightweight, containerizable, easy to deploy inside the cluster.

from flask import Flask, request, jsonify
import subprocess
import logging
import hmac
import hashlib
import os

app = Flask(__name__)
app.logger.setLevel(logging.INFO)

WEBHOOK_SECRET = os.getenv('WEBHOOK_SECRET')

def verify_webhook_signature(payload: bytes, signature: str) -> bool:
    """Verify webhook payload came from GitLab"""
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)

def trigger_test_pipeline(repo_url: str, branch: str, commit_sha: str) -> bool:
    app.logger.info(f"Pipeline: {repo_url}@{branch} ({commit_sha[:8]})")

    result = subprocess.run([
        'kubectl', 'create', 'job', f'test-{commit_sha[:8]}',
        '--image=python:3.9-slim',
        '--namespace=test',
        '--env', f'REPO_URL={repo_url}',
        '--env', f'BRANCH={branch}',
        '--env', f'COMMIT_SHA={commit_sha}',
        '--', 'sh', '-c',
        'git clone $REPO_URL -b $BRANCH /app && cd /app && pip install -r requirements.txt && python -m pytest tests/ -v'
    ], capture_output=True, text=True, timeout=30)

    if result.returncode != 0:
        app.logger.error(f"Pipeline trigger failed: {result.stderr}")
        return False

    return True

@app.route('/webhook', methods=['POST'])
def handle_webhook():
    signature = request.headers.get('X-GitLab-Token', '')
    if not verify_webhook_signature(request.get_data(), signature):
        app.logger.warning("Invalid webhook signature — rejected")
        return "Invalid signature", 401

    payload = request.json
    if not payload or 'repository' not in payload:
        return "Invalid payload", 400

    branch = payload['ref'].split('/')[-1]
    commit_sha = payload['after']
    repo_url = payload['repository']['git_http_url']

    success = trigger_test_pipeline(repo_url, branch, commit_sha)
    return jsonify({"status": "started" if success else "failed"}), 202

On ssl_context='adhoc': The Flask dev server supports ssl_context='adhoc' for quick local TLS testing — it’s not for production. In production, run Flask behind Nginx or a Kubernetes Ingress controller with a proper cert. The orchestration server itself runs on HTTP internally; TLS termination happens at the ingress layer.

Webhook signature verification is not optional. Without it, any HTTP client that discovers your endpoint can trigger a deployment. I skipped this initially and a web crawler hit the endpoint and triggered a pipeline against stale code. Verify signatures before touching the payload.


Kubernetes Test Environment#

Namespace and Resource Quotas#

kubectl create namespace test
apiVersion: v1
kind: ResourceQuota
metadata:
  name: test-quota
  namespace: test
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    pods: "10"

Resource limits on the namespace, not just the pod. A test with an infinite loop and no limits will consume all available cluster memory. Ask me how I know.

Test Runner Job#

apiVersion: batch/v1
kind: Job
metadata:
  name: test-runner
  namespace: test
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: test-runner
        image: python:3.9-slim
        env:
        - name: REPO_URL
          value: "https://gitlab.com/your-repo.git"
        - name: BRANCH
          value: "main"
        - name: COMMIT_SHA
          value: "abc123"
        command: ["sh", "-c"]
        args:
          - |
            set -e
            git clone $REPO_URL -b $BRANCH /test-code
            cd /test-code
            pip install -r requirements.txt
            python -m pytest tests/ -v --junitxml=test-results.xml            
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: test-results
        emptyDir: {}

ttlSecondsAfterFinished: 3600 — finished jobs clean themselves up after an hour. Without this, failed jobs accumulate and eventually exhaust the namespace quota, blocking new jobs from starting.

set -e in the shell script — exits on any non-zero return code. Without it, a failed pip install continues to the test run and you get misleading failures.


RBAC — Minimum Permissions#

The orchestration server needs specific permissions in specific namespaces. Not cluster-admin. Not wildcard resources.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: orchestration-sa
  namespace: orchestration
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: test
  name: test-job-manager
rules:
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create", "delete", "list"]
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: production-deployer
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "patch", "list"]
- apiGroups: [""]
  resources: ["services", "configmaps"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: orchestration-test-binding
  namespace: test
subjects:
- kind: ServiceAccount
  name: orchestration-sa
  namespace: orchestration
roleRef:
  kind: Role
  name: test-job-manager
  apiGroup: rbac.authorization.k8s.io

The orchestration service account can create and delete jobs in test, and patch deployments in production. It cannot access secrets, delete namespaces, or touch any other namespace. If the orchestration server is compromised, the blast radius is contained to those two verbs in those two namespaces.


Network Policy — Namespace Isolation#

Test jobs should not be able to reach production databases, internal services, or anything outside what they need.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-namespace-isolation
  namespace: test
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: orchestration
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 443

Ingress to the test namespace: only from the orchestration namespace. Egress from test: only to production on 443. A compromised test job cannot reach your production database or internal APIs.


Blue-Green Production Deployment#

def deploy_to_production(image_tag: str):
    # Apply green deployment
    subprocess.run([
        'kubectl', 'apply', '-f',
        f'deployment-green-{image_tag}.yaml'
    ], check=True)

    # Wait for green to be healthy
    subprocess.run([
        'kubectl', 'rollout', 'status',
        'deployment/green-deployment',
        '--timeout=300s'
    ], check=True)

    # Cut traffic to green
    subprocess.run([
        'kubectl', 'patch', 'service', 'my-app',
        '-p', '{"spec":{"selector":{"version":"green"}}}'
    ], check=True)

    # Blue stays running for immediate rollback
    app.logger.info(f"Deployed {image_tag} — blue retained for rollback")
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  minReadySeconds: 10
  revisionHistoryLimit: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: my-app
        image: my-registry/my-app:v1.2.3
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

maxUnavailable: 0 — no pods go down before new ones are up. Zero-downtime rollout. revisionHistoryLimit: 3 — keeps the last three ReplicaSets for fast rollback. minReadySeconds: 10 — pod must be ready for 10 seconds before Kubernetes considers it stable. Prevents a pod that crashes after startup from being counted as healthy.

Kubernetes handles rollback automatically when health checks fail. Blue deployment stays running until green is confirmed stable — instant traffic switch back if needed.


Troubleshooting Reference#

Webhook timeout — GitLab reports failure:

# Flask dev server can't handle concurrent long-running requests
# Use a production WSGI server
from gevent.pywsgi import WSGIServer
http_server = WSGIServer(('0.0.0.0', 5000), app)
http_server.serve_forever()

Jobs stuck in Pending — resource quota exhausted:

kubectl describe resourcequota test-quota -n test
# Check which resources are at limit
kubectl delete jobs --field-selector status.successful=1 -n test

ttlSecondsAfterFinished prevents this accumulation if set correctly from the start.

ImagePullBackOff on test jobs:

# Attach pull secret to the test runner service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-runner-sa
  namespace: test
imagePullSecrets:
- name: registry-credentials

Git authentication in Kubernetes jobs:

volumes:
- name: git-credentials
  secret:
    secretName: git-credentials
    defaultMode: 0400

Mount credentials as a read-only secret volume. Never pass credentials as environment variables — they show up in kubectl describe pod output.


Results#

MetricManualAutomated
Deployment time15–30 min2–5 min
Deployment frequencyWeeklyMultiple daily
Rollback time10–15 min30 seconds
Failed production deploymentsRegular2 in 150+ runs, both auto-rolled back

150+ pipeline runs over the project lifetime. The two production failures were both caught by health checks and rolled back automatically before they affected users.


What I’d Do Differently#

External secrets manager from the start. Git credentials and webhook secrets mounted as Kubernetes secrets work, but Secrets Manager or Vault with rotation is the right answer for anything beyond a personal project.

Pipeline observability from day one. I added metrics and alerting after the fact. Build time, success rate, and failure reasons should be instrumented before the pipeline handles real workloads.

Staging environment between test and production. The current setup goes test → production. A staging namespace that mirrors production configuration would catch environment-specific failures before they reach prod.


Source#

Full code and manifests on GitHub .


Tags#

#Infrastructure #Kubernetes #GitOps #CI/CD #PlatformEngineering #Python


About the Author#

Elijah Udom (elijahu) is an Infrastructure & Cloud Engineer based in Lagos, Nigeria. AWS, Kubernetes, eBPF security, AI/ML infrastructure. Building in the open.

Elijah Udom


← Previous: Self-Hosting Gitea on AWS | Next: AWS Security Auditor →