Runbooks

Operational runbooks for common incidents including cluster issues, deployment failures, webhook problems, and governance violations.

Runbook Usage Model

Start from the observed symptom, not a guessed root cause. Collect evidence before taking any high-risk action. Prefer the smallest safe remediation scope first. Close every incident with a preventive action and clear ownership.

Runbook 1: Cluster Agent Disconnected

Severity: Critical (production), Warning (non-production)

Symptoms:

Dashboard shows cluster status as Disconnected (red indicator)
Agent heartbeat age exceeds 5 minutes
Deployments to the affected cluster are failing or timing out
Drift detection has stopped reporting for the cluster

Check Agent Pod Status — Connect to the cluster via kubectl and verify the agent pod is running:

kubectl get pods -n gitopshq-system
kubectl describe pod -n gitopshq-system -l app=gitopshq-agent

Look for: CrashLoopBackOff, OOMKilled, ImagePullBackOff, or Pending state.

Review Agent Logs — Pull recent logs from the agent pod:

kubectl logs -n gitopshq-system -l app=gitopshq-agent --tail=200

Look for: connection refused errors, TLS handshake failures, authentication errors, or timeout messages.

Verify Network Egress — The agent requires outbound HTTPS access to the GitOpsHQ control plane (api.gitopshq.io:443). Check network policies, firewall rules, and proxy settings:

kubectl run netcheck --rm -it --image=curlimages/curl -- \
  curl -v https://api.gitopshq.io/healthz

Check Token Validity — If the agent logs show authentication errors, the registration token may have expired or been rotated. Navigate to Clusters → [cluster] → Settings in GitOpsHQ and check the token status.

Restart the Agent — If the pod is running but not connecting, restart it:

kubectl rollout restart deployment/gitopshq-agent -n gitopshq-system

Wait 60 seconds and check the dashboard for reconnection.

Re-register if Necessary — If the token is invalid or the agent cannot reconnect after restart, re-register the cluster in GitOpsHQ. Generate a new registration token and reinstall the agent Helm chart with the new token.

Resolution: Agent pod is running, heartbeat is fresh (< 60s), and cluster status shows Connected.

Prevention:

Set up alerting on agent heartbeat age > 2 minutes
Monitor agent pod resource usage — set appropriate resource limits to prevent OOM kills
Document network egress requirements for infrastructure teams
Include agent health check in regular cluster maintenance procedures

Runbook 2: Release Deployment Failed

Severity: High (production), Medium (non-production)

Symptoms:

Release status shows Failed in the release timeline
Workload sync status shows Error or OutOfSync
Pods are not starting or are in error state after the release

Check Delivery Generator Output — Navigate to the failed release and open the delivery generator logs. Look for:

Template rendering errors (missing values, invalid YAML syntax)
Variable substitution failures (undefined HQ Variables)
Schema validation errors

Review Manifest Validation — If the delivery generator succeeded but the cluster rejected the manifests, check the K8s validation results. Common failures:

Resource quota exceeded
Invalid API version (e.g., deprecated API)
Missing required fields (e.g., container image not specified)

Check Cluster Sync Status — Navigate to Clusters → [cluster] and check the sync detail for the affected workload. Look for:

Sync errors with specific resource names and error messages
Namespace not found (namespace may not exist on the cluster)
RBAC errors (agent may lack permissions to create certain resources)

Inspect Pod Events — If manifests were applied but pods are failing, use diagnostics to view pod events:

FailedScheduling — insufficient resources or node affinity issues
FailedMount — missing secrets or configmaps
CrashLoopBackOff — application error, check container logs
ImagePullBackOff — image not found or registry authentication failure

Compare with Previous Successful Release — Use the release diff to identify what changed between the last successful release and the failed one. This often reveals the exact change that caused the failure.

Fix and Retry or Rollback — Based on the root cause:

Bad values: Fix the values and create a new release
Missing prerequisite: Create the missing resource (secret, configmap, namespace) and retry
Application bug: Rollback to the previous release and fix the application code

Resolution: Release reaches Healthy status with all pods running, or a rollback restores the previous healthy state.

Prevention:

Always use manifest preview before deploying to production
Configure K8s validation as a pre-deploy gate
Ensure CI pipelines validate manifests against the target cluster's API schema
Require staging deployment before production promotion

Runbook 3: Promotion Blocked

Severity: Medium

Symptoms:

Promotion pipeline shows a stage as Blocked
The promotion cannot advance to the next environment
Releases are accumulating in the approval queue

Check Environment Freeze Status — Navigate to the target environment's settings. If a freeze is active, the promotion will be blocked until the freeze is lifted. Check who created the freeze and when it expires.

Verify Approval Prerequisites — Open the promotion detail and check the approval status:

Are all required approvals received?
Does the approval policy require approvals from distinct teams?
Is the requestor trying to self-approve (blocked if policy prohibits it)?

Check for OPA Policy Violations — If the target environment has OPA policies in enforce mode, check for policy violations. The promotion detail page shows any violations with explanations.

Verify Source Environment Health — Some promotion policies require the source environment to be in a healthy state before promotion. Check if the source environment has any drift, failed syncs, or unhealthy pods.

Check for Conflicting Releases — If another release is currently being deployed to the target environment, the promotion may be queued. Wait for the current deployment to complete.

Resolve the Block — Based on the cause:

Cause	Resolution
Environment freeze	Lift the freeze (if appropriate) or wait for expiration
Missing approvals	Notify approvers or escalate
OPA violation	Fix the violating values and re-evaluate
Source unhealthy	Fix the source environment first
Conflicting release	Wait for completion

Resolution: Promotion advances to the next stage and deploys successfully.

Prevention:

Plan deployment windows around known freeze schedules
Ensure sufficient approvers are available during deployment windows
Run OPA evaluation in preview mode before creating promotion requests
Avoid queuing multiple releases to the same environment simultaneously

Runbook 4: Webhook Delivery Failing

Severity: Medium

Symptoms:

Team not receiving deployment notifications in Slack/Teams
CI/CD webhook callbacks not triggering
Webhook delivery logs showing failures
Stale notification data (last successful delivery was hours ago)

Check Webhook Configuration — Navigate to Settings → Webhooks and verify the endpoint URL, authentication method, and event subscriptions are correct.

Review Delivery Logs — Open the webhook delivery history. For each failed delivery, check:

HTTP response code (4xx = client error, 5xx = server error)
Response body (often contains error details)
Request timing (timeouts indicate endpoint is slow or unreachable)

Verify Endpoint Health — Test the webhook endpoint independently:

curl -X POST https://your-webhook-endpoint.example.com/gitopshq \
  -H "Content-Type: application/json" \
  -d '{"test": true}'

If the endpoint returns errors or times out, the issue is on the receiving side.

Check HMAC Secret — If the webhook uses HMAC signature verification, ensure the signing secret in GitOpsHQ matches the verification secret on the receiving end. A mismatch causes 401/403 errors.

Test Delivery — Use the Test Delivery button in the webhook configuration to send a test event. Verify it arrives at the endpoint and is processed correctly.

Check Rate Limits — If you are sending a high volume of webhooks (e.g., during a batch promotion), the receiving endpoint may be rate-limiting. Check for 429 responses in the delivery logs.

Resolution: Webhook deliveries succeed consistently with 2xx responses.

Prevention:

Monitor webhook delivery success rate with alerting on drops below 99%
Use webhook endpoints that return quickly (< 5s) and process events asynchronously
Rotate HMAC secrets on a schedule and update both sides simultaneously
Configure retry policies for transient failures

Runbook 5: Policy Violations Unresolved

Severity: Medium (blocking deployments)

Symptoms:

Releases are blocked by OPA policy violations
Violation messages appear in the release detail
Teams are unable to deploy to policy-enforced environments

Review the Violation Details — Open the blocked release and read the policy violation messages. Each violation includes:

The policy name and rule that was violated
A human-readable message explaining the violation
The specific input values that triggered the violation

Understand the Policy Intent — Navigate to Policies → OPA and find the policy that generated the violation. Read the Rego rule to understand what it enforces and why.

Check if the Values Are Wrong — In most cases, the violation is correct and the values need to be updated. Common violations:

Missing resource limits on production containers
Using latest image tag instead of a versioned tag
Missing required labels or annotations
Exceeding allowed replica count

Update the Values — Fix the non-compliant values in the workload editor. Use the OPA Workbench to test your updated values against the policy before creating a new release.

Re-evaluate — Create a new release with the fixed values. The OPA evaluation should pass, and the release can proceed to the approval queue.

If the Policy Is Wrong — If you believe the policy itself is incorrect or overly restrictive:

Document the specific scenario where the policy is inappropriate
Escalate to the Platform team with the release context and policy details
Do not use break-glass to bypass policy violations unless it is a genuine emergency

Resolution: Values updated to comply with policies, or policy adjusted by the Platform team.

Prevention:

Run OPA evaluation in the values editor before creating releases (catches violations early)
Publish clear documentation for each OPA policy explaining what it enforces and why
Start with warn mode when introducing new policies, then switch to enforce after teams adapt
Provide example-compliant values in the policy documentation

Runbook 6: Drift Detected in Production

Severity: High

Symptoms:

Dashboard shows Drift Detected for a production environment
Resource state in the cluster differs from the desired state in the GitOps repository
May be accompanied by service degradation (if the drift is harmful)

Identify the Drifted Resources — Navigate to the drift detection panel and list all drifted resources. For each resource, note:

Resource type and name (e.g., Deployment/checkout-api)
The specific fields that differ
When the drift was first detected

Determine if the Drift Is Intentional — Common intentional drift scenarios:

HPA changed replica count (expected — autoscaler is working)
Emergency manual fix applied via kubectl (should be formalized)
External operator modified a resource (may be expected behavior)

Determine if the Drift Is Unintentional — Common unintentional drift:

Someone ran kubectl edit or kubectl apply manually
A CI/CD pipeline applied changes directly to the cluster (bypassing GitOps)
A Helm hook or job modified a resource post-deployment

For Intentional Drift — Adopt the Change — Update the desired state in the GitOps repository to match the cluster. This eliminates the drift by making the new state the desired state. Create a release to formalize the change.

For Unintentional Drift — Reconcile — Trigger a sync to force the cluster back to the desired state. The agent will revert the drifted resources to match the GitOps repository.

Investigate the Cause — Check the cluster's audit log (kubectl audit) to identify who or what made the direct change. If it was a human, review cluster-level RBAC to prevent future manual edits.

Resolution: Drift indicator is cleared, and the cluster matches the desired state.

Prevention:

Restrict direct cluster access via RBAC — only the GitOpsHQ agent should modify workload resources
Configure drift alerting to catch divergence within minutes
Add HPA-managed fields to the drift ignore list to reduce noise
Educate teams that all changes must go through the GitOps pipeline

Runbook 7: OCI Registry Connection Failed

Severity: High (blocks chart pulls and deployments)

Symptoms:

Chart pulls from the registry are failing
Workload creation fails with "chart not found" or "registry unreachable" errors
Registry page shows connection errors

Verify Registry Credentials — Navigate to Settings → Registry and check the configured credentials. Ensure the username, password (or token), and registry URL are correct. Credentials may have expired or been rotated.

Test the Connection — Use the Test Connection button in registry settings. If it fails, note the specific error:

401 Unauthorized — credentials are invalid
403 Forbidden — credentials lack required permissions
Connection refused — network issue
TLS handshake failure — certificate issue

Check Network Connectivity — If the registry is self-hosted, verify that the GitOpsHQ control plane can reach it. Check firewall rules, VPN connectivity, and DNS resolution.

Verify Registry Health — Check the registry's own health endpoint. If the registry is down or degraded, the issue is not in GitOpsHQ.

Rotate Credentials — If credentials have expired, generate new credentials in the registry provider and update them in GitOpsHQ settings. Re-test the connection.

Check Certificate Chain — If using a private registry with a custom CA, ensure the CA certificate is properly configured in GitOpsHQ's trust store.

Resolution: Registry connection test succeeds, charts can be pulled and published.

Prevention:

Set up monitoring on registry credential expiration dates
Use token-based authentication with automatic rotation where possible
Document the registry network path and firewall requirements
Test registry connectivity as part of regular operational health checks

Runbook 8: Break-Glass Session Audit

Severity: Review (post-incident)

Symptoms:

A break-glass session was opened and closed
Post-incident review needs to validate the actions taken
Compliance audit requires documentation of emergency access

Locate the Break-Glass Session — Navigate to Audit Log → Break-Glass Sessions. Find the session by date, actor, or scope. Each session record includes:

Session ID and duration
Actor who initiated the session
Justification provided at opening
Scope (environment, workload)
All actions taken during the session
Closing note

Review the Justification — Verify that the justification describes a genuine emergency:

Is there a corresponding P1/P2 incident?
Was the normal deployment path genuinely blocked?
Was the severity classification appropriate?

Verify Actions Taken — Review every action performed during the session:

Were the actions consistent with the stated justification?
Were only the minimum necessary changes made?
Were any changes made outside the declared scope?

Check for Unauthorized Changes — Compare the state before and after the break-glass session:

Were any resources modified that were not related to the incident?
Were any permissions or policies changed during the session?
Were any new service accounts or tokens created?

Verify Post-Session State — Confirm that the system is in a clean state after the session:

Is the break-glass session properly closed?
Are all resources in a healthy state?
Is there any remaining drift from the emergency changes?

Document Findings — Record the audit findings:

Finding Category	Status
Justification adequate	Pass / Fail
Actions within scope	Pass / Fail
No unauthorized changes	Pass / Fail
Post-session state clean	Pass / Fail
Preventive actions identified	Yes / No

Recommend Preventive Actions — Based on the audit, recommend changes to prevent future break-glass usage:

If the freeze was too broad, narrow it or add emergency exceptions
If approvals were blocked, add backup approvers or reduce quorum for rollbacks
If the root cause was preventable, add validation rules or policy gates

Close the Audit — File the audit report with findings, sign-off from the Platform team, and track any recommended preventive actions to completion.

Resolution: Audit report filed with findings and preventive actions tracked.

Prevention:

Review every break-glass session within 24 hours of closure
Track break-glass frequency as a key governance metric
Reduce break-glass necessity by improving normal operational pathways
Use break-glass audit findings to continuously improve policies and permissions

Incident Closure Flow

Every incident, regardless of which runbook was used, should be closed with this standard flow.

Confirm user impact is resolved — Verify that the service is functioning correctly and end-user impact has ceased.

Attach evidence — Link the relevant approval records, audit entries, runtime diagnostics, and any external monitoring data to the incident record.

Document root cause — Write a clear description of what went wrong, what the fix was, and any residual risk.

Identify preventive actions — Convert lessons learned into concrete improvements: new policies, updated runbooks, additional alerts, or RBAC changes.

Assign ownership — Each preventive action should have a named owner and a target completion date.

Schedule follow-up — Review preventive action completion in the next operational review meeting.

Runbook Quick Reference

Runbook	Trigger Symptom	First Action
Cluster Agent Disconnected	Agent heartbeat > 5min	Check agent pod status
Release Deployment Failed	Release status = Failed	Check delivery generator output
Promotion Blocked	Promotion stage = Blocked	Check freeze and approval status
Webhook Delivery Failing	Webhook success < 99%	Check endpoint health
Policy Violations Unresolved	Release blocked by OPA	Review violation details
Drift Detected in Production	Drift indicator non-zero	Identify drifted resources
OCI Registry Connection Failed	Chart pull errors	Verify credentials
Break-Glass Session Audit	Session closed	Review justification and actions

On this page