Runbooks
Operational runbooks for common incidents including cluster issues, deployment failures, webhook problems, and governance violations.
Runbook Usage Model
Start from the observed symptom, not a guessed root cause. Collect evidence before taking any high-risk action. Prefer the smallest safe remediation scope first. Close every incident with a preventive action and clear ownership.
Runbook 1: Cluster Agent Disconnected
Severity: Critical (production), Warning (non-production)
Symptoms:
- Dashboard shows cluster status as Disconnected (red indicator)
- Agent heartbeat age exceeds 5 minutes
- Deployments to the affected cluster are failing or timing out
- Drift detection has stopped reporting for the cluster
Check Agent Pod Status — Connect to the cluster via kubectl and verify the agent pod is running:
kubectl get pods -n gitopshq-system
kubectl describe pod -n gitopshq-system -l app=gitopshq-agentLook for: CrashLoopBackOff, OOMKilled, ImagePullBackOff, or Pending state.
Review Agent Logs — Pull recent logs from the agent pod:
kubectl logs -n gitopshq-system -l app=gitopshq-agent --tail=200Look for: connection refused errors, TLS handshake failures, authentication errors, or timeout messages.
Verify Network Egress — The agent requires outbound HTTPS access to the GitOpsHQ control plane (api.gitopshq.io:443). Check network policies, firewall rules, and proxy settings:
kubectl run netcheck --rm -it --image=curlimages/curl -- \
curl -v https://api.gitopshq.io/healthzCheck Token Validity — If the agent logs show authentication errors, the registration token may have expired or been rotated. Navigate to Clusters → [cluster] → Settings in GitOpsHQ and check the token status.
Restart the Agent — If the pod is running but not connecting, restart it:
kubectl rollout restart deployment/gitopshq-agent -n gitopshq-systemWait 60 seconds and check the dashboard for reconnection.
Re-register if Necessary — If the token is invalid or the agent cannot reconnect after restart, re-register the cluster in GitOpsHQ. Generate a new registration token and reinstall the agent Helm chart with the new token.
Resolution: Agent pod is running, heartbeat is fresh (< 60s), and cluster status shows Connected.
Prevention:
- Set up alerting on agent heartbeat age > 2 minutes
- Monitor agent pod resource usage — set appropriate resource limits to prevent OOM kills
- Document network egress requirements for infrastructure teams
- Include agent health check in regular cluster maintenance procedures
Runbook 2: Release Deployment Failed
Severity: High (production), Medium (non-production)
Symptoms:
- Release status shows Failed in the release timeline
- Workload sync status shows Error or OutOfSync
- Pods are not starting or are in error state after the release
Check Delivery Generator Output — Navigate to the failed release and open the delivery generator logs. Look for:
- Template rendering errors (missing values, invalid YAML syntax)
- Variable substitution failures (undefined HQ Variables)
- Schema validation errors
Review Manifest Validation — If the delivery generator succeeded but the cluster rejected the manifests, check the K8s validation results. Common failures:
- Resource quota exceeded
- Invalid API version (e.g., deprecated API)
- Missing required fields (e.g., container image not specified)
Check Cluster Sync Status — Navigate to Clusters → [cluster] and check the sync detail for the affected workload. Look for:
- Sync errors with specific resource names and error messages
- Namespace not found (namespace may not exist on the cluster)
- RBAC errors (agent may lack permissions to create certain resources)
Inspect Pod Events — If manifests were applied but pods are failing, use diagnostics to view pod events:
FailedScheduling— insufficient resources or node affinity issuesFailedMount— missing secrets or configmapsCrashLoopBackOff— application error, check container logsImagePullBackOff— image not found or registry authentication failure
Compare with Previous Successful Release — Use the release diff to identify what changed between the last successful release and the failed one. This often reveals the exact change that caused the failure.
Fix and Retry or Rollback — Based on the root cause:
- Bad values: Fix the values and create a new release
- Missing prerequisite: Create the missing resource (secret, configmap, namespace) and retry
- Application bug: Rollback to the previous release and fix the application code
Resolution: Release reaches Healthy status with all pods running, or a rollback restores the previous healthy state.
Prevention:
- Always use manifest preview before deploying to production
- Configure K8s validation as a pre-deploy gate
- Ensure CI pipelines validate manifests against the target cluster's API schema
- Require staging deployment before production promotion
Runbook 3: Promotion Blocked
Severity: Medium
Symptoms:
- Promotion pipeline shows a stage as Blocked
- The promotion cannot advance to the next environment
- Releases are accumulating in the approval queue
Check Environment Freeze Status — Navigate to the target environment's settings. If a freeze is active, the promotion will be blocked until the freeze is lifted. Check who created the freeze and when it expires.
Verify Approval Prerequisites — Open the promotion detail and check the approval status:
- Are all required approvals received?
- Does the approval policy require approvals from distinct teams?
- Is the requestor trying to self-approve (blocked if policy prohibits it)?
Check for OPA Policy Violations — If the target environment has OPA policies in enforce mode, check for policy violations. The promotion detail page shows any violations with explanations.
Verify Source Environment Health — Some promotion policies require the source environment to be in a healthy state before promotion. Check if the source environment has any drift, failed syncs, or unhealthy pods.
Check for Conflicting Releases — If another release is currently being deployed to the target environment, the promotion may be queued. Wait for the current deployment to complete.
Resolve the Block — Based on the cause:
| Cause | Resolution |
|---|---|
| Environment freeze | Lift the freeze (if appropriate) or wait for expiration |
| Missing approvals | Notify approvers or escalate |
| OPA violation | Fix the violating values and re-evaluate |
| Source unhealthy | Fix the source environment first |
| Conflicting release | Wait for completion |
Resolution: Promotion advances to the next stage and deploys successfully.
Prevention:
- Plan deployment windows around known freeze schedules
- Ensure sufficient approvers are available during deployment windows
- Run OPA evaluation in preview mode before creating promotion requests
- Avoid queuing multiple releases to the same environment simultaneously
Runbook 4: Webhook Delivery Failing
Severity: Medium
Symptoms:
- Team not receiving deployment notifications in Slack/Teams
- CI/CD webhook callbacks not triggering
- Webhook delivery logs showing failures
- Stale notification data (last successful delivery was hours ago)
Check Webhook Configuration — Navigate to Settings → Webhooks and verify the endpoint URL, authentication method, and event subscriptions are correct.
Review Delivery Logs — Open the webhook delivery history. For each failed delivery, check:
- HTTP response code (4xx = client error, 5xx = server error)
- Response body (often contains error details)
- Request timing (timeouts indicate endpoint is slow or unreachable)
Verify Endpoint Health — Test the webhook endpoint independently:
curl -X POST https://your-webhook-endpoint.example.com/gitopshq \
-H "Content-Type: application/json" \
-d '{"test": true}'If the endpoint returns errors or times out, the issue is on the receiving side.
Check HMAC Secret — If the webhook uses HMAC signature verification, ensure the signing secret in GitOpsHQ matches the verification secret on the receiving end. A mismatch causes 401/403 errors.
Test Delivery — Use the Test Delivery button in the webhook configuration to send a test event. Verify it arrives at the endpoint and is processed correctly.
Check Rate Limits — If you are sending a high volume of webhooks (e.g., during a batch promotion), the receiving endpoint may be rate-limiting. Check for 429 responses in the delivery logs.
Resolution: Webhook deliveries succeed consistently with 2xx responses.
Prevention:
- Monitor webhook delivery success rate with alerting on drops below 99%
- Use webhook endpoints that return quickly (< 5s) and process events asynchronously
- Rotate HMAC secrets on a schedule and update both sides simultaneously
- Configure retry policies for transient failures
Runbook 5: Policy Violations Unresolved
Severity: Medium (blocking deployments)
Symptoms:
- Releases are blocked by OPA policy violations
- Violation messages appear in the release detail
- Teams are unable to deploy to policy-enforced environments
Review the Violation Details — Open the blocked release and read the policy violation messages. Each violation includes:
- The policy name and rule that was violated
- A human-readable message explaining the violation
- The specific input values that triggered the violation
Understand the Policy Intent — Navigate to Policies → OPA and find the policy that generated the violation. Read the Rego rule to understand what it enforces and why.
Check if the Values Are Wrong — In most cases, the violation is correct and the values need to be updated. Common violations:
- Missing resource limits on production containers
- Using
latestimage tag instead of a versioned tag - Missing required labels or annotations
- Exceeding allowed replica count
Update the Values — Fix the non-compliant values in the workload editor. Use the OPA Workbench to test your updated values against the policy before creating a new release.
Re-evaluate — Create a new release with the fixed values. The OPA evaluation should pass, and the release can proceed to the approval queue.
If the Policy Is Wrong — If you believe the policy itself is incorrect or overly restrictive:
- Document the specific scenario where the policy is inappropriate
- Escalate to the Platform team with the release context and policy details
- Do not use break-glass to bypass policy violations unless it is a genuine emergency
Resolution: Values updated to comply with policies, or policy adjusted by the Platform team.
Prevention:
- Run OPA evaluation in the values editor before creating releases (catches violations early)
- Publish clear documentation for each OPA policy explaining what it enforces and why
- Start with
warnmode when introducing new policies, then switch toenforceafter teams adapt - Provide example-compliant values in the policy documentation
Runbook 6: Drift Detected in Production
Severity: High
Symptoms:
- Dashboard shows Drift Detected for a production environment
- Resource state in the cluster differs from the desired state in the GitOps repository
- May be accompanied by service degradation (if the drift is harmful)
Identify the Drifted Resources — Navigate to the drift detection panel and list all drifted resources. For each resource, note:
- Resource type and name (e.g., Deployment/checkout-api)
- The specific fields that differ
- When the drift was first detected
Determine if the Drift Is Intentional — Common intentional drift scenarios:
- HPA changed replica count (expected — autoscaler is working)
- Emergency manual fix applied via
kubectl(should be formalized) - External operator modified a resource (may be expected behavior)
Determine if the Drift Is Unintentional — Common unintentional drift:
- Someone ran
kubectl editorkubectl applymanually - A CI/CD pipeline applied changes directly to the cluster (bypassing GitOps)
- A Helm hook or job modified a resource post-deployment
For Intentional Drift — Adopt the Change — Update the desired state in the GitOps repository to match the cluster. This eliminates the drift by making the new state the desired state. Create a release to formalize the change.
For Unintentional Drift — Reconcile — Trigger a sync to force the cluster back to the desired state. The agent will revert the drifted resources to match the GitOps repository.
Investigate the Cause — Check the cluster's audit log (kubectl audit) to identify who or what made the direct change. If it was a human, review cluster-level RBAC to prevent future manual edits.
Resolution: Drift indicator is cleared, and the cluster matches the desired state.
Prevention:
- Restrict direct cluster access via RBAC — only the GitOpsHQ agent should modify workload resources
- Configure drift alerting to catch divergence within minutes
- Add HPA-managed fields to the drift ignore list to reduce noise
- Educate teams that all changes must go through the GitOps pipeline
Runbook 7: OCI Registry Connection Failed
Severity: High (blocks chart pulls and deployments)
Symptoms:
- Chart pulls from the registry are failing
- Workload creation fails with "chart not found" or "registry unreachable" errors
- Registry page shows connection errors
Verify Registry Credentials — Navigate to Settings → Registry and check the configured credentials. Ensure the username, password (or token), and registry URL are correct. Credentials may have expired or been rotated.
Test the Connection — Use the Test Connection button in registry settings. If it fails, note the specific error:
401 Unauthorized— credentials are invalid403 Forbidden— credentials lack required permissionsConnection refused— network issueTLS handshake failure— certificate issue
Check Network Connectivity — If the registry is self-hosted, verify that the GitOpsHQ control plane can reach it. Check firewall rules, VPN connectivity, and DNS resolution.
Verify Registry Health — Check the registry's own health endpoint. If the registry is down or degraded, the issue is not in GitOpsHQ.
Rotate Credentials — If credentials have expired, generate new credentials in the registry provider and update them in GitOpsHQ settings. Re-test the connection.
Check Certificate Chain — If using a private registry with a custom CA, ensure the CA certificate is properly configured in GitOpsHQ's trust store.
Resolution: Registry connection test succeeds, charts can be pulled and published.
Prevention:
- Set up monitoring on registry credential expiration dates
- Use token-based authentication with automatic rotation where possible
- Document the registry network path and firewall requirements
- Test registry connectivity as part of regular operational health checks
Runbook 8: Break-Glass Session Audit
Severity: Review (post-incident)
Symptoms:
- A break-glass session was opened and closed
- Post-incident review needs to validate the actions taken
- Compliance audit requires documentation of emergency access
Locate the Break-Glass Session — Navigate to Audit Log → Break-Glass Sessions. Find the session by date, actor, or scope. Each session record includes:
- Session ID and duration
- Actor who initiated the session
- Justification provided at opening
- Scope (environment, workload)
- All actions taken during the session
- Closing note
Review the Justification — Verify that the justification describes a genuine emergency:
- Is there a corresponding P1/P2 incident?
- Was the normal deployment path genuinely blocked?
- Was the severity classification appropriate?
Verify Actions Taken — Review every action performed during the session:
- Were the actions consistent with the stated justification?
- Were only the minimum necessary changes made?
- Were any changes made outside the declared scope?
Check for Unauthorized Changes — Compare the state before and after the break-glass session:
- Were any resources modified that were not related to the incident?
- Were any permissions or policies changed during the session?
- Were any new service accounts or tokens created?
Verify Post-Session State — Confirm that the system is in a clean state after the session:
- Is the break-glass session properly closed?
- Are all resources in a healthy state?
- Is there any remaining drift from the emergency changes?
Document Findings — Record the audit findings:
| Finding Category | Status |
|---|---|
| Justification adequate | Pass / Fail |
| Actions within scope | Pass / Fail |
| No unauthorized changes | Pass / Fail |
| Post-session state clean | Pass / Fail |
| Preventive actions identified | Yes / No |
Recommend Preventive Actions — Based on the audit, recommend changes to prevent future break-glass usage:
- If the freeze was too broad, narrow it or add emergency exceptions
- If approvals were blocked, add backup approvers or reduce quorum for rollbacks
- If the root cause was preventable, add validation rules or policy gates
Close the Audit — File the audit report with findings, sign-off from the Platform team, and track any recommended preventive actions to completion.
Resolution: Audit report filed with findings and preventive actions tracked.
Prevention:
- Review every break-glass session within 24 hours of closure
- Track break-glass frequency as a key governance metric
- Reduce break-glass necessity by improving normal operational pathways
- Use break-glass audit findings to continuously improve policies and permissions
Incident Closure Flow
Every incident, regardless of which runbook was used, should be closed with this standard flow.
Runbook Quick Reference
| Runbook | Trigger Symptom | First Action |
|---|---|---|
| Cluster Agent Disconnected | Agent heartbeat > 5min | Check agent pod status |
| Release Deployment Failed | Release status = Failed | Check delivery generator output |
| Promotion Blocked | Promotion stage = Blocked | Check freeze and approval status |
| Webhook Delivery Failing | Webhook success < 99% | Check endpoint health |
| Policy Violations Unresolved | Release blocked by OPA | Review violation details |
| Drift Detected in Production | Drift indicator non-zero | Identify drifted resources |
| OCI Registry Connection Failed | Chart pull errors | Verify credentials |
| Break-Glass Session Audit | Session closed | Review justification and actions |