SRE Playbook
Reliability and incident response guide for SREs monitoring cluster health, managing rollbacks, and handling production incidents in GitOpsHQ.
Your Role in GitOpsHQ
As a Site Reliability Engineer, you are the guardian of production stability. You monitor cluster health, detect and respond to incidents, execute rollbacks when deployments go wrong, investigate drift, and use break-glass procedures when normal operations are blocked during emergencies.
Your primary concern is reliability and recovery — ensuring that services stay healthy, incidents are contained quickly, and every recovery action is traceable.
Your Access Scope
You typically hold the Operator role at the organization level, giving you read access to all projects, cluster management capabilities, rollback execution rights, and break-glass authorization. You can view all deployments but may not create releases for non-emergency changes.
Reliability Signals
These are the signals you should monitor continuously. Configure alerts for any that cross their threshold.
| Signal | Healthy | Warning | Critical |
|---|---|---|---|
| Agent heartbeat | < 60s since last ping | 2–5 minutes since last ping | > 5 minutes (agent disconnected) |
| Pod health | All pods Running/Ready | > 10% pods not ready | > 30% pods unhealthy |
| Deployment sync | All synced | 1–2 pending sync | Multiple out of sync |
| Drift count | Zero | 1–2 drifts in non-prod | Any drift in production |
| Rollback rate | < 2% of releases | 2–5% of releases | > 5% (systemic quality issue) |
| Approval queue age | < 30 minutes | 30–60 minutes | > 1 hour (pipeline blocked) |
| Webhook delivery | > 99% success rate | 95–99% success rate | < 95% (notification gaps) |
Daily Monitoring Routine
Check Fleet Health Dashboard — Open the GitOpsHQ dashboard and scan the fleet health panel. Every cluster should show Connected with a green indicator. Every production environment should show Healthy with no drift.
Review Overnight Deployments — Check the audit log for any deployments that occurred outside your shift. Filter by the last 12 hours and look for any failures, rollbacks, or break-glass sessions that may need follow-up.
Verify Agent Connectivity — Navigate to Clusters and confirm every registered agent has a recent heartbeat. An agent that missed multiple heartbeats may indicate a network issue or a crashed pod.
Check Drift Status — Review the drift detection panel for any detected drift. Non-production drift may be informational, but production drift requires immediate investigation.
Review Approval Queue — Check for production-bound releases that are waiting for your approval. Aging approvals block the delivery pipeline and force teams to accumulate changes, increasing risk.
Scan Webhook Delivery Logs — Check that notification webhooks are delivering successfully. Failed webhook deliveries mean your team may miss critical alerts.
Incident Response Procedure
When an incident is detected, follow this structured response.
Identify Scope — Determine which tenants, environments, and services are affected. Use the dashboard to quickly assess the blast radius. Is this a single pod, a single service, or multiple services?
Correlate with Recent Changes — Open the audit log and filter by the affected scope within the last 2 hours. Most incidents are caused by recent deployments. Identify the last release that was deployed to the affected environment.
Gather Runtime Evidence — Use the diagnostics panel to collect evidence:
- Pod logs from affected containers
- Pod events showing restarts, OOM kills, or scheduling failures
- Resource utilization (CPU, memory) from the cluster
Determine Containment Strategy — Based on the evidence, choose the appropriate action:
| Scenario | Action |
|---|---|
| Bad configuration in latest release | Rollback to previous release |
| Resource exhaustion | Scale resources, then investigate root cause |
| External dependency failure | No rollback needed — monitor and alert |
| Cluster-level issue | Escalate to Platform team |
Execute Rollback (if needed) — Navigate to Rollback → New Rollback. Select the scope (full environment, single service, or specific fields) and the target commit. Preview the rollback diff to confirm it will restore the desired state.
Verify Recovery — After the rollback deploys, monitor the affected services until they return to Healthy status. Confirm error rates return to baseline, pod health stabilizes, and the drift indicator clears.
Document the Incident — Record the incident in the audit log with: root cause, blast radius, containment action, recovery time, and preventive recommendations.
Rollback Operations
Rollback Scopes
GitOpsHQ supports three rollback scopes to give you precise control over what gets reverted.
Full Rollback reverts the entire environment to a previous point-in-time state. Use this when multiple services are affected or when you need to undo a multi-service release.
- Scope: All workloads in the environment
- Target: A specific release or commit SHA
- Risk: Higher — affects all services, including those that may be healthy
Service Rollback reverts a single service (workload) to a previous state while leaving other services unchanged. This is the most common rollback type.
- Scope: Single workload
- Target: Previous release for that workload
- Risk: Lower — surgical, affects only the problematic service
Field Rollback reverts specific configuration fields to their previous values. Use this when you know exactly which configuration change caused the issue.
- Scope: Specific value paths (e.g.,
image.tag,env.FEATURE_FLAG) - Target: Previous values for those fields
- Risk: Lowest — most precise, but requires clear root cause identification
Rollback Best Practices
- Always preview the rollback diff before executing — confirm the rollback will restore the expected state
- Prefer service-scoped rollback over full rollback when the issue is isolated to one service
- Do not rollback speculatively — gather evidence first. A rollback without root cause understanding may mask the real issue
- If production rollback requires approval, ensure your organization has a fast-path approval policy (single approver with short SLA) for emergency scenarios
- If approval is blocked, use break-glass (see below)
Drift Monitoring
Drift occurs when the actual cluster state diverges from the desired state in the GitOps repository.
Common Drift Causes
| Cause | Detection | Response |
|---|---|---|
Manual kubectl edit | Drift alert on affected resource | Reconcile: sync desired state to cluster |
| HPA scaling | Replica count differs | Usually intentional — add to ignore list |
| Failed sync | Agent could not apply manifest | Investigate sync error and retry |
| External operator | CRD controller modified resource | Evaluate if change should be adopted |
Drift Response Workflow
Break-Glass Usage
Break-glass is your emergency override when normal operations are blocked (e.g., by an environment freeze, missing approvals, or an access restriction) during a critical incident.
Break-Glass Is Audited
Every action taken during a break-glass session is recorded with the actor, justification, scope, and timestamp. Misuse of break-glass is flagged during post-incident review.
When to Use Break-Glass
- Production is on fire and the normal rollback path is blocked by a freeze
- A critical security patch needs to deploy immediately and the approval quorum cannot be reached
- An urgent configuration fix is needed and the primary deployers are unavailable
When NOT to Use Break-Glass
- Routine deployment that is blocked by a policy you find inconvenient
- Testing a new feature in production
- Bypassing approval because it is "taking too long" (escalate instead)
Break-Glass Procedure
Observability Integration
As an SRE, you should configure your observability stack to consume GitOpsHQ signals:
- Prometheus scraping — Point your Prometheus instance at GitOpsHQ's
/metricsendpoint to collect deployment, approval, and agent health metrics. - Alert routing — Configure alerts for: agent disconnection > 5 min, production drift detected, rollback executed, break-glass session opened.
- Dashboards — Build Grafana dashboards showing deployment frequency, rollback rate, drift trend, and mean time to recovery.
- Correlation — Use the audit log's timestamps and correlation IDs to link GitOpsHQ events with your application monitoring data.
Runbook Quick Reference
When facing a specific symptom, jump to the corresponding runbook:
| Symptom | Runbook |
|---|---|
| Cluster agent not reporting | Cluster Agent Disconnected |
| Release stuck in "Deploying" | Release Deployment Failed |
| Promotion will not proceed | Promotion Blocked |
| Team not receiving alerts | Webhook Delivery Failing |
| OPA violations blocking release | Policy Violations Unresolved |
| Live state differs from repo | Drift Detected in Production |