SRE Playbook

Reliability and incident response guide for SREs monitoring cluster health, managing rollbacks, and handling production incidents in GitOpsHQ.

Your Role in GitOpsHQ

As a Site Reliability Engineer, you are the guardian of production stability. You monitor cluster health, detect and respond to incidents, execute rollbacks when deployments go wrong, investigate drift, and use break-glass procedures when normal operations are blocked during emergencies.

Your primary concern is reliability and recovery — ensuring that services stay healthy, incidents are contained quickly, and every recovery action is traceable.

Your Access Scope

You typically hold the Operator role at the organization level, giving you read access to all projects, cluster management capabilities, rollback execution rights, and break-glass authorization. You can view all deployments but may not create releases for non-emergency changes.

Reliability Signals

These are the signals you should monitor continuously. Configure alerts for any that cross their threshold.

Signal	Healthy	Warning	Critical
Agent heartbeat	< 60s since last ping	2–5 minutes since last ping	> 5 minutes (agent disconnected)
Pod health	All pods Running/Ready	> 10% pods not ready	> 30% pods unhealthy
Deployment sync	All synced	1–2 pending sync	Multiple out of sync
Drift count	Zero	1–2 drifts in non-prod	Any drift in production
Rollback rate	< 2% of releases	2–5% of releases	> 5% (systemic quality issue)
Approval queue age	< 30 minutes	30–60 minutes	> 1 hour (pipeline blocked)
Webhook delivery	> 99% success rate	95–99% success rate	< 95% (notification gaps)

Daily Monitoring Routine

Check Fleet Health Dashboard — Open the GitOpsHQ dashboard and scan the fleet health panel. Every cluster should show Connected with a green indicator. Every production environment should show Healthy with no drift.

Review Overnight Deployments — Check the audit log for any deployments that occurred outside your shift. Filter by the last 12 hours and look for any failures, rollbacks, or break-glass sessions that may need follow-up.

Verify Agent Connectivity — Navigate to Clusters and confirm every registered agent has a recent heartbeat. An agent that missed multiple heartbeats may indicate a network issue or a crashed pod.

Check Drift Status — Review the drift detection panel for any detected drift. Non-production drift may be informational, but production drift requires immediate investigation.

Review Approval Queue — Check for production-bound releases that are waiting for your approval. Aging approvals block the delivery pipeline and force teams to accumulate changes, increasing risk.

Scan Webhook Delivery Logs — Check that notification webhooks are delivering successfully. Failed webhook deliveries mean your team may miss critical alerts.

Incident Response Procedure

When an incident is detected, follow this structured response.

Identify Scope — Determine which tenants, environments, and services are affected. Use the dashboard to quickly assess the blast radius. Is this a single pod, a single service, or multiple services?

Correlate with Recent Changes — Open the audit log and filter by the affected scope within the last 2 hours. Most incidents are caused by recent deployments. Identify the last release that was deployed to the affected environment.

Gather Runtime Evidence — Use the diagnostics panel to collect evidence:

Pod logs from affected containers
Pod events showing restarts, OOM kills, or scheduling failures
Resource utilization (CPU, memory) from the cluster

Determine Containment Strategy — Based on the evidence, choose the appropriate action:

Scenario	Action
Bad configuration in latest release	Rollback to previous release
Resource exhaustion	Scale resources, then investigate root cause
External dependency failure	No rollback needed — monitor and alert
Cluster-level issue	Escalate to Platform team

Execute Rollback (if needed) — Navigate to Rollback → New Rollback. Select the scope (full environment, single service, or specific fields) and the target commit. Preview the rollback diff to confirm it will restore the desired state.

Verify Recovery — After the rollback deploys, monitor the affected services until they return to Healthy status. Confirm error rates return to baseline, pod health stabilizes, and the drift indicator clears.

Document the Incident — Record the incident in the audit log with: root cause, blast radius, containment action, recovery time, and preventive recommendations.

Rollback Operations

Rollback Scopes

GitOpsHQ supports three rollback scopes to give you precise control over what gets reverted.

Full Rollback reverts the entire environment to a previous point-in-time state. Use this when multiple services are affected or when you need to undo a multi-service release.

Scope: All workloads in the environment
Target: A specific release or commit SHA
Risk: Higher — affects all services, including those that may be healthy

Service Rollback reverts a single service (workload) to a previous state while leaving other services unchanged. This is the most common rollback type.

Scope: Single workload
Target: Previous release for that workload
Risk: Lower — surgical, affects only the problematic service

Field Rollback reverts specific configuration fields to their previous values. Use this when you know exactly which configuration change caused the issue.

Scope: Specific value paths (e.g., image.tag, env.FEATURE_FLAG)
Target: Previous values for those fields
Risk: Lowest — most precise, but requires clear root cause identification

Rollback Best Practices

Always preview the rollback diff before executing — confirm the rollback will restore the expected state
Prefer service-scoped rollback over full rollback when the issue is isolated to one service
Do not rollback speculatively — gather evidence first. A rollback without root cause understanding may mask the real issue
If production rollback requires approval, ensure your organization has a fast-path approval policy (single approver with short SLA) for emergency scenarios
If approval is blocked, use break-glass (see below)

Drift Monitoring

Drift occurs when the actual cluster state diverges from the desired state in the GitOps repository.

Common Drift Causes

Cause	Detection	Response
Manual `kubectl` edit	Drift alert on affected resource	Reconcile: sync desired state to cluster
HPA scaling	Replica count differs	Usually intentional — add to ignore list
Failed sync	Agent could not apply manifest	Investigate sync error and retry
External operator	CRD controller modified resource	Evaluate if change should be adopted

Drift Response Workflow

Assess the drift — Is it in production or non-production? Is it on a critical resource (Deployment, Service) or a non-critical one (ConfigMap annotation)?

Determine intent — Was the drift intentional (e.g., manual emergency fix) or unintentional (e.g., someone ran kubectl edit by mistake)?

Resolve — For intentional drift, update the desired state in the GitOps repo to match the cluster. For unintentional drift, trigger a sync to force the cluster back to the desired state.

Prevent recurrence — If drift was caused by manual cluster access, tighten RBAC on the cluster level. If caused by an operator, add the resource to the drift ignore list or configure the operator to respect GitOps ownership.

Break-Glass Usage

Break-glass is your emergency override when normal operations are blocked (e.g., by an environment freeze, missing approvals, or an access restriction) during a critical incident.

Break-Glass Is Audited

Every action taken during a break-glass session is recorded with the actor, justification, scope, and timestamp. Misuse of break-glass is flagged during post-incident review.

When to Use Break-Glass

Production is on fire and the normal rollback path is blocked by a freeze
A critical security patch needs to deploy immediately and the approval quorum cannot be reached
An urgent configuration fix is needed and the primary deployers are unavailable

When NOT to Use Break-Glass

Routine deployment that is blocked by a policy you find inconvenient
Testing a new feature in production
Bypassing approval because it is "taking too long" (escalate instead)

Break-Glass Procedure

Navigate to Emergency → Break-Glass → New Session.

Select the scope (environment + workload) for the session.

Provide a detailed justification including the incident severity and what you intend to do.

Execute the emergency action (rollback, configuration change, etc.).

Verify the fix resolves the incident.

Close the break-glass session with a closing note.

File a post-incident report documenting the break-glass usage and preventive actions.

Observability Integration

As an SRE, you should configure your observability stack to consume GitOpsHQ signals:

Prometheus scraping — Point your Prometheus instance at GitOpsHQ's /metrics endpoint to collect deployment, approval, and agent health metrics.
Alert routing — Configure alerts for: agent disconnection > 5 min, production drift detected, rollback executed, break-glass session opened.
Dashboards — Build Grafana dashboards showing deployment frequency, rollback rate, drift trend, and mean time to recovery.
Correlation — Use the audit log's timestamps and correlation IDs to link GitOpsHQ events with your application monitoring data.

Runbook Quick Reference

When facing a specific symptom, jump to the corresponding runbook:

Symptom	Runbook
Cluster agent not reporting	Cluster Agent Disconnected
Release stuck in "Deploying"	Release Deployment Failed
Promotion will not proceed	Promotion Blocked
Team not receiving alerts	Webhook Delivery Failing
OPA violations blocking release	Policy Violations Unresolved
Live state differs from repo	Drift Detected in Production

On this page