GitOpsHQ Docs
Team Playbooks

SRE Playbook

Reliability and incident response guide for SREs monitoring cluster health, managing rollbacks, and handling production incidents in GitOpsHQ.

Your Role in GitOpsHQ

As a Site Reliability Engineer, you are the guardian of production stability. You monitor cluster health, detect and respond to incidents, execute rollbacks when deployments go wrong, investigate drift, and use break-glass procedures when normal operations are blocked during emergencies.

Your primary concern is reliability and recovery — ensuring that services stay healthy, incidents are contained quickly, and every recovery action is traceable.

Your Access Scope

You typically hold the Operator role at the organization level, giving you read access to all projects, cluster management capabilities, rollback execution rights, and break-glass authorization. You can view all deployments but may not create releases for non-emergency changes.

Reliability Signals

These are the signals you should monitor continuously. Configure alerts for any that cross their threshold.

SignalHealthyWarningCritical
Agent heartbeat< 60s since last ping2–5 minutes since last ping> 5 minutes (agent disconnected)
Pod healthAll pods Running/Ready> 10% pods not ready> 30% pods unhealthy
Deployment syncAll synced1–2 pending syncMultiple out of sync
Drift countZero1–2 drifts in non-prodAny drift in production
Rollback rate< 2% of releases2–5% of releases> 5% (systemic quality issue)
Approval queue age< 30 minutes30–60 minutes> 1 hour (pipeline blocked)
Webhook delivery> 99% success rate95–99% success rate< 95% (notification gaps)

Daily Monitoring Routine

Check Fleet Health Dashboard — Open the GitOpsHQ dashboard and scan the fleet health panel. Every cluster should show Connected with a green indicator. Every production environment should show Healthy with no drift.

Review Overnight Deployments — Check the audit log for any deployments that occurred outside your shift. Filter by the last 12 hours and look for any failures, rollbacks, or break-glass sessions that may need follow-up.

Verify Agent Connectivity — Navigate to Clusters and confirm every registered agent has a recent heartbeat. An agent that missed multiple heartbeats may indicate a network issue or a crashed pod.

Check Drift Status — Review the drift detection panel for any detected drift. Non-production drift may be informational, but production drift requires immediate investigation.

Review Approval Queue — Check for production-bound releases that are waiting for your approval. Aging approvals block the delivery pipeline and force teams to accumulate changes, increasing risk.

Scan Webhook Delivery Logs — Check that notification webhooks are delivering successfully. Failed webhook deliveries mean your team may miss critical alerts.

Incident Response Procedure

When an incident is detected, follow this structured response.

Identify Scope — Determine which tenants, environments, and services are affected. Use the dashboard to quickly assess the blast radius. Is this a single pod, a single service, or multiple services?

Correlate with Recent Changes — Open the audit log and filter by the affected scope within the last 2 hours. Most incidents are caused by recent deployments. Identify the last release that was deployed to the affected environment.

Gather Runtime Evidence — Use the diagnostics panel to collect evidence:

  • Pod logs from affected containers
  • Pod events showing restarts, OOM kills, or scheduling failures
  • Resource utilization (CPU, memory) from the cluster

Determine Containment Strategy — Based on the evidence, choose the appropriate action:

ScenarioAction
Bad configuration in latest releaseRollback to previous release
Resource exhaustionScale resources, then investigate root cause
External dependency failureNo rollback needed — monitor and alert
Cluster-level issueEscalate to Platform team

Execute Rollback (if needed) — Navigate to Rollback → New Rollback. Select the scope (full environment, single service, or specific fields) and the target commit. Preview the rollback diff to confirm it will restore the desired state.

Verify Recovery — After the rollback deploys, monitor the affected services until they return to Healthy status. Confirm error rates return to baseline, pod health stabilizes, and the drift indicator clears.

Document the Incident — Record the incident in the audit log with: root cause, blast radius, containment action, recovery time, and preventive recommendations.

Rollback Operations

Rollback Scopes

GitOpsHQ supports three rollback scopes to give you precise control over what gets reverted.

Full Rollback reverts the entire environment to a previous point-in-time state. Use this when multiple services are affected or when you need to undo a multi-service release.

  • Scope: All workloads in the environment
  • Target: A specific release or commit SHA
  • Risk: Higher — affects all services, including those that may be healthy

Service Rollback reverts a single service (workload) to a previous state while leaving other services unchanged. This is the most common rollback type.

  • Scope: Single workload
  • Target: Previous release for that workload
  • Risk: Lower — surgical, affects only the problematic service

Field Rollback reverts specific configuration fields to their previous values. Use this when you know exactly which configuration change caused the issue.

  • Scope: Specific value paths (e.g., image.tag, env.FEATURE_FLAG)
  • Target: Previous values for those fields
  • Risk: Lowest — most precise, but requires clear root cause identification

Rollback Best Practices

  • Always preview the rollback diff before executing — confirm the rollback will restore the expected state
  • Prefer service-scoped rollback over full rollback when the issue is isolated to one service
  • Do not rollback speculatively — gather evidence first. A rollback without root cause understanding may mask the real issue
  • If production rollback requires approval, ensure your organization has a fast-path approval policy (single approver with short SLA) for emergency scenarios
  • If approval is blocked, use break-glass (see below)

Drift Monitoring

Drift occurs when the actual cluster state diverges from the desired state in the GitOps repository.

Common Drift Causes

CauseDetectionResponse
Manual kubectl editDrift alert on affected resourceReconcile: sync desired state to cluster
HPA scalingReplica count differsUsually intentional — add to ignore list
Failed syncAgent could not apply manifestInvestigate sync error and retry
External operatorCRD controller modified resourceEvaluate if change should be adopted

Drift Response Workflow

Assess the drift — Is it in production or non-production? Is it on a critical resource (Deployment, Service) or a non-critical one (ConfigMap annotation)?
Determine intent — Was the drift intentional (e.g., manual emergency fix) or unintentional (e.g., someone ran kubectl edit by mistake)?
Resolve — For intentional drift, update the desired state in the GitOps repo to match the cluster. For unintentional drift, trigger a sync to force the cluster back to the desired state.
Prevent recurrence — If drift was caused by manual cluster access, tighten RBAC on the cluster level. If caused by an operator, add the resource to the drift ignore list or configure the operator to respect GitOps ownership.

Break-Glass Usage

Break-glass is your emergency override when normal operations are blocked (e.g., by an environment freeze, missing approvals, or an access restriction) during a critical incident.

Break-Glass Is Audited

Every action taken during a break-glass session is recorded with the actor, justification, scope, and timestamp. Misuse of break-glass is flagged during post-incident review.

When to Use Break-Glass

  • Production is on fire and the normal rollback path is blocked by a freeze
  • A critical security patch needs to deploy immediately and the approval quorum cannot be reached
  • An urgent configuration fix is needed and the primary deployers are unavailable

When NOT to Use Break-Glass

  • Routine deployment that is blocked by a policy you find inconvenient
  • Testing a new feature in production
  • Bypassing approval because it is "taking too long" (escalate instead)

Break-Glass Procedure

Navigate to Emergency → Break-Glass → New Session.
Select the scope (environment + workload) for the session.
Provide a detailed justification including the incident severity and what you intend to do.
Execute the emergency action (rollback, configuration change, etc.).
Verify the fix resolves the incident.
Close the break-glass session with a closing note.
File a post-incident report documenting the break-glass usage and preventive actions.

Observability Integration

As an SRE, you should configure your observability stack to consume GitOpsHQ signals:

  • Prometheus scraping — Point your Prometheus instance at GitOpsHQ's /metrics endpoint to collect deployment, approval, and agent health metrics.
  • Alert routing — Configure alerts for: agent disconnection > 5 min, production drift detected, rollback executed, break-glass session opened.
  • Dashboards — Build Grafana dashboards showing deployment frequency, rollback rate, drift trend, and mean time to recovery.
  • Correlation — Use the audit log's timestamps and correlation IDs to link GitOpsHQ events with your application monitoring data.

Runbook Quick Reference

When facing a specific symptom, jump to the corresponding runbook:

SymptomRunbook
Cluster agent not reportingCluster Agent Disconnected
Release stuck in "Deploying"Release Deployment Failed
Promotion will not proceedPromotion Blocked
Team not receiving alertsWebhook Delivery Failing
OPA violations blocking releasePolicy Violations Unresolved
Live state differs from repoDrift Detected in Production

On this page