GitOpsHQ Docs
Operations & Monitoring

Observability

Complete guide to monitoring GitOpsHQ operations, health signals, metrics, and integration with observability platforms.

Observability Philosophy

GitOpsHQ treats observability as a first-class concern. Every operation — deployment, approval, rollback, drift detection — emits structured signals that can be consumed by your existing monitoring stack. The goal is to give you a single pane of glass that connects delivery actions to runtime outcomes.

Health Endpoints

GitOpsHQ exposes standard health and metrics endpoints for integration with load balancers, orchestrators, and monitoring systems.

EndpointPurposeResponseUse Case
/healthzLiveness probe200 OK if the process is aliveKubernetes liveness probe, load balancer health check
/readyzReadiness probe200 OK if the service can accept requestsKubernetes readiness probe, rolling update safety
/metricsPrometheus metricsPrometheus text formatMetrics scraping, dashboard building, alerting

Liveness vs Readiness

  • /healthz (liveness) — Returns 200 as long as the GitOpsHQ process is running and responsive. A failure here means the process should be restarted. This endpoint does not check external dependencies.
  • /readyz (readiness) — Returns 200 when the service is fully initialized and can handle requests. During startup, this returns 503 until database connections, cache, and agent communication channels are established.
# Kubernetes probe configuration example
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Dashboard Signals

The GitOpsHQ dashboard organizes health signals into four categories. Each provides a different lens on your operational state.

Fleet Health shows the connectivity and sync status of all registered clusters.

IndicatorMeaning
All greenEvery cluster connected, every workload synced
Yellow badgeOne or more clusters have stale heartbeats or pending syncs
Red badgeAt least one cluster is disconnected or has failed syncs

What to do: Click any non-green cluster to drill into agent status, sync errors, and resource health.

Project Readiness shows deployment status across all projects and environments.

IndicatorMeaning
ReadyAll environments deployed to their target state
PendingReleases awaiting approval or deployment
DriftActual state differs from desired state
FailedLast deployment failed

What to do: Focus on drift and failed states first. Pending states may be normal if approvals are in progress.

Governance Compliance shows the state of policy evaluations and approval workflows.

IndicatorMeaning
CompliantAll active policies are satisfied
ViolationsOne or more OPA policy violations exist
BlockedReleases are blocked by unmet approval requirements
Break-glass activeAn emergency session is currently open

What to do: Investigate violations — they may indicate configuration errors. Blocked releases need attention from approvers.

Activity Pulse shows recent operational activity across the organization.

IndicatorMeaning
ActiveMultiple deployments, approvals, and configuration changes in the last hour
QuietMinimal activity — may be normal outside business hours
AnomalousUnusual patterns (e.g., high rollback rate, sudden spike in break-glass usage)

What to do: Anomalous activity warrants investigation. Correlate with the audit log.

Key Metrics to Monitor

GitOpsHQ exposes the following metrics via the /metrics endpoint in Prometheus format.

Delivery Metrics

MetricDescriptionHealthy Threshold
gitopshq_releases_totalTotal releases created (by status: success/failed/rolledback)Trending upward (more deploys = faster delivery)
gitopshq_release_duration_secondsTime from release creation to deployment completionp95 < 300s (5 minutes)
gitopshq_rollbacks_totalTotal rollback operations< 2% of total releases
gitopshq_promotion_duration_secondsTime to complete a full promotion pipelineDepends on approval SLA
gitopshq_delivery_generator_duration_secondsTime to render manifestsp95 < 30s

Governance Metrics

MetricDescriptionHealthy Threshold
gitopshq_approval_pending_duration_secondsTime releases spend in approval queuep95 < 1800s (30 minutes)
gitopshq_approval_decisions_totalTotal approvals (by decision: approved/rejected)Rejection rate < 10%
gitopshq_opa_evaluations_totalOPA policy evaluations (by result: pass/warn/deny)Deny rate trending downward
gitopshq_breakglass_sessions_totalBreak-glass session countClose to zero (emergency only)

Runtime Metrics

MetricDescriptionHealthy Threshold
gitopshq_agent_heartbeat_age_secondsTime since last agent heartbeat per cluster< 60s
gitopshq_drift_detected_totalDrift detection events by environmentZero in production
gitopshq_sync_failures_totalFailed cluster sync operations< 1% of total syncs
gitopshq_webhook_deliveries_totalWebhook delivery attempts (by status: success/failure)Success rate > 99%

Automation Metrics

MetricDescriptionHealthy Threshold
gitopshq_api_requests_totalAPI requests by service accountWithin rate limits
gitopshq_webhook_inbound_totalInbound webhook events processedNo errors

Daily Monitoring Checklist

Run through this checklist at the start of each operational shift.

Fleet health: All clusters showing Connected with fresh heartbeats.
Production drift: Zero drift detected in production environments.
Failed deployments: No unresolved deployment failures in the last 12 hours.
Approval queue: No production-bound releases pending for more than 1 hour.
Rollback trend: No unusual spike in rollback frequency.
Break-glass sessions: No open break-glass sessions from previous shifts.
Webhook delivery: All notification webhooks delivering successfully.
Agent versions: All agents running the latest supported version.
OPA violations: No unresolved policy violations on active releases.
Audit anomalies: No unexpected actor patterns (e.g., service account access outside CI windows).

Integration with Observability Platforms

Prometheus + Grafana

GitOpsHQ's /metrics endpoint is compatible with standard Prometheus scraping.

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'gitopshq'
    metrics_path: '/metrics'
    scheme: 'https'
    bearer_token_file: '/etc/prometheus/gitopshq-token'
    static_configs:
      - targets: ['api.gitopshq.io:443']
    scrape_interval: 30s

Recommended Grafana dashboards:

DashboardPanelsPurpose
Delivery OverviewRelease rate, failure rate, rollback rate, duration histogramsDelivery health at a glance
GovernanceApproval latency, OPA violation trend, break-glass frequencyPolicy effectiveness
RuntimeAgent connectivity, drift count, sync failuresCluster health
AutomationAPI request rate, webhook delivery success, service account activityCI/CD integration health

Datadog

Export GitOpsHQ metrics to Datadog via Prometheus-to-Datadog bridge or the OpenMetrics integration.

# Datadog agent config
instances:
  - prometheus_url: https://api.gitopshq.io/metrics
    namespace: gitopshq
    metrics:
      - gitopshq_releases_total
      - gitopshq_agent_heartbeat_age_seconds
      - gitopshq_drift_detected_total
    headers:
      Authorization: Bearer <GITOPSHQ_TOKEN>

PagerDuty / Opsgenie

Configure webhook notifications in GitOpsHQ to send alerts to your incident management platform:

  • Critical: Agent disconnected > 5 min, production deployment failed, break-glass session opened
  • Warning: Production drift detected, approval queue > 1 hour, rollback executed
  • Info: Release deployed successfully, promotion completed

Distributed Tracing

GitOpsHQ supports OpenTelemetry Protocol (OTLP) for distributed tracing. Every API request generates a trace with spans for:

  • Request authentication and authorization
  • Policy evaluation (OPA + approval checks)
  • Delivery generation (template rendering, Git commit)
  • Agent communication (sync trigger, health check)

Configure the OTLP exporter endpoint in GitOpsHQ settings to send traces to your Jaeger, Zipkin, or cloud tracing backend. Traces include the X-Correlation-ID header value for cross-system correlation.

Alerting Recommendations

Critical Alerts (Page Immediately)

ConditionThresholdRecommended Action
Agent disconnected> 5 minutesInvestigate network/agent health
Production deployment failedAny failureCheck delivery generator and sync logs
Break-glass session openedAny sessionMonitor and prepare for post-incident review
Multiple rollbacks in 1 hour> 2 rollbacksInvestigate systemic quality issue

Warning Alerts (Investigate Within 1 Hour)

ConditionThresholdRecommended Action
Production drift detectedAny driftDetermine if intentional or accidental
Approval queue aging> 1 hour pendingNotify approvers or escalate
Webhook delivery failures> 5% failure rateCheck endpoint health and credentials
OPA violations on active releasesAny violationReview and resolve before deployment

Informational Alerts (Review Daily)

ConditionThresholdRecommended Action
Release deployedEvery deployTrack deployment frequency trend
Promotion completedEvery promotionTrack promotion success rate
New break-glass session closedEvery closureAdd to post-incident review queue

Signal Views Deep Dive

Delivery Health

Delivery health tells you whether your deployment pipeline is functioning correctly.

  • Release progression: Track releases as they move through Draft → Approved → Deployed → Healthy. Bottlenecks appear as accumulation at any stage.
  • Promotion lead time: The time from the first environment deployment to the final production deployment. Long lead times indicate governance overhead or testing gaps.
  • Rollback volume: A sudden increase in rollbacks signals a quality problem in the development or testing process. Investigate the root cause rather than just rolling back.

Runtime Health

Runtime health tells you whether your live clusters match the desired state.

  • Connected/disconnected trend: A gradual increase in disconnected clusters may indicate network policy changes, certificate expiration, or agent version drift.
  • Drift correlation: Drift that appears immediately after a deployment may indicate the deployment was partially applied. Drift that appears gradually may indicate manual cluster edits or operator interference.
  • Sync timing: If syncs are consistently slow, check cluster resource availability and agent pod health.

Governance Health

Governance health tells you whether your policies are effective and not creating unnecessary friction.

  • Approval aging: If approvals consistently take too long, consider adjusting quorum requirements or adding more approved reviewers.
  • Break-glass frequency: More than one break-glass session per month suggests a gap in normal operational capabilities. Review policies and permissions.
  • OPA evaluation trend: A decreasing denial rate over time indicates teams are learning and adapting to the policies. A persistent high denial rate may mean the policy is too restrictive or unclear.

On this page