Observability

Complete guide to monitoring GitOpsHQ operations, health signals, metrics, and integration with observability platforms.

Observability Philosophy

GitOpsHQ treats observability as a first-class concern. Every operation — deployment, approval, rollback, drift detection — emits structured signals that can be consumed by your existing monitoring stack. The goal is to give you a single pane of glass that connects delivery actions to runtime outcomes.

Health Endpoints

GitOpsHQ exposes standard health and metrics endpoints for integration with load balancers, orchestrators, and monitoring systems.

Endpoint	Purpose	Response	Use Case
`/healthz`	Liveness probe	`200 OK` if the process is alive	Kubernetes liveness probe, load balancer health check
`/readyz`	Readiness probe	`200 OK` if the service can accept requests	Kubernetes readiness probe, rolling update safety
`/metrics`	Prometheus metrics	Prometheus text format	Metrics scraping, dashboard building, alerting

Liveness vs Readiness

/healthz (liveness) — Returns 200 as long as the GitOpsHQ process is running and responsive. A failure here means the process should be restarted. This endpoint does not check external dependencies.
/readyz (readiness) — Returns 200 when the service is fully initialized and can handle requests. During startup, this returns 503 until database connections, cache, and agent communication channels are established.

# Kubernetes probe configuration example
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Dashboard Signals

The GitOpsHQ dashboard organizes health signals into four categories. Each provides a different lens on your operational state.

Fleet Health shows the connectivity and sync status of all registered clusters.

Indicator	Meaning
All green	Every cluster connected, every workload synced
Yellow badge	One or more clusters have stale heartbeats or pending syncs
Red badge	At least one cluster is disconnected or has failed syncs

What to do: Click any non-green cluster to drill into agent status, sync errors, and resource health.

Project Readiness shows deployment status across all projects and environments.

Indicator	Meaning
Ready	All environments deployed to their target state
Pending	Releases awaiting approval or deployment
Drift	Actual state differs from desired state
Failed	Last deployment failed

What to do: Focus on drift and failed states first. Pending states may be normal if approvals are in progress.

Governance Compliance shows the state of policy evaluations and approval workflows.

Indicator	Meaning
Compliant	All active policies are satisfied
Violations	One or more OPA policy violations exist
Blocked	Releases are blocked by unmet approval requirements
Break-glass active	An emergency session is currently open

What to do: Investigate violations — they may indicate configuration errors. Blocked releases need attention from approvers.

Activity Pulse shows recent operational activity across the organization.

Indicator	Meaning
Active	Multiple deployments, approvals, and configuration changes in the last hour
Quiet	Minimal activity — may be normal outside business hours
Anomalous	Unusual patterns (e.g., high rollback rate, sudden spike in break-glass usage)

What to do: Anomalous activity warrants investigation. Correlate with the audit log.

Key Metrics to Monitor

GitOpsHQ exposes the following metrics via the /metrics endpoint in Prometheus format.

Delivery Metrics

Metric	Description	Healthy Threshold
`gitopshq_releases_total`	Total releases created (by status: success/failed/rolledback)	Trending upward (more deploys = faster delivery)
`gitopshq_release_duration_seconds`	Time from release creation to deployment completion	p95 < 300s (5 minutes)
`gitopshq_rollbacks_total`	Total rollback operations	< 2% of total releases
`gitopshq_promotion_duration_seconds`	Time to complete a full promotion pipeline	Depends on approval SLA
`gitopshq_delivery_generator_duration_seconds`	Time to render manifests	p95 < 30s

Governance Metrics

Metric	Description	Healthy Threshold
`gitopshq_approval_pending_duration_seconds`	Time releases spend in approval queue	p95 < 1800s (30 minutes)
`gitopshq_approval_decisions_total`	Total approvals (by decision: approved/rejected)	Rejection rate < 10%
`gitopshq_opa_evaluations_total`	OPA policy evaluations (by result: pass/warn/deny)	Deny rate trending downward
`gitopshq_breakglass_sessions_total`	Break-glass session count	Close to zero (emergency only)

Runtime Metrics

Metric	Description	Healthy Threshold
`gitopshq_agent_heartbeat_age_seconds`	Time since last agent heartbeat per cluster	< 60s
`gitopshq_drift_detected_total`	Drift detection events by environment	Zero in production
`gitopshq_sync_failures_total`	Failed cluster sync operations	< 1% of total syncs
`gitopshq_webhook_deliveries_total`	Webhook delivery attempts (by status: success/failure)	Success rate > 99%

Automation Metrics

Metric	Description	Healthy Threshold
`gitopshq_api_requests_total`	API requests by service account	Within rate limits
`gitopshq_webhook_inbound_total`	Inbound webhook events processed	No errors

Daily Monitoring Checklist

Run through this checklist at the start of each operational shift.

Fleet health: All clusters showing Connected with fresh heartbeats.

Production drift: Zero drift detected in production environments.

Failed deployments: No unresolved deployment failures in the last 12 hours.

Approval queue: No production-bound releases pending for more than 1 hour.

Rollback trend: No unusual spike in rollback frequency.

Break-glass sessions: No open break-glass sessions from previous shifts.

Webhook delivery: All notification webhooks delivering successfully.

Agent versions: All agents running the latest supported version.

OPA violations: No unresolved policy violations on active releases.

Audit anomalies: No unexpected actor patterns (e.g., service account access outside CI windows).

Integration with Observability Platforms

Prometheus + Grafana

GitOpsHQ's /metrics endpoint is compatible with standard Prometheus scraping.

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'gitopshq'
    metrics_path: '/metrics'
    scheme: 'https'
    bearer_token_file: '/etc/prometheus/gitopshq-token'
    static_configs:
      - targets: ['api.gitopshq.io:443']
    scrape_interval: 30s

Recommended Grafana dashboards:

Dashboard	Panels	Purpose
Delivery Overview	Release rate, failure rate, rollback rate, duration histograms	Delivery health at a glance
Governance	Approval latency, OPA violation trend, break-glass frequency	Policy effectiveness
Runtime	Agent connectivity, drift count, sync failures	Cluster health
Automation	API request rate, webhook delivery success, service account activity	CI/CD integration health

Datadog

Export GitOpsHQ metrics to Datadog via Prometheus-to-Datadog bridge or the OpenMetrics integration.

# Datadog agent config
instances:
  - prometheus_url: https://api.gitopshq.io/metrics
    namespace: gitopshq
    metrics:
      - gitopshq_releases_total
      - gitopshq_agent_heartbeat_age_seconds
      - gitopshq_drift_detected_total
    headers:
      Authorization: Bearer <GITOPSHQ_TOKEN>

PagerDuty / Opsgenie

Configure webhook notifications in GitOpsHQ to send alerts to your incident management platform:

Critical: Agent disconnected > 5 min, production deployment failed, break-glass session opened
Warning: Production drift detected, approval queue > 1 hour, rollback executed
Info: Release deployed successfully, promotion completed

Distributed Tracing

GitOpsHQ supports OpenTelemetry Protocol (OTLP) for distributed tracing. Every API request generates a trace with spans for:

Request authentication and authorization
Policy evaluation (OPA + approval checks)
Delivery generation (template rendering, Git commit)
Agent communication (sync trigger, health check)

Configure the OTLP exporter endpoint in GitOpsHQ settings to send traces to your Jaeger, Zipkin, or cloud tracing backend. Traces include the X-Correlation-ID header value for cross-system correlation.

Alerting Recommendations

Critical Alerts (Page Immediately)

Condition	Threshold	Recommended Action
Agent disconnected	> 5 minutes	Investigate network/agent health
Production deployment failed	Any failure	Check delivery generator and sync logs
Break-glass session opened	Any session	Monitor and prepare for post-incident review
Multiple rollbacks in 1 hour	> 2 rollbacks	Investigate systemic quality issue

Warning Alerts (Investigate Within 1 Hour)

Condition	Threshold	Recommended Action
Production drift detected	Any drift	Determine if intentional or accidental
Approval queue aging	> 1 hour pending	Notify approvers or escalate
Webhook delivery failures	> 5% failure rate	Check endpoint health and credentials
OPA violations on active releases	Any violation	Review and resolve before deployment

Informational Alerts (Review Daily)

Condition	Threshold	Recommended Action
Release deployed	Every deploy	Track deployment frequency trend
Promotion completed	Every promotion	Track promotion success rate
New break-glass session closed	Every closure	Add to post-incident review queue

Signal Views Deep Dive

Delivery Health

Delivery health tells you whether your deployment pipeline is functioning correctly.

Release progression: Track releases as they move through Draft → Approved → Deployed → Healthy. Bottlenecks appear as accumulation at any stage.
Promotion lead time: The time from the first environment deployment to the final production deployment. Long lead times indicate governance overhead or testing gaps.
Rollback volume: A sudden increase in rollbacks signals a quality problem in the development or testing process. Investigate the root cause rather than just rolling back.

Runtime Health

Runtime health tells you whether your live clusters match the desired state.

Connected/disconnected trend: A gradual increase in disconnected clusters may indicate network policy changes, certificate expiration, or agent version drift.
Drift correlation: Drift that appears immediately after a deployment may indicate the deployment was partially applied. Drift that appears gradually may indicate manual cluster edits or operator interference.
Sync timing: If syncs are consistently slow, check cluster resource availability and agent pod health.

Governance Health

Governance health tells you whether your policies are effective and not creating unnecessary friction.

Approval aging: If approvals consistently take too long, consider adjusting quorum requirements or adding more approved reviewers.
Break-glass frequency: More than one break-glass session per month suggests a gap in normal operational capabilities. Review policies and permissions.
OPA evaluation trend: A decreasing denial rate over time indicates teams are learning and adapting to the policies. A persistent high denial rate may mean the policy is too restrictive or unclear.

On this page