Observability
Complete guide to monitoring GitOpsHQ operations, health signals, metrics, and integration with observability platforms.
Observability Philosophy
GitOpsHQ treats observability as a first-class concern. Every operation — deployment, approval, rollback, drift detection — emits structured signals that can be consumed by your existing monitoring stack. The goal is to give you a single pane of glass that connects delivery actions to runtime outcomes.
Health Endpoints
GitOpsHQ exposes standard health and metrics endpoints for integration with load balancers, orchestrators, and monitoring systems.
| Endpoint | Purpose | Response | Use Case |
|---|---|---|---|
/healthz | Liveness probe | 200 OK if the process is alive | Kubernetes liveness probe, load balancer health check |
/readyz | Readiness probe | 200 OK if the service can accept requests | Kubernetes readiness probe, rolling update safety |
/metrics | Prometheus metrics | Prometheus text format | Metrics scraping, dashboard building, alerting |
Liveness vs Readiness
/healthz(liveness) — Returns 200 as long as the GitOpsHQ process is running and responsive. A failure here means the process should be restarted. This endpoint does not check external dependencies./readyz(readiness) — Returns 200 when the service is fully initialized and can handle requests. During startup, this returns 503 until database connections, cache, and agent communication channels are established.
# Kubernetes probe configuration example
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10Dashboard Signals
The GitOpsHQ dashboard organizes health signals into four categories. Each provides a different lens on your operational state.
Fleet Health shows the connectivity and sync status of all registered clusters.
| Indicator | Meaning |
|---|---|
| All green | Every cluster connected, every workload synced |
| Yellow badge | One or more clusters have stale heartbeats or pending syncs |
| Red badge | At least one cluster is disconnected or has failed syncs |
What to do: Click any non-green cluster to drill into agent status, sync errors, and resource health.
Project Readiness shows deployment status across all projects and environments.
| Indicator | Meaning |
|---|---|
| Ready | All environments deployed to their target state |
| Pending | Releases awaiting approval or deployment |
| Drift | Actual state differs from desired state |
| Failed | Last deployment failed |
What to do: Focus on drift and failed states first. Pending states may be normal if approvals are in progress.
Governance Compliance shows the state of policy evaluations and approval workflows.
| Indicator | Meaning |
|---|---|
| Compliant | All active policies are satisfied |
| Violations | One or more OPA policy violations exist |
| Blocked | Releases are blocked by unmet approval requirements |
| Break-glass active | An emergency session is currently open |
What to do: Investigate violations — they may indicate configuration errors. Blocked releases need attention from approvers.
Activity Pulse shows recent operational activity across the organization.
| Indicator | Meaning |
|---|---|
| Active | Multiple deployments, approvals, and configuration changes in the last hour |
| Quiet | Minimal activity — may be normal outside business hours |
| Anomalous | Unusual patterns (e.g., high rollback rate, sudden spike in break-glass usage) |
What to do: Anomalous activity warrants investigation. Correlate with the audit log.
Key Metrics to Monitor
GitOpsHQ exposes the following metrics via the /metrics endpoint in Prometheus format.
Delivery Metrics
| Metric | Description | Healthy Threshold |
|---|---|---|
gitopshq_releases_total | Total releases created (by status: success/failed/rolledback) | Trending upward (more deploys = faster delivery) |
gitopshq_release_duration_seconds | Time from release creation to deployment completion | p95 < 300s (5 minutes) |
gitopshq_rollbacks_total | Total rollback operations | < 2% of total releases |
gitopshq_promotion_duration_seconds | Time to complete a full promotion pipeline | Depends on approval SLA |
gitopshq_delivery_generator_duration_seconds | Time to render manifests | p95 < 30s |
Governance Metrics
| Metric | Description | Healthy Threshold |
|---|---|---|
gitopshq_approval_pending_duration_seconds | Time releases spend in approval queue | p95 < 1800s (30 minutes) |
gitopshq_approval_decisions_total | Total approvals (by decision: approved/rejected) | Rejection rate < 10% |
gitopshq_opa_evaluations_total | OPA policy evaluations (by result: pass/warn/deny) | Deny rate trending downward |
gitopshq_breakglass_sessions_total | Break-glass session count | Close to zero (emergency only) |
Runtime Metrics
| Metric | Description | Healthy Threshold |
|---|---|---|
gitopshq_agent_heartbeat_age_seconds | Time since last agent heartbeat per cluster | < 60s |
gitopshq_drift_detected_total | Drift detection events by environment | Zero in production |
gitopshq_sync_failures_total | Failed cluster sync operations | < 1% of total syncs |
gitopshq_webhook_deliveries_total | Webhook delivery attempts (by status: success/failure) | Success rate > 99% |
Automation Metrics
| Metric | Description | Healthy Threshold |
|---|---|---|
gitopshq_api_requests_total | API requests by service account | Within rate limits |
gitopshq_webhook_inbound_total | Inbound webhook events processed | No errors |
Daily Monitoring Checklist
Run through this checklist at the start of each operational shift.
Integration with Observability Platforms
Prometheus + Grafana
GitOpsHQ's /metrics endpoint is compatible with standard Prometheus scraping.
# prometheus.yml scrape config
scrape_configs:
- job_name: 'gitopshq'
metrics_path: '/metrics'
scheme: 'https'
bearer_token_file: '/etc/prometheus/gitopshq-token'
static_configs:
- targets: ['api.gitopshq.io:443']
scrape_interval: 30sRecommended Grafana dashboards:
| Dashboard | Panels | Purpose |
|---|---|---|
| Delivery Overview | Release rate, failure rate, rollback rate, duration histograms | Delivery health at a glance |
| Governance | Approval latency, OPA violation trend, break-glass frequency | Policy effectiveness |
| Runtime | Agent connectivity, drift count, sync failures | Cluster health |
| Automation | API request rate, webhook delivery success, service account activity | CI/CD integration health |
Datadog
Export GitOpsHQ metrics to Datadog via Prometheus-to-Datadog bridge or the OpenMetrics integration.
# Datadog agent config
instances:
- prometheus_url: https://api.gitopshq.io/metrics
namespace: gitopshq
metrics:
- gitopshq_releases_total
- gitopshq_agent_heartbeat_age_seconds
- gitopshq_drift_detected_total
headers:
Authorization: Bearer <GITOPSHQ_TOKEN>PagerDuty / Opsgenie
Configure webhook notifications in GitOpsHQ to send alerts to your incident management platform:
- Critical: Agent disconnected > 5 min, production deployment failed, break-glass session opened
- Warning: Production drift detected, approval queue > 1 hour, rollback executed
- Info: Release deployed successfully, promotion completed
Distributed Tracing
GitOpsHQ supports OpenTelemetry Protocol (OTLP) for distributed tracing. Every API request generates a trace with spans for:
- Request authentication and authorization
- Policy evaluation (OPA + approval checks)
- Delivery generation (template rendering, Git commit)
- Agent communication (sync trigger, health check)
Configure the OTLP exporter endpoint in GitOpsHQ settings to send traces to your Jaeger, Zipkin, or cloud tracing backend. Traces include the X-Correlation-ID header value for cross-system correlation.
Alerting Recommendations
Critical Alerts (Page Immediately)
| Condition | Threshold | Recommended Action |
|---|---|---|
| Agent disconnected | > 5 minutes | Investigate network/agent health |
| Production deployment failed | Any failure | Check delivery generator and sync logs |
| Break-glass session opened | Any session | Monitor and prepare for post-incident review |
| Multiple rollbacks in 1 hour | > 2 rollbacks | Investigate systemic quality issue |
Warning Alerts (Investigate Within 1 Hour)
| Condition | Threshold | Recommended Action |
|---|---|---|
| Production drift detected | Any drift | Determine if intentional or accidental |
| Approval queue aging | > 1 hour pending | Notify approvers or escalate |
| Webhook delivery failures | > 5% failure rate | Check endpoint health and credentials |
| OPA violations on active releases | Any violation | Review and resolve before deployment |
Informational Alerts (Review Daily)
| Condition | Threshold | Recommended Action |
|---|---|---|
| Release deployed | Every deploy | Track deployment frequency trend |
| Promotion completed | Every promotion | Track promotion success rate |
| New break-glass session closed | Every closure | Add to post-incident review queue |
Signal Views Deep Dive
Delivery Health
Delivery health tells you whether your deployment pipeline is functioning correctly.
- Release progression: Track releases as they move through Draft → Approved → Deployed → Healthy. Bottlenecks appear as accumulation at any stage.
- Promotion lead time: The time from the first environment deployment to the final production deployment. Long lead times indicate governance overhead or testing gaps.
- Rollback volume: A sudden increase in rollbacks signals a quality problem in the development or testing process. Investigate the root cause rather than just rolling back.
Runtime Health
Runtime health tells you whether your live clusters match the desired state.
- Connected/disconnected trend: A gradual increase in disconnected clusters may indicate network policy changes, certificate expiration, or agent version drift.
- Drift correlation: Drift that appears immediately after a deployment may indicate the deployment was partially applied. Drift that appears gradually may indicate manual cluster edits or operator interference.
- Sync timing: If syncs are consistently slow, check cluster resource availability and agent pod health.
Governance Health
Governance health tells you whether your policies are effective and not creating unnecessary friction.
- Approval aging: If approvals consistently take too long, consider adjusting quorum requirements or adding more approved reviewers.
- Break-glass frequency: More than one break-glass session per month suggests a gap in normal operational capabilities. Review policies and permissions.
- OPA evaluation trend: A decreasing denial rate over time indicates teams are learning and adapting to the policies. A persistent high denial rate may mean the policy is too restrictive or unclear.