Platform Playbook

Governance and infrastructure guide for platform engineers managing organizations, RBAC, clusters, registries, and policies in GitOpsHQ.

Your Role in GitOpsHQ

As a Platform Engineer, you are the architect of your organization's GitOps foundation. You design the security model, manage cluster infrastructure, curate reusable artifacts, author governance policies, and establish the conventions that every other role follows. Your decisions determine the safety, consistency, and velocity of the entire delivery pipeline.

You set the guardrails — DevOps, Developers, and SREs operate within the boundaries you define.

Your Access Scope

You typically hold the Admin role at the organization level, giving you full access to all settings, policies, clusters, and registry operations. Use this power carefully — follow least-privilege principles when granting access to others.

Organization Governance

Setting Up a New Organization

Create the Organization — Navigate to Organizations → New. Choose a name that reflects the business unit or product line (e.g., acme-payments). Configure the default timezone, billing contact, and notification preferences.

Define Environments — Go to Settings → Environments and create the delivery stages your organization uses. The most common pattern is development → staging → production, but you can add more (e.g., qa, performance, canary). Set the ordering rank to define the promotion sequence.

Establish Environment Policies — For each environment, decide on the governance controls:

Environment	Approval Required	Freeze Support	OPA Enforcement
Development	No	No	Warn only
Staging	Optional (1 approver)	Yes	Warn only
Production	Required (2 approvers, distinct teams)	Yes	Enforce

Configure Notification Channels — Set up Slack, Microsoft Teams, or webhook integrations for deployment notifications, approval requests, and alerts. Route different event types to appropriate channels.

RBAC Setup

A well-designed RBAC model balances security with developer productivity. The goal is least-privilege access that does not create friction for routine operations.

Role Hierarchy

Role	Scope	Typical Holder	Capabilities
Admin	Organization	Platform Engineer	Full access to all settings, policies, clusters, and registry
Operator	Organization	SRE	Cluster management, diagnostics, rollback, break-glass
Member	Organization	Developer, DevOps	Project access governed by resource grants

Team Design Principles

Mirror your organizational structure — Create teams that match real reporting lines or functional groups. This makes approval policies meaningful (e.g., "requires approval from a different team").
Keep teams between 3–10 members — Too small and approval routing becomes a bottleneck. Too large and accountability is diluted.
Use service accounts for automation — Never share human credentials with CI/CD systems. Create dedicated service accounts with scoped tokens.

Resource Grant Patterns

Strict Model — Each team has write access only to their own projects. Cross-project access is read-only. Production access requires explicit grants.

backend-team → backend-projects (write)
backend-team → all-other-projects (read)
frontend-team → frontend-projects (write)
frontend-team → all-other-projects (read)
sre-team → all-projects (read + rollback)

Best for: regulated industries, large organizations, compliance-sensitive environments.

Collaborative Model — Teams have write access to related projects and broader read access. Production still requires approval gates.

product-team → product-projects (write, all envs)
product-team → shared-infra (read)
platform-team → all-projects (admin)

Best for: startups, small teams, fast-moving product organizations.

Cluster Management

Registration Workflow

Register the cluster in GitOpsHQ with a name and description. Choose a name that reflects the cluster's purpose and region (e.g., prod-us-east-1, staging-eu-west).

Install the agent using the generated Helm command. The agent runs in a dedicated namespace (gitopshq-system) and establishes a secure outbound connection.

Verify connectivity — The cluster should show as Connected within 60 seconds. If it does not, check network egress rules and agent logs.

Map to environments — Assign the cluster to one or more environments. A cluster can serve multiple environments (via namespace isolation) or be dedicated to a single environment.

Cluster Health Monitoring

Check these signals regularly:

Signal	Healthy	Investigate
Agent heartbeat	Last seen < 60s ago	Last seen > 5 minutes ago
Sync status	All workloads synced	Any workload out of sync
Resource capacity	CPU/memory below 80%	Consistently above 85%
Drift count	Zero	Any non-zero count

Registry Curation

The registry is your organization's library of reusable deployment templates. A well-curated registry eliminates configuration drift between projects and reduces onboarding time for new services.

Chart Governance

Naming convention — Use lowercase with hyphens: microservice-base, cronjob-template, statefulset-postgres.
Versioning — Follow semantic versioning strictly. Breaking template changes require a major version bump.
Deprecation — When retiring a chart version, mark it as deprecated in the registry. The adoption dashboard shows which projects still use it. Give teams a migration window before removal.
Validation — Every chart publish runs lint checks. Consider adding custom OPA policies that enforce chart standards (e.g., all charts must include health checks).

Organizational policies — Apply everywhere. Examples: all containers must have resource limits, all images must come from approved registries.
Environment policies — Apply to specific environments. Examples: production replicas must be >= 2, staging must not use latest tag.
Project policies — Apply to specific projects. Examples: this project must use a specific ingress class.

Approval Policy Design

Scenario	Recommended Policy
Development deployment	No approval required
Staging deployment	1 approval from any team
Production deployment	2 approvals from distinct teams
Production rollback	1 approval (fast-path)
Break-glass session	Self-authorized with mandatory justification
Policy change	Admin approval required

Environment Strategy

Multi-Environment Patterns

development → staging → production

Best for: most organizations. Simple, well-understood, sufficient for typical delivery workflows.

development → qa → performance → staging → production

Best for: organizations with dedicated QA or performance testing phases.

development → staging-tenant-A → staging-tenant-B → production

Best for: multi-tenant SaaS products where tenant-specific validation is required before production rollout.

Team Onboarding

When onboarding a new team to GitOpsHQ, follow this checklist:

Create the team in Settings → Teams with an appropriate name.

Invite members via email or SSO provisioning.

Assign the organization role (typically Member for development teams).

Create resource grants scoped to the projects the team will work on.

Set up a service account for the team's CI pipeline.

Walk through the Developer or DevOps playbook depending on the team's role.

Verify access by having a team member perform a test action and checking the audit log.

Document the team's access model in your organization's RBAC documentation.

Governance Anti-Patterns

Avoid these common mistakes:

Anti-Pattern	Why It Is Harmful	Better Approach
Granting Admin to avoid designing RBAC	Eliminates accountability, creates security risk	Invest time in proper resource grants
Using break-glass for routine operations	Erodes trust in the emergency system	Fix the underlying permission or policy gap
Production without quorum requirements	Single point of failure for bad deployments	Require at least 2 approvals from distinct teams
No environment freeze capability	Cannot safely plan maintenance windows	Configure freeze support for staging and production
Stale service accounts	Unused tokens are security risks	Rotate tokens quarterly, remove unused accounts
Overly broad OPA policies	Teams work around policies instead of following them	Start narrow, expand based on real incidents

Weekly Review Checklist

Review the audit log for unusual patterns (unexpected access, repeated failures)
Check service account token ages and rotate any older than 90 days
Review pending resource grant requests
Check registry chart adoption and nudge teams lagging on upgrades
Review approval queue latency — if consistently high, consider adjusting quorum or adding approvers
Verify cluster agent health across all registered clusters