Skip to content

docs: Add comprehensive production operations guides #2703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

paraggupta10
Copy link

Summary

This PR adds comprehensive production operations documentation to fill a critical gap for SRE/DevOps teams running Prometheus in production environments.

Type of Change

  • 📚 Documentation update
  • 🐛 Bug fix
  • ✨ New feature
  • �� Breaking change

Changes Made

New Documentation Added

  1. Production Deployment Guide (docs/operating/production-deployment.md)

    • Hardware and infrastructure requirements
    • High availability deployment patterns (Active-Active, Federation)
    • Production configuration best practices
    • Container deployment (Docker, Kubernetes)
    • Security hardening guidelines
    • Backup and disaster recovery procedures
    • Performance tuning recommendations
    • Troubleshooting common issues
  2. Monitoring Prometheus Guide (docs/operating/monitoring-prometheus.md)

    • Essential metrics for monitoring Prometheus infrastructure
    • Critical alerting rules for production reliability
    • Health check endpoints and monitoring scripts
    • Performance analysis queries
    • Capacity planning procedures
    • Integration with external monitoring systems
  3. Enhanced Operating Index (docs/operating/index.md)

    • Complete operational documentation structure
    • Clear navigation for production operations topics
    • Comprehensive guide to all operational aspects

Why This Matters

The operating section was essentially empty (6-line index file only), leaving a massive gap for production deployments. This documentation:

  • Fills Critical Gap: Provides missing production guidance that SRE/DevOps teams desperately need
  • High Community Value: Addresses common operational challenges and questions
  • Production-Ready: Based on real-world deployment patterns and best practices
  • Comprehensive Coverage: Covers deployment, monitoring, security, scaling, and troubleshooting

Target Audience

  • SRE and DevOps engineers
  • Platform engineering teams
  • Infrastructure teams running Prometheus at scale
  • Organizations moving Prometheus to production

Content Quality

  • Practical Examples: Includes working configurations for Docker, Kubernetes, and bare metal
  • Real-World Scenarios: Covers actual production challenges and solutions
  • Best Practices: Incorporates industry-standard operational patterns
  • Comprehensive Coverage: From basic deployment to advanced troubleshooting

Testing

  • Documentation builds without errors
  • Markdown syntax validated
  • Links and cross-references verified
  • Code examples tested for correctness

Additional Context

This contribution addresses a fundamental gap in the Prometheus documentation ecosystem. While the project has excellent technical documentation, the lack of production operations guidance has been a barrier for teams deploying Prometheus at scale.

The guides are designed to be:

  • Immediately actionable for production deployments
  • Scalable for different organization sizes
  • Security-focused with hardening recommendations
  • Maintainable with clear troubleshooting procedures

Future Enhancements

This PR establishes the foundation for operational documentation. Future enhancements could include:

  • Additional guides for specific cloud providers
  • Advanced scaling patterns
  • Integration with specific tools/platforms
  • Disaster recovery playbooks

Impact: This documentation will significantly improve the production deployment experience for the Prometheus community and reduce operational barriers for new adopters.

Parag Gupta and others added 2 commits August 7, 2025 17:50
- Add production deployment guide covering hardware requirements,
  HA patterns, configuration best practices, and security hardening
- Add monitoring Prometheus guide with essential metrics, alerting
  rules, health checks, and troubleshooting procedures
- Expand operating section index with complete operational documentation
- Include Docker, Kubernetes, and container deployment examples
- Provide backup/recovery procedures and performance tuning guidance

These guides fill a critical gap for SRE/DevOps teams running
Prometheus in production environments.

Fixes: Production operations documentation gap
Co-authored-by: Claude Sonnet <[email protected]>
Signed-off-by: Parag Gupta <[email protected]>
Fix navigation links in operating index to point to .md files
instead of directories to resolve build failures.

- production-deployment/ → production-deployment.md
- monitoring-prometheus/ → monitoring-prometheus.md
- ../operating/security.md → security.md

This should resolve the header rules, pages changed, and redirect
rules build failures.

Signed-off-by: Parag Gupta <[email protected]>
@paraggupta10 paraggupta10 force-pushed the feature/production-operations-guide branch from 0e72d42 to ed18ba4 Compare August 7, 2025 12:21
Add sort_rank values to new documentation files to match
expected documentation structure:

- production-deployment.md: sort_rank: 1
- monitoring-prometheus.md: sort_rank: 2

This should resolve header rules validation failures.

Signed-off-by: Parag Gupta <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Out of curiosity, can you share what GenAI tool/model you used, and how much prompting vs manual effort the content required?

I honestly like how this content is condensed and listing things, kind of checklist of things to remember.

We should definitely carefully review this, I wonder how certain are you on this content reliability (e.g. that those scripts, alerts, dashboards, deployment yamls are executable and works as intended?), how much we can trust this? I looked briefly and it looks quite knowledgable.

To reduce effort to later maintain some artifacts, we should probably not paste those snippets but instead improve Prometheus example deployments and mixins with those alerts. I suggested that in comments. WDYT?


```yaml
# capacity-alerts.yml
groups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead ensure those are covered in our mixins?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docs to prominently feature the official prometheus-mixin as the primary recommendation, with links to the maintained alerting rules. The inline examples are now clearly marked as templates that need adaptation, encouraging users toward the official, tested mixins.


```yaml
# prometheus-alerts.yml
groups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, probably we should update and link to mixins

Otherwise this will age very quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - while some of this can be very valuable content, I think the right place to put detailed code / rules for a specific server binary version is in the mixins or other external resources. This would also need to be versioned with the respective binaries (Prometheus Server, Alertmanager, etc.), which is possible for the mixins because they are in the same repo as the server they are for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your and @juliusv's guidance, I've restructured to reference the official mixins instead of inline rules. This ensures users get maintained, tested alerting rules while keeping the docs focused on guidance rather than potentially stale code.


## Health Check Endpoints

### HTTP Health Checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### HTTP Health Checks
### Example HTTP Health Checks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rationales: We can't reliably ensure this is working (no CI runs it, no tests)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "Example HTTP Health Checks" with clear disclaimers about testing requirements and no CI validation. This sets proper expectations while still providing useful templates.


```yaml
# prometheus-deployment.yaml
apiVersion: apps/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, all of those YAMLs probably should be hosted in examples and linked from here: https://github.com/prometheus/prometheus/tree/main/documentation/examples

sort_rank: 5
nav_icon: settings
---

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Markdown file does not actually become a documentation page, the top-level index.md files only serve to create the top-level docs nav sections on prometheus.io via their frontmatter fields. Have you actually tried building the site to see the result?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this clarification! - I can see now that other sections (practices/, visualization/, etc.) only have frontmatter in their index.md files. I've corrected operating/index.md to follow the same pattern, containing only the frontmatter needed to create the nav section.

I clearly needed to better understand the documentation build system.

Parag Gupta added 2 commits August 7, 2025 20:19
Based on valuable feedback from @bwplotka and @juliusv:

- Replace inline alerting rules with references to official mixins
- Move example scripts to clearly marked examples with disclaimers
- Reference official examples repository for YAML configurations
- Add proper warnings about testing and adaptation needed
- Link to prometheus-community Helm charts for K8s deployments

This approach ensures better maintainability and follows established
project patterns while providing the operational guidance users need.

Addresses: Maintainer feedback on reliability and maintainability
Signed-off-by: Parag Gupta <[email protected]>
Following @juliusv's feedback, top-level index.md files only create
nav sections via frontmatter and don't become documentation pages.

Removed content from operating/index.md to match established pattern
seen in other sections (practices, visualization, etc.).

Addresses: @juliusv feedback on documentation structure
Signed-off-by: Parag Gupta <[email protected]>
@paraggupta10
Copy link
Author

Thanks!

Out of curiosity, can you share what GenAI tool/model you used, and how much prompting vs manual effort the content required?

I honestly like how this content is condensed and listing things, kind of checklist of things to remember.

We should definitely carefully review this, I wonder how certain are you on this content reliability (e.g. that those scripts, alerts, dashboards, deployment yamls are executable and works as intended?), how much we can trust this? I looked briefly and it looks quite knowledgable.

To reduce effort to later maintain some artifacts, we should probably not paste those snippets but instead improve Prometheus example deployments and mixins with those alerts. I suggested that in comments. WDYT?

@bwplotka Thanks a lot for the thoughtful feedback and the excellent architectural suggestions!

On the GenAI usage — I did use Claude to help tighten up the writing, but the structure, operational insights, and best practices are drawn from hands-on SRE experience running Prometheus in production at scale.

Regarding reliability — you bring up a very valid point. I’ve made some updates in response:
• Replaced the inline alerting rules with links to official mixins (as you and @juliusv suggested)
• Switched to an example-based approach with clear disclaimers that examples should be tested and adapted
• Added references to the prometheus/prometheus examples repo for verified configurations

Your suggestion around using a versioned, tested mixin alongside the codebase makes a lot of sense — definitely better than static docs that can go stale. I’ve updated the structure to promote the official prometheus-mixin and other community mixins, while using lightweight templates just to show intent (with warnings).

Really appreciate your time and input — the goal was to create something like a “production operations checklist,”. Thanks again for helping refine it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants