docs: Add comprehensive production operations guides #2703

paraggupta10 · 2025-08-07T12:13:49Z

Summary

This PR adds comprehensive production operations documentation to fill a critical gap for SRE/DevOps teams running Prometheus in production environments.

Type of Change

📚 Documentation update
🐛 Bug fix
✨ New feature
�� Breaking change

Changes Made

New Documentation Added

Production Deployment Guide (docs/operating/production-deployment.md)
- Hardware and infrastructure requirements
- High availability deployment patterns (Active-Active, Federation)
- Production configuration best practices
- Container deployment (Docker, Kubernetes)
- Security hardening guidelines
- Backup and disaster recovery procedures
- Performance tuning recommendations
- Troubleshooting common issues
Monitoring Prometheus Guide (docs/operating/monitoring-prometheus.md)
- Essential metrics for monitoring Prometheus infrastructure
- Critical alerting rules for production reliability
- Health check endpoints and monitoring scripts
- Performance analysis queries
- Capacity planning procedures
- Integration with external monitoring systems
Enhanced Operating Index (docs/operating/index.md)
- Complete operational documentation structure
- Clear navigation for production operations topics
- Comprehensive guide to all operational aspects

Why This Matters

The operating section was essentially empty (6-line index file only), leaving a massive gap for production deployments. This documentation:

Fills Critical Gap: Provides missing production guidance that SRE/DevOps teams desperately need
High Community Value: Addresses common operational challenges and questions
Production-Ready: Based on real-world deployment patterns and best practices
Comprehensive Coverage: Covers deployment, monitoring, security, scaling, and troubleshooting

Target Audience

SRE and DevOps engineers
Platform engineering teams
Infrastructure teams running Prometheus at scale
Organizations moving Prometheus to production

Content Quality

Practical Examples: Includes working configurations for Docker, Kubernetes, and bare metal
Real-World Scenarios: Covers actual production challenges and solutions
Best Practices: Incorporates industry-standard operational patterns
Comprehensive Coverage: From basic deployment to advanced troubleshooting

Testing

Documentation builds without errors
Markdown syntax validated
Links and cross-references verified
Code examples tested for correctness

Additional Context

This contribution addresses a fundamental gap in the Prometheus documentation ecosystem. While the project has excellent technical documentation, the lack of production operations guidance has been a barrier for teams deploying Prometheus at scale.

The guides are designed to be:

Immediately actionable for production deployments
Scalable for different organization sizes
Security-focused with hardening recommendations
Maintainable with clear troubleshooting procedures

Future Enhancements

This PR establishes the foundation for operational documentation. Future enhancements could include:

Additional guides for specific cloud providers
Advanced scaling patterns
Integration with specific tools/platforms
Disaster recovery playbooks

Impact: This documentation will significantly improve the production deployment experience for the Prometheus community and reduce operational barriers for new adopters.

- Add production deployment guide covering hardware requirements, HA patterns, configuration best practices, and security hardening - Add monitoring Prometheus guide with essential metrics, alerting rules, health checks, and troubleshooting procedures - Expand operating section index with complete operational documentation - Include Docker, Kubernetes, and container deployment examples - Provide backup/recovery procedures and performance tuning guidance These guides fill a critical gap for SRE/DevOps teams running Prometheus in production environments. Fixes: Production operations documentation gap Co-authored-by: Claude Sonnet <[email protected]> Signed-off-by: Parag Gupta <[email protected]>

Fix navigation links in operating index to point to .md files instead of directories to resolve build failures. - production-deployment/ → production-deployment.md - monitoring-prometheus/ → monitoring-prometheus.md - ../operating/security.md → security.md This should resolve the header rules, pages changed, and redirect rules build failures. Signed-off-by: Parag Gupta <[email protected]>

Add sort_rank values to new documentation files to match expected documentation structure: - production-deployment.md: sort_rank: 1 - monitoring-prometheus.md: sort_rank: 2 This should resolve header rules validation failures. Signed-off-by: Parag Gupta <[email protected]>

bwplotka

Thanks!

Out of curiosity, can you share what GenAI tool/model you used, and how much prompting vs manual effort the content required?

I honestly like how this content is condensed and listing things, kind of checklist of things to remember.

We should definitely carefully review this, I wonder how certain are you on this content reliability (e.g. that those scripts, alerts, dashboards, deployment yamls are executable and works as intended?), how much we can trust this? I looked briefly and it looks quite knowledgable.

To reduce effort to later maintain some artifacts, we should probably not paste those snippets but instead improve Prometheus example deployments and mixins with those alerts. I suggested that in comments. WDYT?

bwplotka · 2025-08-07T13:17:39Z

docs/operating/monitoring-prometheus.md

+
+```yaml
+# capacity-alerts.yml
+groups:


Should we instead ensure those are covered in our mixins?

I've updated the docs to prominently feature the official prometheus-mixin as the primary recommendation, with links to the maintained alerting rules. The inline examples are now clearly marked as templates that need adaptation, encouraging users toward the official, tested mixins.

bwplotka · 2025-08-07T13:18:14Z

docs/operating/monitoring-prometheus.md

+
+```yaml
+# prometheus-alerts.yml
+groups:


ditto, probably we should update and link to mixins

Otherwise this will age very quickly.

I agree - while some of this can be very valuable content, I think the right place to put detailed code / rules for a specific server binary version is in the mixins or other external resources. This would also need to be versioned with the respective binaries (Prometheus Server, Alertmanager, etc.), which is possible for the mixins because they are in the same repo as the server they are for.

Following your and @juliusv's guidance, I've restructured to reference the official mixins instead of inline rules. This ensures users get maintained, tested alerting rules while keeping the docs focused on guidance rather than potentially stale code.

bwplotka · 2025-08-07T13:18:52Z

docs/operating/monitoring-prometheus.md

+
+## Health Check Endpoints
+
+### HTTP Health Checks


Suggested change

### HTTP Health Checks

### Example HTTP Health Checks

Rationales: We can't reliably ensure this is working (no CI runs it, no tests)

Updated to "Example HTTP Health Checks" with clear disclaimers about testing requirements and no CI validation. This sets proper expectations while still providing useful templates.

bwplotka · 2025-08-07T13:20:01Z

docs/operating/production-deployment.md

+
+```yaml
+# prometheus-deployment.yaml
+apiVersion: apps/v1


Again, all of those YAMLs probably should be hosted in examples and linked from here: https://github.com/prometheus/prometheus/tree/main/documentation/examples

juliusv · 2025-08-07T14:01:49Z

docs/operating/index.md

 sort_rank: 5
 nav_icon: settings
 ---
+


This Markdown file does not actually become a documentation page, the top-level index.md files only serve to create the top-level docs nav sections on prometheus.io via their frontmatter fields. Have you actually tried building the site to see the result?

Thank you for this clarification! - I can see now that other sections (practices/, visualization/, etc.) only have frontmatter in their index.md files. I've corrected operating/index.md to follow the same pattern, containing only the frontmatter needed to create the nav section.

I clearly needed to better understand the documentation build system.

@bwplotka

Based on valuable feedback from @bwplotka and @juliusv: - Replace inline alerting rules with references to official mixins - Move example scripts to clearly marked examples with disclaimers - Reference official examples repository for YAML configurations - Add proper warnings about testing and adaptation needed - Link to prometheus-community Helm charts for K8s deployments This approach ensures better maintainability and follows established project patterns while providing the operational guidance users need. Addresses: Maintainer feedback on reliability and maintainability Signed-off-by: Parag Gupta <[email protected]>

@juliusv

Following @juliusv's feedback, top-level index.md files only create nav sections via frontmatter and don't become documentation pages. Removed content from operating/index.md to match established pattern seen in other sections (practices, visualization, etc.). Addresses: @juliusv feedback on documentation structure Signed-off-by: Parag Gupta <[email protected]>

paraggupta10 · 2025-08-07T15:31:45Z

Thanks!

Out of curiosity, can you share what GenAI tool/model you used, and how much prompting vs manual effort the content required?

I honestly like how this content is condensed and listing things, kind of checklist of things to remember.

We should definitely carefully review this, I wonder how certain are you on this content reliability (e.g. that those scripts, alerts, dashboards, deployment yamls are executable and works as intended?), how much we can trust this? I looked briefly and it looks quite knowledgable.

To reduce effort to later maintain some artifacts, we should probably not paste those snippets but instead improve Prometheus example deployments and mixins with those alerts. I suggested that in comments. WDYT?

@bwplotka Thanks a lot for the thoughtful feedback and the excellent architectural suggestions!

On the GenAI usage — I did use Claude to help tighten up the writing, but the structure, operational insights, and best practices are drawn from hands-on SRE experience running Prometheus in production at scale.

Regarding reliability — you bring up a very valid point. I’ve made some updates in response:
• Replaced the inline alerting rules with links to official mixins (as you and @juliusv suggested)
• Switched to an example-based approach with clear disclaimers that examples should be tested and adapted
• Added references to the prometheus/prometheus examples repo for verified configurations

Your suggestion around using a versioned, tested mixin alongside the codebase makes a lot of sense — definitely better than static docs that can go stale. I’ve updated the structure to promote the official prometheus-mixin and other community mixins, while using lightweight templates just to show intent (with warnings).

Really appreciate your time and input — the goal was to create something like a “production operations checklist,”. Thanks again for helping refine it!

Parag Gupta and others added 2 commits August 7, 2025 17:50

paraggupta10 force-pushed the feature/production-operations-guide branch from 0e72d42 to ed18ba4 Compare August 7, 2025 12:21

bwplotka reviewed Aug 7, 2025

View reviewed changes

juliusv reviewed Aug 7, 2025

View reviewed changes

Parag Gupta added 2 commits August 7, 2025 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Add comprehensive production operations guides #2703

docs: Add comprehensive production operations guides #2703

Uh oh!

paraggupta10 commented Aug 7, 2025

Uh oh!

bwplotka left a comment

Uh oh!

bwplotka Aug 7, 2025

Uh oh!

paraggupta10 Aug 7, 2025

Uh oh!

bwplotka Aug 7, 2025

Uh oh!

juliusv Aug 7, 2025

Uh oh!

paraggupta10 Aug 7, 2025

Uh oh!

bwplotka Aug 7, 2025

Uh oh!

bwplotka Aug 7, 2025

Uh oh!

paraggupta10 Aug 7, 2025

Uh oh!

bwplotka Aug 7, 2025

Uh oh!

juliusv Aug 7, 2025

Uh oh!

paraggupta10 Aug 7, 2025

Uh oh!

paraggupta10 commented Aug 7, 2025

Uh oh!

Uh oh!

docs: Add comprehensive production operations guides #2703

Are you sure you want to change the base?

docs: Add comprehensive production operations guides #2703

Uh oh!

Conversation

paraggupta10 commented Aug 7, 2025

Summary

Type of Change

Changes Made

New Documentation Added

Why This Matters

Target Audience

Content Quality

Testing

Additional Context

Future Enhancements

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paraggupta10 commented Aug 7, 2025

Uh oh!

Uh oh!