-
Notifications
You must be signed in to change notification settings - Fork 1.1k
docs: Add comprehensive production operations guides #2703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Add comprehensive production operations guides #2703
Conversation
- Add production deployment guide covering hardware requirements, HA patterns, configuration best practices, and security hardening - Add monitoring Prometheus guide with essential metrics, alerting rules, health checks, and troubleshooting procedures - Expand operating section index with complete operational documentation - Include Docker, Kubernetes, and container deployment examples - Provide backup/recovery procedures and performance tuning guidance These guides fill a critical gap for SRE/DevOps teams running Prometheus in production environments. Fixes: Production operations documentation gap Co-authored-by: Claude Sonnet <[email protected]> Signed-off-by: Parag Gupta <[email protected]>
Fix navigation links in operating index to point to .md files instead of directories to resolve build failures. - production-deployment/ → production-deployment.md - monitoring-prometheus/ → monitoring-prometheus.md - ../operating/security.md → security.md This should resolve the header rules, pages changed, and redirect rules build failures. Signed-off-by: Parag Gupta <[email protected]>
0e72d42
to
ed18ba4
Compare
Add sort_rank values to new documentation files to match expected documentation structure: - production-deployment.md: sort_rank: 1 - monitoring-prometheus.md: sort_rank: 2 This should resolve header rules validation failures. Signed-off-by: Parag Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Out of curiosity, can you share what GenAI tool/model you used, and how much prompting vs manual effort the content required?
I honestly like how this content is condensed and listing things, kind of checklist of things to remember.
We should definitely carefully review this, I wonder how certain are you on this content reliability (e.g. that those scripts, alerts, dashboards, deployment yamls are executable and works as intended?), how much we can trust this? I looked briefly and it looks quite knowledgable.
To reduce effort to later maintain some artifacts, we should probably not paste those snippets but instead improve Prometheus example deployments and mixins with those alerts. I suggested that in comments. WDYT?
|
||
```yaml | ||
# capacity-alerts.yml | ||
groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead ensure those are covered in our mixins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the docs to prominently feature the official prometheus-mixin as the primary recommendation, with links to the maintained alerting rules. The inline examples are now clearly marked as templates that need adaptation, encouraging users toward the official, tested mixins.
|
||
```yaml | ||
# prometheus-alerts.yml | ||
groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, probably we should update and link to mixins
Otherwise this will age very quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - while some of this can be very valuable content, I think the right place to put detailed code / rules for a specific server binary version is in the mixins or other external resources. This would also need to be versioned with the respective binaries (Prometheus Server, Alertmanager, etc.), which is possible for the mixins because they are in the same repo as the server they are for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following your and @juliusv's guidance, I've restructured to reference the official mixins instead of inline rules. This ensures users get maintained, tested alerting rules while keeping the docs focused on guidance rather than potentially stale code.
|
||
## Health Check Endpoints | ||
|
||
### HTTP Health Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### HTTP Health Checks | |
### Example HTTP Health Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationales: We can't reliably ensure this is working (no CI runs it, no tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "Example HTTP Health Checks" with clear disclaimers about testing requirements and no CI validation. This sets proper expectations while still providing useful templates.
|
||
```yaml | ||
# prometheus-deployment.yaml | ||
apiVersion: apps/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, all of those YAMLs probably should be hosted in examples and linked from here: https://github.com/prometheus/prometheus/tree/main/documentation/examples
docs/operating/index.md
Outdated
sort_rank: 5 | ||
nav_icon: settings | ||
--- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Markdown file does not actually become a documentation page, the top-level index.md files only serve to create the top-level docs nav sections on prometheus.io via their frontmatter fields. Have you actually tried building the site to see the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this clarification! - I can see now that other sections (practices/, visualization/, etc.) only have frontmatter in their index.md files. I've corrected operating/index.md to follow the same pattern, containing only the frontmatter needed to create the nav section.
I clearly needed to better understand the documentation build system.
Based on valuable feedback from @bwplotka and @juliusv: - Replace inline alerting rules with references to official mixins - Move example scripts to clearly marked examples with disclaimers - Reference official examples repository for YAML configurations - Add proper warnings about testing and adaptation needed - Link to prometheus-community Helm charts for K8s deployments This approach ensures better maintainability and follows established project patterns while providing the operational guidance users need. Addresses: Maintainer feedback on reliability and maintainability Signed-off-by: Parag Gupta <[email protected]>
Following @juliusv's feedback, top-level index.md files only create nav sections via frontmatter and don't become documentation pages. Removed content from operating/index.md to match established pattern seen in other sections (practices, visualization, etc.). Addresses: @juliusv feedback on documentation structure Signed-off-by: Parag Gupta <[email protected]>
@bwplotka Thanks a lot for the thoughtful feedback and the excellent architectural suggestions! On the GenAI usage — I did use Claude to help tighten up the writing, but the structure, operational insights, and best practices are drawn from hands-on SRE experience running Prometheus in production at scale. Regarding reliability — you bring up a very valid point. I’ve made some updates in response: Your suggestion around using a versioned, tested mixin alongside the codebase makes a lot of sense — definitely better than static docs that can go stale. I’ve updated the structure to promote the official prometheus-mixin and other community mixins, while using lightweight templates just to show intent (with warnings). Really appreciate your time and input — the goal was to create something like a “production operations checklist,”. Thanks again for helping refine it! |
Summary
This PR adds comprehensive production operations documentation to fill a critical gap for SRE/DevOps teams running Prometheus in production environments.
Type of Change
Changes Made
New Documentation Added
Production Deployment Guide (
docs/operating/production-deployment.md
)Monitoring Prometheus Guide (
docs/operating/monitoring-prometheus.md
)Enhanced Operating Index (
docs/operating/index.md
)Why This Matters
The operating section was essentially empty (6-line index file only), leaving a massive gap for production deployments. This documentation:
Target Audience
Content Quality
Testing
Additional Context
This contribution addresses a fundamental gap in the Prometheus documentation ecosystem. While the project has excellent technical documentation, the lack of production operations guidance has been a barrier for teams deploying Prometheus at scale.
The guides are designed to be:
Future Enhancements
This PR establishes the foundation for operational documentation. Future enhancements could include:
Impact: This documentation will significantly improve the production deployment experience for the Prometheus community and reduce operational barriers for new adopters.