From 9deb4e4b349d3e593f293e3fcfe83ec47c493efc Mon Sep 17 00:00:00 2001 From: Parag Gupta Date: Thu, 7 Aug 2025 17:42:43 +0530 Subject: [PATCH 1/5] docs: add comprehensive production operations guides - Add production deployment guide covering hardware requirements, HA patterns, configuration best practices, and security hardening - Add monitoring Prometheus guide with essential metrics, alerting rules, health checks, and troubleshooting procedures - Expand operating section index with complete operational documentation - Include Docker, Kubernetes, and container deployment examples - Provide backup/recovery procedures and performance tuning guidance These guides fill a critical gap for SRE/DevOps teams running Prometheus in production environments. Fixes: Production operations documentation gap Co-authored-by: Claude Sonnet Signed-off-by: Parag Gupta --- docs/operating/index.md | 48 +- docs/operating/monitoring-prometheus.md | 603 ++++++++++++++++++++++++ docs/operating/production-deployment.md | 571 ++++++++++++++++++++++ 3 files changed, 1221 insertions(+), 1 deletion(-) create mode 100644 docs/operating/monitoring-prometheus.md create mode 100644 docs/operating/production-deployment.md diff --git a/docs/operating/index.md b/docs/operating/index.md index d783f6329..2b12b7ca5 100644 --- a/docs/operating/index.md +++ b/docs/operating/index.md @@ -1,5 +1,51 @@ --- -title: Operating +title: Operating Prometheus in Production sort_rank: 5 nav_icon: settings --- + +# Operating Prometheus in Production + +This section provides comprehensive guidance for deploying, monitoring, and maintaining Prometheus in production environments. These guides are designed for SRE, DevOps, and platform engineering teams who need to run Prometheus reliably at scale. + +## Production Deployment + +Running Prometheus in production requires careful planning around scalability, reliability, and operational concerns: + +* [Production Deployment Guide](production-deployment/) - Comprehensive guide for production-ready Prometheus deployments including hardware sizing, high availability setup, and configuration best practices +* [Performance Tuning](performance-tuning/) - Optimization techniques for large-scale deployments, memory management, and query performance +* [Storage Management](storage-management/) - Long-term storage strategies, retention policies, and data lifecycle management + +## Monitoring and Maintenance + +Effective operation requires monitoring your monitoring infrastructure: + +* [Monitoring Prometheus](monitoring-prometheus/) - How to monitor your Prometheus instances, essential metrics, and alerting on infrastructure health +* [Troubleshooting Guide](troubleshooting/) - Common issues, diagnostic techniques, and resolution strategies for production problems +* [Backup and Recovery](backup-recovery/) - Data protection strategies, disaster recovery procedures, and backup validation + +## Security and Compliance + +Securing monitoring infrastructure is critical for production deployments: + +* [Security Best Practices](../operating/security.md) - Authentication, authorization, network security, and data protection +* [Compliance Considerations](compliance/) - Meeting regulatory requirements, audit trails, and data governance + +## Operational Integration + +Prometheus doesn't operate in isolation - integration with your operational ecosystem is key: + +* [Alert Management](alert-management/) - Alert routing, escalation policies, and integration with incident management systems +* [Capacity Planning](capacity-planning/) - Growth planning, resource forecasting, and scaling strategies +* [Multi-tenancy](multi-tenancy/) - Patterns for shared Prometheus infrastructure, isolation, and resource allocation + +## Migration and Upgrades + +Managing changes to production monitoring infrastructure: + +* [Upgrade Strategies](upgrade-strategies/) - Safe upgrade procedures, rollback plans, and compatibility considerations +* [Migration Guide](migration-guide/) - Moving from other monitoring systems, data migration, and transition planning + +--- + +**Note**: These guides assume you have a basic understanding of Prometheus concepts. If you're new to Prometheus, start with the [Introduction](/docs/introduction/) section. diff --git a/docs/operating/monitoring-prometheus.md b/docs/operating/monitoring-prometheus.md new file mode 100644 index 000000000..2d253e037 --- /dev/null +++ b/docs/operating/monitoring-prometheus.md @@ -0,0 +1,603 @@ +--- +title: Monitoring Prometheus +--- + +# Monitoring Prometheus + +Meta-monitoring (monitoring your monitoring system) is critical for production reliability. This guide covers essential metrics, alerting rules, and dashboards for monitoring Prometheus infrastructure health. + +## Essential Prometheus Metrics + +### Memory and Performance Metrics + +```promql +# Memory usage by component +prometheus_tsdb_head_samples_appended_total +prometheus_tsdb_symbol_table_size_bytes +prometheus_engine_query_duration_seconds + +# Active series and cardinality +prometheus_tsdb_head_series +prometheus_tsdb_head_chunks + +# Storage utilization +prometheus_tsdb_blocks_loaded +prometheus_tsdb_compactions_total +prometheus_tsdb_compactions_failed_total +``` + +### Query Performance Monitoring + +```promql +# Query latency percentiles +histogram_quantile(0.95, + rate(prometheus_engine_query_duration_seconds_bucket[5m]) +) + +# Concurrent queries +prometheus_engine_queries_concurrent_max +prometheus_engine_queries + +# Slow queries (>30s) +increase(prometheus_engine_query_duration_seconds_bucket{le="30"}[5m]) +``` + +### Ingestion and Scraping Health + +```promql +# Samples ingested per second +rate(prometheus_tsdb_head_samples_appended_total[5m]) + +# Failed scrapes +up == 0 + +# Scrape duration +prometheus_target_scrapes_exceeded_sample_limit_total +prometheus_target_scrape_duration_seconds +``` + +### Storage Health + +```promql +# WAL disk usage +prometheus_tsdb_wal_fsync_duration_seconds +prometheus_tsdb_wal_corruptions_total + +# Compaction metrics +rate(prometheus_tsdb_compactions_total[5m]) +prometheus_tsdb_compactions_failed_total + +# Block loading issues +prometheus_tsdb_blocks_loaded +prometheus_tsdb_head_truncations_failed_total +``` + +## Critical Alerting Rules + +### High-Priority Alerts + +```yaml +# prometheus-alerts.yml +groups: +- name: prometheus.rules + rules: + + # Prometheus instance down + - alert: PrometheusDown + expr: up{job="prometheus"} == 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Prometheus instance {{ $labels.instance }} is down" + description: "Prometheus instance {{ $labels.instance }} has been down for more than 5 minutes." + + # High memory usage + - alert: PrometheusHighMemoryUsage + expr: > + ( + process_resident_memory_bytes{job="prometheus"} / + prometheus_config_last_reload_success_timestamp_seconds{job="prometheus"} * 0 + 1 + ) * 100 > 80 + for: 15m + labels: + severity: warning + annotations: + summary: "Prometheus {{ $labels.instance }} memory usage is high" + description: "Prometheus {{ $labels.instance }} memory usage is above 80% for more than 15 minutes." + + # Too many active series + - alert: PrometheusHighCardinality + expr: prometheus_tsdb_head_series > 1000000 + for: 10m + labels: + severity: warning + annotations: + summary: "Prometheus {{ $labels.instance }} has high cardinality" + description: "Prometheus {{ $labels.instance }} has {{ $value }} active series, which is above the recommended threshold." + + # Query latency high + - alert: PrometheusHighQueryLatency + expr: > + histogram_quantile(0.95, + rate(prometheus_engine_query_duration_seconds_bucket[5m]) + ) > 30 + for: 10m + labels: + severity: warning + annotations: + summary: "Prometheus {{ $labels.instance }} has high query latency" + description: "95th percentile query latency is {{ $value }}s for more than 10 minutes." + + # WAL corruption + - alert: PrometheusWALCorruption + expr: increase(prometheus_tsdb_wal_corruptions_total[1h]) > 0 + labels: + severity: critical + annotations: + summary: "Prometheus {{ $labels.instance }} WAL corruption detected" + description: "Prometheus {{ $labels.instance }} has detected WAL corruption." + + # Compaction failures + - alert: PrometheusCompactionFailed + expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0 + labels: + severity: warning + annotations: + summary: "Prometheus {{ $labels.instance }} compaction failed" + description: "Prometheus {{ $labels.instance }} has failed compactions in the last hour." + + # Target scrape failures + - alert: PrometheusTargetScrapeFailure + expr: > + ( + 1 - ( + sum(up) / + count(up) + ) + ) * 100 > 10 + for: 15m + labels: + severity: warning + annotations: + summary: "High percentage of target scrape failures" + description: "{{ $value }}% of targets are failing to be scraped." + + # Storage space low + - alert: PrometheusStorageSpaceLow + expr: > + ( + node_filesystem_free_bytes{mountpoint="/prometheus"} / + node_filesystem_size_bytes{mountpoint="/prometheus"} + ) * 100 < 20 + for: 5m + labels: + severity: warning + annotations: + summary: "Prometheus storage space is low" + description: "Prometheus storage has less than 20% free space remaining." + + # Configuration reload failed + - alert: PrometheusConfigReloadFailed + expr: prometheus_config_last_reload_successful == 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Prometheus configuration reload failed" + description: "Prometheus {{ $labels.instance }} configuration reload has failed." +``` + +### Capacity Planning Alerts + +```yaml +# capacity-alerts.yml +groups: +- name: prometheus.capacity + rules: + + # Ingestion rate trending up + - alert: PrometheusIngestionRateHigh + expr: > + predict_linear( + rate(prometheus_tsdb_head_samples_appended_total[1h])[4h:], + 24*3600 + ) > 50000 + for: 30m + labels: + severity: warning + annotations: + summary: "Prometheus ingestion rate trending high" + description: "Ingestion rate is predicted to exceed 50k samples/sec within 24 hours." + + # Series growth rate + - alert: PrometheusSeriesGrowthHigh + expr: > + predict_linear( + prometheus_tsdb_head_series[4h:], + 24*3600 + ) > 2000000 + for: 1h + labels: + severity: warning + annotations: + summary: "Prometheus series count growing rapidly" + description: "Active series count is predicted to exceed 2M within 24 hours." + + # Query load increasing + - alert: PrometheusQueryLoadHigh + expr: > + rate(prometheus_engine_queries[5m]) > 100 + for: 30m + labels: + severity: warning + annotations: + summary: "Prometheus query load is high" + description: "Query rate is {{ $value }} queries/sec, consider query optimization." +``` + +## Monitoring Dashboard + +### Grafana Dashboard JSON + +```json +{ + "dashboard": { + "title": "Prometheus Overview", + "panels": [ + { + "title": "Prometheus Instances Status", + "type": "stat", + "targets": [ + { + "expr": "up{job=\"prometheus\"}", + "legendFormat": "{{ instance }}" + } + ] + }, + { + "title": "Memory Usage", + "type": "graph", + "targets": [ + { + "expr": "process_resident_memory_bytes{job=\"prometheus\"}", + "legendFormat": "RSS Memory - {{ instance }}" + }, + { + "expr": "process_virtual_memory_bytes{job=\"prometheus\"}", + "legendFormat": "Virtual Memory - {{ instance }}" + } + ] + }, + { + "title": "Query Performance", + "type": "graph", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(prometheus_engine_query_duration_seconds_bucket[5m]))", + "legendFormat": "95th percentile" + }, + { + "expr": "histogram_quantile(0.50, rate(prometheus_engine_query_duration_seconds_bucket[5m]))", + "legendFormat": "50th percentile" + } + ] + }, + { + "title": "Active Series", + "type": "graph", + "targets": [ + { + "expr": "prometheus_tsdb_head_series", + "legendFormat": "{{ instance }}" + } + ] + }, + { + "title": "Ingestion Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])", + "legendFormat": "Samples/sec - {{ instance }}" + } + ] + }, + { + "title": "Storage Usage", + "type": "graph", + "targets": [ + { + "expr": "prometheus_tsdb_blocks_loaded", + "legendFormat": "Blocks Loaded - {{ instance }}" + } + ] + } + ] + } +} +``` + +## Health Check Endpoints + +### HTTP Health Checks + +```bash +#!/bin/bash +# prometheus-health-check.sh + +PROMETHEUS_URL="http://localhost:9090" + +# Basic health check +echo "=== Basic Health Check ===" +curl -s "$PROMETHEUS_URL/-/healthy" || echo "Health check failed" + +# Readiness check +echo "=== Readiness Check ===" +curl -s "$PROMETHEUS_URL/-/ready" || echo "Readiness check failed" + +# Configuration reload status +echo "=== Configuration Status ===" +CONFIG_STATUS=$(curl -s "$PROMETHEUS_URL/api/v1/status/config" | jq '.status') +echo "Config reload status: $CONFIG_STATUS" + +# Target status +echo "=== Target Status ===" +UP_TARGETS=$(curl -s "$PROMETHEUS_URL/api/v1/targets" | jq '.data.activeTargets | map(select(.health == "up")) | length') +TOTAL_TARGETS=$(curl -s "$PROMETHEUS_URL/api/v1/targets" | jq '.data.activeTargets | length') +echo "Healthy targets: $UP_TARGETS/$TOTAL_TARGETS" + +# Runtime information +echo "=== Runtime Information ===" +curl -s "$PROMETHEUS_URL/api/v1/status/runtimeinfo" | jq '.' +``` + +### Kubernetes Health Checks + +```yaml +# Kubernetes probes for Prometheus StatefulSet +livenessProbe: + httpGet: + path: /-/healthy + port: 9090 + initialDelaySeconds: 30 + periodSeconds: 15 + timeoutSeconds: 10 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /-/ready + port: 9090 + initialDelaySeconds: 30 + periodSeconds: 5 + timeoutSeconds: 5 + failureThreshold: 3 +``` + +## Performance Monitoring Queries + +### Memory Analysis + +```promql +# Top metrics by memory usage +topk(10, + prometheus_tsdb_symbol_table_size_bytes + + prometheus_tsdb_head_chunks_bytes +) + +# Memory usage by component +sum by (job) (process_resident_memory_bytes{job="prometheus"}) + +# Memory growth rate +increase(process_resident_memory_bytes{job="prometheus"}[1h]) +``` + +### Query Analysis + +```promql +# Most expensive queries by duration +topk(10, + rate(prometheus_engine_query_duration_seconds_sum[5m]) / + rate(prometheus_engine_query_duration_seconds_count[5m]) +) + +# Query concurrency +prometheus_engine_queries_concurrent_max + +# Failed queries +rate(prometheus_engine_queries_total{result="error"}[5m]) +``` + +### Storage Analysis + +```promql +# WAL size growth +increase(prometheus_tsdb_wal_segment_current[1h]) + +# Compaction duration +prometheus_tsdb_compaction_duration_seconds + +# Block size distribution +histogram_quantile(0.95, prometheus_tsdb_compaction_chunk_size_bytes_bucket) +``` + +## Automated Monitoring Scripts + +### Daily Health Report + +```bash +#!/bin/bash +# daily-prometheus-report.sh + +PROMETHEUS_URL="http://localhost:9090" +REPORT_DATE=$(date +%Y-%m-%d) +REPORT_FILE="/var/log/prometheus/daily-report-$REPORT_DATE.txt" + +echo "Prometheus Daily Health Report - $REPORT_DATE" > $REPORT_FILE +echo "================================================" >> $REPORT_FILE + +# Instance status +echo "Instance Status:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=up{job=\"prometheus\"}" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Memory usage +echo -e "\nMemory Usage (GB):" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}/1024/1024/1024" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Active series +echo -e "\nActive Series:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=prometheus_tsdb_head_series" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Query performance +echo -e "\nQuery Performance (95th percentile, seconds):" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=histogram_quantile(0.95, rate(prometheus_engine_query_duration_seconds_bucket[24h]))" | \ + jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"' >> $REPORT_FILE + +# Failed scrapes +echo -e "\nFailed Scrapes:" >> $REPORT_FILE +curl -s "$PROMETHEUS_URL/api/v1/query?query=count by (job) (up == 0)" | \ + jq -r '.data.result[] | "\(.metric.job): \(.value[1])"' >> $REPORT_FILE + +echo "Report generated: $REPORT_FILE" +``` + +### Capacity Planning Script + +```bash +#!/bin/bash +# capacity-planning.sh + +PROMETHEUS_URL="http://localhost:9090" + +echo "Prometheus Capacity Planning Report" +echo "==================================" + +# Current metrics +CURRENT_SERIES=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=prometheus_tsdb_head_series" | jq '.data.result[0].value[1] | tonumber') +CURRENT_MEMORY=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}" | jq '.data.result[0].value[1] | tonumber') +INGESTION_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=rate(prometheus_tsdb_head_samples_appended_total[1h])" | jq '.data.result[0].value[1] | tonumber') + +echo "Current active series: $CURRENT_SERIES" +echo "Current memory usage: $(echo "$CURRENT_MEMORY / 1024 / 1024 / 1024" | bc) GB" +echo "Current ingestion rate: $(echo "$INGESTION_RATE" | bc) samples/sec" + +# Projected growth (30 days) +PROJECTED_SERIES=$(echo "$CURRENT_SERIES * 1.1" | bc) # 10% growth +PROJECTED_MEMORY=$(echo "$CURRENT_MEMORY * 1.1" | bc) + +echo -e "\nProjected in 30 days (10% growth):" +echo "Projected series: $PROJECTED_SERIES" +echo "Projected memory: $(echo "$PROJECTED_MEMORY / 1024 / 1024 / 1024" | bc) GB" + +# Recommendations +if (( $(echo "$CURRENT_SERIES > 500000" | bc -l) )); then + echo -e "\nRecommendation: Consider horizontal scaling or series optimization" +fi + +if (( $(echo "$CURRENT_MEMORY > 8589934592" | bc -l) )); then # 8GB + echo -e "\nRecommendation: Monitor memory usage closely, consider memory optimization" +fi +``` + +## Log Analysis + +### Important Log Patterns + +```bash +# Monitor Prometheus logs for issues +tail -f /var/log/prometheus/prometheus.log | grep -E "(error|warn|panic|fatal)" + +# Common error patterns to watch for: +# - "out of memory" +# - "too many open files" +# - "context deadline exceeded" +# - "compaction failed" +# - "WAL corruption" +``` + +### Log Aggregation Query (if using Loki) + +```logql +# Prometheus error analysis +{job="prometheus"} |= "error" | json | line_format "{{ .level }}: {{ .msg }}" + +# Memory pressure indicators +{job="prometheus"} |~ "memory|OOM|out of memory" + +# Query performance issues +{job="prometheus"} |~ "slow|timeout|deadline exceeded" +``` + +## Troubleshooting Playbook + +### High Memory Usage + +1. **Check active series**: `prometheus_tsdb_head_series` +2. **Identify high-cardinality metrics**: Use cardinality analysis queries +3. **Review scrape configurations**: Look for unnecessary labels +4. **Consider series dropping**: Use `metric_relabel_configs` + +### Slow Queries + +1. **Enable query logging**: `--query.log_file` flag +2. **Analyze query patterns**: Review most expensive queries +3. **Optimize query structure**: Use recording rules for complex queries +4. **Increase query timeout**: `--query.timeout` if appropriate + +### Storage Issues + +1. **Check disk space**: Monitor filesystem usage +2. **Review retention settings**: Adjust retention time/size +3. **Monitor compaction**: Check for failed compactions +4. **WAL monitoring**: Watch WAL size growth + +## Integration with External Monitoring + +### Exporting Metrics to Another Prometheus + +```yaml +# Remote write configuration for meta-monitoring +remote_write: + - url: "http://meta-prometheus:9090/api/v1/write" + queue_config: + capacity: 10000 + max_samples_per_send: 1000 + write_relabel_configs: + - source_labels: [__name__] + regex: "prometheus_.*" + action: keep +``` + +### Alertmanager Integration + +```yaml +# Alertmanager configuration for Prometheus alerts +route: + group_by: ['alertname', 'instance'] + group_wait: 10s + group_interval: 10s + repeat_interval: 1h + receiver: 'prometheus-alerts' + routes: + - match: + severity: critical + receiver: 'prometheus-critical' + +receivers: +- name: 'prometheus-alerts' + slack_configs: + - api_url: 'YOUR_SLACK_WEBHOOK' + channel: '#prometheus-alerts' + +- name: 'prometheus-critical' + pagerduty_configs: + - service_key: 'YOUR_PAGERDUTY_KEY' +``` + +--- + +This monitoring setup ensures your Prometheus infrastructure remains healthy and performant. Regular monitoring of these metrics and alerts will help you maintain reliable monitoring for your production environments. \ No newline at end of file diff --git a/docs/operating/production-deployment.md b/docs/operating/production-deployment.md new file mode 100644 index 000000000..487a1fe07 --- /dev/null +++ b/docs/operating/production-deployment.md @@ -0,0 +1,571 @@ +--- +title: Production Deployment Guide +--- + +# Production Deployment Guide + +This guide provides comprehensive recommendations for deploying Prometheus in production environments. It covers hardware requirements, high availability patterns, configuration best practices, and operational considerations for running Prometheus at scale. + +## Hardware and Infrastructure Requirements + +### Server Specifications + +**Memory Requirements** +- **Minimum**: 4GB RAM for small deployments (< 10k active series) +- **Recommended**: 16-32GB RAM for medium deployments (10k-100k active series) +- **Large Scale**: 64GB+ RAM for large deployments (100k+ active series) + +**CPU Requirements** +- **Minimum**: 2 CPU cores +- **Recommended**: 4-8 CPU cores for most production workloads +- **Large Scale**: 16+ CPU cores for high-cardinality environments + +**Storage Requirements** +- **SSD strongly recommended** for data directory +- **Disk space calculation**: `retention_days * daily_ingestion_rate * compression_ratio` + - Typical compression ratio: 1.5-3x + - Example: 30 days * 1GB/day * 2 = 60GB storage needed +- **Separate disk** for WAL (Write-Ahead Log) recommended for high-throughput deployments + +### Network Considerations + +```yaml +# Recommended firewall rules +ingress: + - port: 9090 # Prometheus web UI and API + protocol: TCP + sources: ["monitoring-subnet", "admin-subnet"] + + - port: 9091 # Pushgateway (if used) + protocol: TCP + sources: ["application-subnets"] + +egress: + - port: 80/443 # Scraping HTTP/HTTPS targets + protocol: TCP + destinations: ["0.0.0.0/0"] + + - port: 9100 # Node exporter + protocol: TCP + destinations: ["infrastructure-subnets"] +``` + +## High Availability Deployment Patterns + +### Active-Active Configuration + +Deploy multiple identical Prometheus instances scraping the same targets: + +```yaml +# prometheus-1.yml +global: + scrape_interval: 15s + evaluation_interval: 15s + external_labels: + replica: 'prometheus-1' + cluster: 'production' + +scrape_configs: + - job_name: 'application-servers' + static_configs: + - targets: ['app1:8080', 'app2:8080', 'app3:8080'] +``` + +```yaml +# prometheus-2.yml +global: + scrape_interval: 15s + evaluation_interval: 15s + external_labels: + replica: 'prometheus-2' + cluster: 'production' + +scrape_configs: + - job_name: 'application-servers' + static_configs: + - targets: ['app1:8080', 'app2:8080', 'app3:8080'] +``` + +**Benefits:** +- No single point of failure +- Load distribution for queries +- Natural data redundancy + +**Considerations:** +- Requires deduplication in query layer (Thanos, Cortex, or VictoriaMetrics) +- Double storage requirements +- Alert rule evaluation happens on both instances + +### Federation for Hierarchical Scaling + +```yaml +# Global Prometheus configuration +scrape_configs: + - job_name: 'prometheus-federation' + scrape_interval: 15s + honor_labels: true + metrics_path: '/federate' + params: + 'match[]': + - '{job=~"prometheus|node|kubernetes-.*"}' + - 'up' + - 'prometheus_build_info' + static_configs: + - targets: + - 'prometheus-region-us-east:9090' + - 'prometheus-region-us-west:9090' + - 'prometheus-region-eu:9090' +``` + +## Production Configuration Best Practices + +### Storage Configuration + +```yaml +# Command line flags for storage optimization +--storage.tsdb.path=/prometheus/data +--storage.tsdb.retention.time=30d +--storage.tsdb.retention.size=100GB +--storage.tsdb.wal-compression +--storage.tsdb.no-lockfile +--web.enable-lifecycle +--web.enable-admin-api +``` + +### Memory Optimization + +```yaml +# Limit memory usage and optimize for large deployments +--storage.tsdb.head-chunks-write-queue-size=10000 +--query.max-concurrency=20 +--query.timeout=2m +--query.max-samples=50000000 +``` + +### Sample Configuration File + +```yaml +# /etc/prometheus/prometheus.yml +global: + scrape_interval: 30s + scrape_timeout: 10s + evaluation_interval: 30s + external_labels: + environment: 'production' + datacenter: 'us-east-1' + +rule_files: + - "/etc/prometheus/rules/*.yml" + +alerting: + alertmanagers: + - static_configs: + - targets: + - alertmanager-1:9093 + - alertmanager-2:9093 + timeout: 10s + +scrape_configs: + # Prometheus itself + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + scrape_interval: 30s + metrics_path: /metrics + + # Node exporter for system metrics + - job_name: 'node-exporter' + static_configs: + - targets: + - 'node1:9100' + - 'node2:9100' + - 'node3:9100' + scrape_interval: 30s + + # Application metrics + - job_name: 'application' + static_configs: + - targets: + - 'app1:8080' + - 'app2:8080' + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + +# Remote write for long-term storage (optional) +remote_write: + - url: "https://remote-storage-endpoint/api/v1/write" + queue_config: + capacity: 2500 + max_shards: 200 + min_shards: 1 + max_samples_per_send: 500 + batch_send_deadline: 5s +``` + +## Container Deployment + +### Docker Configuration + +```dockerfile +# Dockerfile for production Prometheus +FROM prom/prometheus:latest + +# Copy configuration +COPY prometheus.yml /etc/prometheus/ +COPY rules/ /etc/prometheus/rules/ + +# Set proper ownership +USER root +RUN chown -R prometheus:prometheus /etc/prometheus/ +USER prometheus + +# Expose metrics port +EXPOSE 9090 + +# Use proper entrypoint with production flags +ENTRYPOINT ["/bin/prometheus", \ + "--config.file=/etc/prometheus/prometheus.yml", \ + "--storage.tsdb.path=/prometheus", \ + "--storage.tsdb.retention.time=30d", \ + "--storage.tsdb.wal-compression", \ + "--web.console.libraries=/etc/prometheus/console_libraries", \ + "--web.console.templates=/etc/prometheus/consoles", \ + "--web.enable-lifecycle", \ + "--web.external-url=https://prometheus.company.com"] +``` + +### Docker Compose for HA Setup + +```yaml +# docker-compose.yml +version: '3.8' + +services: + prometheus-1: + image: prom/prometheus:latest + container_name: prometheus-1 + ports: + - "9090:9090" + volumes: + - ./prometheus-1.yml:/etc/prometheus/prometheus.yml + - ./rules:/etc/prometheus/rules + - prometheus-1-data:/prometheus + command: + - '--config.file=/etc/prometheus/prometheus.yml' + - '--storage.tsdb.path=/prometheus' + - '--storage.tsdb.retention.time=30d' + - '--storage.tsdb.wal-compression' + - '--web.enable-lifecycle' + - '--web.external-url=http://prometheus-1:9090' + restart: unless-stopped + networks: + - monitoring + + prometheus-2: + image: prom/prometheus:latest + container_name: prometheus-2 + ports: + - "9091:9090" + volumes: + - ./prometheus-2.yml:/etc/prometheus/prometheus.yml + - ./rules:/etc/prometheus/rules + - prometheus-2-data:/prometheus + command: + - '--config.file=/etc/prometheus/prometheus.yml' + - '--storage.tsdb.path=/prometheus' + - '--storage.tsdb.retention.time=30d' + - '--storage.tsdb.wal-compression' + - '--web.enable-lifecycle' + - '--web.external-url=http://prometheus-2:9090' + restart: unless-stopped + networks: + - monitoring + +volumes: + prometheus-1-data: + prometheus-2-data: + +networks: + monitoring: + driver: bridge +``` + +### Kubernetes Deployment + +```yaml +# prometheus-deployment.yaml +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: prometheus + namespace: monitoring +spec: + serviceName: prometheus + replicas: 2 + selector: + matchLabels: + app: prometheus + template: + metadata: + labels: + app: prometheus + spec: + serviceAccountName: prometheus + securityContext: + runAsUser: 65534 + runAsGroup: 65534 + fsGroup: 65534 + containers: + - name: prometheus + image: prom/prometheus:latest + ports: + - containerPort: 9090 + name: http + args: + - '--config.file=/etc/prometheus/prometheus.yml' + - '--storage.tsdb.path=/prometheus' + - '--storage.tsdb.retention.time=30d' + - '--storage.tsdb.retention.size=50GiB' + - '--storage.tsdb.wal-compression' + - '--web.enable-lifecycle' + - '--web.external-url=http://prometheus.monitoring.svc.cluster.local:9090' + - '--web.route-prefix=/' + resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "2" + volumeMounts: + - name: config + mountPath: /etc/prometheus + - name: storage + mountPath: /prometheus + livenessProbe: + httpGet: + path: /-/healthy + port: 9090 + initialDelaySeconds: 30 + timeoutSeconds: 30 + readinessProbe: + httpGet: + path: /-/ready + port: 9090 + initialDelaySeconds: 30 + timeoutSeconds: 30 + volumes: + - name: config + configMap: + name: prometheus-config + volumeClaimTemplates: + - metadata: + name: storage + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: "fast-ssd" + resources: + requests: + storage: 100Gi +``` + +## Security Hardening + +### Authentication and Authorization + +```yaml +# Basic auth configuration +basic_auth_users: + admin: $2a$10$hYoOolb6tZyZQkEJ8T8jIuJ6U.4FK/8e8cDatYQ8F5U0QKa.4QKyC # admin + readonly: $2a$10$ZoOJlGqEEzOz5T8uFX5c8elZeT3cxBE8XuqD8qJ2z9F5x8c4U6Ty6 # readonly + +# TLS configuration +tls_server_config: + cert_file: /etc/prometheus/tls/server.crt + key_file: /etc/prometheus/tls/server.key + client_ca_file: /etc/prometheus/tls/ca.crt + client_auth_type: RequireAndVerifyClientCert +``` + +### Network Security + +```bash +# Firewall rules using iptables +# Allow Prometheus web interface from monitoring subnet only +iptables -A INPUT -p tcp --dport 9090 -s 10.0.1.0/24 -j ACCEPT +iptables -A INPUT -p tcp --dport 9090 -j DROP + +# Allow scraping from Prometheus to targets +iptables -A OUTPUT -p tcp --dport 9100 -d 10.0.0.0/16 -j ACCEPT +iptables -A OUTPUT -p tcp --dport 8080 -d 10.0.0.0/16 -j ACCEPT +``` + +## Monitoring Prometheus Performance + +Essential metrics to monitor for Prometheus health: + +```promql +# Memory usage +prometheus_tsdb_head_samples_appended_total +prometheus_engine_query_duration_seconds +prometheus_tsdb_symbol_table_size_bytes + +# Storage metrics +prometheus_tsdb_blocks_loaded +prometheus_tsdb_compactions_total +prometheus_tsdb_head_series + +# Query performance +prometheus_query_duration_seconds +prometheus_engine_queries_concurrent_max +``` + +## Backup and Disaster Recovery + +### Snapshot-based Backup + +```bash +#!/bin/bash +# backup-prometheus.sh + +PROMETHEUS_URL="http://localhost:9090" +BACKUP_DIR="/backup/prometheus" +DATE=$(date +%Y%m%d_%H%M%S) + +# Create snapshot +curl -XPOST $PROMETHEUS_URL/api/v1/admin/tsdb/snapshot + +# Get snapshot name +SNAPSHOT=$(ls -t /prometheus/snapshots/ | head -1) + +# Copy snapshot to backup location +mkdir -p $BACKUP_DIR/$DATE +cp -r /prometheus/snapshots/$SNAPSHOT $BACKUP_DIR/$DATE/ + +# Compress backup +tar -czf $BACKUP_DIR/prometheus_backup_$DATE.tar.gz -C $BACKUP_DIR/$DATE . + +# Clean up old backups (keep 30 days) +find $BACKUP_DIR -name "*.tar.gz" -mtime +30 -delete + +echo "Backup completed: $BACKUP_DIR/prometheus_backup_$DATE.tar.gz" +``` + +### Recovery Procedure + +```bash +#!/bin/bash +# restore-prometheus.sh + +BACKUP_FILE="$1" +PROMETHEUS_DATA_DIR="/prometheus" + +if [ -z "$BACKUP_FILE" ]; then + echo "Usage: $0 " + exit 1 +fi + +# Stop Prometheus +systemctl stop prometheus + +# Backup current data +mv $PROMETHEUS_DATA_DIR $PROMETHEUS_DATA_DIR.backup.$(date +%s) + +# Extract backup +mkdir -p $PROMETHEUS_DATA_DIR +tar -xzf $BACKUP_FILE -C $PROMETHEUS_DATA_DIR + +# Set proper permissions +chown -R prometheus:prometheus $PROMETHEUS_DATA_DIR + +# Start Prometheus +systemctl start prometheus + +echo "Recovery completed from $BACKUP_FILE" +``` + +## Performance Tuning + +### Memory Optimization + +```bash +# JVM-style memory flags for Go garbage collection +export GOGC=100 # Default garbage collection target +export GOMEMLIMIT=8GiB # Set memory limit (Go 1.19+) + +# Start Prometheus with memory optimizations +prometheus \ + --storage.tsdb.head-chunks-write-queue-size=10000 \ + --query.max-concurrency=20 \ + --storage.tsdb.min-block-duration=2h \ + --storage.tsdb.max-block-duration=2h +``` + +### Storage Optimization + +```yaml +# Reduce cardinality by dropping unnecessary labels +metric_relabel_configs: + - source_labels: [__name__] + regex: 'go_.*' + action: drop + - source_labels: [instance] + regex: '(.*):[0-9]+' + target_label: instance + replacement: '${1}' +``` + +## Troubleshooting Common Issues + +### High Memory Usage + +```promql +# Check for high cardinality series +topk(10, count by (__name__)({__name__=~".+"})) + +# Identify sources of cardinality +prometheus_tsdb_symbol_table_size_bytes +prometheus_tsdb_head_series +``` + +### Slow Queries + +```promql +# Monitor query performance +rate(prometheus_engine_query_duration_seconds_sum[5m]) / +rate(prometheus_engine_query_duration_seconds_count[5m]) + +# Check for expensive queries +prometheus_engine_queries_concurrent_max +``` + +### Storage Issues + +```bash +# Check disk space +df -h /prometheus + +# Monitor WAL size +du -sh /prometheus/wal/ + +# Check for corrupted blocks +prometheus_tsdb_blocks_loaded vs expected blocks +``` + +## Next Steps + +After deploying Prometheus in production: + +1. Set up [monitoring of Prometheus itself](monitoring-prometheus/) +2. Configure [alerting rules](../practices/alerting.md) +3. Implement [backup procedures](backup-recovery/) +4. Review [security configurations](security.md) +5. Plan for [scaling and performance tuning](performance-tuning/) + +--- + +**Additional Resources:** +- [Prometheus Configuration Reference](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) +- [Storage Documentation](https://prometheus.io/docs/prometheus/latest/storage/) +- [Best Practices](../practices/) \ No newline at end of file From ed18ba4fe62b57ea1b11bac81f16c47d667c13c2 Mon Sep 17 00:00:00 2001 From: Parag Gupta Date: Thu, 7 Aug 2025 17:51:27 +0530 Subject: [PATCH 2/5] fix: correct links to actual markdown files MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix navigation links in operating index to point to .md files instead of directories to resolve build failures. - production-deployment/ → production-deployment.md - monitoring-prometheus/ → monitoring-prometheus.md - ../operating/security.md → security.md This should resolve the header rules, pages changed, and redirect rules build failures. Signed-off-by: Parag Gupta --- docs/operating/index.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/operating/index.md b/docs/operating/index.md index 2b12b7ca5..8e01f7a9a 100644 --- a/docs/operating/index.md +++ b/docs/operating/index.md @@ -12,7 +12,7 @@ This section provides comprehensive guidance for deploying, monitoring, and main Running Prometheus in production requires careful planning around scalability, reliability, and operational concerns: -* [Production Deployment Guide](production-deployment/) - Comprehensive guide for production-ready Prometheus deployments including hardware sizing, high availability setup, and configuration best practices +* [Production Deployment Guide](production-deployment.md) - Comprehensive guide for production-ready Prometheus deployments including hardware sizing, high availability setup, and configuration best practices * [Performance Tuning](performance-tuning/) - Optimization techniques for large-scale deployments, memory management, and query performance * [Storage Management](storage-management/) - Long-term storage strategies, retention policies, and data lifecycle management @@ -20,7 +20,7 @@ Running Prometheus in production requires careful planning around scalability, r Effective operation requires monitoring your monitoring infrastructure: -* [Monitoring Prometheus](monitoring-prometheus/) - How to monitor your Prometheus instances, essential metrics, and alerting on infrastructure health +* [Monitoring Prometheus](monitoring-prometheus.md) - How to monitor your Prometheus instances, essential metrics, and alerting on infrastructure health * [Troubleshooting Guide](troubleshooting/) - Common issues, diagnostic techniques, and resolution strategies for production problems * [Backup and Recovery](backup-recovery/) - Data protection strategies, disaster recovery procedures, and backup validation @@ -28,7 +28,7 @@ Effective operation requires monitoring your monitoring infrastructure: Securing monitoring infrastructure is critical for production deployments: -* [Security Best Practices](../operating/security.md) - Authentication, authorization, network security, and data protection +* [Security Best Practices](security.md) - Authentication, authorization, network security, and data protection * [Compliance Considerations](compliance/) - Meeting regulatory requirements, audit trails, and data governance ## Operational Integration From 820209d8ea3bf804f7e711b1f9841e919687148d Mon Sep 17 00:00:00 2001 From: Parag Gupta Date: Thu, 7 Aug 2025 17:59:50 +0530 Subject: [PATCH 3/5] fix: add missing sort_rank to frontmatter Add sort_rank values to new documentation files to match expected documentation structure: - production-deployment.md: sort_rank: 1 - monitoring-prometheus.md: sort_rank: 2 This should resolve header rules validation failures. Signed-off-by: Parag Gupta --- docs/operating/monitoring-prometheus.md | 1 + docs/operating/production-deployment.md | 1 + 2 files changed, 2 insertions(+) diff --git a/docs/operating/monitoring-prometheus.md b/docs/operating/monitoring-prometheus.md index 2d253e037..59b6421b3 100644 --- a/docs/operating/monitoring-prometheus.md +++ b/docs/operating/monitoring-prometheus.md @@ -1,5 +1,6 @@ --- title: Monitoring Prometheus +sort_rank: 2 --- # Monitoring Prometheus diff --git a/docs/operating/production-deployment.md b/docs/operating/production-deployment.md index 487a1fe07..93bba1c05 100644 --- a/docs/operating/production-deployment.md +++ b/docs/operating/production-deployment.md @@ -1,5 +1,6 @@ --- title: Production Deployment Guide +sort_rank: 1 --- # Production Deployment Guide From c55f56dcb608988e42b1f6223919557f372dc8a5 Mon Sep 17 00:00:00 2001 From: Parag Gupta Date: Thu, 7 Aug 2025 20:19:34 +0530 Subject: [PATCH 4/5] refactor: address maintainer feedback on mixins and examples Based on valuable feedback from @bwplotka and @juliusv: - Replace inline alerting rules with references to official mixins - Move example scripts to clearly marked examples with disclaimers - Reference official examples repository for YAML configurations - Add proper warnings about testing and adaptation needed - Link to prometheus-community Helm charts for K8s deployments This approach ensures better maintainability and follows established project patterns while providing the operational guidance users need. Addresses: Maintainer feedback on reliability and maintainability Signed-off-by: Parag Gupta --- docs/operating/monitoring-prometheus.md | 196 +++++++----------------- docs/operating/production-deployment.md | 185 +++++++--------------- 2 files changed, 108 insertions(+), 273 deletions(-) diff --git a/docs/operating/monitoring-prometheus.md b/docs/operating/monitoring-prometheus.md index 59b6421b3..48657f108 100644 --- a/docs/operating/monitoring-prometheus.md +++ b/docs/operating/monitoring-prometheus.md @@ -75,168 +75,69 @@ prometheus_tsdb_head_truncations_failed_total ## Critical Alerting Rules -### High-Priority Alerts +### **Prometheus Monitoring Mixins** -```yaml -# prometheus-alerts.yml -groups: -- name: prometheus.rules - rules: - - # Prometheus instance down - - alert: PrometheusDown - expr: up{job="prometheus"} == 0 - for: 5m - labels: - severity: critical - annotations: - summary: "Prometheus instance {{ $labels.instance }} is down" - description: "Prometheus instance {{ $labels.instance }} has been down for more than 5 minutes." - - # High memory usage - - alert: PrometheusHighMemoryUsage - expr: > - ( - process_resident_memory_bytes{job="prometheus"} / - prometheus_config_last_reload_success_timestamp_seconds{job="prometheus"} * 0 + 1 - ) * 100 > 80 - for: 15m - labels: - severity: warning - annotations: - summary: "Prometheus {{ $labels.instance }} memory usage is high" - description: "Prometheus {{ $labels.instance }} memory usage is above 80% for more than 15 minutes." - - # Too many active series - - alert: PrometheusHighCardinality - expr: prometheus_tsdb_head_series > 1000000 - for: 10m - labels: - severity: warning - annotations: - summary: "Prometheus {{ $labels.instance }} has high cardinality" - description: "Prometheus {{ $labels.instance }} has {{ $value }} active series, which is above the recommended threshold." - - # Query latency high - - alert: PrometheusHighQueryLatency - expr: > - histogram_quantile(0.95, - rate(prometheus_engine_query_duration_seconds_bucket[5m]) - ) > 30 - for: 10m - labels: - severity: warning - annotations: - summary: "Prometheus {{ $labels.instance }} has high query latency" - description: "95th percentile query latency is {{ $value }}s for more than 10 minutes." +Instead of maintaining alerting rules inline (which can become outdated), we recommend using the official Prometheus monitoring mixins that are maintained alongside the codebase: - # WAL corruption - - alert: PrometheusWALCorruption - expr: increase(prometheus_tsdb_wal_corruptions_total[1h]) > 0 - labels: - severity: critical - annotations: - summary: "Prometheus {{ $labels.instance }} WAL corruption detected" - description: "Prometheus {{ $labels.instance }} has detected WAL corruption." +**📋 Official Prometheus Monitoring Mixin** +- **Repository**: [prometheus/prometheus](https://github.com/prometheus/prometheus/tree/main/documentation/prometheus-mixin) +- **Maintained**: Versioned with Prometheus releases +- **Coverage**: Production-ready alerts for Prometheus infrastructure health +- **Installation**: Follow the mixin documentation for your environment - # Compaction failures - - alert: PrometheusCompactionFailed - expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0 - labels: - severity: warning - annotations: - summary: "Prometheus {{ $labels.instance }} compaction failed" - description: "Prometheus {{ $labels.instance }} has failed compactions in the last hour." - - # Target scrape failures - - alert: PrometheusTargetScrapeFailure - expr: > - ( - 1 - ( - sum(up) / - count(up) - ) - ) * 100 > 10 - for: 15m - labels: - severity: warning - annotations: - summary: "High percentage of target scrape failures" - description: "{{ $value }}% of targets are failing to be scraped." +**Key Alert Categories Covered**: +- Prometheus instance health and availability +- High memory usage and resource constraints +- Query performance and latency issues +- Storage and WAL-related problems +- Target scraping failures and connectivity - # Storage space low - - alert: PrometheusStorageSpaceLow - expr: > - ( - node_filesystem_free_bytes{mountpoint="/prometheus"} / - node_filesystem_size_bytes{mountpoint="/prometheus"} - ) * 100 < 20 - for: 5m - labels: - severity: warning - annotations: - summary: "Prometheus storage space is low" - description: "Prometheus storage has less than 20% free space remaining." +**🔗 Additional Community Mixins**: +- [monitoring-mixins/prometheus-mixin](https://monitoring.mixins.dev/prometheus/) - Community-maintained alerts +- [grafana/jsonnet-libs](https://github.com/grafana/jsonnet-libs) - Grafana Labs mixins - # Configuration reload failed - - alert: PrometheusConfigReloadFailed - expr: prometheus_config_last_reload_successful == 0 - for: 5m - labels: - severity: warning - annotations: - summary: "Prometheus configuration reload failed" - description: "Prometheus {{ $labels.instance }} configuration reload has failed." -``` +### **Example Custom Alerting Rules** -### Capacity Planning Alerts +For organizations needing custom alerts beyond the mixins, here are example patterns. **Note**: These are templates that should be adapted and tested for your specific environment: ```yaml -# capacity-alerts.yml +# Example: Custom capacity planning alerts +# ⚠️ Disclaimer: Test thoroughly in your environment before production use groups: -- name: prometheus.capacity +- name: prometheus.capacity.examples rules: - - # Ingestion rate trending up - - alert: PrometheusIngestionRateHigh - expr: > - predict_linear( - rate(prometheus_tsdb_head_samples_appended_total[1h])[4h:], - 24*3600 - ) > 50000 - for: 30m + - alert: PrometheusHighMemoryUsageCustom + expr: | + ( + process_resident_memory_bytes{job="prometheus"} / + (1024^3) # Convert to GB + ) > 8 # Adjust threshold for your deployment + for: 15m labels: severity: warning annotations: - summary: "Prometheus ingestion rate trending high" - description: "Ingestion rate is predicted to exceed 50k samples/sec within 24 hours." + summary: "Prometheus {{ $labels.instance }} memory usage is high" + description: "Memory usage is {{ $value }}GB, consider scaling or optimization." - # Series growth rate - - alert: PrometheusSeriesGrowthHigh - expr: > + - alert: PrometheusIngestionRateIncreasing + expr: | predict_linear( - prometheus_tsdb_head_series[4h:], + rate(prometheus_tsdb_head_samples_appended_total[1h])[4h:], 24*3600 - ) > 2000000 - for: 1h - labels: - severity: warning - annotations: - summary: "Prometheus series count growing rapidly" - description: "Active series count is predicted to exceed 2M within 24 hours." - - # Query load increasing - - alert: PrometheusQueryLoadHigh - expr: > - rate(prometheus_engine_queries[5m]) > 100 + ) > 50000 # Adjust based on your capacity for: 30m labels: severity: warning annotations: - summary: "Prometheus query load is high" - description: "Query rate is {{ $value }} queries/sec, consider query optimization." + summary: "Prometheus ingestion rate trending high" + description: "Predicted to exceed 50k samples/sec within 24 hours." ``` +**📝 Important Notes**: +- These are **example templates** - adapt thresholds for your environment +- Test thoroughly before deploying to production +- Consider contributing improvements back to the official mixins + ## Monitoring Dashboard ### Grafana Dashboard JSON @@ -321,11 +222,14 @@ groups: ## Health Check Endpoints -### HTTP Health Checks +### **Example HTTP Health Checks** + +The following are example scripts for monitoring Prometheus health endpoints. **⚠️ Disclaimer**: These are templates that should be tested and adapted for your specific environment - no CI validates these scripts. ```bash #!/bin/bash -# prometheus-health-check.sh +# example-prometheus-health-check.sh +# ⚠️ Test thoroughly in your environment before production use PROMETHEUS_URL="http://localhost:9090" @@ -333,7 +237,7 @@ PROMETHEUS_URL="http://localhost:9090" echo "=== Basic Health Check ===" curl -s "$PROMETHEUS_URL/-/healthy" || echo "Health check failed" -# Readiness check +# Readiness check echo "=== Readiness Check ===" curl -s "$PROMETHEUS_URL/-/ready" || echo "Readiness check failed" @@ -353,6 +257,12 @@ echo "=== Runtime Information ===" curl -s "$PROMETHEUS_URL/api/v1/status/runtimeinfo" | jq '.' ``` +**📝 Usage Notes**: +- Requires `curl` and `jq` to be installed +- Adjust `PROMETHEUS_URL` for your deployment +- Consider adding authentication headers if Prometheus is secured +- Test timeout and error handling for your environment + ### Kubernetes Health Checks ```yaml diff --git a/docs/operating/production-deployment.md b/docs/operating/production-deployment.md index 93bba1c05..78339ff2c 100644 --- a/docs/operating/production-deployment.md +++ b/docs/operating/production-deployment.md @@ -206,10 +206,21 @@ remote_write: ## Container Deployment -### Docker Configuration +### **Official Deployment Examples** + +For production-ready deployment configurations, we recommend using the official examples that are maintained and tested: + +**📁 Prometheus Examples Repository** +- **Location**: [prometheus/prometheus/documentation/examples](https://github.com/prometheus/prometheus/tree/main/documentation/examples) +- **Maintained**: Versioned with Prometheus releases +- **Tested**: Validated configurations for various deployment scenarios + +### **Docker Configuration** + +**📋 Basic Docker Setup Example** ```dockerfile -# Dockerfile for production Prometheus +# Example Dockerfile for production Prometheus FROM prom/prometheus:latest # Copy configuration @@ -236,141 +247,55 @@ ENTRYPOINT ["/bin/prometheus", \ "--web.external-url=https://prometheus.company.com"] ``` -### Docker Compose for HA Setup +### **Kubernetes Deployment** + +**📋 Recommended Approach**: Use official Helm charts or kustomize examples + +**Official Resources**: +- **Prometheus Community Helm Chart**: [prometheus-community/helm-charts](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) +- **Prometheus Operator**: [prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) +- **Official Examples**: [prometheus/prometheus examples](https://github.com/prometheus/prometheus/tree/main/documentation/examples) +**📝 Key Kubernetes Considerations**: +- Use StatefulSets for data persistence +- Configure proper resource requests and limits +- Set up horizontal pod autoscaling carefully +- Use persistent volumes for data storage +- Configure proper security contexts +- Set up monitoring and alerting for the Kubernetes deployment itself + +**Example Resource Requirements**: ```yaml -# docker-compose.yml -version: '3.8' - -services: - prometheus-1: - image: prom/prometheus:latest - container_name: prometheus-1 - ports: - - "9090:9090" - volumes: - - ./prometheus-1.yml:/etc/prometheus/prometheus.yml - - ./rules:/etc/prometheus/rules - - prometheus-1-data:/prometheus - command: - - '--config.file=/etc/prometheus/prometheus.yml' - - '--storage.tsdb.path=/prometheus' - - '--storage.tsdb.retention.time=30d' - - '--storage.tsdb.wal-compression' - - '--web.enable-lifecycle' - - '--web.external-url=http://prometheus-1:9090' - restart: unless-stopped - networks: - - monitoring - - prometheus-2: - image: prom/prometheus:latest - container_name: prometheus-2 - ports: - - "9091:9090" - volumes: - - ./prometheus-2.yml:/etc/prometheus/prometheus.yml - - ./rules:/etc/prometheus/rules - - prometheus-2-data:/prometheus - command: - - '--config.file=/etc/prometheus/prometheus.yml' - - '--storage.tsdb.path=/prometheus' - - '--storage.tsdb.retention.time=30d' - - '--storage.tsdb.wal-compression' - - '--web.enable-lifecycle' - - '--web.external-url=http://prometheus-2:9090' - restart: unless-stopped - networks: - - monitoring - -volumes: - prometheus-1-data: - prometheus-2-data: - -networks: - monitoring: - driver: bridge +# Example resource configuration - adjust for your needs +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "2" ``` -### Kubernetes Deployment +### **High Availability with Helm** -```yaml -# prometheus-deployment.yaml -apiVersion: apps/v1 -kind: StatefulSet -metadata: - name: prometheus - namespace: monitoring -spec: - serviceName: prometheus - replicas: 2 - selector: - matchLabels: - app: prometheus - template: - metadata: - labels: - app: prometheus - spec: - serviceAccountName: prometheus - securityContext: - runAsUser: 65534 - runAsGroup: 65534 - fsGroup: 65534 - containers: - - name: prometheus - image: prom/prometheus:latest - ports: - - containerPort: 9090 - name: http - args: - - '--config.file=/etc/prometheus/prometheus.yml' - - '--storage.tsdb.path=/prometheus' - - '--storage.tsdb.retention.time=30d' - - '--storage.tsdb.retention.size=50GiB' - - '--storage.tsdb.wal-compression' - - '--web.enable-lifecycle' - - '--web.external-url=http://prometheus.monitoring.svc.cluster.local:9090' - - '--web.route-prefix=/' - resources: - requests: - memory: "2Gi" - cpu: "500m" - limits: - memory: "4Gi" - cpu: "2" - volumeMounts: - - name: config - mountPath: /etc/prometheus - - name: storage - mountPath: /prometheus - livenessProbe: - httpGet: - path: /-/healthy - port: 9090 - initialDelaySeconds: 30 - timeoutSeconds: 30 - readinessProbe: - httpGet: - path: /-/ready - port: 9090 - initialDelaySeconds: 30 - timeoutSeconds: 30 - volumes: - - name: config - configMap: - name: prometheus-config - volumeClaimTemplates: - - metadata: - name: storage - spec: - accessModes: ["ReadWriteOnce"] - storageClassName: "fast-ssd" - resources: - requests: - storage: 100Gi +For production HA deployments, consider the prometheus-community Helm chart with these key configurations: + +```bash +# Example Helm installation with HA configuration +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo update + +# Install with custom values for HA +helm install prometheus prometheus-community/prometheus \ + --set server.replicaCount=2 \ + --set server.persistentVolume.size=100Gi \ + --set server.retention=30d \ + --namespace monitoring \ + --create-namespace ``` +**📋 Important**: Always customize the values.yaml file for your specific requirements. See the [official chart documentation](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) for all available options. + ## Security Hardening ### Authentication and Authorization From 8f2f7a256b5b96de9ee6ea012f649ae443efdfee Mon Sep 17 00:00:00 2001 From: Parag Gupta Date: Thu, 7 Aug 2025 20:26:08 +0530 Subject: [PATCH 5/5] fix: correct index.md to only contain frontmatter Following @juliusv's feedback, top-level index.md files only create nav sections via frontmatter and don't become documentation pages. Removed content from operating/index.md to match established pattern seen in other sections (practices, visualization, etc.). Addresses: @juliusv feedback on documentation structure Signed-off-by: Parag Gupta --- docs/operating/index.md | 46 ----------------------------------------- 1 file changed, 46 deletions(-) diff --git a/docs/operating/index.md b/docs/operating/index.md index 8e01f7a9a..6be192e62 100644 --- a/docs/operating/index.md +++ b/docs/operating/index.md @@ -3,49 +3,3 @@ title: Operating Prometheus in Production sort_rank: 5 nav_icon: settings --- - -# Operating Prometheus in Production - -This section provides comprehensive guidance for deploying, monitoring, and maintaining Prometheus in production environments. These guides are designed for SRE, DevOps, and platform engineering teams who need to run Prometheus reliably at scale. - -## Production Deployment - -Running Prometheus in production requires careful planning around scalability, reliability, and operational concerns: - -* [Production Deployment Guide](production-deployment.md) - Comprehensive guide for production-ready Prometheus deployments including hardware sizing, high availability setup, and configuration best practices -* [Performance Tuning](performance-tuning/) - Optimization techniques for large-scale deployments, memory management, and query performance -* [Storage Management](storage-management/) - Long-term storage strategies, retention policies, and data lifecycle management - -## Monitoring and Maintenance - -Effective operation requires monitoring your monitoring infrastructure: - -* [Monitoring Prometheus](monitoring-prometheus.md) - How to monitor your Prometheus instances, essential metrics, and alerting on infrastructure health -* [Troubleshooting Guide](troubleshooting/) - Common issues, diagnostic techniques, and resolution strategies for production problems -* [Backup and Recovery](backup-recovery/) - Data protection strategies, disaster recovery procedures, and backup validation - -## Security and Compliance - -Securing monitoring infrastructure is critical for production deployments: - -* [Security Best Practices](security.md) - Authentication, authorization, network security, and data protection -* [Compliance Considerations](compliance/) - Meeting regulatory requirements, audit trails, and data governance - -## Operational Integration - -Prometheus doesn't operate in isolation - integration with your operational ecosystem is key: - -* [Alert Management](alert-management/) - Alert routing, escalation policies, and integration with incident management systems -* [Capacity Planning](capacity-planning/) - Growth planning, resource forecasting, and scaling strategies -* [Multi-tenancy](multi-tenancy/) - Patterns for shared Prometheus infrastructure, isolation, and resource allocation - -## Migration and Upgrades - -Managing changes to production monitoring infrastructure: - -* [Upgrade Strategies](upgrade-strategies/) - Safe upgrade procedures, rollback plans, and compatibility considerations -* [Migration Guide](migration-guide/) - Moving from other monitoring systems, data migration, and transition planning - ---- - -**Note**: These guides assume you have a basic understanding of Prometheus concepts. If you're new to Prometheus, start with the [Introduction](/docs/introduction/) section.