Skip to content

Commit f744427

Browse files
committed
add k8s prod readiness checklist
1 parent 3ceb3e0 commit f744427

File tree

5 files changed

+306
-1
lines changed

5 files changed

+306
-1
lines changed
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
= Redpanda Kubernetes Production Readiness Checklist
2+
:description: Comprehensive checklist for validating Redpanda deployments in Kubernetes against production readiness standards.
3+
:page-context-links: [{"name": "Linux", "to": "deploy:redpanda/linux/index.adoc" },{"name": "Kubernetes", "to": "deploy:redpanda/kubernetes/index.adoc" } ]
4+
:page-categories: Production, Deployment
5+
6+
This checklist validates Redpanda deployments in Kubernetes against production readiness standards. Use the automated checker script to verify most requirements, and complete manual checks for comprehensive production preparation.
7+
8+
TIP: The automated production readiness checker (`check-redpanda-readiness-modular.py`) can validate most of these requirements automatically. Run it against your Kubernetes deployment to get a comprehensive assessment.
9+
10+
== Critical Production Requirements
11+
12+
These checks are essential for a stable, reliable production deployment. All critical requirements should pass before going live.
13+
14+
=== Deployment Method Validation
15+
16+
==== Automated Checks
17+
18+
**Deployment method detection**:: Verify that the deployment method (Helm or Operator) is properly detected and configured.
19+
+
20+
[,bash]
21+
----
22+
./check-redpanda-readiness-modular.py -n <namespace> -d <deployment-name>
23+
----
24+
25+
**Operator CRDs validation** (Operator deployments only):: Ensure all required Custom Resource Definitions are installed and available.
26+
+
27+
Required CRDs:
28+
+
29+
* `clusters.cluster.redpanda.com`
30+
* `topics.cluster.redpanda.com`
31+
* `users.cluster.redpanda.com`
32+
* `schemas.cluster.redpanda.com`
33+
34+
=== Cluster Health and Configuration
35+
36+
==== Automated Checks
37+
38+
**Cluster health status**:: Verify the cluster reports as healthy with no broker issues.
39+
+
40+
[,bash]
41+
----
42+
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster health
43+
----
44+
45+
**Minimum broker count (≥3)**:: Ensure at least 3 brokers are running for production fault tolerance.
46+
+
47+
Production clusters should have odd numbers of brokers (3, 5, 7, etc.) for optimal consensus behavior.
48+
49+
**Default topic replication factor (≥3)**:: Verify the default replication factor is set appropriately for production.
50+
+
51+
[,bash]
52+
----
53+
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get default_topic_replications
54+
----
55+
56+
**Existing topics replication factor (≥3)**:: Check that all existing topics have adequate replication.
57+
+
58+
[,bash]
59+
----
60+
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk topic list
61+
----
62+
63+
**No brokers in maintenance mode**:: Ensure no brokers are currently in maintenance mode during normal operations.
64+
65+
**All brokers active membership**:: Verify all brokers are in active state and not being decommissioned.
66+
67+
=== Storage Configuration
68+
69+
==== Automated Checks
70+
71+
**Persistent storage configuration**:: Verify using persistent storage (not hostPath) for data persistence.
72+
+
73+
HostPath storage is not suitable for production as it lacks durability guarantees.
74+
75+
==== Manual Checks
76+
77+
**Storage class performance**:: Ensure storage classes provide adequate IOPS and throughput for your workload.
78+
+
79+
* For high-throughput workloads: Use SSD-based storage classes
80+
* Consider provisioned IOPS where available
81+
* Test storage performance under load
82+
83+
**Volume sizing**:: Plan storage capacity for data growth and retention requirements.
84+
+
85+
* Account for replication overhead
86+
* Include space for compaction operations
87+
* Monitor disk usage trends
88+
89+
=== Resource Allocation
90+
91+
==== Automated Checks
92+
93+
**CPU and memory resource limits**:: Verify pods have resource requests and limits configured.
94+
+
95+
All Redpanda pods must have:
96+
+
97+
* CPU requests and limits
98+
* Memory requests and limits
99+
100+
**CPU to memory ratio (1:2 minimum)**:: Ensure adequate memory allocation relative to CPU for optimal performance.
101+
+
102+
Production deployments should provision at least 2 GiB of memory per CPU core.
103+
104+
==== Manual Checks
105+
106+
**Resource capacity planning**:: Ensure nodes have adequate resources for the configured limits.
107+
+
108+
* Verify cluster has sufficient total resources
109+
* Account for other workloads on shared nodes
110+
* Plan for resource growth and burst capacity
111+
112+
=== Security Configuration
113+
114+
==== Automated Checks
115+
116+
**Authorization enabled**:: Verify Kafka authorization is enabled for access control.
117+
+
118+
[,bash]
119+
----
120+
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get kafka_enable_authorization
121+
----
122+
123+
**Developer mode disabled**:: Ensure developer mode is disabled in production configuration.
124+
+
125+
Developer mode should never be enabled in production environments.
126+
127+
==== Manual Checks
128+
129+
**Authentication configuration**:: Configure appropriate authentication mechanisms.
130+
+
131+
* Set up SASL authentication for client connections
132+
* Configure TLS certificates for encryption
133+
* Implement proper user management and ACLs
134+
135+
**Network security**:: Secure network access to the cluster.
136+
+
137+
* Configure NetworkPolicies to restrict pod-to-pod communication
138+
* Use TLS for all client connections
139+
* Secure admin API endpoints
140+
141+
== Recommended Production Enhancements
142+
143+
These checks improve operational robustness and performance but are not critical for basic functionality.
144+
145+
=== Cluster Configuration
146+
147+
==== Automated Checks
148+
149+
**Redpanda license verification**:: Validate Enterprise license if using Enterprise features.
150+
151+
**Consistent Redpanda version**:: Ensure all brokers run the same Redpanda version.
152+
+
153+
Version mismatches can cause compatibility issues and should be resolved.
154+
155+
=== Storage Optimization
156+
157+
==== Automated Checks
158+
159+
**XFS filesystem for data directory**:: Verify data directories use XFS filesystem for optimal performance.
160+
+
161+
[,bash]
162+
----
163+
kubectl exec -n <namespace> <pod-name> -c redpanda -- df -khT <data-directory>
164+
----
165+
166+
==== Manual Checks
167+
168+
**Storage performance tuning**:: Optimize storage configuration for production workloads.
169+
+
170+
* Configure appropriate `vm.swappiness` settings
171+
* Tune filesystem mount options
172+
* Consider storage class performance characteristics
173+
174+
=== Resource Optimization
175+
176+
==== Automated Checks
177+
178+
**Pod anti-affinity rules**:: Configure pod anti-affinity to spread brokers across nodes.
179+
+
180+
This prevents single node failures from affecting multiple brokers.
181+
182+
**Pod Disruption Budget configured**:: Set up PDBs to control voluntary disruptions during maintenance.
183+
184+
**No fractional CPU requests**:: Ensure CPU requests use whole numbers for consistent performance.
185+
+
186+
Fractional CPUs can lead to performance variability in production.
187+
188+
**Node isolation configuration**:: Configure taints/tolerations or nodeSelector for workload isolation.
189+
+
190+
Isolating Redpanda workloads improves performance predictability.
191+
192+
==== Manual Checks
193+
194+
**CPU pinning and NUMA awareness**:: Configure CPU affinity for optimal performance on multi-core systems.
195+
196+
**Memory allocation strategy**:: Optimize memory settings for your workload patterns.
197+
198+
=== Security Enhancements
199+
200+
==== Automated Checks
201+
202+
**Overprovisioned disabled**:: Ensure overprovisioned mode is disabled for production stability.
203+
204+
**System requirements validation**:: Run system checks to validate optimal configuration.
205+
+
206+
[,bash]
207+
----
208+
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk redpanda check
209+
----
210+
211+
==== Manual Checks
212+
213+
**Security scanning**:: Regularly scan container images and configurations for vulnerabilities.
214+
215+
**Backup and recovery procedures**:: Implement and test backup and recovery processes.
216+
+
217+
* Configure topic backups
218+
* Test cluster recovery procedures
219+
* Document emergency response procedures
220+
221+
**Audit logging**:: Enable and configure audit logging for compliance requirements.
222+
223+
== Monitoring and Observability
224+
225+
=== Manual Checks
226+
227+
**Monitoring setup**:: Deploy comprehensive monitoring for cluster health and performance.
228+
+
229+
* Set up Prometheus metrics collection
230+
* Configure Grafana dashboards
231+
* Implement alerting rules
232+
233+
**Log aggregation**:: Configure centralized log collection and analysis.
234+
+
235+
* Forward Redpanda logs to central logging system
236+
* Set up log retention policies
237+
* Configure log-based alerting
238+
239+
**Health checks**:: Implement application-level health checks.
240+
+
241+
* Configure Kubernetes liveness and readiness probes
242+
* Set up external health monitoring
243+
* Define SLI/SLO metrics
244+
245+
== Operational Readiness
246+
247+
=== Manual Checks
248+
249+
**Deployment automation**:: Implement Infrastructure as Code for reproducible deployments.
250+
+
251+
* Use Helm charts or Kubernetes manifests in version control
252+
* Implement GitOps workflows
253+
* Automate testing and validation
254+
255+
**Upgrade procedures**:: Document and test cluster upgrade processes.
256+
+
257+
* Plan for rolling upgrades with zero downtime
258+
* Test upgrade procedures in staging environments
259+
* Implement rollback capabilities
260+
261+
**Incident response**:: Prepare for operational incidents and outages.
262+
+
263+
* Document troubleshooting procedures
264+
* Establish on-call processes
265+
* Create incident response playbooks
266+
267+
== Running the Automated Checker
268+
269+
Use the automated checker to validate most requirements:
270+
271+
[,bash]
272+
----
273+
# Basic check (shows only issues)
274+
./check-redpanda-readiness-modular.py -n <namespace> -d <deployment-name>
275+
276+
# Verbose output (shows all results)
277+
./check-redpanda-readiness-modular.py -n <namespace> -d <deployment-name> -v
278+
279+
# Generate JSON report
280+
./check-redpanda-readiness-modular.py -n <namespace> -d <deployment-name> -o report.json
281+
----
282+
283+
The script automatically detects deployment methods and validates configurations against production standards.
284+
285+
== Next Steps
286+
287+
After completing this checklist:
288+
289+
1. **Performance testing**: Conduct load testing to validate performance under expected traffic.
290+
2. **Disaster recovery testing**: Test backup and recovery procedures.
291+
3. **Security review**: Conduct security assessment and penetration testing.
292+
4. **Operational validation**: Verify monitoring, alerting, and incident response procedures.
293+
5. **Documentation**: Complete operational runbooks and troubleshooting guides.

modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,10 @@ include::deploy:partial$kubernetes/guides/troubleshoot.adoc[leveloffset=+1]
619619

620620
== Next steps
621621

622+
After deploying Redpanda, validate your production readiness:
623+
624+
- xref:deploy:redpanda/kubernetes/k-production-checklist.adoc[Production readiness checklist] - Comprehensive validation of your deployment against production standards
625+
622626
See the xref:manage:kubernetes/index.adoc[Manage Kubernetes topics] to learn how to customize your deployment to meet your needs.
623627

624628
include::shared:partial$suggested-reading.adoc[]

modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ The production deployment tasks involve Kubernetes administrators (admins) as we
1010
. All: xref:deploy:redpanda/kubernetes/k-requirements.adoc[Review the requirements and recommendations] to align on prerequisites.
1111
. Admin: xref:deploy:redpanda/kubernetes/k-tune-workers.adoc[Tune the worker nodes] for best performance.
1212
. User: xref:deploy:redpanda/kubernetes/k-production-deployment.adoc[Deploy Redpanda] using either the Redpanda Operator or the Redpanda Helm chart.
13+
. All: xref:deploy:redpanda/kubernetes/k-production-checklist.adoc[Validate production readiness] using the comprehensive checklist to ensure your deployment meets production standards.

modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,10 @@ include::deploy:partial$requirements.adoc[]
1111

1212
== Next steps
1313

14-
xref:deploy:redpanda/kubernetes/k-production-deployment.adoc[].
14+
After meeting these requirements, proceed to:
15+
16+
- xref:deploy:redpanda/kubernetes/k-production-deployment.adoc[Deploy Redpanda for production]
17+
- xref:deploy:redpanda/kubernetes/k-production-checklist.adoc[Validate production readiness] with the comprehensive checklist
1518

1619
include::shared:partial$suggested-reading.adoc[]
1720

modules/deploy/partials/high-availability.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -531,6 +531,10 @@ cat debug.log | grep -v ApiVersions | egrep 'opening|read'
531531

532532
include::shared:partial$suggested-reading.adoc[]
533533

534+
ifdef::env-kubernetes[]
535+
* xref:deploy:redpanda/kubernetes/k-production-checklist.adoc[Production readiness checklist] - Validate your Kubernetes deployment against production standards
536+
endif::[]
537+
534538
* https://redpanda.com/blog/redpanda-official-jepsen-report-and-analysis?utm_assettype=report&utm_assetname=roi_report&utm_source=gated_content&utm_medium=content&utm_campaign=jepsen_blog[Redpanda's official Jepsen report^]
535539
* https://redpanda.com/blog/simplifying-raft-replication-in-redpanda[Simplifying Redpanda Raft implementation^]
536540
* https://redpanda.com/blog/kafka-redpanda-availability[An availability footprint of the Redpanda and Apache Kafka replication protocols^]

0 commit comments

Comments
 (0)