Skip to content

Commit bc89dd8

Browse files
committed
CNTRLPLANE-1575: Add support for event-ttl in Kube API Server Operator
Signed-off-by: Thomas Jungblut <[email protected]>
1 parent 7f59958 commit bc89dd8

File tree

1 file changed

+246
-0
lines changed

1 file changed

+246
-0
lines changed
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
---
2+
title: event-ttl
3+
authors:
4+
- "@tjungblu"
5+
- "CursorAI"
6+
reviewers:
7+
- benluddy
8+
- p0lyn0mial
9+
approvers:
10+
- sjenning
11+
api-approvers:
12+
- JoelSpeed
13+
creation-date: 2025-10-08
14+
last-updated: 2025-10-08
15+
tracking-link:
16+
- https://issues.redhat.com/browse/OCPSTRAT-2095
17+
- https://issues.redhat.com/browse/CNTRLPLANE-1539
18+
- https://github.com/openshift/api/pull/2520
19+
status: proposed
20+
see-also:
21+
replaces:
22+
superseded-by:
23+
---
24+
25+
# Event TTL Configuration
26+
27+
## Summary
28+
29+
This enhancement describes a configuration option in the `config.openshift.io/v1` API group to configure the event-ttl setting for the kube-apiserver. The event-ttl setting controls how long events are retained in etcd before being automatically deleted.
30+
31+
Currently, OpenShift uses a default event-ttl of 3 hours (180 minutes), while upstream Kubernetes uses 1 hour. This enhancement allows customers to configure this value based on their specific requirements, with a range of 30 minutes to 6 hours (360 minutes), with a default of 180 minutes (3 hours).
32+
33+
## Motivation
34+
35+
The event-ttl setting in kube-apiserver controls the retention period for events in etcd. Events are automatically deleted after this duration to prevent etcd from growing indefinitely. Different customers have different requirements for event retention:
36+
37+
- Some customers need longer retention for compliance or debugging purposes
38+
- Others may want shorter retention to reduce etcd storage usage
39+
- The current fixed value of 3 hours may not suit all use cases
40+
41+
### Goals
42+
43+
1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API
44+
2. Provide a reasonable range of values (30 minutes to 6 hours) that covers most customer needs
45+
3. Maintain backward compatibility with the current default of 3 hours (180 minutes)
46+
4. Ensure the configuration is properly validated and applied
47+
48+
### Non-Goals
49+
50+
- Changing the default event-ttl value (will remain 3 hours/180 minutes)
51+
- Supporting event-ttl values outside the recommended range (30-360 minutes)
52+
- Modifying the underlying etcd compaction behavior beyond what the event-ttl setting provides
53+
54+
## Proposal
55+
56+
We propose to add an `eventTTLMinutes` field to the `APIServer` resource in `config.openshift.io/v1` that allows customers to configure the event-ttl setting for kube-apiserver.
57+
58+
### User Stories
59+
60+
#### Story 1: Compliance Requirements
61+
As a cluster administrator in a regulated environment, I want to configure a longer event retention period so that I can meet compliance requirements for audit trails and debugging.
62+
63+
#### Story 2: Storage Optimization
64+
As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting.
65+
66+
#### Story 3: Default Behavior
67+
As a cluster administrator, I want the current default behavior to be preserved so that existing clusters continue to work without changes.
68+
69+
### API Extensions
70+
71+
This enhancement modifies the `APIServer` resource in `config.openshift.io/v1` by adding a new `eventTTL` field.
72+
73+
### Workflow Description
74+
75+
The workflow for configuring event-ttl is straightforward:
76+
77+
1. **Cluster Administrator** accesses the OpenShift cluster via CLI or web console
78+
2. **Cluster Administrator** edits the `APIServer` resource in the `config.openshift.io/v1` API group
79+
3. **Cluster Administrator** sets the `eventTTLMinutes` field to the desired value in minutes (e.g., 60, 180, 360)
80+
4. **kube-apiserver-operator** detects the configuration change
81+
5. **kube-apiserver-operator** validates the new event-ttl value (must be between 30-360 minutes)
82+
6. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration
83+
7. **kube-apiserver** restarts with the new event-ttl setting
84+
8. **etcd** begins using the new event retention policy for future events
85+
86+
The configuration change takes effect immediately for new events, while existing events continue to use their original TTL until they expire.
87+
88+
### Topology Considerations
89+
90+
#### Hypershift / Hosted Control Planes
91+
92+
This enhancement does not apply to Hypershift.
93+
94+
#### Standalone Clusters
95+
96+
This enhancement is fully applicable to standalone OpenShift clusters. The event-ttl configuration will be applied to the kube-apiserver running in the control plane, affecting event retention in the cluster's etcd.
97+
98+
#### Single-node Deployments or MicroShift
99+
100+
For single-node OpenShift (SNO) deployments, this enhancement will work as expected. The event-ttl configuration will be applied to the kube-apiserver running on the single node.
101+
102+
For MicroShift, this enhancement is not directly applicable as MicroShift uses a different architecture and may not have the same event-ttl configuration options. However, if MicroShift adopts similar event management, the same principles would apply.
103+
104+
### Implementation Details/Notes/Constraints
105+
106+
The proposed API looks like this:
107+
108+
```yaml
109+
kind: APIServer
110+
apiVersion: config.openshift.io/v1
111+
spec:
112+
eventTTLMinutes: 180 # Integer value in minutes, e.g., 60, 180, 360
113+
```
114+
115+
The `eventTTLMinutes` field will be an integer value representing minutes. The field will be validated to ensure it falls within the required range of 30-360 minutes.
116+
117+
The API design is based on the changes in [openshift/api PR #2520](https://github.com/openshift/api/pull/2520), which includes:
118+
119+
```go
120+
type KubeAPIServerSpec struct {
121+
StaticPodOperatorSpec `json:",inline"`
122+
123+
// eventTTLMinutes specifies the amount of time that the events are stored before being deleted.
124+
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes).
125+
//
126+
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
127+
// The current default value is 3h (180 minutes).
128+
//
129+
// +kubebuilder:default:=180
130+
// +kubebuilder:validation:Minimum=30
131+
// +kubebuilder:validation:Maximum=360
132+
// +optional
133+
EventTTLMinutes int32 `json:"eventTTLMinutes,omitempty"`
134+
}
135+
```
136+
137+
### Impact of Lower TTL Values
138+
139+
Setting the event-ttl to values lower than the upstream default of 1 hour will primarily impact:
140+
141+
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need to perform compaction more frequently to remove expired events. This increases the bandwidth usage for etcd compaction operations.
142+
143+
2. **etcd CPU Usage**: More frequent compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events.
144+
145+
3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting.
146+
147+
The main reason for this impact is that with faster expiring events, the system needs to delete events much more frequently, increasing the overhead of the cleanup process.
148+
149+
### Risks and Mitigations
150+
151+
**Risk**: Customers might set extremely low values that could impact etcd performance.
152+
**Mitigation**: The API validation ensures values are within a reasonable range (30-360 minutes).
153+
154+
155+
### Drawbacks
156+
157+
- Adds complexity to the configuration API
158+
- Additional validation and error handling required
159+
160+
## Alternatives (Not Implemented)
161+
162+
1. **Hardcoded Values**: Keep the current fixed value of 3 hours
163+
- **Rejected**: Does not meet customer requirements for configurability
164+
165+
2. **Environment Variable**: Use environment variables instead of API configuration
166+
- **Rejected**: Less user-friendly and harder to manage
167+
168+
3. **Separate CRD**: Create a separate CRD for event configuration
169+
- **Rejected**: Overkill for a single setting, better to include in existing APIServer resource
170+
171+
## Test Plan
172+
173+
**Note:** *Section not required until targeted at a release.*
174+
175+
The test plan will include:
176+
177+
1. **Unit Tests**: Test the API validation and parsing logic
178+
2. **Integration Tests**: Test that the configuration is properly applied to kube-apiserver
179+
3. **E2E Tests**: Test that events are properly deleted after the configured TTL
180+
4. **Performance Tests**: Test the impact of different TTL values on etcd performance
181+
182+
## Graduation Criteria
183+
184+
### Dev Preview -> Tech Preview
185+
186+
- API is implemented and validated
187+
- Basic functionality works end-to-end
188+
- Documentation is available
189+
- Sufficient test coverage
190+
191+
### Tech Preview -> GA
192+
193+
- More comprehensive testing (upgrade, downgrade, scale)
194+
- Performance testing with various TTL values
195+
- User feedback incorporated
196+
- Documentation updated in openshift-docs
197+
198+
### Removing a deprecated feature
199+
200+
This enhancement does not remove any existing features. It only adds new configuration options while maintaining backward compatibility with the existing default behavior.
201+
202+
## Upgrade / Downgrade Strategy
203+
204+
### Upgrade Strategy
205+
206+
- Existing clusters will continue to use the default 3-hour (180-minute) TTL
207+
- No changes required for existing clusters
208+
- New configuration option is available immediately
209+
210+
### Downgrade Strategy
211+
212+
- Configuration will be ignored by older versions
213+
- No impact on cluster functionality
214+
- Events will continue to use the default TTL (180 minutes)
215+
216+
## Version Skew Strategy
217+
218+
- The event-ttl setting is a kube-apiserver configuration
219+
- No coordination required with other components
220+
- Version skew is not a concern for this enhancement
221+
222+
## Operational Aspects of API Extensions
223+
224+
This enhancement modifies the `APIServer` resource but does not add new API extensions. The impact is limited to:
225+
226+
- Configuration validation in the kube-apiserver-operator
227+
- Application of the setting to kube-apiserver deployment
228+
- No impact on API availability or performance
229+
230+
## Support Procedures
231+
232+
### Detection
233+
234+
- Configuration can be verified by checking the `APIServer` resource
235+
- kube-apiserver logs will show the configured event-ttl value
236+
- etcd metrics can be monitored for compaction frequency
237+
238+
### Troubleshooting
239+
240+
- If events are not being deleted as expected, check the event-ttl configuration
241+
- Monitor etcd compaction metrics for unusual patterns
242+
243+
## Implementation History
244+
245+
- 2025-10-08: Initial enhancement proposal
246+

0 commit comments

Comments
 (0)