-
Notifications
You must be signed in to change notification settings - Fork 180
feat(flowcontrol): Implement the FlowRegistry #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(flowcontrol): Implement the FlowRegistry #1319
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @kfswain |
6b26721
to
3d6ed72
Compare
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing, I just have some comments that have been hanging since last night
If the code is an orchestrator, why is it called a registry and not simply FlowOrchestrator? |
That is a very precise question. You've hit on a key point about the component's dual role. The name Registry Pattern: The
Its main job is to be the single source of truth for "what flows exist and what is their configuration?" Orchestrator Pattern: The
Why The orchestration is the how, not the what. It's the complex internal machinery that makes the registry robust. We chose the name
This is an excellent point of feedback, though. It's clear my documentation could be more precise. I will update the GoDoc comment to clarify this distinction: that it is a Registry which uses an internal Actor-based orchestrator to manage its state. Thank you for the sharp observation! |
3d6ed72
to
6cfed87
Compare
/approve I approved in case you would like to address the comments in the followup PR. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ahg-g (cc: @kfswain) -- Thanks again for the LGTM and approval on the initial version. After it was approved, I did another deep pass on the implementation with a focus on hardening the concurrency model and improving the long-term maintainability before merging. I've just pushed up the refined version for your final review. The core logic is the same, but I've made a few significant enhancements that I believe are critical for this foundational layer:
I'm much more confident in the robustness and performance of this version. Let me know what you think. |
6cfed87
to
34f429c
Compare
34f429c
to
264b81a
Compare
How can I tell the difference between the latest commit and what I reviewed before? for the future, I prefer to not force push so that it is easier for the reviewer to diff |
264b81a
to
e9170bc
Compare
Split from my reflog, so it should be visible as an independent commit now. What is the repo best practices for merge? Do we squash? |
We have github configured to automatically squash, so just send reviewer responses as separate commits. |
This commit introduces the complete, concrete implementation of the `FlowRegistry`. As the stateful control plane for the flow control systemm, it provides a scalable, concurrent-safe, and robust foundation for managing the lifecycle of all flows, queues, and shards. The architecture is designed to prioritize data path performance and strict correctness for control plane state transitions. Key architectural features include: - **Serialized Control Plane (Actor Model):** All administrative operations and internal state change events are processed serially by a single background event loop. This fundamental design choice eliminates race conditions for complex, multi-step operations like shard scaling and garbage collection, simplifying the logic and guaranteeing correctness. - **Sharded Architecture:** The registry's state is partitioned across multiple `registryShard` instances. This allows the data path (enqueue/ dispatch operations) to scale linearly with the number of workers and CPU cores by minimizing global lock contention. - **Generational Garbage Collection:** We employ a periodic, generational scanner. This uses a "Trust but Verify" pattern: it identifies candidate flows using an eventually-consistent cache ("Trust"), then performs a "stop-the-world" live check against all shards ("Verify") before deletion. This provides strong consistency precisely when needed. - **Immutable Flow Identity (`FlowKey`):** The `FlowKey` (ID + Priority) is treated as an immutable identifier. To change the priority of traffic, a caller simply registers a new flow with the new priority. The old flow is gracefully and automatically garbage collected once it becomes idle. This elegant design completely avoids complex and error-prone state migration logic. - **Hybrid Concurrency Model:** A multi-tiered locking strategy is employed to maximize performance and correctness: - `FlowRegistry`: Coarse-grained lock for the serialized control plane. - `registryShard`: R/W locks to allow parallel reads from workers. - `managedQueue`: A hybrid mutex/atomic model to guarantee strict consistency between queue contents and statistics, which is critical for GC correctness.
e9170bc
to
eefc461
Compare
/lgtm |
/remove-hold |
This PR introduces the complete, concrete implementation of the
FlowRegistry
, the stateful control plane for the flow control system. This is a foundational architectural component that manages the lifecycle of all flows, queues, and policies, providing a sharded, concurrent-safe view of its state to theFlowController
workers.The architecture is designed to prioritize data path performance and strict correctness for control plane state transitions, resulting in a robust, scalable, and maintainable foundation for the flow control engine.
This tracks #674
Architectural Highlights
The design introduces a clear separation between the control plane (
FlowRegistry
) and the data plane (registryShard
), employing several patterns to ensure correctness and performance under high concurrency:Serialized Control Plane (Actor Model): The
FlowRegistry
uses an actor-like pattern. A single background goroutine processes all state change events from a channel. This serializes all mutations to the registry's core state (like scaling or GC), eliminating a significant class of race conditions.Sharded Data Plane with Fine-Grained Locking: The registry's state is partitioned across multiple
registryShard
instances. Each shard uses fine-grained, per-priority-band locks, allowing concurrent data path operations across different priorities and dramatically reducing lock contention.Asynchronous, Lock-Free Signaling: A lock-free atomic state machine is used for signaling between the data path and the control plane (e.g., for queue empty/non-empty transitions). This completely decouples the data path from control plane backpressure, guarantees strictly ordered signals, and prevents lost transitions even under high contention by coalescing signals at the source.
"Trust but Verify" Garbage Collection: A periodic, time-based scanner manages the lifecycle of idle flows. It uses a "Trust but Verify" pattern: it identifies candidate flows using an eventually-consistent cache ("Trust"), then performs a "stop-the-world" live check on the relevant priority band across all shards ("Verify") before deletion. This provides strong consistency precisely when needed while minimizing data path disruption.
Immutable Flow Identity: The
FlowKey
(ID + Priority) is immutable. To change the priority of traffic, a caller simply registers a new flow. The old flow is gracefully and automatically garbage collected once it becomes idle. This elegant design completely avoids complex and error-prone state migration logic.Suggested Review Path
contracts/
directory to understand the high-level interfaces and public API contracts.registry/doc.go
). This file contains the detailed architectural overview, including the concurrency and garbage collection strategies.shard.go
(the data plane slice),managedqueue.go
(the stateful decorator with its lock-free signaler),flowstate.go
(the GC cache), and finallyregistry.go
(the orchestrator).config.go
.Testing Philosophy and Validation
This PR includes a complete and robust test suite that provides extremely high confidence in the correctness of this complex, concurrent system. The testing strategy is a key feature of this contribution:
FlowRegistry
(registry_test.go
) use a test harness with an "event tap". This allows for gray-box testing of the actor model, enabling fast, deterministic, and race-free validation of asynchronous operations without sleeps or polling.Test..._Concurrency_...
) exist to validate thread-safety under stress, specifically targeting the most critical race conditions like the GC/scaling lock interaction, the draining state transition, and the lock-free signaling mechanism.config
,flowstate
, etc.) are tested in strict isolation to exhaustively validate their specific logic, error paths, and invariants.