-
Notifications
You must be signed in to change notification settings - Fork 894
Proxy has Service Fabric integration for dynamic discovery of routes #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
Could you please also comment on the concurrency requirements for services being introduced here? Is any of them supposed to be called concurrently?
src/Integrations/ServiceFabric/FabricWrapper/PropertyManagementClientWrapper.cs
Show resolved
Hide resolved
src/Integrations/ServiceFabric/ServiceDiscovery/Worker/ServiceFabricExtensionConfigProvider.cs
Show resolved
Hide resolved
src/Integrations/ServiceFabric/ServiceDiscovery/Worker/ServiceFabricExtensionConfigProvider.cs
Show resolved
Hide resolved
src/Integrations/ServiceFabric/ServiceDiscovery/Worker/ServiceFabricExtensionConfigProvider.cs
Show resolved
Hide resolved
src/ReverseProxy/Service/Config/ServiceDiscoveryConfigApplier.cs
Outdated
Show resolved
Hide resolved
.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs
Outdated
Show resolved
Hide resolved
Do we need to use the IslandGateway name in the parameter keys Eg
|
@samsp-msft this is covered by |
.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs
Outdated
Show resolved
Hide resolved
src/Integrations/ServiceFabric/Exceptions/ServiceFabricIntegrationException.cs
Show resolved
Hide resolved
I'm not a fan of how this plugs into the proxy. It seems like the
What happens when multiple instances have an inconsistent view of the configuration because of network issues between SF and the node? |
src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/FabricServiceEndpointSelector.cs
Show resolved
Hide resolved
@davidfowl ditto. It needs additional work and design, which is why I am leaving the last-mile open for you and the team to decide what fits best.
@davidfowl I am suggesting this should be tackled as a separate PR, which would also address how dynamic discovery impacts the configuration model (see next point below), I would rather defer to you and the team to decide how best to take it, hence I am not attempting to address it in this PR.
@davidfowl Not necessarily, but possibly. Examples of what others are doing:
@davidfowl At the risk of preaching to the choir, I'll say there's no free lunch. Eventual Consistency models offer simplicity but it comes at a cost. We believe this is the right choice for a component like YARP in distributed hosting scenarios, and there is precedent for this choice. Suggested reading: Embracing eventual consistency in SoA networking re. Envoy's design choice, by Envoy's lead author. Orchestration layers can be built on top of our eventually consistent model to provide strong guarantees when needed, and separately from YARP's core. |
In case of YARP, EC simplifies the code and consequently our work as proxy developers, but it will likely make user's experience worse because, as it was pointed out in the above article, eventual consistency is counterintuitive. Moreover, I would challenge the assumption that in real-world changes are always rolled-out gradually with enough time between change so that nodes can reach consensus before the next change is applied and users are shielded from observing inconsistent state. In my experience, service changes might also be applied immediately to a large share of nodes (up to 100%) in special, but not so uncommon cases. Examples can be rolling out a security hot fix or deploying a change to relatively small geo region. Furthermore, even in case of gradual deployment, percentage can be changed in incrementally increasing steps (e.g. 1%, 2%, 4%, 8%, 16%, 32%, etc.) to speed-up roll-out, thus at the later stages a substantial share of system nodes get affected by each step which can expose to users inconsistencies inherent to EC approach. All in all, I realize than network is inherently unreliable and CAP theorem imposes hard limit on what we can achieve, but I strongly believe we should always aim to firstly provide strong consistency and relax the model only if it looks unreasonably expensive to implement or maintain. Having that said, implementing mechanism synchronizing SF configuration among YARP nodes seems to be outside this PR scope. |
# Conflicts: # reverse-proxy.sln # src/ReverseProxy/Abstractions/Clusters/IClustersRepo.cs # src/ReverseProxy/Abstractions/Destinations/Contract/Destination.cs # src/ReverseProxy/Abstractions/Routes/Contract/AuthorizationConstants.cs # src/ReverseProxy/Abstractions/Routes/Contract/CorsConstants.cs # src/ReverseProxy/Abstractions/Routes/IRoutesRepo.cs # src/ReverseProxy/Abstractions/Routes/IRuntimeRouteBuilder.cs # src/ReverseProxy/Microsoft.ReverseProxy.csproj # test/ReverseProxy.Tests.Common/InMemoryConfigProvider.cs # test/ReverseProxy.Tests.Common/TestConfigError.cs # test/ReverseProxy.Tests.Common/TestConfigErrorReporter.cs # test/ReverseProxy.Tests.Common/TestLogger.cs # test/ReverseProxy.Tests.Common/TestLoggerFactory.cs # test/ReverseProxy.Tests.Common/TestRandom.cs # test/ReverseProxy.Tests.Common/TestRandomFactory.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up
- Custom exception types
- IClock
- LoadFromServiceFabric should take options.
- What else?
This looks OK to merge, but I'll let Alexander handle the final sign-off and merge.
src/ReverseProxy.ServiceFabric/ServiceDiscovery/ServiceExtensionLabelsProvider.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost ready to be merged, there is only a few items left:
- Confirm making IClock public
- Close on "Backend" vs "Cluster" in labels questions
- Submit at least one issue capturing all follow-up work items:
- Review error handling logic and harden it where necessary
- Label name casing
- Custom exception types
- LoadFromServiceFabric should take options
- Get rid of unnecessary closure (e.g. in
PropertyManagementClientWrapper.GetPropertyAsync
) - Parse all supported
*Options
(e.g.LoadBalancingOptions
,SessioAffinityOptions
, etc.) inLabelsParser
- Convert static classes to interfaces + implementations where it's appropriate (e.g.
LabelsParser
) - Apply Logging Extension pattern instead of direct
ILogger
calls. - Finalize the extension's name discussion
src/ReverseProxy.ServiceFabric/ServiceDiscovery/IServiceExtensionLabelsProvider.cs
Show resolved
Hide resolved
@davidni I submitted all necessary follow-up items listed below. Please, only confirm Follow-up items:
cc: @Tratcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found at least one blocking bug.
Hmm, it seems strange. The last comment on this PR has just disappeared. |
@alnikola Sorry, it was me 😁 , I added a comment about disposing |
Simply producing a new cluster for each service-endpoint pair will not be compatible with the current load balancing implementation, because YARP will handle each of such clusters independently. |
Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this. |
The main concern is load balancing (in case it's enabled of course). Currently, load balancing considers only destinations belonging to one cluster. Meaning, it's based on the assumption that clusters are independent. However, in the above approach multiple YARP cluster will look at the same SF service. Thus, that, what looks like a balanced load from YARP perspective, might not be actually balanced from SF service perspective. |
Note. Load balancing config is not yet supported here, but it will be soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM
At first ... thanx for your answer! Sorry, but I didn't see the problem why endpoints of a service shouldn't be handled indepentently (including healthcheck). If a service has for example a control-api endpoint and a data-api endpoint defined and in the "service fabric cluster" are parallel instances of the service hosted on different nodes, why can't all control-api endpoints build a "proxy cluster" and all data-api endpoints another? The apis operate independently in context of proxy routing and this maybe also includes optional health check routes per endpoint. If a service instance becomes busy then the health checks of both endpoints should report this status independently. I think that health checks are not directly tied to services. Also metrics like CPU load is not tied to service, it is a property of the node (the machine) on which the service instance lives. So my question is: What are the specific blockers for health checks in this context? My SF mapping here is that SF services are a kind of hosting infrastructure (application nodes) for optional state persistence and independent listeners/endpoints which are the finer grained "microservices" (yes, i know this is a bit pragmatic and doesn't follow the microservice pattern) ;-) |
Another question is: Shouldn't the work of collecting healthcare metrics for nodes, services or endpoints be delegated to another service which is much more specialized? |
My concern was only about load balancing operating in the given case of service-endpoint pair to cluster mapping. Health checks should work fine. |
YARP's health check mechanism is aimed to support the main YARP scenarios. However, its health checks subsystem is quite extensible so you can integrate it with some SF-specific monitoring tool if that's necessary in your case. |
Introduction
Adds
Microsoft.ReverseProxy.ServiceFabric
which implements dynamic discovery of Service Fabric services. The general architecture takes inspiration from Traefik's Service Fabric integration, and is designed for easy and friction-free onboarding of multiple Service Fabric services operating behind YARP. Running multiple instances of YARP in the same cluster is possible, and an eventual consistency model helps avoid the need for any external storage except for the runtime Service Fabric queries used to dynamically discover services in the cluster.A Service Fabric service that wants to leverage YARP simply needs to add an
<Extension>
element in theirServiceManifest
, for example (see complete example further below):Functional overview
This works by periodically querying Service Fabric for all applications in a Service Fabric cluster. For each application, we enumerate all services. For each service, we extract its labels and, if the service has opted in to use YARP (
YARP.Enable=true
). If the service that has opted in to use YARP, we then enumerate all partitions, and all instances/replicas within each partition.Service Fabric concepts mappings to YARP
fabric:/App1/Svc2
)YARP.Routes.<routeName>.*
in ServiceManifestDynamic configurability
Labels defined in the Service Manifest support the following scenarios:
[appParamName]
, its value will be replaced at runtime with the value of the application parameter with the same name defined for the application the service belongs toYARP.EnableDynamicOverrides
is specified astrue
, we will also query Service Fabric Naming Service to discover any named properties. Any properties found for the service being enumerated take precedence over values specified in ServiceManifest. This functionality is opt-in, and disabled by default.Designed for high availability
We keep a cache of the last discovery results obtained from Service Fabric. If at any point, for whatever reason, we are not able to retrieve an updated view of the cluster, we will continue to operate with last-known-good data. The cache is kept in memory, but could be extended to support permanent storage on disk to also survive reboots.
Designed for use by large teams
This integration is designed to support hundreds of Service Fabric services hosted behind YARP, and it provides additional features to help owners of those teams self-service their devops requirements relating to YARP. When we successfully discover a service, we emit a Service Fabric health report against the discovered service. If any config errors are encountered, a Warning health report is issued, which makes it easy for the service owners to observe the errors on Service Fabric Explorer and / or telemetry. This closes the loop and makes it easy for a service in Service Fabric to confirm that its configurations are being honored and to detect any issues.
Complete ServiceManifest example: