Proxy has Service Fabric integration for dynamic discovery of routes #257

davidni · 2020-06-19T22:43:15Z

Introduction

Adds Microsoft.ReverseProxy.ServiceFabric which implements dynamic discovery of Service Fabric services. The general architecture takes inspiration from Traefik's Service Fabric integration, and is designed for easy and friction-free onboarding of multiple Service Fabric services operating behind YARP. Running multiple instances of YARP in the same cluster is possible, and an eventual consistency model helps avoid the need for any external storage except for the runtime Service Fabric queries used to dynamically discover services in the cluster.

A Service Fabric service that wants to leverage YARP simply needs to add an <Extension> element in their ServiceManifest, for example (see complete example further below):

<Extensions>
  <Extension Name="YARP-preview">
    <Labels xmlns="http://schemas.microsoft.com/2015/03/fabact-no-schema">
      <Label Key="YARP.Enable">true</Label>
      <Label Key="YARP.Routes.route1.Hosts">example.com</Label>
    </Labels>
  </Extension>
</Extensions>

Functional overview

This works by periodically querying Service Fabric for all applications in a Service Fabric cluster. For each application, we enumerate all services. For each service, we extract its labels and, if the service has opted in to use YARP (YARP.Enable=true). If the service that has opted in to use YARP, we then enumerate all partitions, and all instances/replicas within each partition.

Service Fabric concepts mappings to YARP

Service Fabric terminology	YARP terminology
Cluster	--
Application	--
Service (e.g. `fabric:/App1/Svc2`)	Cluster (ClusterId=ServiceName)
Instance / replica	Destination (Address=instance' or replica's endpoint)
`YARP.Routes.<routeName>.*` in ServiceManifest	ProxyRoute (id=ServiceName+routeName, Match=Hosts,Path extracted from the labels)

Dynamic configurability

Labels defined in the Service Manifest support the following scenarios:

Arbitrary key / value pair. Only keys understood by us are honored, and unknown keys are ignored
Application parameter reference. If the label value is of the form [appParamName], its value will be replaced at runtime with the value of the application parameter with the same name defined for the application the service belongs to
Naming Service overrides. If label YARP.EnableDynamicOverrides is specified as true, we will also query Service Fabric Naming Service to discover any named properties. Any properties found for the service being enumerated take precedence over values specified in ServiceManifest. This functionality is opt-in, and disabled by default.

Designed for high availability

We keep a cache of the last discovery results obtained from Service Fabric. If at any point, for whatever reason, we are not able to retrieve an updated view of the cluster, we will continue to operate with last-known-good data. The cache is kept in memory, but could be extended to support permanent storage on disk to also survive reboots.

Designed for use by large teams

This integration is designed to support hundreds of Service Fabric services hosted behind YARP, and it provides additional features to help owners of those teams self-service their devops requirements relating to YARP. When we successfully discover a service, we emit a Service Fabric health report against the discovered service. If any config errors are encountered, a Warning health report is issued, which makes it easy for the service owners to observe the errors on Service Fabric Explorer and / or telemetry. This closes the loop and makes it easy for a service in Service Fabric to confirm that its configurations are being honored and to detect any issues.

Complete ServiceManifest example:

<ServiceManifest Name="Service1Pkg" Version="1.0.0" xmlns="http://schemas.microsoft.com/2011/01/fabric">
  <ServiceTypes>
    <StatelessServiceType ServiceTypeName="Service1Type" >
      <Extensions>
        <Extension Name="YARP-preview">
          <Labels xmlns="http://schemas.microsoft.com/2015/03/fabact-no-schema">
            <Label Key="YARP.Enable">true</Label>
            <Label Key="YARP.Routes.route1.Hosts">example.com</Label>
            <!-- Optional: enable active health probes -->
            <Label Key='YARP.Backend.Healthcheck.Active.Enabled'>true</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Path'>api/health</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Timeout'>30</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Interval'>10</Label>
          </Labels>
        </Extension>
      </Extensions>
    </StatelessServiceType>
  </ServiceTypes>

  <!-- ... -->
</ServiceManifest>

alnikola

Some comments.

Could you please also comment on the concurrency requirements for services being introduced here? Is any of them supposed to be called concurrently?

src/Integrations/ServiceFabric/FabricWrapper/PropertyManagementClientWrapper.cs

src/Integrations/ServiceFabric/FabricWrapper/QueryClientWrapper.cs

src/Integrations/ServiceFabric/ServiceDiscovery/Util/LabelsParser.cs

src/Integrations/ServiceFabric/ServiceDiscovery/Worker/ServiceFabricExtensionConfigProvider.cs

src/ReverseProxy/Service/Config/ServiceDiscoveryConfigApplier.cs

.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs

samsp-msft · 2020-06-22T20:11:57Z

Do we need to use the IslandGateway name in the parameter keys Eg

<Label Key='IslandGateway.Backend.Healthcheck.Enabled'>true</Label>

davidni · 2020-06-23T03:06:17Z

Do we need to use the IslandGateway name in the parameter keys

@samsp-msft this is covered by Organize and rename labels in the PR description under Remaining work (outside of scope of this PR). The name IslandGateway doesn't make sense outside my team, I just don't have the bandwidth to change and verify the changes immediately.

reverse-proxy.sln

.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs

src/Integrations/ServiceFabric/Exceptions/ServiceFabricIntegrationException.cs

davidfowl · 2020-06-25T03:09:19Z

I'm not a fan of how this plugs into the proxy. It seems like the IServiceDiscovery is the interface being plugged in here. Is the intent to replace the configuration based model completely? This comment seems like it's not resolved #257 (comment).

We keep a cache of the last discovery results obtained from Service Fabric. If at any point, for whatever reason, we are not able to retrieve an updated view of the cluster, we will continue to operate with last-known-good data. The cache is kept in memory, but could be extended to support permanent storage on disk to also survive reboots.

What happens when multiple instances have an inconsistent view of the configuration because of network issues between SF and the node?

src/ReverseProxy.ServiceFabric/FabricWrapper/QueryClientWrapper.cs

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/FabricServiceEndpointSelector.cs

davidni · 2020-06-25T03:40:28Z

I'm not a fan of how this plugs into the proxy.

@davidfowl ditto. It needs additional work and design, which is why I am leaving the last-mile open for you and the team to decide what fits best.

This comment seems like it's not resolved #257 (comment).

@davidfowl I am suggesting this should be tackled as a separate PR, which would also address how dynamic discovery impacts the configuration model (see next point below), I would rather defer to you and the team to decide how best to take it, hence I am not attempting to address it in this PR.

Is the intent to replace the configuration based model completely?

@davidfowl Not necessarily, but possibly. Examples of what others are doing:

Envoy lets you specify a config source at the top level. We ended up with something similar in our implementation -- a top level config field ServiceDiscoveryName, which can be servicefabric or static for us. The latter means we will read Routes, Clusters, Destinations from ASP .NET Core configs in a section "static". But IMHO this should NOT be the north star. It would be nice to allow N >= 1 config sources to work together, so that YARP could have static configs, which can be augmented by any number of dynamic config providers (such as Service Fabric). That aligns with Traefik (see next line)
Traefik lets you specify providers, all of which provide dynamic configurations and which can use configs from one another. See also this PR.

What happens when multiple instances have an inconsistent view of the configuration because of network issues between SF and the node?

@davidfowl At the risk of preaching to the choir, I'll say there's no free lunch. Eventual Consistency models offer simplicity but it comes at a cost. We believe this is the right choice for a component like YARP in distributed hosting scenarios, and there is precedent for this choice. Suggested reading: Embracing eventual consistency in SoA networking re. Envoy's design choice, by Envoy's lead author. Orchestration layers can be built on top of our eventually consistent model to provide strong guarantees when needed, and separately from YARP's core.

alnikola · 2020-06-25T13:02:34Z

Eventual Consistency models offer simplicity but it comes at a cost. We believe this is the right choice for a component like YARP in distributed hosting scenarios, and there is precedent for this choice. Suggested reading: Embracing eventual consistency in SoA networking

In case of YARP, EC simplifies the code and consequently our work as proxy developers, but it will likely make user's experience worse because, as it was pointed out in the above article, eventual consistency is counterintuitive.

Moreover, I would challenge the assumption that in real-world changes are always rolled-out gradually with enough time between change so that nodes can reach consensus before the next change is applied and users are shielded from observing inconsistent state. In my experience, service changes might also be applied immediately to a large share of nodes (up to 100%) in special, but not so uncommon cases. Examples can be rolling out a security hot fix or deploying a change to relatively small geo region. Furthermore, even in case of gradual deployment, percentage can be changed in incrementally increasing steps (e.g. 1%, 2%, 4%, 8%, 16%, 32%, etc.) to speed-up roll-out, thus at the later stages a substantial share of system nodes get affected by each step which can expose to users inconsistencies inherent to EC approach.

All in all, I realize than network is inherently unreliable and CAP theorem imposes hard limit on what we can achieve, but I strongly believe we should always aim to firstly provide strong consistency and relax the model only if it looks unreasonably expensive to implement or maintain.

Having that said, implementing mechanism synchronizing SF configuration among YARP nodes seems to be outside this PR scope.

# Conflicts: # reverse-proxy.sln # src/ReverseProxy/Abstractions/Clusters/IClustersRepo.cs # src/ReverseProxy/Abstractions/Destinations/Contract/Destination.cs # src/ReverseProxy/Abstractions/Routes/Contract/AuthorizationConstants.cs # src/ReverseProxy/Abstractions/Routes/Contract/CorsConstants.cs # src/ReverseProxy/Abstractions/Routes/IRoutesRepo.cs # src/ReverseProxy/Abstractions/Routes/IRuntimeRouteBuilder.cs # src/ReverseProxy/Microsoft.ReverseProxy.csproj # test/ReverseProxy.Tests.Common/InMemoryConfigProvider.cs # test/ReverseProxy.Tests.Common/TestConfigError.cs # test/ReverseProxy.Tests.Common/TestConfigErrorReporter.cs # test/ReverseProxy.Tests.Common/TestLogger.cs # test/ReverseProxy.Tests.Common/TestLoggerFactory.cs # test/ReverseProxy.Tests.Common/TestRandom.cs # test/ReverseProxy.Tests.Common/TestRandomFactory.cs

…s of Island Gateway

Tratcher

Follow-up

Custom exception types
IClock
LoadFromServiceFabric should take options.
What else?

This looks OK to merge, but I'll let Alexander handle the final sign-off and merge.

src/ReverseProxy.ServiceFabric/Exceptions/ConfigException.cs

src/ReverseProxy.ServiceFabric/ServiceDiscovery/ServiceExtensionLabelsProvider.cs

alnikola

It's almost ready to be merged, there is only a few items left:

Confirm making IClock public
Close on "Backend" vs "Cluster" in labels questions
Submit at least one issue capturing all follow-up work items:
- Review error handling logic and harden it where necessary
- Label name casing
- Custom exception types
- LoadFromServiceFabric should take options
- Get rid of unnecessary closure (e.g. in PropertyManagementClientWrapper.GetPropertyAsync)
- Parse all supported *Options (e.g. LoadBalancingOptions, SessioAffinityOptions, etc.) in LabelsParser
- Convert static classes to interfaces + implementations where it's appropriate (e.g. LabelsParser)
- Apply Logging Extension pattern instead of direct ILogger calls.
- Finalize the extension's name discussion

src/ReverseProxy.ServiceFabric/ServiceDiscovery/IServiceExtensionLabelsProvider.cs

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/LabelsParser.cs

alnikola · 2020-11-18T15:30:11Z

@davidni I submitted all necessary follow-up items listed below. Please, only confirm IClock API and address the two remaining comments, Then, the PR will be merged.

Follow-up items:

Review error handling logic and harden it where necessary [SF] Review error handling logic and harden it where necessary #552
Label name casing [SF] Label names casing handling #553
Custom exception types [SF] Get rid of custom exception types #554
LoadFromServiceFabric should take options [SF] LoadFromServiceFabric should take options #555
Get rid of unnecessary closure (e.g. in PropertyManagementClientWrapper.GetPropertyAsync) [SF] Get rid of unnecessary closures #556
Parse all supported *Options (e.g. LoadBalancingOptions, SessioAffinityOptions, etc.) in LabelsParser [SF] Parse all supported *Options in LabelsParser #557
Convert static classes to interfaces + implementations where it's appropriate (e.g. LabelsParser) [SF] Convert static classes to interfaces + implementations #558
Apply Logging Extension pattern instead of direct ILogger calls. [SF] Apply Logging Extension pattern instead of direct ILogger calls #559
Finalize the extension's name discussion [SF] Choose the right extension name #560

cc: @Tratcher

samples/ReverseProxy.ServiceFabric.Sample/README.md

alnikola

Found at least one blocking bug.

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/LabelsParser.cs

alnikola · 2020-11-20T14:14:31Z

Hmm, it seems strange. The last comment on this PR has just disappeared.

Kahbazi · 2020-11-20T14:27:18Z

Hmm, it seems strange. The last comment on this PR has just disappeared.

@alnikola Sorry, it was me 😁 , I added a comment about disposing FabricClient and then find out that it's already been discussed so I delete my comment.

samples/ReverseProxy.ServiceFabric.Sample/README.md

alnikola · 2020-11-24T12:26:55Z

Hello,
in my opinion your currently implemented Service Fabric concept mappings to YARP doesn't work very well because a Service Fabric service can provide more than one endpoint. So the YARP cluster should be mapped from a combination of service and endpoint name.

Simply producing a new cluster for each service-endpoint pair will not be compatible with the current load balancing implementation, because YARP will handle each of such clusters independently.

mcpride · 2020-11-24T13:19:38Z

Hello,
in my opinion your currently implemented Service Fabric concept mappings to YARP doesn't work very well because a Service Fabric service can provide more than one endpoint. So the YARP cluster should be mapped from a combination of service and endpoint name.

Simply producing a new cluster for each service-endpoint pair will not be compatible with the current load balancing implementation, because YARP will handle each of such clusters independently.

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

alnikola · 2020-11-24T13:29:58Z

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

The main concern is load balancing (in case it's enabled of course). Currently, load balancing considers only destinations belonging to one cluster. Meaning, it's based on the assumption that clusters are independent. However, in the above approach multiple YARP cluster will look at the same SF service. Thus, that, what looks like a balanced load from YARP perspective, might not be actually balanced from SF service perspective.

alnikola · 2020-11-24T13:31:35Z

Note. Load balancing config is not yet supported here, but it will be soon.

alnikola

LTGM

samples/ReverseProxy.ServiceFabric.Sample/README.md

mcpride · 2020-11-25T10:55:28Z

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

The main concern is load balancing (in case it's enabled of course). Currently, load balancing considers only destinations belonging to one cluster. Meaning, it's based on the assumption that clusters are independent. However, in the above approach multiple YARP cluster will look at the same SF service. Thus, that, what looks like a balanced load from YARP perspective, might not be actually balanced from SF service perspective.

At first ... thanx for your answer! Sorry, but I didn't see the problem why endpoints of a service shouldn't be handled indepentently (including healthcheck). If a service has for example a control-api endpoint and a data-api endpoint defined and in the "service fabric cluster" are parallel instances of the service hosted on different nodes, why can't all control-api endpoints build a "proxy cluster" and all data-api endpoints another? The apis operate independently in context of proxy routing and this maybe also includes optional health check routes per endpoint. If a service instance becomes busy then the health checks of both endpoints should report this status independently. I think that health checks are not directly tied to services. Also metrics like CPU load is not tied to service, it is a property of the node (the machine) on which the service instance lives.

So my question is: What are the specific blockers for health checks in this context?

My SF mapping here is that SF services are a kind of hosting infrastructure (application nodes) for optional state persistence and independent listeners/endpoints which are the finer grained "microservices" (yes, i know this is a bit pragmatic and doesn't follow the microservice pattern) ;-)

mcpride · 2020-11-25T11:12:05Z

Another question is: Shouldn't the work of collecting healthcare metrics for nodes, services or endpoints be delegated to another service which is much more specialized?
There exists the project FabricObserver which is maybe a good candidate.

alnikola · 2020-11-25T13:25:47Z

My concern was only about load balancing operating in the given case of service-endpoint pair to cluster mapping. Health checks should work fine.

alnikola · 2020-11-25T13:32:19Z

Shouldn't the work of collecting healthcare metrics for nodes, services or endpoints be delegated to another service which is much more specialized?

YARP's health check mechanism is aimed to support the main YARP scenarios. However, its health checks subsystem is quite extensible so you can integrate it with some SF-specific monitoring tool if that's necessary in your case.

davidni added 8 commits June 19, 2020 12:41

Before removing this.

28162ee

Almost builds

3e6579d

Builds

396dad0

Renamed folder

feee0d2

Added tests

c173495

All green

b3709fc

Merge remote-tracking branch 'upstream/master' into davidni/sf

fa9b170

Fix merge conflicts

9974c78

davidni requested review from Tratcher, alnikola and halter73 as code owners June 19, 2020 22:43

alnikola reviewed Jun 22, 2020

View reviewed changes

Kahbazi reviewed Jun 22, 2020

View reviewed changes

.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs Outdated Show resolved Hide resolved

Tratcher reviewed Jun 24, 2020

View reviewed changes

reverse-proxy.sln Outdated Show resolved Hide resolved

.../ServiceFabric/DependencyInjection/ServiceFabricIntegrationIslandGatewayBuilderExtensions.cs Outdated Show resolved Hide resolved

src/Integrations/ServiceFabric/Exceptions/ServiceFabricIntegrationException.cs Show resolved Hide resolved

davidni added 3 commits June 24, 2020 16:26

Merge remote-tracking branch 'upstream/master' into davidni/sf

be84be3

Fix merge, adjust folder structure per PR feedback

bb37411

TryAddSingleton instead of AddSingleton

b4e9f61

davidfowl reviewed Jun 25, 2020

View reviewed changes

src/ReverseProxy.ServiceFabric/FabricWrapper/QueryClientWrapper.cs Show resolved Hide resolved

davidfowl reviewed Jun 25, 2020

View reviewed changes

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/FabricServiceEndpointSelector.cs Show resolved Hide resolved

davidni marked this pull request as draft June 25, 2020 18:35

samsp-msft added the Partner Blocking: Island Gateway label Jul 9, 2020

davidni added 4 commits August 19, 2020 11:47

Simplified folder structure more

12015c3

sln changes

6c1ec6a

Added sample project, builds!

b9fb226

davidni added 3 commits November 16, 2020 16:00

Rename active health check labels to match yarp config model

b6a2f12

Fix readme too

8471415

AddServiceFabricDiscovery --> LoadFromServiceFabric, remove other use…

9c43aa8

…s of Island Gateway

Tratcher reviewed Nov 17, 2020

View reviewed changes

src/ReverseProxy.ServiceFabric/Exceptions/ConfigException.cs Show resolved Hide resolved

src/ReverseProxy.ServiceFabric/ServiceDiscovery/ServiceExtensionLabelsProvider.cs Outdated Show resolved Hide resolved

Tratcher assigned alnikola Nov 17, 2020

Tratcher added this to the YARP 1.0.0-preview8 milestone Nov 17, 2020

alnikola reviewed Nov 18, 2020

View reviewed changes

src/ReverseProxy.ServiceFabric/ServiceDiscovery/IServiceExtensionLabelsProvider.cs Show resolved Hide resolved

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/LabelsParser.cs Show resolved Hide resolved

alnikola reviewed Nov 18, 2020

View reviewed changes

samples/ReverseProxy.ServiceFabric.Sample/README.md Show resolved Hide resolved

alnikola suggested changes Nov 18, 2020

View reviewed changes

src/ReverseProxy.ServiceFabric/ServiceDiscovery/Util/LabelsParser.cs Show resolved Hide resolved

alnikola reviewed Nov 20, 2020

View reviewed changes

samples/ReverseProxy.ServiceFabric.Sample/README.md Outdated Show resolved Hide resolved

alnikola reviewed Nov 20, 2020

View reviewed changes

samples/ReverseProxy.ServiceFabric.Sample/README.md Outdated Show resolved Hide resolved

davidni added 2 commits November 24, 2020 09:33

Merge remote-tracking branch 'origin/master' into davidni/sf

b364d77

PR feedback

edbf485

alnikola approved these changes Nov 24, 2020

View reviewed changes

samples/ReverseProxy.ServiceFabric.Sample/README.md Outdated Show resolved Hide resolved

Update samples/ReverseProxy.ServiceFabric.Sample/README.md

d710322

alnikola merged commit 319f180 into dotnet:master Nov 24, 2020

alnikola mentioned this pull request Nov 24, 2020

Document SF integration #572

Closed

Kahbazi mentioned this pull request Nov 27, 2020

Dispose FabricClient #573

Merged

egaribay77 mentioned this pull request Nov 3, 2022

Support multiple endpoint/listeners per service microsoft/service-fabric-yarp#11

Open

Proxy has Service Fabric integration for dynamic discovery of routes #257

Proxy has Service Fabric integration for dynamic discovery of routes #257

Uh oh!

Conversation

davidni commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Functional overview

Service Fabric concepts mappings to YARP

Dynamic configurability

Designed for high availability

Designed for use by large teams

Complete ServiceManifest example:

Uh oh!

alnikola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samsp-msft commented Jun 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidni commented Jun 23, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidfowl commented Jun 25, 2020

Uh oh!

Uh oh!

Uh oh!

davidni commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alnikola commented Jun 25, 2020

Uh oh!

Tratcher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alnikola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alnikola commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alnikola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alnikola commented Nov 20, 2020

Uh oh!

Kahbazi commented Nov 20, 2020

Uh oh!

Uh oh!

Uh oh!

alnikola commented Nov 24, 2020

Uh oh!

mcpride commented Nov 24, 2020

Uh oh!

alnikola commented Nov 24, 2020

Uh oh!

alnikola commented Nov 24, 2020

Uh oh!

alnikola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcpride commented Nov 25, 2020

davidni commented Jun 19, 2020 •

edited

Loading

samsp-msft commented Jun 22, 2020 •

edited

Loading

davidni commented Jun 25, 2020 •

edited

Loading

alnikola commented Nov 18, 2020 •

edited

Loading