Skip to content

Conversation

davidni
Copy link
Contributor

@davidni davidni commented Jun 19, 2020

Introduction

Adds Microsoft.ReverseProxy.ServiceFabric which implements dynamic discovery of Service Fabric services. The general architecture takes inspiration from Traefik's Service Fabric integration, and is designed for easy and friction-free onboarding of multiple Service Fabric services operating behind YARP. Running multiple instances of YARP in the same cluster is possible, and an eventual consistency model helps avoid the need for any external storage except for the runtime Service Fabric queries used to dynamically discover services in the cluster.

A Service Fabric service that wants to leverage YARP simply needs to add an <Extension> element in their ServiceManifest, for example (see complete example further below):

<Extensions>
  <Extension Name="YARP-preview">
    <Labels xmlns="http://schemas.microsoft.com/2015/03/fabact-no-schema">
      <Label Key="YARP.Enable">true</Label>
      <Label Key="YARP.Routes.route1.Hosts">example.com</Label>
    </Labels>
  </Extension>
</Extensions>

Functional overview

This works by periodically querying Service Fabric for all applications in a Service Fabric cluster. For each application, we enumerate all services. For each service, we extract its labels and, if the service has opted in to use YARP (YARP.Enable=true). If the service that has opted in to use YARP, we then enumerate all partitions, and all instances/replicas within each partition.

Service Fabric concepts mappings to YARP

Service Fabric terminology YARP terminology
Cluster --
Application --
Service (e.g. fabric:/App1/Svc2) Cluster (ClusterId=ServiceName)
Instance / replica Destination (Address=instance' or replica's endpoint)
YARP.Routes.<routeName>.* in ServiceManifest ProxyRoute (id=ServiceName+routeName, Match=Hosts,Path extracted from the labels)

Dynamic configurability

Labels defined in the Service Manifest support the following scenarios:

  • Arbitrary key / value pair. Only keys understood by us are honored, and unknown keys are ignored
  • Application parameter reference. If the label value is of the form [appParamName], its value will be replaced at runtime with the value of the application parameter with the same name defined for the application the service belongs to
  • Naming Service overrides. If label YARP.EnableDynamicOverrides is specified as true, we will also query Service Fabric Naming Service to discover any named properties. Any properties found for the service being enumerated take precedence over values specified in ServiceManifest. This functionality is opt-in, and disabled by default.

Designed for high availability

We keep a cache of the last discovery results obtained from Service Fabric. If at any point, for whatever reason, we are not able to retrieve an updated view of the cluster, we will continue to operate with last-known-good data. The cache is kept in memory, but could be extended to support permanent storage on disk to also survive reboots.

Designed for use by large teams

This integration is designed to support hundreds of Service Fabric services hosted behind YARP, and it provides additional features to help owners of those teams self-service their devops requirements relating to YARP. When we successfully discover a service, we emit a Service Fabric health report against the discovered service. If any config errors are encountered, a Warning health report is issued, which makes it easy for the service owners to observe the errors on Service Fabric Explorer and / or telemetry. This closes the loop and makes it easy for a service in Service Fabric to confirm that its configurations are being honored and to detect any issues.

Complete ServiceManifest example:

<ServiceManifest Name="Service1Pkg" Version="1.0.0" xmlns="http://schemas.microsoft.com/2011/01/fabric">
  <ServiceTypes>
    <StatelessServiceType ServiceTypeName="Service1Type" >
      <Extensions>
        <Extension Name="YARP-preview">
          <Labels xmlns="http://schemas.microsoft.com/2015/03/fabact-no-schema">
            <Label Key="YARP.Enable">true</Label>
            <Label Key="YARP.Routes.route1.Hosts">example.com</Label>
            <!-- Optional: enable active health probes -->
            <Label Key='YARP.Backend.Healthcheck.Active.Enabled'>true</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Path'>api/health</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Timeout'>30</Label>
            <Label Key='YARP.Backend.Healthcheck.Active.Interval'>10</Label>
          </Labels>
        </Extension>
      </Extensions>
    </StatelessServiceType>
  </ServiceTypes>

  <!-- ... -->
</ServiceManifest>

Copy link
Contributor

@alnikola alnikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

Could you please also comment on the concurrency requirements for services being introduced here? Is any of them supposed to be called concurrently?

@samsp-msft
Copy link
Contributor

samsp-msft commented Jun 22, 2020

Do we need to use the IslandGateway name in the parameter keys Eg

<Label Key='IslandGateway.Backend.Healthcheck.Enabled'>true</Label>

@davidni
Copy link
Contributor Author

davidni commented Jun 23, 2020

Do we need to use the IslandGateway name in the parameter keys

@samsp-msft this is covered by Organize and rename labels in the PR description under Remaining work (outside of scope of this PR). The name IslandGateway doesn't make sense outside my team, I just don't have the bandwidth to change and verify the changes immediately.

@davidfowl
Copy link
Member

I'm not a fan of how this plugs into the proxy. It seems like the IServiceDiscovery is the interface being plugged in here. Is the intent to replace the configuration based model completely? This comment seems like it's not resolved #257 (comment).

We keep a cache of the last discovery results obtained from Service Fabric. If at any point, for whatever reason, we are not able to retrieve an updated view of the cluster, we will continue to operate with last-known-good data. The cache is kept in memory, but could be extended to support permanent storage on disk to also survive reboots.

What happens when multiple instances have an inconsistent view of the configuration because of network issues between SF and the node?

@davidni
Copy link
Contributor Author

davidni commented Jun 25, 2020

I'm not a fan of how this plugs into the proxy.

@davidfowl ditto. It needs additional work and design, which is why I am leaving the last-mile open for you and the team to decide what fits best.


This comment seems like it's not resolved #257 (comment).

@davidfowl I am suggesting this should be tackled as a separate PR, which would also address how dynamic discovery impacts the configuration model (see next point below), I would rather defer to you and the team to decide how best to take it, hence I am not attempting to address it in this PR.


Is the intent to replace the configuration based model completely?

@davidfowl Not necessarily, but possibly. Examples of what others are doing:

  • Envoy lets you specify a config source at the top level. We ended up with something similar in our implementation -- a top level config field ServiceDiscoveryName, which can be servicefabric or static for us. The latter means we will read Routes, Clusters, Destinations from ASP .NET Core configs in a section "static". But IMHO this should NOT be the north star. It would be nice to allow N >= 1 config sources to work together, so that YARP could have static configs, which can be augmented by any number of dynamic config providers (such as Service Fabric). That aligns with Traefik (see next line)
  • Traefik lets you specify providers, all of which provide dynamic configurations and which can use configs from one another. See also this PR.

What happens when multiple instances have an inconsistent view of the configuration because of network issues between SF and the node?

@davidfowl At the risk of preaching to the choir, I'll say there's no free lunch. Eventual Consistency models offer simplicity but it comes at a cost. We believe this is the right choice for a component like YARP in distributed hosting scenarios, and there is precedent for this choice. Suggested reading: Embracing eventual consistency in SoA networking re. Envoy's design choice, by Envoy's lead author. Orchestration layers can be built on top of our eventually consistent model to provide strong guarantees when needed, and separately from YARP's core.

@alnikola
Copy link
Contributor

Eventual Consistency models offer simplicity but it comes at a cost. We believe this is the right choice for a component like YARP in distributed hosting scenarios, and there is precedent for this choice. Suggested reading: Embracing eventual consistency in SoA networking

In case of YARP, EC simplifies the code and consequently our work as proxy developers, but it will likely make user's experience worse because, as it was pointed out in the above article, eventual consistency is counterintuitive.

Moreover, I would challenge the assumption that in real-world changes are always rolled-out gradually with enough time between change so that nodes can reach consensus before the next change is applied and users are shielded from observing inconsistent state. In my experience, service changes might also be applied immediately to a large share of nodes (up to 100%) in special, but not so uncommon cases. Examples can be rolling out a security hot fix or deploying a change to relatively small geo region. Furthermore, even in case of gradual deployment, percentage can be changed in incrementally increasing steps (e.g. 1%, 2%, 4%, 8%, 16%, 32%, etc.) to speed-up roll-out, thus at the later stages a substantial share of system nodes get affected by each step which can expose to users inconsistencies inherent to EC approach.

All in all, I realize than network is inherently unreliable and CAP theorem imposes hard limit on what we can achieve, but I strongly believe we should always aim to firstly provide strong consistency and relax the model only if it looks unreasonably expensive to implement or maintain.

Having that said, implementing mechanism synchronizing SF configuration among YARP nodes seems to be outside this PR scope.

@davidni davidni marked this pull request as draft June 25, 2020 18:35
# Conflicts:
#	reverse-proxy.sln
#	src/ReverseProxy/Abstractions/Clusters/IClustersRepo.cs
#	src/ReverseProxy/Abstractions/Destinations/Contract/Destination.cs
#	src/ReverseProxy/Abstractions/Routes/Contract/AuthorizationConstants.cs
#	src/ReverseProxy/Abstractions/Routes/Contract/CorsConstants.cs
#	src/ReverseProxy/Abstractions/Routes/IRoutesRepo.cs
#	src/ReverseProxy/Abstractions/Routes/IRuntimeRouteBuilder.cs
#	src/ReverseProxy/Microsoft.ReverseProxy.csproj
#	test/ReverseProxy.Tests.Common/InMemoryConfigProvider.cs
#	test/ReverseProxy.Tests.Common/TestConfigError.cs
#	test/ReverseProxy.Tests.Common/TestConfigErrorReporter.cs
#	test/ReverseProxy.Tests.Common/TestLogger.cs
#	test/ReverseProxy.Tests.Common/TestLoggerFactory.cs
#	test/ReverseProxy.Tests.Common/TestRandom.cs
#	test/ReverseProxy.Tests.Common/TestRandomFactory.cs
Copy link
Member

@Tratcher Tratcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up

  • Custom exception types
  • IClock
  • LoadFromServiceFabric should take options.
  • What else?

This looks OK to merge, but I'll let Alexander handle the final sign-off and merge.

Copy link
Contributor

@alnikola alnikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost ready to be merged, there is only a few items left:

  • Confirm making IClock public
  • Close on "Backend" vs "Cluster" in labels questions
  • Submit at least one issue capturing all follow-up work items:
    • Review error handling logic and harden it where necessary
    • Label name casing
    • Custom exception types
    • LoadFromServiceFabric should take options
    • Get rid of unnecessary closure (e.g. in PropertyManagementClientWrapper.GetPropertyAsync)
    • Parse all supported *Options (e.g. LoadBalancingOptions, SessioAffinityOptions, etc.) in LabelsParser
    • Convert static classes to interfaces + implementations where it's appropriate (e.g. LabelsParser)
    • Apply Logging Extension pattern instead of direct ILogger calls.
    • Finalize the extension's name discussion

@alnikola
Copy link
Contributor

alnikola commented Nov 18, 2020

@davidni I submitted all necessary follow-up items listed below. Please, only confirm IClock API and address the two remaining comments, Then, the PR will be merged.

Follow-up items:

cc: @Tratcher

Copy link
Contributor

@alnikola alnikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found at least one blocking bug.

@alnikola
Copy link
Contributor

Hmm, it seems strange. The last comment on this PR has just disappeared.

@Kahbazi
Copy link
Member

Kahbazi commented Nov 20, 2020

Hmm, it seems strange. The last comment on this PR has just disappeared.

@alnikola Sorry, it was me 😁 , I added a comment about disposing FabricClient and then find out that it's already been discussed so I delete my comment.

@alnikola
Copy link
Contributor

Hello,
in my opinion your currently implemented Service Fabric concept mappings to YARP doesn't work very well because a Service Fabric service can provide more than one endpoint. So the YARP cluster should be mapped from a combination of service and endpoint name.

Simply producing a new cluster for each service-endpoint pair will not be compatible with the current load balancing implementation, because YARP will handle each of such clusters independently.

@mcpride
Copy link

mcpride commented Nov 24, 2020

Hello,
in my opinion your currently implemented Service Fabric concept mappings to YARP doesn't work very well because a Service Fabric service can provide more than one endpoint. So the YARP cluster should be mapped from a combination of service and endpoint name.

Simply producing a new cluster for each service-endpoint pair will not be compatible with the current load balancing implementation, because YARP will handle each of such clusters independently.

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

@alnikola
Copy link
Contributor

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

The main concern is load balancing (in case it's enabled of course). Currently, load balancing considers only destinations belonging to one cluster. Meaning, it's based on the assumption that clusters are independent. However, in the above approach multiple YARP cluster will look at the same SF service. Thus, that, what looks like a balanced load from YARP perspective, might not be actually balanced from SF service perspective.

@alnikola
Copy link
Contributor

Note. Load balancing config is not yet supported here, but it will be soon.

Copy link
Contributor

@alnikola alnikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM

@alnikola alnikola merged commit 319f180 into dotnet:master Nov 24, 2020
@alnikola alnikola mentioned this pull request Nov 24, 2020
@mcpride
Copy link

mcpride commented Nov 25, 2020

Why do you think that clustering of endpoints isn't compatible? Healthcheck might be a bit redundant on endpoint level instead of checking the health of the service itself but i can't see a problem with this.

The main concern is load balancing (in case it's enabled of course). Currently, load balancing considers only destinations belonging to one cluster. Meaning, it's based on the assumption that clusters are independent. However, in the above approach multiple YARP cluster will look at the same SF service. Thus, that, what looks like a balanced load from YARP perspective, might not be actually balanced from SF service perspective.

At first ... thanx for your answer! Sorry, but I didn't see the problem why endpoints of a service shouldn't be handled indepentently (including healthcheck). If a service has for example a control-api endpoint and a data-api endpoint defined and in the "service fabric cluster" are parallel instances of the service hosted on different nodes, why can't all control-api endpoints build a "proxy cluster" and all data-api endpoints another? The apis operate independently in context of proxy routing and this maybe also includes optional health check routes per endpoint. If a service instance becomes busy then the health checks of both endpoints should report this status independently. I think that health checks are not directly tied to services. Also metrics like CPU load is not tied to service, it is a property of the node (the machine) on which the service instance lives.

So my question is: What are the specific blockers for health checks in this context?

My SF mapping here is that SF services are a kind of hosting infrastructure (application nodes) for optional state persistence and independent listeners/endpoints which are the finer grained "microservices" (yes, i know this is a bit pragmatic and doesn't follow the microservice pattern) ;-)

@mcpride
Copy link

mcpride commented Nov 25, 2020

Another question is: Shouldn't the work of collecting healthcare metrics for nodes, services or endpoints be delegated to another service which is much more specialized?
There exists the project FabricObserver which is maybe a good candidate.

@alnikola
Copy link
Contributor

My concern was only about load balancing operating in the given case of service-endpoint pair to cluster mapping. Health checks should work fine.

@alnikola
Copy link
Contributor

Shouldn't the work of collecting healthcare metrics for nodes, services or endpoints be delegated to another service which is much more specialized?

YARP's health check mechanism is aimed to support the main YARP scenarios. However, its health checks subsystem is quite extensible so you can integrate it with some SF-specific monitoring tool if that's necessary in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Partner Blocking: Island Gateway User Story Used for divisional .NET planning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants