Skip to content
This repository was archived by the owner on Apr 23, 2021. It is now read-only.

Test flexibility of TracingInstrument by creating an XRay tracer #72

Open
slashmo opened this issue Jul 16, 2020 · 18 comments
Open

Test flexibility of TracingInstrument by creating an XRay tracer #72

slashmo opened this issue Jul 16, 2020 · 18 comments
Assignees
Labels
t:tracing use-case Exploring different use cases of the API

Comments

@slashmo
Copy link
Owner

slashmo commented Jul 16, 2020

Because XRay uses different wordings for things like Spans it's a good candidate for testing how well our TracingInstrument performs in the real world. We plan on forking pokryfka/aws-xray-sdk-swift to poke around a bit by having it depend on gsoc-swift-tracing.

@slashmo slashmo added t:tracing use-case Exploring different use cases of the API labels Jul 16, 2020
@slashmo slashmo self-assigned this Jul 16, 2020
@pokryfka
Copy link
Contributor

pokryfka commented Jul 16, 2020

Cool, let me know if the interface I created is convenient to use and if if you have any comments/issues in general.

While you are more than welcome to fork it and poke around, IMO rather than making pokryfka/aws-xray-sdk-swift depend on gsoc-swift-tracing it would be more proper to make another project with dependency on both which would implement XRayTracingInstrument. (or maybe target within the project, like with lambda?)

I should probably be as simple as creating a XRayRecorder instance within and "feeding" it with whatever comes in with the more generic context.


other than different wording there are is also distinction between "annotation" (indexed) and "metadata" - not sure how it compares with OT.

@pokryfka
Copy link
Contributor

BTW the main missing/incomplete functionality as far as the pokryfka/aws-xray-sdk-swift is concerned is tracing downstream HTTP web services and context propagation which, it seems to me, is one of the main features of gsoc-swift-tracing

XRay does support it fairly well but the tracing needs to be "hooked" in somehow.
As far as swift-server is concerned AsyncHTTPClient seems to be popular choice and would be a convenient place to do it.
However AsyncHTTPClient does not have any interface to add "middleware" or tracing, also its not open and cannot be extended.

I made a quick hack and implemented XRayMiddleware for aws-sdk-swift but its broken and I am lurking to see how you address it.

@slashmo
Copy link
Owner Author

slashmo commented Jul 16, 2020

Thanks for chiming in, @pokryfka 👍

I guess we aim for TracingInstrument to be the standard way to vend tracing APIs.
Then the question would be if it should be functionality added on top of the existing public API of tracers, or if existing tracers should limit their public API to be used through TracingInstrument only. I think the benefit here would be that adoption of different cross-cutting tools, or even different tracers, would become easier for the end-user as they'd "simply" need to switch out the Instrument to be used. If a tracing library additionally offers its own public API for set up and instrumentation I think it could become a bit confusing and likely be duplicate effort for the tracing library developers. With that I don't mean that there is no need for tracer-specific functionality that we can't vendor through the more generic TracingInstrument, but rather that TracingInstrument would become the way to hook it up to instrumented systems. In that case, in my opinion, depending directly on gsoc-swift-tracing (which is definitely a temporary location for this) makes it a bit clearer.
What do you think? Also interested in @ktoso's opinion on this.

@pokryfka
Copy link
Contributor

pokryfka commented Jul 16, 2020

I guess we aim for TracingInstrument to be the standard way to vend tracing APIs
Then the question would be if it should be functionality added on top of the existing public API of tracers, or if existing tracers should limit their public API to be used through TracingInstrument only.

as far as I understand, It could work similar to swift-metrics which defines a generic interface for Metrics, which then could be setup (bootstrap) with different clients, for example SwiftPrometheus

For "libraries" generic TracingInstrument is most likely the only reasonable choice which then can be bootstrap with different client, for example OpenTracingInstrument or XRayTracingInstrument.

On the other hand application developer may choose more concrete tracer if it adds value and he's willing to sacrifice some flexibility.

For example, if running a simple Lambda function it may be decided to use, say, DynamoDB rather than creating "database abstraction"; similarly XRay can make a good choice because its provided in the execution runtime.

Note that it does work the same way with SwiftPrometheus which provides more functionality than defined by swift-metrics.


As far as tracing downstream HTTP web services is concerned, the most natural place is (a) HTTP client.
At the moment AsyncHTTPClient, which is a popular choice, does not have dependency on gsoc-swift-tracing (nor on swift-metrics) and it does not expose an interface to add middleware or tracing.
I dont know if and how do you plan to address it.

In case of aws-sdk-swift, AWSClient.swift uses AsyncHTTPClient and adds support for swift-metrics on top of that; I guess in a similar way it would be extended to explicitly support TracingInstruments if AsyncHTTPClient does not.

@pokryfka
Copy link
Contributor

@ktoso is there a chance to "instrument" HTTP client rather than having to create InstrumentedHTTPClient containing HTTP client?

in minimal approach, I think it could be as simple as making AsyncHTTPClient.HTTPClient open (currently public).

it would make adoption of TracingInstrument easier

@ktoso
Copy link
Collaborator

ktoso commented Jul 16, 2020

Huh a lot of questions to get to here :-)

Thanks for chiming in @pokryfka! I hope we'll be able to collaborate on figuring out how these libs and APIs should inter-op, your expertise with XRay is invaluable here 👍

I understand what you mean with a compat "shim" -- that is also an option generally speaking. I'm only a bit worried about excessive copying of metadata between the instruments and your specific backend -- thus the existence of (gsoc-)swift-baggage-context which can be used as generic and typesafe container for values, that our instruments carry around, inject/extract.

The Spans should be easy to adopt in a compat library indeed, because it's a protocol so writes could write "into" XRay Span types -- that is by design and hopefully will make implementing such shim layers not too hard. I am not sure how well this approach would deal with avoiding to copy metadata from a baggage context into and back from where your library stores them...? Technically, even if you decided to not directly depend on swift-tracing, perhaps you could use the baggage context for storage (we need to make it CoW yet), in order to avoid copying things "between" APIs as much. Or, after all, decide to depend on the swift-tracing API module directly -- that's ultimately your call, but the goal being that a tracer should be compatible with those to be useful in the wider ecosystem.

other than different wording there are is also distinction between "annotation" (indexed) and "metadata" - not sure how it compares with OT.

Such differences are of great interest to us in general... could you perhaps have a look at BaggageContext as well as Span and think how those are overlapping? Note that we can carry "anything" inside a baggage context, and it would get propagated, even if it is not explicitly in the Span API. Modifying it could be a bit tricker though, but I hope that'd also be possible with extensions on the span type if necessary.

as far as I understand, It could work similar to swift-metrics which defines a generic interface for Metrics, which then could be setup (bootstrap) with different clients, for example SwiftPrometheus

For "libraries" generic TracingInstrument is most likely the only reasonable choice which then can be bootstrap with different client, for example OpenTracingInstrument or XRayTracingInstrument.

Correct, that is the plan. End-users can of course directly use XRay tracer types and get benefits from it, but libraries like HTTPClients and similar can not special choose a specific tracer to marry themselves with.

As far as swift-server is concerned AsyncHTTPClient seems to be popular choice and would be a convenient place to do it.

However AsyncHTTPClient does not have any interface to add "middleware" or tracing, also its not open and cannot be extended.

Indeed, and not only that specific client but all other ones as well (and even database clients/drivers if they want to) and other binary clients as well.

The result of this gsoc (and continued work after the gsoc -- this will take time) are APIs that libraries, such as the AsyncHTTPClient (and many other ones) can (and will, as we're in close collaboration) adopt and instrument their libraries in terms of those generic TracingInstruments. This means, that at least the minimal shared feature-set of tracers that is possible to express using these APIs (which are very much aligned with OpenTelemetry style APIs) should be automatically available when an end-user picks and configures a specific tracer (e.g. XRay), by configuring the TracingSystem with it.

So to answer the following directly:

@ktoso is there a chance to "instrument" HTTP client rather than having to create InstrumentedHTTPClient containing HTTP client?

Yes, that is absolutely what's we're going for with this API. AsyncHTTPClient will depend on it directly and we'll be able to use any tracer that adapts these APIs with it - without the need of having to instrument the HTTPClient "again" every single time there's some new tracer that'd like to instrument things.

In other words, the goal of this work is to address the "libraries need to instrument their code" pain of distributed tracing. We want to ensure they can do so only once in abstract terms, and other tracers can be plugged into the existing instrumentation points. To achieve this, we need to check this API is "good enough" to serve the common needs of popular tracers, put these APIs through the SSWG process, have it incubate and adopted by as many projects as makes sense 🚀

The other goal is of course to have all existing tracers be able to express as much functionality using those generic APIs, so the libraries instrumented with them provide useful traces. Yes, not everything may be expressible, but it's better to converge on an (open) standard for all libraries, than attempting to instrument each library separately for each slightly different tracer :-).

Hope that answers some of the in-line questions.
Especially I'd like to see how much we'll be able to rely on the generic baggage context to do most of the interop heavy lifting, that's kind of its purpose in life :-)

There's definitely going to be some tricky and unsolved problems still so I hope we can collaborate and figure out how to get the best tracing experience to all of Swift, including generic libraries and end-user apps/systems!

@pokryfka
Copy link
Contributor

pokryfka commented Jul 16, 2020

AsyncHTTPClient will depend on it directly and we'll be able to use any tracer that adapts these APIs

that's awesome!


I think there are two choices:

  • XRayRecorder being (implementing) a TracingInstrument or
  • XRayTracingInstrument with an instance of XRayRecorder acting as a "proxy"

The second one should be easier and less intrusive but as you pointed out would involve more copying data back and forward.

I am happy with both as long as any functionality not defined by TracingInstrument would remain available.

I will try to check how TracingInstrument compares to XRay segments but probably not sooner than next week.

For instance, I am not sure I understand the difference between Span and SpanEvent. Need to read more about it.
(In Xray segments (spans?) can have subsegments which can have their own subsegments.)

there are no events as such in XRay (could be added as metadata)


Question:

As far as XRay is concerned, there is not a lot of context propagation as such.
The tracing header pretty much covers that, example:

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1

Note that it does contain trace id, parent id and sampling decision but that's about it.

gsoc-swift-tracing is much more robust in this regard and could pass forward and back much more,
I am not sure if there is a use for that as far as tracing is concerned (?)

Given scenario:

client A makes a call to server B which makes a few calls to servers C, D, E

  • what should be recorded by TracingInstrument on client A?
  • how to know/handle if B/C/D/E are instrumented (and send information themselves)?

in a XRay scenario, tracing header would be passed to B (which on its own could be also instrumented) then (optionally) include info about the HTTP request and response, example:

{
  "trace_id"   : "1-5759e988-bd862e3fe1be46a994272793",
  "id"         : "defdfd9912dc5a56",
  "start_time" : 1461096053.37518,
  "end_time"   : 1461096053.4042,
  "name"       : "www.example.com",
  "http"       : {
    "request"  : {
      "url"        : "https://www.example.com/health",
      "method"     : "GET",
      "user_agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7",
      "client_ip"  : "11.0.3.111"
    },
    "response" : {
      "status"         : 200,
      "content_length" : 86
    }
  },

@pokryfka
Copy link
Contributor

pokryfka commented Jul 16, 2020

/// An `Instrument` with added functionality for distributed tracing. Is uses the span-based tracing model and is
/// based on the OpenTracing/OpenTelemetry spec.
public protocol TracingInstrument: Instrument {
    /// The currently traces `Span`.
    var currentSpan: Span? { get }
  
// ...

}
  1. how should currentSpan be implemented in a multithreaded application?

  2. please correct me if I am wrong (I used Swift NIO for the first time a few weeks ago):
    when working with Swift NIO, even with one event loop, there may be "hops" and the order (and hence context if it contains parent segment) may be not defined

@ktoso
Copy link
Collaborator

ktoso commented Jul 17, 2020

AsyncHTTPClient will depend on it directly and we'll be able to use any tracer that adapts these APIs

that's awesome!

👍


Right yeah, so a library can either directly use the types or offer an adapter/shim. The copying of data has the potential to be a impactful on performance so that's one of the worries and why directly using baggage can be preferable.

I am happy with both as long as any functionality not defined by TracingInstrument would remain available.

Yes that's a goal as well -- users should be able to boostrap with xray and then the httpclient etc will use the generic interface to talk to it, but they totally could use the xray tracer directly -- I suspect we'd want to do this as an extension on InstrumentationSystem where you could do InstrumentationSystem.xray: XRay... which would basically get the underlying instrument and force cast it -- since you control the extension and people should only ever use this library which adds the xray extension if they're indeed using xray this could work out fine.

there are no events as such in XRay (could be added as metadata)

Ok, cool; you could ignore them OR log them if users decided to configure the xray tracer with "log events" or something like that... Events are basically structured logs which are associated with a span -- so basically they carry the trace id in metadata as well. Some systems collect them up and emit those together with the span when it is being recorded, but we can imagine other ways of handling them... Ignoring by default since xray does not do that (it seems?) and allowing some configuration to may be do something with them seems reasonable.


currentSpan

This in reality in other languages this basically means a thread local -- those are pretty unwieldily to control in async indeed unless every single library is aware of the context/span and can carry it around at asynchronous boundaries. That's what we've done with https://developer.lightbend.com/docs/telemetry/current/extensions/opentracing/enabling.html by instrumenting every single async api and make the span be carried through them.

In Swift's reality... I don't think this is going to fly at all though, we can't randomly instrument (with bytecode weaving as one does on the JVM) libraries we don't control, and we'd want these tracing APIs also to work with Dispatch and other APIs we can't change. So... why is that currentSpan even there? To be honest because for now we adopted the Open Telemetry suggested API as-is, but the currentSpan may not make much sense for us and maybe we should drop it and rely on explicit context passing always.

Long story short:

  • in all tracer implementations I know (on the JVM or Rust) of this ends up being a ThreadLocal variable; these compose terribly, so it's up to us to see if we try with it anyway or not.
  • even if the thread local is terrible, it sometimes may be useful if you have some API that is synchronous but for whatever reason you can't pass your context into it but then it calls a function that is yours again and is instrumentation aware... you could attempt to recover the span there by using the currentSpan as last resort... 🤔
    • Say things.register({ () -> myTask }) but then things is executed synchronously inside of some existing span;
    • things messed up and is not tracing aware for some reason;
    • you could still in the closure try to recover the span from the current span...
    • This is not really optimal, explicit passing of baggage context would be preferable (!), and then anyone can instrument.span(context) -> Span? to recover a span from a baggage if possible. (baggage contains the trace headers)

TLs of course break down completely with anything very async like event loop groups, dispatch groups (where one has to use queue local) and actor model things which also hop threads unpredictably.

// Sidenote: There's some ideas how "structured task local" values could look like being experimented with on the JVM's project Loom's Scope Variables which would be MUCH better to solve this and would work well with coroutines... but we don't have those in swift and no idea if it'd ever happen. But in a magical future that'd be a nice way to implement this.

@ktoso
Copy link
Collaborator

ktoso commented Jul 17, 2020

The tracing header pretty much covers that, example:

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793;Parent=53995c3f42cd8ad8;Sampled=1

Note that it does contain trace id, parent id and sampling decision but that's about it.

gsoc-swift-tracing is much more robust in this regard and could pass forward and back much more,
I am not sure if there is a use for that as far as tracing is concerned (?)

So the goal of baggage context is that "any tracer can put their values in there" and we just carry it around, without even knowing what's inside there.

In that sense, XRay would put the X-Amzn-Trace-Id into BaggageContext and since you'd have installed the XRay tracer, it would be invoked on any incoming and outgoing request, with the baggage: https://github.com/slashmo/gsoc-swift-tracing/blob/main/Sources/Instrumentation/Instrument.swift#L53-L67

So XRay tracer gets called as:

func extract<Carrier, Extractor>(_ carrier: Carrier, into baggage: inout BaggageContext, using extractor: Extractor) { 
  // carrier is e.g. HTTPHeaders so you extract the header and put it into baggage
}

And when we're about to make a request, the HTTPClient would invoke instrumentation.inject(context, &httpHeaders, using: http headers injector) so you can:

func inject<Carrier, Injector>(_ baggage: BaggageContext, into carrier: inout Carrier, using injector: Injector) {

        guard let traceID = baggage.xRayTraceID else { return }
        injector.inject(traceID, forKey: "X-Amzn-Trace-Id", into: &carrier) // your job to do the formatting right as the impl which knows about how the header should be formatted
  } 
}

(examples https://github.com/slashmo/gsoc-swift-tracing/blob/main/Tests/InstrumentationTests/InstrumentTests.swift )

Note that there can be multiple instruments installed at the same time, so if someone wanted to carry some other information at those points they can as well, and we're invoking all instruments at those inject/extract points.

@ktoso
Copy link
Collaborator

ktoso commented Jul 17, 2020

I'm not sure I understand the questions here:

  • what should be recorded by TracingInstrument on client A?

That's up to your tracer impl, we only make sure you'll get called with:

  • extract(_:into:using:) (when request is received), // instrument API
  • startSpan(...) (when someone starts a span), // tracing instrument API
  • span.finish() (when a span is finished), // tracing instrument API
  • inject(_:into:using:) (when request is made) // instrument API
  • how to know/handle if B/C/D/E are instrumented (and send information themselves)?

I'm not sure if we're using the word instrumented in the same meaning here? To clarify, to me "instrumented" is a static property of a system, and if it's completely not instrumented we lost a trace -- there's nothing that'd extract the headers after all (!).

I suspect you mean "how to know if they should record"? Because if they're instrumented instruments will be called and can do the extract/inject stuff, but that will be done always, and based on the extracted information your tracer has to notice "aha! Sampled=0, so this system should NOT record" right?

Thanks for the discussion, let's keep it going and iron out all questions and confusions :-)

@pokryfka
Copy link
Contributor

pokryfka commented Jul 20, 2020

@ktoso @slashmo

quick comparison of XRay segment against OT Span

References:

Annotations and metadata vs Span Attribute

  • XRay Annotations are indexed (up to 50 annotations per trace) meaning they can be used when filtering traces
  • annotation values could be, like Span Attribute, of string, boolean or numeric type but unlike Span Attribute they could not be lists/arrays
  • XRay metadata values could be of string, boolean or numeric type as well as lists and nested types (dictionaries in JSON document)
  • XRay Annotations provide more value, so IMO Span Attribute should not be always mapped to XRay metadata just because its easier that way

on top of that XRay segment defines:

AWS

AWS object/dictionary containing information about AWS runtime environment like Account id, ECS container, EC2 instance Id.

These are AWS/implementation specific but perhaps some of OT SpanAttributes, see #65, could be mapped to AWS instead of ending up as flat attribute/metadata.

HTTP request metadata

see HTTP request metadata

We should try to map HTTP Span attributes

This is vaguely related with SpanKind

XRay subsegment HTTP request object may have:

traced – (subsegments only) boolean indicating that the downstream call is to another traced service. If this field is set to true, X-Ray considers the trace to be broken until the downstream service uploads a segment with a parent_id that matches the id of the subsegment that contains this block.

I will follow up on this in separate comment.

Errors, faults, and exceptions vs Span status

  • XRay segment does not have status as such, it defines a few boolean flags to indicate errors:
    • Error – Client errors (400 series errors)
    • Fault – Server faults (500 series errors)
    • Throttle – Throttling errors (429 Too Many Requests)
  • XRay segment may contain cause attribute with (a list of) Exception object(s) which may contain message and type (as defined by Span Status)

Record Exception

Per OT:

To facilitate recording an exception languages SHOULD provide a RecordException convenience method. The signature of the method is to be determined by each language and can be overloaded as appropriate. The method MUST record an exception as an Event with the conventions outlined in the exception semantic conventions document.

This is not currently defined by TracingInstrument. We probably want something like:

mutating func setError(_ error: Swift.Error)

Note that XRay Exception type defines (optional) stack trace.
It would be convenient if the swift-tracing provided a tool and defined a type so that it does not have to be "reinvented" in every swift tracer out there.

Span Events

XRay segment does not have direct Span Event equivalent;
Arguably the closest match is a subsegment with startTime == endTime (?)

Links

XRay segment does not have links (to other segments/spans).

It may have precursor_ids which are

array of subsegment IDs that identifies subsegments with the same parent that completed prior to this subsegment.

whereas Span link

can be from the same or a different trace

These should probably be ignored by XRay instrument.

@pokryfka
Copy link
Contributor

I'm not sure if we're using the word instrumented in the same meaning here? To clarify, to me "instrumented" is a static property of a system, and if it's completely not instrumented we lost a trace -- there's nothing that'd extract the headers after all (!).

I suspect you mean "how to know if they should record"? Because if they're instrumented instruments will be called and can do the extract/inject stuff, but that will be done always, and based on the extracted information your tracer has to notice "aha! Sampled=0, so this system should NOT record" right?

Thank you for the defining "instrumenting" and "recording" - that's useful.
My question was more of "thinking aloud", not precise at all, sorry for that and let me try to clarify that:

XRay segment defines how to record:

details about an HTTP request that your application served (in a segment) or that your application made to a downstream HTTP API (in a subsegment)

see HTTP request data

example of JSON document with subsegment for a downstream HTTP call:

{
  "id": "004f72be19cddc2a",
  "start_time": 1484786387.131,
  "end_time": 1484786387.501,
  "name": "names.example.com",
  "namespace": "remote",
  "http": {
    "request": {
      "method": "GET",
      "url": "https://names.example.com/"
    },
    "response": {
      "content_length": -1,
      "status": 200
    }
  }
}

Im an not sure if its covered in any of the use cases, it would be great if instrumented HTTP client injected context but also passed back HTTP response (or failure) back to the tracer.

For reference this how tracing of HTTP requests is implemented in (one of a few) official AWS XRay SDK for Java:

https://github.com/aws/aws-xray-sdk-java/blob/master/aws-xray-recorder-sdk-aws-sdk-v2/src/main/java/com/amazonaws/xray/interceptors/TracingInterceptor.java

and equivalent in Swift XRay tracer (this very much a proof of a concept at the moment but also much less code so maybe easier to get the idea): https://github.com/pokryfka/aws-xray-sdk-swift/blob/master/Sources/AWSXRayRecorderSDK/Middleware.swift


Now, as far as XRay is concerned, the response contains HTTP status and optionally content size:

response – Information about a response.

  • status – number indicating the HTTP status of the response.
  • content_length – number indicating the length of the response body in bytes.

however given the flexibility and robustness of BaggageContext it could "return back" more information, for example using "trace-flags" as defined by WC3 TraceContext

not sure if/how that could be useful (?)

@pokryfka
Copy link
Contributor

@ktoso

TLs of course break down completely with anything very async like event loop groups, dispatch groups (where one has to use queue local) and actor model things which also hop threads unpredictably.

This would be great to have #48

@pokryfka
Copy link
Contributor

@slashmo @ktoso

FYI pokryfka/aws-xray-sdk-swift#16

This is how I think XRayInstrument could be implementing TracingInstrument.
Its "a stub" at the moment, I will play a bit with that.
Please let me know if you have any comments.

@slashmo
Copy link
Owner Author

slashmo commented Jul 20, 2020

@pokryfka Awesome, thanks for taking the time to create this 😎 I think we can use your PR from now on to discuss XRay specific things / current limitations to our APIs. Very excited about getting the first real-world tracer up and running 😊

@ktoso
Copy link
Collaborator

ktoso commented Jul 22, 2020

XRay segment does not have direct Span Event equivalent;
Arguably the closest match is a subsegment with startTime == endTime (?)

It seems Amazon folks are implementing Events as metadata, for reference: awslabs/aws-xray-sdk-with-opentelemetry@89f941a

@pokryfka
Copy link
Contributor

pokryfka commented Jul 22, 2020

XRay segment does not have direct Span Event equivalent;
Arguably the closest match is a subsegment with startTime == endTime (?)

It seems Amazon folks are implementing Events as metadata, for reference: awslabs/aws-xray-sdk-with-opentelemetry@89f941a

Thank you the reference, I will check it out later on!

Anything Encodable can be set as Metadata, also OT Links for that matter.
At the same time Metadata gives least value:

  • Event s attributes will not be indexed and timestamps will not be presented on the timeline;
  • Link s will be debug messages without referencing anything.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
t:tracing use-case Exploring different use cases of the API
Projects
None yet
Development

No branches or pull requests

3 participants