Introduction to Distributed Tracing

Nikolay Novik

https://github.com/jettify

KyivPy 22

I am ...

How many of you heard of distributed tracing?
  1. I read Dapper paper.
  2. I heard about this and know key ideas.
  3. I think distributed tracing is kinda cool.
Problem statement

  • User response are slow where is bottle neck?
  • Standard tools are broken, cprofile is not helping
  • How many services participate in serving this http route?
  • What is going on in this madness in first place?
Tools we have: Metrics

  • Aggregates events per service
  • Insights about trends and alerts
  • No per request overview
Tools we have: Logs

  • Records discrete events
  • Manual correlation
  • Usually expensive
What is Distributed Tracing?
Distributed Tracing - is a tool that helps gather timing data needed to troubleshoot latency problems in service oriented architectures. Provides an end-to-end view of requests as they travel through your application, and shows a map of your application’s underlying components.
Popularity started from Google Dapper paper (2010) and microservices hype

How Dapper works

  • Small context trips across process boundaries: trace_id, parent_span_id, span_id
    
                            {'X-B3-TraceId': '6f9a20b5092fa5e144fd15cc31141cd4'
                             'X-B3-ParentSpanId': None,
                             'X-B3-SpanId': '41baf1be2fb9bfc5',
                             'X-B3-Sampled': '1',}
    					
  • Each service report time information separately
  • Dapper server correlates all spans into one trace
Google Dapper Tracing Tool Goals
  • Low overhead
  • Application transparency
  • Scalability
Google Dapper: Low overhead

  • Employ sampling to for low overhead
  • Sample of just one out of thousands, provides sufficient information for many common use cases
  • Low network overhead, context is tiny
Google Dapper: Scalability

  • Data written in local log files
  • Collectors pulls data from all production hosts
  • Results are stored in regional BigTable
Google Dapper: Application transparency
  • Tracing mostly transparent for developer
  • Instrumented RPC library used by all services
  • Trace context sits in thread local storage, so instrumentation can pick it when required
Zipkin strait forward implementation of Dapper ideas

Firefighter tools

Zipkin is not substitution for metrics and logging. http://thelastpickle.com/blog/2015/12/07/using-zipkin-for-full-stack-tracing-including-cassandra.html
Zipkin in the wild

Other tools for Distributed tracing

Why not OpenTracing

  • Lockin to one instrumentation vendor
  • Wire and data interop is out-of-scope
  • Monkey patching everywhere
  • Things are not settled, big player still negotiate tracing cooperation between vendors
https://gist.github.com/adriancole/3c4b70925b8f87d7c98e369216b916aa
Zipkin Architecture

  • Client and server sends spans separately
  • Span correlation happens on zipkin server
Zipkin Glossary

  • Span represents one specific method (RPC) call
  • Annotation string data associated with a particular timestamp in span
  • Binary Annotation - key and value associated with given span
  • Trace - collection of spans, related to serving particular request
Identifying Services interacting with request

Identifying duplicate calls

Identifying Slow requests

Identifying serial execution

Service dependency analysis

Zipkin python story
Name github stars
py_zipkin github.com/Yelp/py_zipkin 61 ★
pyramid_zipkin github.com/Yelp/pyramid_zipkin 23 ★
flask-zipkin github.com/qiajigou/flask-zipkin 14 ★
django-zipkin github.com/prezi/django-zipkin 19 ★
aiozipkin github.com/aio-libs/aiozipkin 10 ★
Zipkin for asyncio: aiozipkin

import aiozipkin as az

async def run():
    zipkin_address = "http://127.0.0.1:9411"
    endpoint = az.create_endpoint("simple_service", ipv4="127.0.0.1", port=8080)
    tracer = az.create(zipkin_address, endpoint)

    # create and setup new trace
    with tracer.new_trace(sampled=True) as span:
        # give a name for the span
        span.name("Slow SQL")
        # tag with relevant information
        span.tag("span_type", "root")
        # indicate that this is client span
        span.kind(az.CLIENT)
        # make timestamp and name it with START SQL query
        span.annotate("START SQL SELECT * FROM")
        # imitate long SQL query
        await asyncio.sleep(0.1)
        # make other timestamp and name it "END SQL"
        span.annotate("END SQL")
					
It is possible to attach bunch of metadata for each span.
aiozipkin annotations

    with tracer.new_trace(sampled=True) as span:
        span.name("Slow SQL")

        span.annotate("START SQL SELECT * FROM")
        await asyncio.sleep(0.1)

        span.annotate("END SQL")
        await asyncio.sleep(0.1)
					
Span duration can be annotated manually
aiozipkin nested spans

    # create and setup new trace
    with tracer.new_trace(sampled=True) as span:
        span.name('root_span')
        await asyncio.sleep(0.1)

        # create child span
        with tracer.new_child(span.context) as nested_span:
            nested_span.name('nested_span_1')
            await asyncio.sleep(0.01)

        # create other child span
        with tracer.new_child(span.context) as nested_span:
            nested_span.name('nested_span_2')
            await asyncio.sleep(0.01)

					
Spans can be nested, just pass contest from parent
aiozipkin aiohttp
Client Side

    with tracer.new_trace(sampled=True) as span:
        span.kind(az.CLIENT)
        headers = span.context.make_headers()
        resp = await session.get(host, headers=headers)
        await resp.text()
                    
Server side will work automatically if aiozipkin installed as plugin

def make_app(host, port, loop):
    app = web.Application()
    endpoint = az.create_endpoint(
        'aiohttp_server', ipv4=host, port=port)
    tracer = az.create(zipkin_address, endpoint, sample_rate=1.0)
    az.setup(app, tracer)
                    
Middleware attaches server span to client span.
aiozipkin application transparency
  • Developers are lazy, for better experience libraries should be instrumented
  • Other vendors like datadog, newrelic et, will benefit too (right now they monkey patch everything)
  • asyncio right now does not support contest variables like thread locals, PEP 550 addresses this issue
  • aiohttp HTTP client instrumentation in progress

References

  1. Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report, Google, 2010.
  2. Mace, J. End-to-End Tracing: Adoption and Use Cases. Survey, Brown University, 2017.

Thank you!


aio-libs: https://github.com/aio-libs

slides: https://jettify.github.io/kyivpy22