Key Concepts For Scalable Stateful Services

Nikolay Novik

https://github.com/jettify

PyConUA 2017

I am ...

Poll: Have you ever read dynamo paper?
  1. I read this papers.
  2. I heard about this paper and know key ideas.
  3. I think distributed systems is kinda cool.
Agenda
  1. Motivation, why and when we might want to user stateful services.
  2. Industry examples: Uber, Halo 4, DragonAge, HPC
  3. Problem statement, required components
  4. Overview of consistent hashing, gossip dissemination and swim failure detection
  5. Possible improvements
Use Stateless (Duck tape) When you can!

Stateless protocol is proved technique, use it like duck tape

Issues with Stateless services

  • Soft real time is requirement
  • State serialization
  • Wasteful data fetching
  • DB leaky transactions

Stateless Service Example

Notice that user data fetched several times and cached on multiple servers.

Benefits of Stateful Services

  • Data locality, logic executed where data is stored with fast access
  • Lower latency state in memory, no need extra network hops
  • Higher performance no need to deserialize data
Stateful Service Example

Avoided are extra trips to the database which reduces latency. Even if the database is down the request can be handled.

Industry example: UBER

Geo spatial index service to match driver and user

Industry example: Halo 4

Orleans used as backbone for server part of Halo game, including: presence, statistics, cheat detection, etc

Industry example: HPC

San Diego Supercomputer Center uses Serf to coordinate compute resources in multiple locations, cluster size is about 2k nodes

Lets try to solve close to real world problem: Prediction service

    Services that predicts reselling prices of different products, based on product specification

  • User enters used product specs, and obtains price estimate
  • Each product category
Functional requirements

  • Dynamic scaling
  • Fault tolerance
  • Exploit data locality
  • Flexible API
Required components
  1. Work distribution and routing move job request to appropriate node
  2. Cluster membership update provide means to determine nodes participating in cluster in stable and cluster resizing conditions
  3. Failure detector periodically check nodes and remove unresponsive/dead ones
Routing. Naive solution with hard coded cluster nodes
  • Very easy to implement, viable solution when dynamic resizing is not required
  • Does not support dynamic scaling in or scaling out
  • Requires cluster restart for changing nodes configuration
Routing. Consistent Hashing Solution

This simple algorithms made Akamai multi billion worth company

Consistent Hashing. Basic Idea

Consistent hashing minimizes number of keys, need to be remapped

http://blog.carlosgaldino.com/consistent-hashing.html
Consistent Hashing. Adding node

In case of adding capacity, only fraction of keys will be moved

Consistent Hashing. Removing node

In case of node failure next address will handle related keys

Consistent Hashing. Virtual nodes

Virtual nodes help with keys distribution, moving it close to 1/n

Cluster Membership Problem

We have routing and job distribution, lets figure out how to add and remove nodes.

Why not just use Zookeeper/Consul/Ectd (or in other words ZAB, Paxos, Raft)?

Issues

  • Availability
  • Performance
  • Network partitions
  • Operation overhead
Typical system with coordination
  • Zookeeper forces own view
  • Possible links: $ \frac{n (n -1) }{2} $ but for FD used only $ n $
  • Nodes availability decision best when it is local
Cluster Membership Update Problem. Naive solution

Broadcast: could be used for cluster membership update

  • Use network broadcast (usually disabled)
  • Send message one by one to each peer(not reliable)
Gossip Protocol

Xerox invented gossip protocols: anti-entropy and rumor mongering.
Gossip Overview

Basic gossip protocol

  • Send message to k random peers
  • peers retransmit message to next k random peers
  • in $\log(n)$ steps, information will be disseminated

Gossip Protocol vs Packet loss

Heavy packet loss does not stop dissemination, it simply will take a bit longer, 2 times for 50% loss.

Failure detection protocol

We can route jobs and communicate cluster update, last component is failure detector.

Failure detectors for asynchronous systems

    In asynchronous distributed systems, the detection of crash failures is imperfect. There will be false positives and false negatives.

Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.

Failure detectors. Properties
  • Completeness - every crashed process is eventually suspected
  • Accuracy - no correct process is ever suspected
  • Speed - how fast we can detect fault node
  • Network message load - number of messages required during protocol period
Basic failure detector
  • Each process periodically sends out an incremented heartbeat counter to the outside world.
  • Another process is detected as failed when a heartbeat is not received from it for some time
Basic failure detector. Properties
  • Completeness each process eventually miss heartbeat
  • Speed configurable, as little as protocol interval
  • Accuracy high, depends on speed
  • Network message load $O(n^2)$ each node sends message to all other nodes
SWIM failure detector

SWIM: Scalable Weakly-consistent Infection-style Process Group Membership. Protocol

SWIM Failure Detector

  • On each protocol round, node sends only $k=3$ pings messages
  • SWIM uses ping as primary way to do FD, and indirect ping for better tolerance to network partitions
SWIM Failure Detector. Properties
  • Completeness each process eventually will be pinged
  • Speed configurable, 1 protocol interval
  • Accuracy 99.9 % with delivery probability 0.95 and k=3
  • Network message load. $O(n)$ ($4k+2)n$)
SWIM vs connection loss. Suspicion subprotocol

Provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.

SWIM vs packet order

Ordering between messages is important, but total order is not required, only happens before/casual ordering.

  • Logical timestamp for state updates
  • Peer specific and only incremented by peer
SWIM vs Network Partitions

Nodes in each subnet can talk to each as result declares peers on other subnet as dead.

    How we can recover cluster after network heal?
  • Do not purge nodes on dead
  • Periodically try to rejoin
Problem solved! Implementation details

  • How python can help with implementation?
  • What frameworks to use?
Overview of frameworks for building cluster aware systems
Name Language Developer Description
??? Python ??? ???
RingPop node.js Uber Used as services for matching user and driver with follow up location update
Serf golang Hashicorp Used in number applications for instance in HPC to manage computing resources
Orleans .NET Microsoft General purpose framework, used in Halo online game
Orbit/jGroups Java EA Games Used in Bioware games, such as DragonAge game, not sure where thou. Inspired by Orleans
riak_core Erlang Basho Building block for Riak database and erlang distributed systems
Akka Scala Lightblend General purpose distribute systems framework, often used as microservsies platform
Improvement: network coordinates

Famous paper from MIT, describes synthetic network coordinates, based on ping delays, used in Serf/Consul for data center fail over

Improvement: network coordinates visualization

Notice coordinate drifting in space and stable distance between clusters

Improvement: partial view for huge clusters

For huge clusters full membership is not scalable, paper proposes partial membership protocol

Improvement: partial view in case of node failures

Even for failure rates as high as 95%, HyParView still manages to maintain a reliability value in the order of deliveries to 90% of the active processes.

Improvement: DHT for more balancing

Orleans uses a one-hop distributed hash table that maps actors between machines, as result actors could be moved across the cluster

Stateful Services Challenges
  • Work distribution
  • Code deployment
  • Unbounded data structures
  • Memory management
  • Persistent strategies

Read more papers!

References

  1. Karger, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.
  2. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
  3. Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent infection-style process group membership protocol." Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on. IEEE, 2002.
  4. Dabek, Frank, et al. "Vivaldi: A decentralized network coordinate system." ACM SIGCOMM Computer Communication Review 34.4 (2004): 15-26.
  5. Leitao, Joao, José Pereira, and Luis Rodrigues. "HyParView: A membership protocol for reliable gossip-based broadcast." Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on. IEEE, 2007.
  6. Stoica, Ion, et al. "Chord: A scalable peer-to-peer lookup service for internet applications." ACM SIGCOMM Computer Communication Review 31.4 (2001): 149-160.
  7. Bailis, Peter, and Kyle Kingsbury. "The network is reliable." Queue 12.7 (2014): 20.
  8. Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.b

Thank you!


aio-libs: https://github.com/aio-libs

slides: https://jettify.github.io/pyconua2017