A couple years ago (i.e. 2007), I wrote a short blog entry commenting on Pat Helland’s paper “Life beyond Distributed Transactions: an Apostate’s Opinion” (Worthy of a second and third read). I found it curious that it was re-discovered by highscalabilty.com (see: “7 Design Patterns for Almost Infinite Scalability“). Though Highscalability.com is a treasure trove of implementation ideas on achieving high scalability. It made me wonder if anyone else had created a pattern language for high scalability? I have seen a few attempts and this entry is a quick attempt to extend those and conjure a new one up. Hopefully it serves as a good starting point for further refinement and improvements.
At the most abstract level there is Daniel Abadi’s PACELC classification for distributed systems. IMHO, PACELC, as compared to Brewster’s CAP theorem, is a more pragmatic description of the trade-offs one will make when designing a distributed system. PACELC says that if there is a network (P)artition does the system favor (A)vailability or (C)onsistency; (E)lse in the normal state does it favor (L)atency or (C)onsistency.
Cameron Purdy (founder of Oracle’s Coherence product) has a presentation where he proposes these building blocks for scaling-out:
- Replication (for Availability)
This short list is rumored to comprehensively cover every distributed system that can be encountered in the wild. If I applied the PACELC to this classification, I may be able to select Routing, Replication and Coordination techniques that favor either Consistency or Availability. Also, I may select Routing, Coordination and Messaging that favors Latency or Consistency.
Jonas Boner, who I have a big fan of for a very long time (see: AspectWerkz ), has a great slide deck that comprehensively enumerates in detail existing techniques to achieve scalability, with availability and stability thrown in for good measure. Shown below is how this list may be mapped into Purdy’s classification (I have taken the liberty to refine the original classification), I’ve marked which trade-off that is favored, either Latency or Consistency, where I thought made sense.
- State Routing
- Distributed Caching(Latency)
- HTTP Caching (Latency)
- Behavior Routing
- Fire-forget (Latency)
- Event Stream Processing(Latency)
- Dynamic Load Balancing
- Behavior Partitioning
- Loop Parallelism
- Round Robin Allocation
- Random Allocation
- Weighted Allocation
- State Partitioning (Favors Latency)
- Distributed Caching
- HTTP Caching
- State Replication (Favors Availability in Partition Failure)
- Master Slave-Synchronous (Consistency)
- Master Slave-Asynch (Latency)
- Master Master-Synchronous (Consistency)
- Master Master-Asynch (Latency)
- Buddy Replication-Synchronous (Consistency)
- Buddy Replication-Asynch (Latency)
- State Coordination
- Message Passing Concurrency(Latency)
- Software Transactional Memory(Consistency)
- Shared State Concurrency(Consistency)
- Service of Record(Consistency if Synchronous)
- Behavior Coordination
- Message Passing Concurrency
- Dataflow Architecture
- Tuple Space
- Request Reply
- Queuing (Consistency)
- Request Reply(Latency)
The trade-off between Consistency and Availability arises with the implementation of Replication by selecting an Synchronous versus Asynchronous Messaging (or even Coordination) approach. Employing Partitioning favors Latency and never Consistency (this should be obvious). The remaining patterns of Routing, Coordination and Messaging provides the flexibility where one can choose either Latency or Consistency.
This for now appears to be a workable starting point. Although, there’s a lot of room for improvement. For example in the Replication category, Master-Master or the more general form of Buddy Replication is clearly favors Consistency at the cost of Latency irregardless of the choice of Synchronous or Asynchronous messaging and coordination strategy. I think this article “Concurrency Controls in Data Replication provides a better classification of replication techniques.
There is also some inconsistencies that appear to need further refinement, for example the Fire and Forget Routing strategy appears to favor Latency in the sense that it is non-blocking (see: Scalability Best Practices: Lessons from eBay“), however messaging pattern may be the presence of a queue that clearly favors Consistency over Latency. So it favors Latency from the caller perspective, but Consistency from the receiver side (i.e. everything is serialized). In general one may say that decoupling (or loose coupling) favors latency while the tight coupling favors consistency. As an example, optimistic concurrency is loosely coupled and therefore favors latency.
To summarize, there are a lot of techniques that have been developed over the past few decades. Concepts like Dataflow and Tuple Spaces and many other Parallel Computation techniques have been known since the ’70s. The question an architect should however can ask today (which wasn’t asked back then) is which technique to use given the trade-offs defined by PACELC. The short coming of this pattern language is that is does not provide a prescription of how to achieve high scalability. It only provides the patterns one would find in a high scalability system.
The selection of the architecture, should be clearly driven by the use-cases and requirements. That is, consider vertical (see: “Nuggets of Wisdom from eBay’s Architecture“)as well as horizontal partitioning. Finally, unless a service has a limited set of use cases, one can’t expect to build a one-size fits all architecture in the domain of high-scalability.
P.S. I stumbled upon recently this very impressive paper by James Hamilton from Microsoft’s Live.com. He writes about the important considerations when designing a high scalability system from the operational perspective. This kind of insight is extremely very hard to come by. Not many software developers have the intuition to understand what goes on in the data center. On my next entry, I’ll attempt to incorporate some of Hamilton’s ideas to improve this pattern language.