Recursive Recoverability
|
|
http://www.stanford.edu/~candea/papers/perfeval/perfeval.pdf
"Software has only an abstract embodiment, and is thus governed solely by laws laid down, sometimes unwittingly, by its designers and implementors. The ability to make mistakes is, therefore, unbounded."
"The designers of such operating systems traded data safety and recovery performance for improved steady state performance. Not only do such performance tradeoffs impact robustness, but they also lead to complexity by introducing multiple ways to manipulate state, more code, and more APIs. The code becomes harder to maintain and offers the potential for more bugsĀa ne tradeoff, if the goal is to build fast systems, but a bad idea if the goal is to build highly available systems."
"Major Internet portals routinely kill and restart their web server processes after waiting for them to quiesce, in order to deal with known memory leaks that build up quickly under heavy load. A major search engine periodically performs rolling reboots of all nodes in their search engine cluster [11]." E. Brewer. Inktomi insights. Personal Communication, 2000.
"For a system to be recursively recoverable (RR), it must consist of ne grain components that are independently recoverable, such that part of the system can be repaired without touching the rest. This requires components to be loosely coupled and be prepared to be denied service from other components that may be in the process of microrebooting."
"Diverging from traditional reliability research, we regard the computer system as an entity that cannot be understood in its entirety, and rely on empirical observations to keep it running."
"Modeling systems is particularly unsuccessful in large scale Internet services, that are built from many heterogenous, individually packaged components, and whose workloads can vary drastically. Such systems are often subject to rapid and perpetual evolution, which makes it impossible to build a model that stays consistent over time."

