Abstract:
The ability to manage large amounts of simulation data is pivotal for exploring complex computational models in biology and elsewhere. One of the key debates of the last few months is whether a relational database and support for SQL queries is the “one size fits all” solution for emerging distributed processing paradigms. While expensive distributed SQL databases can excel at filtering data if properly optimised, current processing needs call for extreme scalability at low cost and uniform processing (i.e., a specific type of processing is applied all data, as opposed to selective subsets); this tendency has been referred to as the “NoSQL” movement.
In this project, our aim is to evaluate various options that arise in building extremely scalable cost-effective and easy-to-manage distributed databases of simulation results that support (i) automated replication and fault-tolerance to allow running on cheap (and often unreliable) hardware (ii) retrieval and annotation of individual simulation results, (iii) plotting one parameter against another for large numbers of arbitrarily selected single-run-results, and (iv) computation of summary statistics and other large-scale analytics for the stochastic output from repeated simulations of the same input parameter combination.
Commercial relational databases provide distributed storage support, but are (i) tailored towards a different processing paradigm (i.e., filter-based processing), and are (ii) not cost-effective (i.e., an annual license for a distributed version of a commercial product might far exceed other costs of ownership for the data).
Our goal is to implement a prototype system that supports the features above in an easy-to-use way. The target user of the system will be a research scientist interested in storing and uniformly manipulating the outcome of simulation results – a processing paradigm that manifests in multiple disciplines, including biology. To test the effectiveness of our general solution, we will apply our system to live production data from the evolution@home global computing system (>1 million single simulation runs worth >500 CPU years; a general system for exploring evolutionary models that is currently used to explore potential genetic causes for extinctions of endangered species).
In Detail:
Many complex problems in science today are addressed by computer simulations that map a set of input parameters with the help of a model to a set of output parameters. This pattern appears in multiple disciplines: biology, geology, physics, astronomy, to name but a few. Investigating the model is easy as long as the number of parameter combinations is relatively small. However, as soon as simulation projects grow beyond a certain size, the effectiveness of analyses is hampered by the need to search large amounts of often poorly organised data. A frequently suggested solution is to “simply” store such data in a relational database. While this helps structuring the data, relational databases are not optimised for the fast processing of scientific data and the size of such databases is mostly limited to a single machine. Furthermore, relational databases do not come with code that supports the most common use-cases of scientists; rather, the extra functionality needed for the analysis of the simulation results is implemented externally. The result is that the relational database acts as an elaborate storage layer. In fact, most scientific databases do not use a relational database back-end; rather, they operate over flat files in a file system, since most processing takes place in bulk.
From a structured data management perspective (i.e., assuming that flat files are not used and some semantics are readily supported by the storage layer) there are three fundamentally different approaches for achieving extreme scalability, which we use here to mean the virtually unlimited growth of nodes in a massively distributed shared-nothing database. Each approach has unique strengths and weaknesses:
1) Horizontal partitioning of tables in relational databases. Each node stores different rows of a huge table and handles all requests relating to these records. Advantages include fast access to all the values of a particular record (all on same machine), while access of particular columns across many records can be slower (on many different machines).
2) Vertical partitioning of tables in relational databases. Each node stores a different column (or part of it), resulting in faster access for queries that are frequently used by scientists (e.g., plot parameter x versus y). If access to all values of a particular record is rarely needed, slower speed for such requests will not be critical.
3) Distributed file systems for cluster-like processing. Files are organised in terms of records containing name-value pairs (‘columns’). The file system provides bindings for content-based retrieval, while a higher-level abstraction provides a way to process the files in bulk. MapReduce has emerged as a highly parallelisable data-flow for processing data in such an environment. Such systems allow for much larger flexibility as each file could carry different content; at the same time it becomes much harder to ensure the application-level correctness of data, as no schema is being enforced (as is the case in relational databases).
Stonebraker et al. (2010) have argued that all three approaches can, in principle, deliver the same functionality and extreme scalability. However, a careful comparison of all three approaches indicates different trade-offs that are required to get the corresponding systems to work (Pavlo et al. 2009; Stonebraker et al. 2010). For a scientist who simply wants to manage an extremely large dataset, but has no particular interest in relational databases the following practical details stand out.
Horizontal and vertical partitioning are not supported by mature open source systems, but only by big commercial relational databases. These (i) have a high licensing fee for large clusters, as each node requires a separate licence, (ii) due to their complexity require almost a full-time database administrator with specialist knowledge of both the database system and the data that is being stored, (iii) have a filtering performance that is potentially better than that of MapReduce systems, particularly when it comes to identifying data, but require specialist insights and sometimes even vendor help to optimise performance, as response time can increase dramatically if some tuning parameters have the wrong value. They can (iv) mimic cluster-processing functionality by implementing User-Defined-Functions (UDFs), but some scientists find these together with all required SQL code to be more difficult to implement than a custom-built system.
In comparison, cluster management systems offering a fully replicated storage layer and a MapReduce-like data processing paradigm are much simpler, both at a conceptual level and in their practical administration. They usually provide good out-of-the-box performance (albeit improving on that can be challenging) and many are freely available. They are particularly well suited for processing massive amounts of data through automatic data-flow parallelisation by using two functions: the “map” function is applied to identify and/or re-group the input into disjoints sets, and the “reduce” function to process each disjoint set in isolation and combine all individual results into the overall result. Such functionality is almost tailor-made for simulation-based studies. It is particularly well-suited to compute “multi-run-analytics” like the mean and variance for many stochastic single simulation runs (where a single set of input parameters results in many sets of output parameters combinations that capture the stochastic variations). Here the “map” phase groups results according to input parameter combinations, while the “reduce” phase computes the final statistics for each group. The relative ease of use associated with these systems and their out-of-the-box redundancy earned them widespread use.
Building on recent NoSQL approaches we will use best practices to develop a general scheme for organising simulation data in order to explore potential bottlenecks. Bad performance from poor data organisation is rarely compensated by efficient DBs. Thus the main aim to provide a general easy-to-use scheme for organising simulation data in an extremely scalable way. A core idea is that most queries of simulation data focus on one or a few simulation “projects”. We will use the following principles to optimise our scheme for this use-case:
1. Keep projects separate so that searching in one does not require looking at any data in all others.
2. Distribute each project among as many data storage nodes as reasonable in order to increase parallelism and hence speed.
3. Group results for efficiency (eg. 64MB; reduce access time, fragmentation).
4. Keep stochastic repeats of the same input parameter combination on the same data storage node for fast “multi-run-analytics”.
5. Avoid all indices to save insertion time; exception: important administrational indices like a hash on unique input parameter combinations.
We will evaluate various systems for cluster-based data management without a huge price-tag that a scientist in need of an extremely scalable data management solution might consider choosing. Our evaluation will be based on empirical comparisons of corresponding designs in order to evaluate (i) the potential limits of the scalability, (ii) fault-tolerance, (iii) ease of administration, (iv) ease of implementing special types of processing and (v) expected scalability of performance.
We plan to follow a two-phase process. First, we will evaluate the above factors in the two arguably leading such solutions: Hadoop, and CouchDB. If time permits and a fully functional version becomes available during the evaluation phase of the project, a potential third system is SciDB. Our evaluation will result in a report of the comparative study of these systems, which we plan to make freely available over the web to help other scientists find their system of choice.
We will then focus on an in-depth evaluation of the actual performance of the most promising system. To do this, we will design a general scheme for handling arbitrary simulation results by this system, including a facility for uploading and identifying simulation results by filtering based on input parameters. We will then implement various data processing algorithms for large-scale data analytics on top of the system. We will do so for the general case (e.g., averaging the output parameters of stochastic simulations over corresponding sets of input parameters, supporting queries needed for parameter estimation via Approximate Bayesian Computation as described in Beaumont & Rannala, 2004).
As a particular test-case of our proposed solution, and given the expertise of the participants in the project, we will import live production data from the evolution@home global computing system (>1 million single simulation runs worth >500 CPU years; a general system for exploring evolutionary models that is currently used to explore potential genetic causes for extinctions of endangered species). We will run the same general purpose processing pipelines developed above in order to test their performance in a real-world production setting and write a report about our experiences.
Our goal is to implement a system that supports the features above in an easy-to-use way in order to reduce the pain associated with data management for many scientists with a particular view to biologists who want to minimise the time required for obtaining workable solutions. In the end, we will have produced a prototype distributed data management system for simulation-based studies, which also provides the functionality for large-scale data analytics. In the process, the system will help address the question of whether a “NoSQL” data management system is capable of serving the needs of scientists.
References:
(1) Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM (2010) vol. 53 (1)
(2) Pavlo et al. A comparison of approaches to large-scale data analysis. SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data (2009)
(3) Dean J & Ghemawat, S (2010) MapReduce: A Flexible Data Processing Tool Communications of the ACM (2010) vol. 53 (1)
(4) Cudre-Mauroux et al. A demonstration of SciDB: a science-oriented DBMS. Proceedings of the VLDB Endowment (2009) vol. 2 (2)
(5) Beaumont, M. A. & Rannala, B. 2004 The Bayesian revolution in genetics. Nat. Rev. Genet. 5, 251-261.