BAM
Abstract Machine for Bottom-Up Evaluation with the Push Method
|
This is a prototype of a deductive database system, i.e. an implementation of the programming and query language Datalog. Datalog is a simplified and logically more pure version of the programming language Prolog.
So far, most database applications use (at least) two languages: SQL for database queries and updates, and a standard programming language like Java, C++ or PHP for the programming part. In SQL, the declararive programming paradigm has been applied very successfully: One only writes logical conditions for the data one needs, and the DBMS automatically finds an efficient algorithm to compute these data. I.e. one specifies WHAT is searched, but not HOW it is computed. There are two main advantages of this: (1) Increased productivity: A declarative specification is much shorter than an equivalent standard program. (2) Better adaption to new hardware: Because the algorithm is not prescribed as in a standard program, new hardware developments can be used by updating the DBMS, with the queries in the applications unchanged. The goal of deductive databases is to extend the part of the application which can be specified declaratively in this way. Ideally, the entire application is written in a single Datalog-like language.
In the ydb project, our approach is to translate Datalog to C++, which is then compiled. In this way, we hope to reach a high efficiency. Furthermore, the integration with library code written in C++ is simplified. We also try to generate readable code, so that in the end a comparison with hand-crafted code should be possible. Our system uses the "Push" method developed by Stefan Brass and Heike Stephan. This uses a derived tuple immediately for deriving further tuples (as far as possible). In contrast, the standard textbook implementation of Datalog first derives all tuples which are derivable by a rule, and stores them in a temporary relation, before execution moves to the next rule. One advantage of the "Push" method is that it avoids copying of data values as far as possible. A derived tuple is represented by values in C++ variables and a position in the code. If from this tuple, another tuple can be derived, which contains some of these values (in possibly other argument positions), no copying is usually done. Furthermore, constants contained in the program rules often do not have to be represented explicitly, because we specialize rules in the Datalog-to-C++ compilation (this feature can be selectively switched off in case the generated program becomes too big). Of course, in order to detect duplicates and avoid possible non-termination because of the repeated generation of duplicates, sometimes generated tuples must be stored. Because the only purpose of this is the detection of duplicates, a hash table or other data structure for efficient element test can be used. We also plan to use nested data structures (NF^2 tables) in case one variable changes more often than another one: The "Push" method can be seen as working with "virtual NF^2 tables", and for the duplicate test, these virtual tables must be materialized. Another case where intermediate relations must be stored are rules containing more than one body literal with a derived predicate. In this case, a tuple generated for one such body literal must be stored until all possible join partners for the other such body literals have been generated. However, a Datalog-to-Datalog translation like SLDMagic which is applied before the Datalog-to-C++ transformation, mainly produces rules of this simple type (only one body literal with a derived predicate).