MARTIN-LUTHER-UNIVERSITÄT HALLE-WITTENBERG RESEARCH PROJECT
Institut für Informatik Deductive Databases
Prof. Dr. Stefan Brass Push Method

 

 

Bottom-Up Evaluation with the "Push Method"

Benchmark Results

 

This page is under construction. You will find a list of benchmark results in our recent WLP-paper: [PDF]. See also [http://users.informatik.uni-halle.de/~stephan/Push/Tests/].

We use benchmarks of the OpenRuleBench benchmark suite:

We compare our implementation with the following systems. Older versions of these systems have also been used in the original OpenRuleBench performance comparision. However, the OpenRuleBench publications contain some further systems which were not easily accessible to us.

In addition, we started to do comparisons with the following systems (not contained in the original OpenRuleBench publications):

We executed the benchmarks on a machine with two 6272 AMD Opteron (Interlagos) 16 Core CPUs. However, the current version of our program does not use multiple threads (we will consider this in future research). The machine has 64 GB of RAM (DDR3 1600 DIMM). It runs Debian x86_64 GNU/Linux 3.2.63. The programs and data files are stored on a local hard disk (500GB SATA2).

We used the settings in the OpenRuleBench scripts for the systems tested there (unless otherwise noted). These were discussed with the system authors, so presumably they are the best possible selection of options (at least at the time when OpenRuleBench was created). However, in some cases we actually found possibilities for improving the benchmark results by using other options. Then we report results for both settings.

An important difference of our implementation to most of the above systems (except Mercury) is that we compile into machine code (via C++), whereas XSB and YAP emulate an abstract machine (variants of the WAM, the standard abstract machine for Prolog). If one looks at the numbers in the paper "Optimizing Bytecode Emulation for Prolog" by V. S. Costa in PPDP'99, the speedup of native code for the benchmarks studied there ranges from 1.5 to 5.6, with median value 2.6. More numbers for the comparison are contained in: "Improving Compilation of Prolog to C Using Program Information" by J. Morales and M. Carro. Here, the values range from 0.92 to 3.0 with an average of 1.57. See also "Improving the Compilation of Prolog to C Using Moded Types and Determinism Information" by José F. Morales, Manuel Carro, and Manuel Hermenegildo (2004). Here the average speedup is 3.18. Of course, the speedup also depends on the granularity of the operations that are executed. For instance, if many operations are index lookups or other large database operations, these are in native code in the virtual machine emulator anyway, so that the overhead of interpretation is not important in this case. We plan to check an interpreted version of our approach, to make the comparison more reliable. In the meantime, one can probably assume that a factor of about 3 can be explained with the difference between native code and interpreted instructions of a virtual machine.

 

DBLP Benchmark:

This benchmark uses some data from the DBLP bibliography for computer science. The data is stored in an EAV-table with facts of the form

att('node12eo5mnsvx56023',title,'On the Power of Magic.').

The data file is 122 MB big and contains data about 200.000 publications (currently, DBLP has data about 3.3 Million publications). The total number of facts (table rows) in the data file is about 2.4 Million (2 437 867). I.e. there are on average 12 facts per publication. The data file is contained in the zip-Archive which can be downloaded from the OpenRuleBench Download Page. Benchmark results for a larger data file (3 GB) are reported below. This file uses all DBLP data which are currently (Okt. 2016) available.

The query is a four-fold self join:

answer(Id, T, A, Y, M):-
   att(Id, title, T),
   att(Id, year, Y),
   att(Id, author, A),
   att(Id, month, M).

The benchmark results are:

System Load Time Execution Time Total Time Factor Memory
Push 2.565 s 0.022 s 2.610 s 1.0 385.166 MB
XSB (orig. settings) 89.045 s 2.690 s 92.390 s 35.4 415.500 MB
XSB (with trie index) 90.710 s 0.269 s 91.275 s 35.0 380.259 MB
XSB (with load_dync) 28.505 s 2.669 s 31.937 s 12.2 415.239 MB
XSB (trie+load_dync) 42.125 s 0.245 s 42.933 s 16.4 379.666 MB
YAP 24.878 s 7.370 s 32.438 s 12.4 813.760 MB
DLV 20.110 s   25.898 s 9.9 926.864 MB
Mercury 31.642 s 0.503 s 32.459 s 12.4 151.081 MB
HSQLDB 32.471 s 3.863 s 40.777 s 15.6 3000.000 MB

In the OpenRuleBench programs, XSB did not use the trie index, but this dramatically improves execution time (the influence on total time is not big, because this is dominated by the loading time).

Please note: We just learnt that XSB has also a loader predicate load_dync written in C instead of load_dyn which is written in Prolog. This gives a speedup of about factor 3 in the loading time. We show above the measurements for the DBLP benchmark. Of course, we should use this also in the other benchmarks. New numbers will be published soon.

In order to check how much memory was used by the data for the benchmark and how much is program code in the system (including possibly large libraries) we determined the memory used by just starting and stopping each system (without loading data or compiling rules). This is the "Base Memory" shown in the following table. The column "Mem. Diff" shows the difference of the total memory used for the DBLP benchmark minus this base memory. It turned out that the influence of the "base memory" is not big. Most memory was actually used for the data:

SystemBase Memory (MB)Mem. Diff. (MB)
Push1.182383.990
XSB10.965404.535
XSB (trie)10.965369.294
YAP4.849808.911
DLV0.697926.167

The SQL query we used for HSQLDB was:

SELECT COUNT(*)
FROM ATT ATT_TITLE, ATT ATT_YEAR, ATT ATT_AUTHOR, ATT ATT_MONTH
WHERE ATT_TITLE.ATTRIBUTE = 'title'
AND ATT_YEAR.ATTRIBUTE = 'year'
AND ATT_AUTHOR.ATTRIBUTE = 'author'
AND ATT_MONTH.ATTRIBUTE = 'month'
AND ATT_TITLE.DOC_ID = ATT_YEAR.DOC_ID
AND ATT_TITLE.DOC_ID = ATT_AUTHOR.DOC_ID
AND ATT_TITLE.DOC_ID = ATT_MONTH.DOC_ID

We used an index over (ATTRIBUTE, DOC_ID). We tried also an index with the two attributes inversed, or all three attributes, but all this did not change much. HSQLDB did use three threads (300% CPU) and 3 GB of main memory (i.e. three times the system resources of the deductive systems above). Although there might be further options for tuning HSQLDB, it is good to see that a known open source DBMS (that claims to beat MySQL) does not give better "off the shelf" performance than the above logic programming systems.

 

DBLP Benchmark (with 3 GB Data File):

We downloaded the XML data file from DBLP on October 3, 2016, and translated it with a small Java program using the Xerces SAX library to Datalog facts. The XML file contains data about approximately 5.2 million publications (and authors), the result contains 50 million Datalog facts (50 221 773). The file is about 3 GB large (3 082 908 204 Bytes). Here is it: dblp_big.data.

Of course, the query is the same as in the original DBLP benchmark (a four-fold self join):

answer(Id, T, A, Y, M):-
   att(Id, title, T),
   att(Id, year, Y),
   att(Id, author, A),
   att(Id, month, M).

The benchmark results so far are:

System Load Time Execution Time Total Time Factor Memory
Push 76.881 s 0.457 s 77.430 s 1.0 6546.402 MB
XSB (orig. settings) 1645.074 s 125.320 s 1780.890 s 23.0 8312.336 MB
XSB (with trie index) 1640.754 s 5.298 s 1655.023 s 21.4 7078.505 MB
XSB (with load_dync) 427.161 s 126.207 s 563.947 s 7.3 8312.061 MB
XSB (trie+load_dync) 656.634 s 5.407 s 670.997 s 8.7 7078.563 MB
YAP 462.178 s 913.041 s 1392.455 s 18.0 16114.804 MB
Mercury 1736.895 s 10.440 s 1747.725 s 22.6 1921.966 MB

The Test with HSQLDB was disappointing: Even with 50 GB memory for the JVM, there was an OutOfMemoryError after inserting 12.5 million rows, i.e. only one fourth of the input rows, stored in a file of 3 GB. Autocommit was switched on. One index was declared (over the first two attributes). We tried several times with slightly different settings, also with a cached table (which is stored on disk). But with such a big memory, we would have expected that the "off the shelf" settings work. It seems that for one byte in the data file, more than 50 Bytes of RAM are needed!

More results for this benchmark follow soon. Obviously, we will test alternative database systems.

 

The Join1 Benchmark:

The Join1 Benchmark from the OpenRuleBench benchmark suite contains the following rules:

a(X, Y)  :- b1(X, Z), b2(Z, Y).
b1(X, Y) :- c1(X, Z), c2(Z, Y).
b2(X, Y) :- c3(X, Z), c4(Z, Y).
c1(X, Y) :- d1(X, Z), d2(Z, Y).

The EDB predicates are c2, c3, c4, d1, d2. There are three data files: One with 10 000 facts each (i.e. 50 000 facts in total), one with 50 000 facts each, and one with 250 000 facts each. The data values are randomly generated integers between 1 and 1000. Different queries are considered in the OpenRuleBench, namely a(X, Y), b1(X, Y), b2(X, Y), and the same with bound first or second argument, e.g.~the query a(1, Y). In our test, we only tried the query a(X, Y) so far. Here are the results for the small data file (10 000 facts per EDB predicate, 624 KB, File):

System Load Time Execution Time Total Time Factor Memory
Push (switch, Bitmap) 0.014 s 1.772 s 1.787 s 1.0 7.311 MB
Push (Proc., Bitmap) 0.014 s 1.818 s 1.833 s 1.0 15.178 MB
Push (Dyn. Hashtab) 0.011 s 10.372 s 10.383 s 5.8 45.688 MB
XSB (batched scheduling) 0.141 s 22.432 s 22.827 s 12.8 82.667 MB
XSB 0.148 s 24.551 s 25.002 s 14.0 82.526 MB
XSB (with tabling, indexes) 0.414 s 11.302 s 11.927 s 6.7 123.885 MB
YAP 0.421 s 17.557 s 18.067 s 10.1 10.933 MB
YAP (with tabling, indexes) 0.437 s 6.719 s 7.256 s 4.1 133.622 MB
DLV     147.172 s 82.4 461.646 MB

The OpenRuleBench scripts use a version of XSB with "batched scheduling" (xsb-btc). However, it gives only a relatively small improvement over the standard version (with "local scheduling"). However, it turned out that if one switches on tabling for all predicates and uses more indexes, a significant improvement can be reached for XSB and YAP over the settings that have been used in the original OpenRuleBench. Many duplicates are generated if the rules are executed naively, and the runtime for the benchmark is actually dominated by the time needed for the duplicate check. This can be seen by the imrovement that an array of bitmaps gives over a standard dynamic hashtable (we are planning to improve the hashtable implementation). If one does no duplicate elimination, one gets 99.9 million result tuples (instead of 1 million), i.e. on average every tuple is computed 100 times. Our implementation of this needed 2.501 s execution time.

 

Transitive Closure Benchmark without Query Bindings:

Computing the transitive closure of a binary relation is a typical task involving recursion. The program is:

tc(X, Y) :- par(X, Y).
tc(X, Y) :- par(X, Z), tc(Z, Y).

OpenRuleBench comes with scripts do generate data files that contain a large number of facts of the following form:

par(253,456).

The following measurements have been done with a file containing 50.000 rows that describe a cyclic graph with nodes 1..1000 (i.e. the domain of the two attributes are integers in this range): Data File (File2). Measurements with other files are shown below.

System Load Time Execution Time Total Time Factor Memory
Push (Proc.) 0.014 s 2.128 s 2.123 s 1.0 29.508 MB
Push (switch) 0.011 s 2.171 s 2.190 s 1.0 21.832 MB
Seminaive 0.010 s 4.598 s 4.613 s 2.1 29.587 MB
XSB (with trie index) 1.095 s 11.357 s 12.641 s 6.0 134.024 MB
YAP 0.424 s 19.407 s 19.867 s 9.4 145.566 MB
DLV 0.260 s   109.673 s 51.7 404.737 MB

In order to check the scalability, we also did tests with all 10 data files used in the OpenRuleBench. The files are randomly generated with scripts contained in the OpenRuleBench archive, see OpenRuleBench Download Page. To make everything reproducible, we link to our version of the data files in the table below. We started to generate more graphs randomly, also using our own programs, but we have not yet finished these benchmark measurements. The graphs generated by the OpenRuleBench scripts contain 1000 or 2000 nodes (encoded as integers from 1 to 1000/2000) and an increasing number of edges (see table below). There are always two versions, marked as "cyc" and "nocyc", but also the "nocyc" version sometimes contain cycles. To reduce the effort, we compared only with XSB. This also used the original settings as in the OpenRuleBench archive, without the trie index for XSB. Note that the factor here compares only the execution time, whereas in the above table, it compares the overall runtime.

Rows Domain Cyc XSB Push Factor File Load XSB Load Push
50.000 1000 N 5.980 s 0.732 s 8.2 File1 1.144 s 0.010 s
50.000 1000 Y 15.513 s 2.160 s 7.2 File2 1.128 s 0.008 s
250.000 1000 N 33.274 s 4.350 s 7.6 File3 13.020 s 0.045 s
250.000 1000 Y 82.497 s 12.440 s 6.6 File4 11.968 s 0.043 s
500.000 1000 N 76.673 s 10.230 s 7.5 File5 51.475 s 0.097 s
500.000 1000 Y 187.200 s 30.620 s 6.1 File6 42.706 s 0.095 s
500.000 2000 N 139.881 s 23.870 s 5.9 File7 30.965 s 0.101 s
500.000 2000 Y 329.873 s 61.460 s 5.4 File8 26.389 s 0.103 s
1.000.000 2000 N 297.870 s 56.623 s 5.3 File9 113.391 s 0.200 s
1.000.000 2000 Y 714.721 s 150.270 s 4.8 File10 90.193 s 0.203 s

 

Transitive Closure Benchmark with the First Argument Bound:

In this benchmark, one is not interested in all connected pairs in the transitive closure, but only in nodes reachable from node 1. So the binding pattern for accessing the predicate is bf. The program is again:

tc(X, Y) :- par(X, Y).
tc(X, Y) :- par(X, Z), tc(Z, Y).

However, the query is now tc(1,_). Different systems use different methods to pass bindings to called predicates. E.g. for XSB and YAP, this is a feature of SLD-resolution. For a system based on bottom-up evaluation, the magic set method is the standard solution. Probably DLV does this. However, since magic sets are known to have problems with tail recursions, we use SLDMagic for our "Push" method instead (the "Push" method alone is a pure bottom-up method and would not pass query bindings). The output of the transformation is:

p1(A) :- p1(B), par(B,A).
p1(A) :- par(1,A).
p0(A) :- p1(B), par(B,A).
p0(A) :- par(1,A).
tc(1,A) :- p0(A).

The predicate p0 is not really needed, but since the SLDMagic prototype produces this program, it would be unfair to improve it manually. Nevertheless, it can be evaluated extremely fast, since it reduced the arity of the recursive predicate. Here are the benchmark results:

System Load Time Execution Time Total Time Factor Memory
Push (switch) 0.016 s 0.004 s 0.021 s 1.0 1.889 MB
Push (Proc.) 0.017 s 0.005 s 0.021 s 1.0 1.917 MB
XSB 1.098 s 7.142 s 8.418 s 400.9 86.936 MB
YAP 0.440 s 8.743 s 9.246 s 440.3 91.325 MB
DLV 0.260 s 110.779 s 5275.2 404.736 MB

At least the standard version of tabling has the same problem with tail recursions as magic sets: If one materializes the derived tc-facts in this case, one gets not only facts of the form tc(1,Y), but all facts tc(X,Y) for every X reachable from 1. Depending on the graph, this can make the difference between a linear number of facts and a quadratic number of facts. XSB and YAP might be affected by this problem. So it could be argued that the good results for the Push Method are mainly due to the SLDMagic program transformation. Therefore, in a further test, the SLDMagic-transformed program was also given as input to the other systems. Indeed, they did profit from the transformation (in runtime as well as in memory usage):

System Load Time Execution Time Total Time Factor Memory
Push 0.016 s 0.004 s 0.021 s 1.0 1.917 MB
XSB (with SLDMagic) 1.093 s 0.010 s 1.292 s 61.5 14.298 MB
YAP (with SLDMagic) 0.449 s 0.036 s 0.562 s 26.8 16.696 MB
DLV (with SLDMagic) 0.260 s 0.389 s 18.5 11.997 MB

Here are some more measurements for the TCBF benchmark for larger data files showing that there is no essential difference between a version that uses local variables and offers no cursor interface (tcbf_1_f*), the standard version with switch and cursor interface (tcbf_2_f*), and the new version with recursive functions (tcbf_4_f*). f* is used to select the input file (f1 to f10 as in the TCFF benchmark):

 

Wine Ontology Benchmark:

OpenRuleBench also contains a "Wine Ontology Benchmark". It consists of 961 rules with 225 IDB-predicates, of which all but one are recursive, and 113 EDB-predicates. The program is basically one big recursive component.

Actually, the wine ontology program does not do what it is supposed to do: The test query is for californian wines, but the result contains many other objects, too. However, we checked that XSB produces the same output. Nevertheless, the program is a challenging test for a deductive system, no matter whether it computes something reasonable. The wine ontology was developed for the OWL guide. That links to a site no longer available, but the ontology is probably the one available here: DAML File. Thanks to Boris Motik, who referred us to the paper Approximate OWL Instance Retrieval with SCREECH by Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph and Tuvshintur Tserendor (2008). This introduces an approximate translation from OWL DL TBoxes to Datalog. Although it is not completely clear yet that this translation was used, the approximation would explain that the result might contain wrong answers.

System Load Time Execution Time Total Time Factor Memory
Push 0.001 s 2.255 s 2.260 s 1.0 9.236 MB
XSB 0.106 s 8.548 s 8.851 s 3.9 322.894 MB
YAP 0.052 s 10.793 s 10.899 s 4.8 334.761 MB
DLV 0.060 s 31.743 s 14.0 42.452 MB

Note that the time for the transformation from Datalog to C++ and the compilation of the C++ program are not included in the times shown for the Push method. The reason is that this has to be done only once, and afterwards the resulting program can be executed many times with different database states. Since it is in particular independent of the size of the database state, the effect of this additional time becomes smaller for larger inputs (needing more execution time). While for all normal benchmark problems, the transformation and compilation time together is below 0.6 s, for the wine ontology benchmark, the resulting program is huge and needs more than one minute compilation time. We are still working on this problem.

 

More Benchmark Results:

This page is still under construction. Further benchmark results will be posted soon. See also: [http://users.informatik.uni-halle.de/~stephan/Push/Tests/].

 


Stefan Brass (brass@informatik.uni-halle.de), May 19, 2016

Original URL: http://www.informatik.uni-halle.de/~brass/push/bench.html   [XHTML 1.0]   [CSS]   [Links]   [Legal Info]