MARTIN-LUTHER-UNIVERSITÄT HALLE-WITTENBERG | RESEARCH PROJECT |
Institut für Informatik | Deductive Databases |
Prof. Dr. Stefan Brass | Push Method |
Home Publications C++ Code Translator Abstract Machine Benchmark Results Links
This page is under construction. You will find a list of benchmark results in our recent WLP-paper: [PDF]. See also [http://users.informatik.uni-halle.de/~stephan/Push/Tests/].
We use benchmarks of the OpenRuleBench benchmark suite:
We compare our implementation with the following systems. Older versions of these systems have also been used in the original OpenRuleBench performance comparision. However, the OpenRuleBench publications contain some further systems which were not easily accessible to us.
In addition, we started to do comparisons with the following systems (not contained in the original OpenRuleBench publications):
We executed the benchmarks on a machine with two 6272 AMD Opteron (Interlagos) 16 Core CPUs. However, the current version of our program does not use multiple threads (we will consider this in future research). The machine has 64 GB of RAM (DDR3 1600 DIMM). It runs Debian x86_64 GNU/Linux 3.2.63. The programs and data files are stored on a local hard disk (500GB SATA2).
We used the settings in the OpenRuleBench scripts for the systems tested there (unless otherwise noted). These were discussed with the system authors, so presumably they are the best possible selection of options (at least at the time when OpenRuleBench was created). However, in some cases we actually found possibilities for improving the benchmark results by using other options. Then we report results for both settings.
An important difference of our implementation to most of the above systems (except Mercury) is that we compile into machine code (via C++), whereas XSB and YAP emulate an abstract machine (variants of the WAM, the standard abstract machine for Prolog). If one looks at the numbers in the paper "Optimizing Bytecode Emulation for Prolog" by V. S. Costa in PPDP'99, the speedup of native code for the benchmarks studied there ranges from 1.5 to 5.6, with median value 2.6. More numbers for the comparison are contained in: "Improving Compilation of Prolog to C Using Program Information" by J. Morales and M. Carro. Here, the values range from 0.92 to 3.0 with an average of 1.57. See also "Improving the Compilation of Prolog to C Using Moded Types and Determinism Information" by José F. Morales, Manuel Carro, and Manuel Hermenegildo (2004). Here the average speedup is 3.18. Of course, the speedup also depends on the granularity of the operations that are executed. For instance, if many operations are index lookups or other large database operations, these are in native code in the virtual machine emulator anyway, so that the overhead of interpretation is not important in this case. We plan to check an interpreted version of our approach, to make the comparison more reliable. In the meantime, one can probably assume that a factor of about 3 can be explained with the difference between native code and interpreted instructions of a virtual machine.
This benchmark uses some data from the DBLP bibliography for computer science. The data is stored in an EAV-table with facts of the form
att('node12eo5mnsvx56023',title,'On the Power of Magic.').
The data file is 122 MB big and contains data about 200.000 publications (currently, DBLP has data about 3.3 Million publications). The total number of facts (table rows) in the data file is about 2.4 Million (2 437 867). I.e. there are on average 12 facts per publication. The data file is contained in the zip-Archive which can be downloaded from the OpenRuleBench Download Page. Benchmark results for a larger data file (3 GB) are reported below. This file uses all DBLP data which are currently (Okt. 2016) available.
The query is a four-fold self join:
answer(Id, T, A, Y, M):-
att(Id, title, T),
att(Id, year, Y),
att(Id, author, A),
att(Id, month, M).
The benchmark results are:
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push | 2.565 s | 0.022 s | 2.610 s | 1.0 | 385.166 MB |
XSB (orig. settings) | 89.045 s | 2.690 s | 92.390 s | 35.4 | 415.500 MB |
XSB (with trie index) | 90.710 s | 0.269 s | 91.275 s | 35.0 | 380.259 MB |
XSB (with load_dync) | 28.505 s | 2.669 s | 31.937 s | 12.2 | 415.239 MB |
XSB (trie+load_dync) | 42.125 s | 0.245 s | 42.933 s | 16.4 | 379.666 MB |
YAP | 24.878 s | 7.370 s | 32.438 s | 12.4 | 813.760 MB |
DLV | 20.110 s | 25.898 s | 9.9 | 926.864 MB | |
Mercury | 31.642 s | 0.503 s | 32.459 s | 12.4 | 151.081 MB |
HSQLDB | 32.471 s | 3.863 s | 40.777 s | 15.6 | 3000.000 MB |
In the OpenRuleBench programs, XSB did not use the trie index, but this dramatically improves execution time (the influence on total time is not big, because this is dominated by the loading time).
Please note:
We just learnt that XSB has also a loader predicate load_dync
written in C instead of load_dyn
which is written in Prolog.
This gives a speedup of about factor 3
in the loading time.
We show above the measurements for the DBLP benchmark.
Of course, we should use this also in the other benchmarks.
New numbers will be published soon.
In order to check how much memory was used by the data for the benchmark and how much is program code in the system (including possibly large libraries) we determined the memory used by just starting and stopping each system (without loading data or compiling rules). This is the "Base Memory" shown in the following table. The column "Mem. Diff" shows the difference of the total memory used for the DBLP benchmark minus this base memory. It turned out that the influence of the "base memory" is not big. Most memory was actually used for the data:
System | Base Memory (MB) | Mem. Diff. (MB) |
---|---|---|
Push | 1.182 | 383.990 |
XSB | 10.965 | 404.535 |
XSB (trie) | 10.965 | 369.294 |
YAP | 4.849 | 808.911 |
DLV | 0.697 | 926.167 |
The SQL query we used for HSQLDB was:
SELECT COUNT(*)
FROM ATT ATT_TITLE,
ATT ATT_YEAR,
ATT ATT_AUTHOR,
ATT ATT_MONTH
WHERE ATT_TITLE.ATTRIBUTE = 'title'
AND ATT_YEAR.ATTRIBUTE = 'year'
AND ATT_AUTHOR.ATTRIBUTE = 'author'
AND ATT_MONTH.ATTRIBUTE = 'month'
AND ATT_TITLE.DOC_ID = ATT_YEAR.DOC_ID
AND ATT_TITLE.DOC_ID = ATT_AUTHOR.DOC_ID
AND ATT_TITLE.DOC_ID = ATT_MONTH.DOC_ID
We used an index over (ATTRIBUTE, DOC_ID)
.
We tried also an index with the two attributes inversed,
or all three attributes,
but all this did not change much.
HSQLDB did use three threads (300% CPU)
and 3 GB of main memory
(i.e. three times the system resources of the deductive systems above).
Although there might be further options for tuning HSQLDB,
it is good to see that a known open source DBMS
(that claims to beat MySQL)
does not give better "off the shelf" performance
than the above logic programming systems.
We downloaded the XML data file from DBLP on October 3, 2016, and translated it with a small Java program using the Xerces SAX library to Datalog facts. The XML file contains data about approximately 5.2 million publications (and authors), the result contains 50 million Datalog facts (50 221 773). The file is about 3 GB large (3 082 908 204 Bytes). Here is it: dblp_big.data.
Of course, the query is the same as in the original DBLP benchmark (a four-fold self join):
answer(Id, T, A, Y, M):-
att(Id, title, T),
att(Id, year, Y),
att(Id, author, A),
att(Id, month, M).
The benchmark results so far are:
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push | 76.881 s | 0.457 s | 77.430 s | 1.0 | 6546.402 MB |
XSB (orig. settings) | 1645.074 s | 125.320 s | 1780.890 s | 23.0 | 8312.336 MB |
XSB (with trie index) | 1640.754 s | 5.298 s | 1655.023 s | 21.4 | 7078.505 MB |
XSB (with load_dync) | 427.161 s | 126.207 s | 563.947 s | 7.3 | 8312.061 MB |
XSB (trie+load_dync) | 656.634 s | 5.407 s | 670.997 s | 8.7 | 7078.563 MB |
YAP | 462.178 s | 913.041 s | 1392.455 s | 18.0 | 16114.804 MB |
Mercury | 1736.895 s | 10.440 s | 1747.725 s | 22.6 | 1921.966 MB |
The Test with HSQLDB was disappointing: Even with 50 GB memory for the JVM, there was an OutOfMemoryError after inserting 12.5 million rows, i.e. only one fourth of the input rows, stored in a file of 3 GB. Autocommit was switched on. One index was declared (over the first two attributes). We tried several times with slightly different settings, also with a cached table (which is stored on disk). But with such a big memory, we would have expected that the "off the shelf" settings work. It seems that for one byte in the data file, more than 50 Bytes of RAM are needed!
More results for this benchmark follow soon. Obviously, we will test alternative database systems.
The Join1 Benchmark from the OpenRuleBench benchmark suite contains the following rules:
a(X, Y) :- b1(X, Z), b2(Z, Y).
b1(X, Y) :- c1(X, Z), c2(Z, Y).
b2(X, Y) :- c3(X, Z), c4(Z, Y).
c1(X, Y) :- d1(X, Z), d2(Z, Y).
The EDB predicates are
c2
,
c3
,
c4
,
d1
,
d2
.
There are three data files:
One with 10 000 facts each
(i.e. 50 000 facts in total),
one with 50 000 facts each,
and one with 250 000 facts each.
The data values are randomly generated integers between 1 and 1000.
Different queries are considered in the
OpenRuleBench,
namely
a(X, Y)
,
b1(X, Y)
,
b2(X, Y)
,
and the same with bound first or second argument,
e.g.~the query a(1, Y)
.
In our test,
we only tried the query a(X, Y)
so far.
Here are the results for the small data file
(10 000 facts per EDB predicate, 624 KB,
File):
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push (switch, Bitmap) | 0.014 s | 1.772 s | 1.787 s | 1.0 | 7.311 MB |
Push (Proc., Bitmap) | 0.014 s | 1.818 s | 1.833 s | 1.0 | 15.178 MB |
Push (Dyn. Hashtab) | 0.011 s | 10.372 s | 10.383 s | 5.8 | 45.688 MB |
XSB (batched scheduling) | 0.141 s | 22.432 s | 22.827 s | 12.8 | 82.667 MB |
XSB | 0.148 s | 24.551 s | 25.002 s | 14.0 | 82.526 MB |
XSB (with tabling, indexes) | 0.414 s | 11.302 s | 11.927 s | 6.7 | 123.885 MB |
YAP | 0.421 s | 17.557 s | 18.067 s | 10.1 | 10.933 MB |
YAP (with tabling, indexes) | 0.437 s | 6.719 s | 7.256 s | 4.1 | 133.622 MB |
DLV | 147.172 s | 82.4 | 461.646 MB |
The
OpenRuleBench scripts
use a version of XSB with "batched scheduling" (xsb-btc
).
However, it gives only a relatively small improvement over the standard version
(with "local scheduling").
However,
it turned out
that if one switches on tabling for all predicates
and uses more indexes,
a significant improvement can be reached
for XSB and YAP
over the settings that have been used in the original OpenRuleBench.
Many duplicates are generated if the rules are executed naively,
and the runtime for the benchmark is actually dominated
by the time needed for the duplicate check.
This can be seen by the imrovement
that an array of bitmaps gives
over a standard dynamic hashtable
(we are planning to improve the hashtable implementation).
If one does no duplicate elimination,
one gets 99.9 million result tuples (instead of 1 million),
i.e. on average every tuple is computed 100 times.
Our implementation of this needed 2.501 s execution time.
Computing the transitive closure of a binary relation is a typical task involving recursion. The program is:
tc(X, Y) :- par(X, Y).
tc(X, Y) :- par(X, Z), tc(Z, Y).
OpenRuleBench comes with scripts do generate data files that contain a large number of facts of the following form:
par(253,456).
The following measurements have been done with a file containing 50.000 rows that describe a cyclic graph with nodes 1..1000 (i.e. the domain of the two attributes are integers in this range): Data File (File2). Measurements with other files are shown below.
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push (Proc.) | 0.014 s | 2.128 s | 2.123 s | 1.0 | 29.508 MB |
Push (switch) | 0.011 s | 2.171 s | 2.190 s | 1.0 | 21.832 MB |
Seminaive | 0.010 s | 4.598 s | 4.613 s | 2.1 | 29.587 MB |
XSB (with trie index) | 1.095 s | 11.357 s | 12.641 s | 6.0 | 134.024 MB |
YAP | 0.424 s | 19.407 s | 19.867 s | 9.4 | 145.566 MB |
DLV | 0.260 s | 109.673 s | 51.7 | 404.737 MB |
In order to check the scalability, we also did tests with all 10 data files used in the OpenRuleBench. The files are randomly generated with scripts contained in the OpenRuleBench archive, see OpenRuleBench Download Page. To make everything reproducible, we link to our version of the data files in the table below. We started to generate more graphs randomly, also using our own programs, but we have not yet finished these benchmark measurements. The graphs generated by the OpenRuleBench scripts contain 1000 or 2000 nodes (encoded as integers from 1 to 1000/2000) and an increasing number of edges (see table below). There are always two versions, marked as "cyc" and "nocyc", but also the "nocyc" version sometimes contain cycles. To reduce the effort, we compared only with XSB. This also used the original settings as in the OpenRuleBench archive, without the trie index for XSB. Note that the factor here compares only the execution time, whereas in the above table, it compares the overall runtime.
Rows | Domain | Cyc | XSB | Push | Factor | File | Load XSB | Load Push |
---|---|---|---|---|---|---|---|---|
50.000 | 1000 | N | 5.980 s | 0.732 s | 8.2 | File1 | 1.144 s | 0.010 s |
50.000 | 1000 | Y | 15.513 s | 2.160 s | 7.2 | File2 | 1.128 s | 0.008 s |
250.000 | 1000 | N | 33.274 s | 4.350 s | 7.6 | File3 | 13.020 s | 0.045 s |
250.000 | 1000 | Y | 82.497 s | 12.440 s | 6.6 | File4 | 11.968 s | 0.043 s |
500.000 | 1000 | N | 76.673 s | 10.230 s | 7.5 | File5 | 51.475 s | 0.097 s |
500.000 | 1000 | Y | 187.200 s | 30.620 s | 6.1 | File6 | 42.706 s | 0.095 s |
500.000 | 2000 | N | 139.881 s | 23.870 s | 5.9 | File7 | 30.965 s | 0.101 s |
500.000 | 2000 | Y | 329.873 s | 61.460 s | 5.4 | File8 | 26.389 s | 0.103 s |
1.000.000 | 2000 | N | 297.870 s | 56.623 s | 5.3 | File9 | 113.391 s | 0.200 s |
1.000.000 | 2000 | Y | 714.721 s | 150.270 s | 4.8 | File10 | 90.193 s | 0.203 s |
In this benchmark,
one is not interested in all connected pairs in the transitive closure,
but only in nodes reachable from node 1
.
So the binding pattern for accessing the predicate is bf
.
The program is again:
tc(X, Y) :- par(X, Y).
tc(X, Y) :- par(X, Z), tc(Z, Y).
However, the query is now tc(1,_)
.
Different systems use different methods to pass bindings
to called predicates.
E.g. for XSB and YAP,
this is a feature of SLD-resolution.
For a system based on bottom-up evaluation,
the magic set method is the standard solution.
Probably DLV does this.
However, since magic sets are known
to have problems with tail recursions,
we use SLDMagic for our "Push" method instead
(the "Push" method alone is a pure bottom-up method
and would not pass query bindings).
The output of the transformation is:
p1(A) :- p1(B), par(B,A).
p1(A) :- par(1,A).
p0(A) :- p1(B), par(B,A).
p0(A) :- par(1,A).
tc(1,A) :- p0(A).
The predicate p0
is not really needed,
but since the SLDMagic prototype produces this program,
it would be unfair to improve it manually.
Nevertheless,
it can be evaluated extremely fast,
since it reduced the arity of the recursive predicate.
Here are the benchmark results:
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push (switch) | 0.016 s | 0.004 s | 0.021 s | 1.0 | 1.889 MB |
Push (Proc.) | 0.017 s | 0.005 s | 0.021 s | 1.0 | 1.917 MB |
XSB | 1.098 s | 7.142 s | 8.418 s | 400.9 | 86.936 MB |
YAP | 0.440 s | 8.743 s | 9.246 s | 440.3 | 91.325 MB |
DLV | 0.260 s | 110.779 s | 5275.2 | 404.736 MB |
At least the standard version of tabling
has the same problem with tail recursions
as magic sets:
If one materializes the derived tc
-facts in this case,
one gets not only facts of the form tc(1,Y)
,
but all facts tc(X,Y)
for every X
reachable from 1
.
Depending on the graph,
this can make the difference between a linear number of facts
and a quadratic number of facts.
XSB and YAP might be affected by this problem.
So it could
be argued that the
good results for the Push Method are mainly due to the SLDMagic program
transformation.
Therefore,
in a further test,
the SLDMagic-transformed program was
also given as input to the other systems.
Indeed, they did profit
from the transformation
(in runtime as well as in memory usage):
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push | 0.016 s | 0.004 s | 0.021 s | 1.0 | 1.917 MB |
XSB (with SLDMagic) | 1.093 s | 0.010 s | 1.292 s | 61.5 | 14.298 MB |
YAP (with SLDMagic) | 0.449 s | 0.036 s | 0.562 s | 26.8 | 16.696 MB |
DLV (with SLDMagic) | 0.260 s | 0.389 s | 18.5 | 11.997 MB |
Here are some more measurements for the TCBF benchmark
for larger data files
showing that there is no essential difference between
a version that uses local variables and offers no cursor interface
(tcbf_1_f*
),
the standard version with switch and cursor interface
(tcbf_2_f*
),
and the new version with recursive functions
(tcbf_4_f*
).
f*
is used to select the input file
(f1
to f10
as in the TCFF benchmark):
ydb
,
column 6 to 9 are measured by /usr/bin/time
.
The value "Warning" in column 5 means that real time and CPU time
differ,
the larger of both is printed.)ydb
) in benchmark mode
and adding output of /usr/bin/time
run_bench
(appends output to file BENCHMARKS
).run_bench
ten times:
run_bench_10.BENCHMARKS
as input:
run_bench_avg.OpenRuleBench also contains a "Wine Ontology Benchmark". It consists of 961 rules with 225 IDB-predicates, of which all but one are recursive, and 113 EDB-predicates. The program is basically one big recursive component.
Actually, the wine ontology program does not do what it is supposed to do: The test query is for californian wines, but the result contains many other objects, too. However, we checked that XSB produces the same output. Nevertheless, the program is a challenging test for a deductive system, no matter whether it computes something reasonable. The wine ontology was developed for the OWL guide. That links to a site no longer available, but the ontology is probably the one available here: DAML File. Thanks to Boris Motik, who referred us to the paper Approximate OWL Instance Retrieval with SCREECH by Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph and Tuvshintur Tserendor (2008). This introduces an approximate translation from OWL DL TBoxes to Datalog. Although it is not completely clear yet that this translation was used, the approximation would explain that the result might contain wrong answers.
System | Load Time | Execution Time | Total Time | Factor | Memory |
---|---|---|---|---|---|
Push | 0.001 s | 2.255 s | 2.260 s | 1.0 | 9.236 MB |
XSB | 0.106 s | 8.548 s | 8.851 s | 3.9 | 322.894 MB |
YAP | 0.052 s | 10.793 s | 10.899 s | 4.8 | 334.761 MB |
DLV | 0.060 s | 31.743 s | 14.0 | 42.452 MB |
Note that the time for the transformation from Datalog to C++ and the compilation of the C++ program are not included in the times shown for the Push method. The reason is that this has to be done only once, and afterwards the resulting program can be executed many times with different database states. Since it is in particular independent of the size of the database state, the effect of this additional time becomes smaller for larger inputs (needing more execution time). While for all normal benchmark problems, the transformation and compilation time together is below 0.6 s, for the wine ontology benchmark, the resulting program is huge and needs more than one minute compilation time. We are still working on this problem.
This page is still under construction. Further benchmark results will be posted soon. See also: [http://users.informatik.uni-halle.de/~stephan/Push/Tests/].
Stefan Brass (brass@informatik.uni-halle.de), May 19, 2016
Original URL: http://www.informatik.uni-halle.de/~brass/push/bench.html [XHTML 1.0] [CSS] [Links] [Legal Info]