In the course of my diploma thesis, I evaluated the performance of several RDF stores when small pieces of information are requested from a large dataset (DBpedia infoboxes plus two very small sets). The benchmark queries employ varying levels of joins and constraints.
As of now, only the configuration for OpenLink Virtuoso has been optimized - this must be taken into consideration when comparing performance.
Contents
News
- 2008/01/17: Added a third additional index for Virtuoso; results updated accordingly.
- 2008/01/16: Updated results for OpenLink Virtuoso - now using more adequate indexes, resulting in significantly shorter query times. Incorporated Feedback from Andy Seaborne (HP). Switched back to linear scales.
- 2008/01/14: Added queries for download.
- 2008/01/13: Initial release.
1. Motivation
The use case is a mobile client-server application that allows for the exploration of Linked Data based on geographical coordinates. As the application will be user-facing, short response times are of high importance. In this context, queries are expected to yield small result sets, but involve large datasets (such as DBpedia) and possibly several levels of joins.
2. Tested RDF Stores
RDF stores were required to support large datasets such as DBpedia, SPARQL, Named Graphs as well as means to implement owl:sameAs inference (i.e. built-in ability or an apt programming interface). The following stores were selected:
2.1 OpenLink Virtuoso Open-Source Edition 5.0.2
Virtuoso was compiled from source for x64. Import was performed using the JDBC interface; data was loaded using the TTLP_MT command.
The following parameters were modified from the default configuration:
[Database] MaxCheckpointRemap = 131072 ; set to 1 gb as the database size is roughly 4 gb (reference) [Parameters] NumberOfBuffers = 85197 ; 65% percent of RAM (reference)
MaxDirtyBuffers = 63898 ; About 3/4 of buffers (reference) TransactionAfterImageLimit = 1500000000 ; required during import due to the size of the infoboxes set (reference)
In an initial release of this benchmark, Virtuoso's performance was far from ideal, which OpenLink traced back to inappropriate indexes for this usage scenario, which does not make use of graph indications. Following suggestions by OpenLink, the configuration was adjusted to include POGS, PSOG and SOPG indexes next to the default OGPS index, resulting in 3-45 times shorter query times.
2.2 SDB Beta 1
The index layout was tested on PostgreSQL 8.2.5 and MySQL 5.0.45 (x64 versions, default configurations). The hash layout was tested only on PostgreSQL due to performance issues ("Hash loading is very bad on MySQL." - SDB Wiki).
The results obtained for SDB currently can not be compared to those of Virtuoso, as the databases lack optimizations. Andy Seaborne suggests the use of PostgreSQL's ANALYZE command.
2.3 Sesame 2.0 beta 6
Sesame's good preliminary results and moderate loading times prompted me to explore the effects of supplementary indexes in addition to the default spoc and posc indexes. The following table shows the build times on the full dataset (see the section Queries for query times):
| Index | Build Time [s] |
|---|---|
| opsc | 12666.415 |
| ospc | 12323.288 |
| psoc | 3299.538 |
| sopc | 349.508 |
3. Dataset
The benchmark dataset consists of DBpedia's infoboxes, geocoordinates and homepages datasets with minor corrections:
- infoboxes-fixed.nt (15,472,624 triples; 2.1 GB)
Based on DBpedia's infoboxes.nt dated 2007-08-30. 166 triples from the original set were excluded because they contained excessively large URIs (> 500 characters) that caused importing problems with Virtuoso (DBpedia bug #1871653). -
geocoordinates-fixed.nt (447,517 triples; 64 MB)
Based on DBpedia's geocoordinates.nt dated 2007-08-30. Decimal datatype URI was corrected (DBpedia bug #1817019; resolved). - homepages-fixed.nt (200,036 triples; 24 MB)
Based on DBpedia's homepages.nt dated 2007-08-30. 3 URLs that included line breaks were manually corrected (fixed for DBpedia 3.0).
4. Benchmark Configuration
- Processor: Intel Pentium Dual Core 2.8 GHz
- Physical Memory: 1 GB
- Hard Disk: 40 GB data partition; 2 GB swap
- Operating System: Ubuntu Linux 7.10 64-bit
The low amount of RAM (1GB vs. a 4 GB dataset) likely impacts the results. Accordingly, the results have significance only for comparable configurations.
5. Loading
The RDF stores feature different indexing behaviors: Sesame automatically indexes after each import, while SDB and Virtuoso allow for selective index activation. In order to make load times comparable, the data import was performed as follows:
- infoboxes-fixed.nt was imported with indexes initially disabled in SDB and Virtuoso. Indexes were then activated and the time required for index creation time was factored into the import time.
- geocoordinates-fixed.nt was imported with indexes enabled.
- homepages-fixed.nt was imported with indexes enabled.
5.1 Loading of infoboxes-fixed.nt

5.2 Loading of geocoordinates-fixed.nt

5.3 Loading of homepages-fixed.nt

6. Queries
As few data has been prepared for actual use in the application, the queries are mostly of generic nature. They run against the DBpedia infoboxes set and assess performance with varying levels of joins and constraints.
In order to minimize query caching effects, queries were always executed in order after server startup. An exception was Virtuoso, where a noticeable warm-up delay occurred with the initial query. Accordingly, results for query 1 were obtained by restarting the server and warming it up using query 5.
- Download: queries.tar.gz
6.1 All available information about a specific subject
SELECT ?p ?o WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o
}
6.2 Two degrees of separation from Kevin Bacon (?)
PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 .
}
6.3 Unconstrained query for artworks, artists, museums and their directors
PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director
}
6.4 Homepages of resources roughly in the area of Berlin
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage WHERE {
<http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
<http://dbpedia.org/resource/Berlin> geo:long ?berlinLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s foaf:homepage ?homepage .
FILTER (
?lat <= ?berlinLat + 0.03190235436 &&
?long >= ?berlinLong - 0.08679199218 &&
?lat >= ?berlinLat - 0.03190235436 &&
?long <= ?berlinLong + 0.08679199218)
}
6.5 Homepages of architects of resources roughly in the area of New York City
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage WHERE {
<http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
<http://dbpedia.org/resource/New_York_City> geo:long ?nyLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s p:architect ?a .
?a foaf:homepage ?homepage .
FILTER (
?lat <= ?nyLat + 0.3190235436 &&
?long >= ?nyLong - 0.8679199218 &&
?lat >= ?nyLat - 0.3190235436 &&
?long <= ?nyLong + 0.8679199218)
}
7. Feedback
Please send comments to Christian Becker.
Further information about our work in the area of the Semantic Web/Web-of-Data can be found at
List of our other open source projects @ Freie Universität Berlin
