Andreas Schultz
Andrea Matteini
Robert Isele
Chris Bizer
Christian Becker

Contents

Introduction

This page presents a performance evaluation for LDIF using life science use cases.

The following data sources have been used to evaluate the performance of LDIF:

Benchmark Machine

We used a machine with the following specification for the benchmark experiments:

Hardware:

Software:

Test Procedure

We applied the following test procedure to each test:

  1. Clear Operating System caches
     echo 2 > /proc/sys/vm/drop_caches 
  2. Run test
     java -server -Xmx20G -jar ldif-0.1-single-machine.jar 

Use case A

In this use case, we evaluated the performance of LDIF applied to integrate data originating from the five life science sources described above and translate those into a common target vocabulary.

Datasets

In order to measure the performance at different scale we used several samples of the input data which differ in the amount of resources (1, 10 and 100 thousand) for each data source.
For this use case we extracted only data of human (Homo sapiens, hsa) and mouse (Mus musculus, mmu) species.

Details about the benchmark datasets are summarized in the following table:


1 K 10 K 100 K
Number of input quads 1,090,078 7,632,762 23,927,642
Overall file size 195 MB 1.4 GB 4.7 GB
Download link 1k.zip 10k.zip 100k.zip

Mappings

We defined R2R mappings for translating vocabularies used by the source datasets into the target vocabulary.

Link Specifications

The following Silk identity resolution heuristics are used to find genes and other expressions that are described in multiple datasets.

Results

The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.


1 K 10 K 100 K
Load and build entities for R2R 5.2 sec 47.4 sec 111.7 sec
R2R data translation 3.4 sec 12.9 sec 92.4 sec
Build entities for Silk 1.2 sec 6.9 sec 45.0 sec
Silk Identity Resolution 7.0 sec 45.5 sec 293.7 sec
Final URI rewriting 0.4 sec 1.9 sec 12.4 sec
Overall execution time 20.5 sec 124.3 sec 9.9 min

Use case B

In this use case, we evaluated the performance of LDIF integrating two larger life science datasets and translate those into a common target vocabulary.

Datasets

For this use case we used only KEGG GENES and UniProt datasets. There is a huge difference in dataset size between the two datasets. Converted to N-Triples the complete KEGG GENES dump is about 28GB in size whereas the UniProt dataset contains over 400GB worth of data. Because of this size mismatch we extracted only data of five different species from UniProt: bos taurus (cattle), canis familiaris (dog), danio rerio (zebrafish), drosophila melanogaster (fruitfly) and sus scrofa (pig).

For the test, we generated subsets of both data sources amounting together to 25 million, 50 million and 100 million RDF triples.

Details about the benchmark datasets are summarized in the following table. It provides statistics about the data integration process for each dataset. The original number of input triples decreases in the process as LDIF discards input triples which are irrelevant for the defined mappings, and therefore can not be translated into the target vocabulary. The number decreases again after the actual translation, as the input data uses more verbose vocabularies and as multiple triples from the input data are thus combined into single triples in the target vocabulary. The size of the final dataset is the number of quads after the mapping phase plus any provided provenance data.


25 M 50 M 100 M
Number of input quads 25,000,000 50,000,000 100,000,000
Number of quads after irrelevance filter 13,576,394 25,397,310 44,249,757
Number of quads after mapping 4,419,410 11,398,236 24,972,112
Number of pairs of equivalent entities resolved 24,782 113,245 213,062
Overall file size 5.6 GB 12 GB 23 GB
Download link 25M.zip 50M.zip 100M.zip

Mappings

We defined R2R mappings for translating genes, diseases and pathways from KEGG GENES and genes from UniProt into a proprietary target vocabulary. Some more sophisticated mappings from the use case translate complex structural
patterns and perform value transformations (e.g. extracting an integer value from a URI. The prevalent value transformations are extracting strings with a regular expression and modifying the target data types.

Here are some examples of the mappings we used for:

Link Specifications

We defined the following Link Specification for the identity resolution phase:

<Silk>
<Prefixes>
<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
<Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" />
<Prefix id="property" namespace="http://mywiki/resource/property/" />
<Prefix id="category" namespace="http://mywiki/resource/category/" />
</Prefixes>

<DataSources>
<DataSource id="Source" type="sparqlEndpoint" >
<Param name="endpointURI" value="http://localhost:2020/sparql/read" />
</DataSource>
<DataSource id="Target" type="sparqlEndpoint" >
<Param name="endpointURI" value="http://localhost:2020/sparql/read" />
</DataSource>
</DataSources>

<Interlinks>
<Interlink id="link">
<LinkType>owl:sameAs</LinkType>

<SourceDataset dataSource="Source" var="a">
<RestrictTo>
{ ?a rdf:type category:Gene }
UNION { ?a rdf:type category:Disease }
UNION { ?a rdf:type category:Pathway }
</RestrictTo>
</SourceDataset>

<TargetDataset dataSource="Target" var="b">
<RestrictTo>
{ ?b rdf:type category:Gene }
UNION { ?b rdf:type category:Disease }
UNION { ?b rdf:type category:Pathway }
</RestrictTo>
</TargetDataset>

<LinkageRule>
<Aggregate type="max">
<Compare metric="equality">
<Input path="?a/property:UniprotId" />
<Input path="?b/property:UniprotId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:EntrezGeneId" />
<Input path="?b/property:EntrezGeneId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:MgiMarkerAccessionId" />
<Input path="?b/property:MgiMarkerAccessionId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggDiseaseId" />
<Input path="?b/property:KeggDiseaseId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggPathwayId" />
<Input path="?b/property:KeggPathwayId" />
</Compare>
</Aggregate>
</LinkageRule>

<Filter threshold="1.0" />

</Interlink>

</Interlinks>
</Silk>

Results

The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.


25 M 50 M 100 M
Load and build entites for R2R 128.1 sec 297.2 sec 1059.7 sec
R2R data translation 169.9 sec 515.0 sec 1109.2 sec
Build entities for Silk 15.3 sec 36.8 sec 107.4 sec
Silk Identity Resolution 103.0 sec 568.5 sec 2954.9 sec
Final URI rewriting 8.1 sec 27.0 sec 65.0 sec
Overall execution time 7.0 min 24.0 min 88.3 min