This page presents a performance evaluation for LDIF using life science use cases.
The following data sources have been used to evaluate the performance of LDIF:
We used a machine with the following specification for the benchmark experiments:
We applied the following test procedure to each test:
echo 2 > /proc/sys/vm/drop_caches
java -server -Xmx20G -jar ldif-0.1-single-machine.jar
In this use case, we evaluated the performance of LDIF applied to integrate data originating from the five life science sources described above and translate those into a common target vocabulary.
In order to measure the performance at different scale we used
several samples of the input data which differ in the amount of
resources (1, 10 and 100 thousand) for each data source.
For this use case we extracted only data of human (Homo sapiens, hsa) and mouse (Mus musculus, mmu) species.
Details about the benchmark datasets are summarized in the following table:
| 1 K | 10 K | 100 K | |
|---|---|---|---|
| Number of input quads | 1,090,078 | 7,632,762 | 23,927,642 |
| Overall file size | 195 MB | 1.4 GB | 4.7 GB |
| Download link | 1k.zip | 10k.zip | 100k.zip |
We defined R2R mappings for translating vocabularies used by the source datasets into the target vocabulary.
The following Silk identity resolution heuristics are used to find genes and other expressions that are described in multiple datasets.
The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.
| 1 K | 10 K | 100 K | |
|---|---|---|---|
| Load and build entities for R2R | 5.2 sec | 47.4 sec | 111.7 sec |
| R2R data translation | 3.4 sec | 12.9 sec | 92.4 sec |
| Build entities for Silk | 1.2 sec | 6.9 sec | 45.0 sec |
| Silk Identity Resolution | 7.0 sec | 45.5 sec | 293.7 sec |
| Final URI rewriting | 0.4 sec | 1.9 sec | 12.4 sec |
| Overall execution time | 20.5 sec | 124.3 sec | 9.9 min |
In this use case, we evaluated the performance of LDIF integrating two larger life science datasets and translate those into a common target vocabulary.
For this use case we used only KEGG GENES and UniProt datasets. There is a huge difference in dataset size between the two datasets. Converted to N-Triples the complete KEGG GENES dump is about 28GB in size whereas the UniProt dataset contains over 400GB worth of data. Because of this size mismatch we extracted only data of five different species from UniProt: bos taurus (cattle), canis familiaris (dog), danio rerio (zebrafish), drosophila melanogaster (fruitfly) and sus scrofa (pig).
For the test, we generated subsets of both data sources amounting together to 25 million, 50 million and 100 million RDF triples.
Details about the benchmark datasets are summarized in the following table. It provides statistics about the data integration process for each dataset. The original number of input triples decreases in the process as LDIF discards input triples which are irrelevant for the defined mappings, and therefore can not be translated into the target vocabulary. The number decreases again after the actual translation, as the input data uses more verbose vocabularies and as multiple triples from the input data are thus combined into single triples in the target vocabulary. The size of the final dataset is the number of quads after the mapping phase plus any provided provenance data.
| 25 M | 50 M | 100 M | |
|---|---|---|---|
| Number of input quads | 25,000,000 | 50,000,000 | 100,000,000 |
| Number of quads after irrelevance filter | 13,576,394 | 25,397,310 | 44,249,757 |
| Number of quads after mapping | 4,419,410 | 11,398,236 | 24,972,112 |
| Number of pairs of equivalent entities resolved | 24,782 | 113,245 | 213,062 |
| Overall file size | 5.6 GB | 12 GB | 23 GB |
| Download link | 25M.zip | 50M.zip | 100M.zip |
We defined R2R mappings for translating genes, diseases and pathways from KEGG GENES and genes from UniProt
into a proprietary target vocabulary. Some more sophisticated mappings from the use case translate complex structural
patterns and perform value transformations (e.g. extracting an integer
value from a URI. The prevalent value transformations are extracting
strings with a regular expression and modifying the target data types.
mp:Gene
a r2r:ClassMapping;
r2r:prefixDefinitions
"category: <http://mywiki/resource/category/> .
property: <http://mywiki/resource/property/> .
pathway: <http://wiking.vulcan.com/neurobase/kegg_pathway/resource/vocab/> .
genes: <http://wiking.vulcan.com/neurobase/kegg_genes/resource/vocab/> .
xsd: <http://www.w3.org/2001/XMLSchema#> .";
r2r:sourcePattern "?SUBJ a genes:gene";
r2r:targetPattern "?SUBJ a category:Gene";
.
mp:GeneLinkUniProt
a r2r:PropertyMapping;
r2r:mappingRef mp:Gene;
r2r:sourcePattern "?SUBJ genes:externalLink ?x";
r2r:transformation "?id = regexToList('UniProt:(.+)', ?x)";
r2r:targetPattern "?SUBJ property:UniprotId ?'id'^^xsd:string";
.
We defined the following Link Specification for the identity resolution phase:
<Silk>
<Prefixes>
<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
<Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" />
<Prefix id="property" namespace="http://mywiki/resource/property/" />
<Prefix id="category" namespace="http://mywiki/resource/category/" />
</Prefixes>
<DataSources>
<DataSource id="Source" type="sparqlEndpoint" >
<Param name="endpointURI" value="http://localhost:2020/sparql/read" />
</DataSource>
<DataSource id="Target" type="sparqlEndpoint" >
<Param name="endpointURI" value="http://localhost:2020/sparql/read" />
</DataSource>
</DataSources>
<Interlinks>
<Interlink id="link">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="Source" var="a">
<RestrictTo>
{ ?a rdf:type category:Gene }
UNION { ?a rdf:type category:Disease }
UNION { ?a rdf:type category:Pathway }
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="Target" var="b">
<RestrictTo>
{ ?b rdf:type category:Gene }
UNION { ?b rdf:type category:Disease }
UNION { ?b rdf:type category:Pathway }
</RestrictTo>
</TargetDataset>
<LinkageRule>
<Aggregate type="max">
<Compare metric="equality">
<Input path="?a/property:UniprotId" />
<Input path="?b/property:UniprotId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:EntrezGeneId" />
<Input path="?b/property:EntrezGeneId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:MgiMarkerAccessionId" />
<Input path="?b/property:MgiMarkerAccessionId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggDiseaseId" />
<Input path="?b/property:KeggDiseaseId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggPathwayId" />
<Input path="?b/property:KeggPathwayId" />
</Compare>
</Aggregate>
</LinkageRule>
<Filter threshold="1.0" />
</Interlink>
</Interlinks>
</Silk>
The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.
| 25 M | 50 M | 100 M | |
|---|---|---|---|
| Load and build entites for R2R | 128.1 sec | 297.2 sec | 1059.7 sec |
| R2R data translation | 169.9 sec | 515.0 sec | 1109.2 sec |
| Build entities for Silk | 15.3 sec | 36.8 sec | 107.4 sec |
| Silk Identity Resolution | 103.0 sec | 568.5 sec | 2954.9 sec |
| Final URI rewriting | 8.1 sec | 27.0 sec | 65.0 sec |
| Overall execution time | 7.0 min | 24.0 min | 88.3 min |