DBpedia D2R Server R2R Silk LDIF NG4J Marbles WIQA Pubby RAP
Open Source projects by the Web-based Systems Group:  
A framework for building Linked Data applications
Andreas Schultz
Andrea Matteini
Robert Isele
Chris Bizer
Christian Becker
LDIF translates heterogeneous Linked Data from the Web into a clean, local target representation while keeping track of data provenance.

News

Contents

  1. About LDIF
  2. LDIF components
  3. Configuration options
  4. Quick start
  5. Examples
  6. Performance Evaluation
  7. Source code and development
  8. Version history
  9. Support and Feedback
  10. References
  11. Acknowledgments

1. About LDIF

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.

The LDIF integration pipeline consists of the following steps:

  1. Collect Data: Import modules locally replicate data sets via file download, crawling or SPARQL.
  2. Map to Schema: An expressive mapping language allows for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary.
  3. Resolve Identities: An identity resolution component discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics.
  4. Output: LDIF outputs the integrated data in a single file. For provenance tracking, LDIF employs the Named Graphs data model.

The figure below shows the schematic architecture of Linked Data applications that implement the crawling/data warehousing pattern. The figure highlights the steps of the data integration process that are currently supported by LDIF.

Example-architecture of an integration aware Linked Data application

2. LDIF Components

The LDIF Framework consists of a Scheduler, Data Import and an Integration component with a set of pluggable modules. These modules are organized as data input, data transformation and data output.

LDIF components

Currently, we have implemented the following modules:

Scheduler

The Scheduler is used for triggering pending data import jobs or integration jobs. It is configured with an XML document (see Configuration) and offers several ways to express when and how often a certain job should be executed.
This component is useful when you want to load external data or run the integration periodically, otherwise you could just run the integration component.

Data Import

LDIF provides access modules for replicating data sets locally via file download, crawling or SPARQL. These different types of import jobs generate provenance metadata, which is tracked throughout the integration process. Import jobs are managed by a scheduler that can be configured to refresh (hourly, daily etc.) the local cache for each source.

Triple/Quad Dump Import

In order to get a local replication of data sets from the Web of Data the simplest way is to download a file containing the data set. The triple/quad dump import does exactly this, with the difference that LDIF generates a provenance graph for a triple dump import, whereas it takes the given graphs from a quad dump import as provenance graphs.

Crawler Import

Data sets that can only be accessed via dereferencable URIs are a good candidate for a crawler. In LDIF we thus integrated LDSpider for crawl import jobs. The configuration files for crawl import jobs are specified in the configuration section. Each crawled URI is put into a seperate named graph for provenance tracking.

SPARQL Import

Data sources that can be accessed via SPARQL are replicated by LDIF's SPARQL access module. The relevant data to be queried can be further specified in the configuration file for a SPARQL import job. Data from each SPARQL import job gets tracked by its own named graph.

Integration Runtime Environment

The integration component manages the data flow between the various stages/modules, the caching of the intermediate results and the execution of the different modules for each stage.

Data Input

The integration component expects input data to be represented as Named Graphs and be stored in N-Quads format accessible locally - the Web access modules convert any imported data into N-Quads format.

Transformation

LDIF provides trasformation modules for vocabulary mapping and identity resolution.

R2R Data Translation

LDIF employs the R2R Framework to translate Web data that is represented using terms from different vocabularies into a single target vocabulary. Vocabulary mappings are expressed using the R2R Mapping Language. The language provides for simple transformations as well as for more complex structural transformations and property value transformations such as normalizing different units of measurement or complex string manipulations. The syntax of the R2R Mapping Language is very similar to the query language SPARQL, which eases the learning curve. The expressivity of the language enabled us to deal with all requirements that we have encountered so far when translating Linked Data from the Web into a target representation (evaluation in [2]).  Simple class/property-renaming mappings which often form the majority in an integration use case can also be expressed in OWL/RDFS (e.g ns1:class rdfs:subClassOf ns2:clazz).
An overview and examples for mappings are given on the R2R website. The specification and user manual is provided as a separate document.

Silk Identity Resolution

LDIF employs the Silk Link Discovery Framework to find different URIs that are used within different data sources to identify the same real-world entity. For each set of duplicates which have been identified by Silk, LDIF replaces all URI aliases with a single target URI within the output data. In addition, it adds owl:sameAs links pointing at the original URIs, which makes it possible for applications to refer back to the data sources on the Web. If the LDIF input data already contains owl:sameAs links, the referenced URIs are normalized accordingly (optional, see configuration). Silk is a flexible identity resolution framework that allows the user to specify identity resolution heuristics which combine different types of matchers using the declarative Silk - Link Specification Language.
An overview and examples can be found on the Silk website.

Data Output

Two output formats are currently supported by LDIF.

N-Quads Writer

The N-Quads writer dumps the final output of the integration workflow into a single N-Quads file. This file contains the translated versions of all graphs from the input graph set as well as the content of the provenance graph and sameAs-links.
Currently, the provenance information is just copied to the final output. In future releases, we will use the provenance information for data quality assessment and data fusion (see Next Steps).

N-Triples Writer

The N-Triples writer dumps the final output of the integration workflow into a single N-Triples file. Since there exists no connection to the provenance data anymore after outputting it as N-Triple, the provenance data is discarded instead of being output.

Runtime Environments

The Runtime Environment for the integration component manages the data flow between the various stages/modules and the caching of the intermediate results.

In order to parallelize the data processing, the data is partitioned into entities prior to supplying it to a transformation module. An entity represents a Web resource together with all data that is required by a transformation module to process this resource. Entities consist of one or more graph paths and include a graph URI for each node. Each transformation module specifies which paths should be included into the entities it processes. Splitting the work into fine-granular entities, allows LDIF to parallelize the work.

LDIF provides three implementations of the Runtime Environment: 1. the in-memory version, 2. the RDF store version and 3. the Hadoop version. Depending of the size of your data set and the available computing resources, you can choose the runtime environment that best fits your use case.

Single machine / In-memory

The in-memory implementation keeps all intermediate results in memory. It is fast but its scalability is limited by the amount of available memory. For instance, integrating 25 million triples required 5 GB memory within one of our experiments. Parallelization is achieved by distributing the work (entities) to multiple threads.

Single machine / RDF Store

This implementation of the runtime environment uses an Jena TDB RDF store to store intermediate results. The communication between the RDF store and the runtime environment is realized in the form of SPARQL queries. This runtime environment allows you to process data sets that don't fit into memory anymore. The downside is that the RDF Store implementation is slower as the In-memory implementation.

Cluster / Hadoop

This implementation of the runtime environment allows you to parallelize the work onto multiple machines using Hadoop. Each phase in the integration flow has been ported to be executable on a Hadoop cluster. Some initial performance figures comparing the run times of the in-memory, quad store and Hadoop version against different data set sizes are provided in the Benchmark Wiki.

Next steps for LDIF

Over the next months, we plan to extend LDIF along the following lines:

  1. Add a Data Quality Evaluation and Data Fusion Module which allows Web data to be filtered according to different data quality assessment policies and provides for fusing Web data according to different conflict resolution methods.
  2. Flexible integration workflow. Currently the integration flow is static and can only be influenced by predefined configuration parameters. We plan to make the workflow and its configuration more flexible in order to make it easier to include additional modules that cover other data integration aspects.

3. Configuration Options

This section describes how LDIF configuration files look like and which parameters you can modify to change the runtime behavior of LDIF.

Schedule Job Configuration

A Schedule Job updates the representation of external sources in the local cache and it is configured with an XML document, whose structure is described by this XML Schema. The scheduler configuration is the top configuration file that references all the other configuration files like the for the import jobs for accessing remote sources and for the integration job.

A typical configuration document looks like this:

<scheduler xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
<properties>scheduler.properties</properties>
<dataSources>datasources</dataSources>
<importJobs>importJobs</importJobs>
<integrationJobs>integration-config.xml</integrationJob>
<dumpLocation>dumps</dumpLocation>
</scheduler>

It has the following elements:

Both relative and absolute paths are supported.

Configuration Properties

In the Schedule Job configuration file you can specify a (Java) properties file to further tweak certain parameters concerning the workflow. Here is a list with all properties that can be set at the moment and the possible values for each property:

Integration Job Configuration

An Integration Job is configured with an XML document, whose structure is described by this XML Schema.
The current structure is very simple because the integration flow is static at the moment - something that will change in future releases. The config file specifies amongst other things how often the whole integration workflow should be executed. It should be noted that when an integration job starts, it only works on fully imported data. Data of import jobs that did not finish before the integration starts is ignored - the only exception is if the oneTimeExecution configuration property is set to true; then the integration waits for all import jobs to finish.

A typical configuration document looks like this:

<integrationJob xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
<properties>test.properties</properties>
<sources>sources</sources>
<linkSpecifications>linkSpecs</linkSpecifications>
<mappings>mappings</mappings>
<output>output.nq</output>
<runSchedule>daily</runSchedule>
</integrationJob>

It has the following elements:

Both relative and absolute paths are supported. In this case there is a root directory with the config file and the test.properties file in it. Furthermore the following directories would be nested in the root directory: linkSpecs, sources and mappings. Data sets have to be in a local directory.

Configuration Properties for the Integration Job

In the Integration Job configuration file you can specify a (Java) properties file to further tweak certain parameters concerning the integration workflow. Here is a list with all properties that can be set at the moment and the possible values for each property:

Import Job Configuration

An Import Job is configured with an XML document, whose structure is described by this XML Schema.

It has the following elements:

LDIF supports four different mechanisms to import external data:

Quad Import

A typical config file for a Quad Import Job looks like this:

<importJob xmlns="http://www4.wiwiss.fu-berlin.de/bizer/ldif">
<internalId>dBpedia.0</internalId>
<dataSource>dBpedia</dataSource>
<refreshSchedule>daily</refreshSchedule>
<quadImportJob>
<dumpLocation>http://dbpedia.org/dump.nq</dumpLocation>
</quadImportJob>
</importJob>

Triple Import

In a triple import you use the tripleImportJob element instead of the quadImportJob element:

<importJob xmlns="http://www4.wiwiss.fu-berlin.de/bizer/ldif">
<internalId>dBpedia.0</internalId>
<dataSource>dBpedia</dataSource>
<refreshSchedule>daily</refreshSchedule>
<tripleImportJob>
<dumpLocation>http://dbpedia.org/dump.nt</dumpLocation>
</tripleImportJob>
</importJob>

SPARQL Import

In a SPARQL import job the sparqlImportJob element specifies the endpoint that will be queried for data and a restriction pattern - note that angle brackets of URIs have to be escaped using &lt; and &gt;. This restriction pattern is joined with the pattern ?s ?p ?o, which is also the only pattern in the Construct part of the generated SPARQL Construct query. This means that all the triples of the entities matching the restriction in the pattern element are collected. It is also possible to specify a graph with the graphName element and to restrict the number of imported triples with the tripleLimit element. All but the endpointLocation is optional.

<importJob xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
<internalId>musicbrainz.3</internalId>
<dataSource>MusicBrainz_Talis</dataSource>
<refreshSchedule>monthly</refreshSchedule>
<sparqlImportJob>
<endpointLocation>http://api.talis.com/stores/musicbrainz/services/sparql</endpointLocation>
<tripleLimit>100000</tripleLimit>
<sparqlPatterns>
<pattern>?s a &lt;http://purl.org/ontology/mo/MusicArtist&gt;</pattern>
</sparqlPatterns>
</sparqlImportJob>
</importJob>

Crawler Import

A crawl import job is configured by specifying one or more seed URIs as starting points of the crawl, predicates that the crawler should follow to discover new resources and optionally the maximum number of levels to crawl, meaning the maximum distance to one of the seed URIs. Also optionally the maximum number of resources to crawl can be specified. Of each crawled resource all received triples are stored.

<importJob xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
<internalId>freebase.0</internalId>
<dataSource>Freebase</dataSource>
<refreshSchedule>onStartup</refreshSchedule>
<crawlImportJob>
<seedURIs>
<uri>http://rdf.freebase.com/ns/en.dance-pop</uri>
<uri>http://rdf.freebase.com/ns/en.radiohead</uri>
<uri>http://rdf.freebase.com/ns/en.art_rock</uri>
</seedURIs>
<predicatesToFollow>
<uri>http://rdf.freebase.com/ns/music.artist.genre</uri>
<uri>http://rdf.freebase.com/ns/music.genre.albums</uri>
<uri>http://rdf.freebase.com/ns/music.genre.artists</uri>
<uri>http://rdf.freebase.com/ns/music.album.genre</uri>
<uri>http://rdf.freebase.com/ns/music.album.artist</uri>
<uri>http://rdf.freebase.com/ns/music.artist.album</uri>
<uri>http://rdf.freebase.com/ns/influence.influence_node.influenced_by</uri>
<uri>http://rdf.freebase.com/ns/music.artist.label</uri>
<uri>http://rdf.freebase.com/ns/music.record_label.artist</uri>
<uri>http://rdf.freebase.com/ns/music.producer.releases_produced</uri>
<uri>http://rdf.freebase.com/ns/music.release.producers</uri>
<uri>http://rdf.freebase.com/ns/music.release.album</uri>
<uri>http://rdf.freebase.com/ns/music.producer.releases_produced</uri>
</predicatesToFollow>
<levels>5</levels>
<resourceLimit>50000</resourceLimit>
</crawlImportJob>
</importJob>

Provenance Metadata

The result of each import contains provenance metadata, whose structure is described by this ontology.
For each imported graph, provenance information will contain:

A typical provenance graph for a Crawl Import Job looks like this:

Provenance graph

A typical provenance graph for a Quad Import Job looks like this:

<http://dbpedia.org/graphA> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www4.wiwiss.fu-berlin.de/ldif/ImportedGraph> <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
<http://dbpedia.org/graphA> <http://www4.wiwiss.fu-berlin.de/ldif/hasImportJob> _:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
<http://dbpedia.org/graphB> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www4.wiwiss.fu-berlin.de/ldif/ImportedGraph> <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
<http://dbpedia.org/graphB> <http://www4.wiwiss.fu-berlin.de/ldif/hasImportJob> _:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www4.wiwiss.fu-berlin.de/ldif/ImportJob> <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/importId> "dBpedia.0" <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/lastUpdate> "2011-09-21T19:01:00-05:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/hasDatasource> "dBpedia" <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/hasImportType> "quad" <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .
_:dbpedia0 <http://www4.wiwiss.fu-berlin.de/ldif/hasOriginalLocation> "http://mes.smw-lde-eu.s3.amazonaws.com/dBpedia_dump.nt.bz2" <http://www4.wiwiss.fu-berlin.de/ldif/provenance> .

Data Source Configuration

A Data Source is configured with an XML document, whose structure is described by this XML Schema. It contains human readable information about a data source. The label element should be a unique string in each integration use case, because it will be referenced by the import jobs.

<dataSource>
<label>DBpedia</label>
<description>DBpedia ist an RDF version of Wikipedia</description>
<homepage>http://dbpedia.org</homepage>
</dataSource>

4. Quick start

This section explains you how to run the different versions of LDIF.

Single machine / In-memory

To see LDIF in action, please follow these steps:

The example will run in about 3 minutes. Integration results will be written into integrated_music_light.nq in the working directory, containing both integrated data and provenance metadata.
Learn more about LDIF configuration by looking at the Schedule Job Configuration (examples/music/light/schedulerConfig.xml) and the Integration Job Configuration (examples/music/light/integrationJob.xml)

Single machine / RDF Store

To see LDIF running with a quad store (TDB) as backend, please follow these steps:

The configuration properties that need to be used are quadStoreType and databaseLocation (see the Configuration section from more details).

The example will run in about 3 minutes. Integration results will be written into integrated_music_light.nq in the working directory, containing both integrated data and provenance metadata.
Learn more about LDIF configuration by looking at the Schedule Job Configuration (examples/music/light/schedulerConfig.xml) and the Integration Job Configuration (examples/music/light/integrationJob.xml)

Cluster / Hadoop

To see LDIF running on a Hadoop cluster, please follow these steps:

This will import the data sets as defined by the LDIF import jobs and copies them afterwards to the Hadoop file system. Integration results will be written into the /user/hduser/integrated_music_light.nq directory in the Hadoop distributed file system (HDFS). You can check the content of this directory using the following command: hadoop dfs -ls /user/hduser/integrated_music_light.nq

Please note that most of the run time for this small use case is dominated by the Hadoop overhead.

Learn more about Hadoop configuration by looking at our Benchmark and Troubleshooting wiki pages.

In order to have a cleaner console output, consider replacing the Hadoop default logging configuration ([HADOOP-HOME]/conf/log4j.properties) with our customized log4j.properties file.

Here is a list of Hadoop configuration parameters that can be useful to tune when running LDIF with big datasets:

Parameter Description Recommended value
mapred.job.reuse.jvm.num.tasks Reuse of a JVM across multiple tasks of the same job -1
mapred.min.split.size The minimum size chunk that map input should be split into 268435456
mapred.map.child.java.opts Specify the heap-size for the child jvms -Xmx1G
mapred.output.compress Enable output compression true
mapred.output.compression.type How the compression is applied BLOCK
mapred.output.compression.code The compression codec class that is used for compression/decompression org.apache.hadoop.io.compress.GzipCodec

5. Examples


This section presents two LDIF usage examples.
  1. The Music example shows how different music-related data sources are accessed using the LDIF Web data access components and integrated afterwards using the LDIF data translation and identity resolution modules.
  2. The Life Science example shows how LDIF is used to integrate several local RDF dumps of life science data sets.

5.1 Using LDIF to integrate Data from the Music Domain

This example shows how LDIF is applied to integrate data describing musical artists, music albums, labels and genres from the following remote sources:

Configurations

Each source is accessed via the appropriate access module. The DBpedia data set is downloaded, Freebase is crawled because of lack of other access possibilities, MusicBrainz and BBC Music are both accessed via SPARQL because no download of the data set is available and crawling is in general inferior, because you might not gather all the instances you are interested in.

The following import job configuration files are used for the different sources:

The following mapping file provides for translating the source data sets into our target vocabulary:

The target vocabulary is a mix of existing ontologies like FOAF, Music Ontology, Dublin Core, DBpedia etc.

The following Silk identity resolution heuristics are used to find music artists, labels and albums that are described in multiple data sets:

Music artist and record instances are integrated from all the sources.
Labels and genres are integrated only from fewer sources since not all of them provide this information. For example MusicBrainz does not support genre information.

Execution instructions

In order to run the example, please download LDIF and run the following commands:

Please note that the execution of the import jobs can take about 3 hours, mainly due to crawling, which is relatively slow compared to other access methods.

It is also available a light version of the use case, which runs in less than 5 minutes:

Output

The following graph shows a portion of the LDIF output describing Bob Marley:

LDIF output for music use case

5.2 Using LDIF to integrate Life Science Data

This example shows how LDIF is applied to integrate data originating from five Life Science sources.

The example is taken from a joint project with Vulcan Inc. and ontoprise GmbH about extending Semantic Media Wiki+ with a Linked Data Integration Framework.

In this example, the following data sources are translated into a common Wiki ontology:

Configurations

A subset of these datasets can be found in the sub-directory examples/life-science/sources of the LDIF release.

The following mapping file provides for translating the vocabularies used by the source data sets into the Wiki ontology.

The following Silk identity resolution heuristics are used to find genes and other expressions that are described in multiple data sets.

To run the example, please download LDIF and use the following LDIF configuration. The configuration options are explained in the Section Configuration below.

Execution instructions

Examples of data translation

In the following, we explain the data translation that is performed for the example of one entity that is described in two input data sets:

01:  @prefix aba-voc: <http://brain-map.org/gene/0.1#> .
02: @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
03: @prefix uniprot: <http://purl.uniprot.org/core/> .
04:  
05: <file:///aba_mouse_20101010_1000.nq> {
06: <http://brain-map.org/mouse/brain/Oprk1.xml> aba-voc:entrezgeneid "18387" ;
07: aba-voc:gene-aliases _:Ab12290 .
08: _:Ab12290 <http://brain-map.org/gene/0.1#aliassymbol> "Oprk1" .
09: }
10:
11: <file:///datasets/uniprot-organism-human-reviewed-complete_1000.nq> {
12: <http://purl.uniprot.org/uniprot/P61981> rdfs:seeAlso <http://purl.uniprot.org/geneid/18387> .
13: <http://purl.uniprot.org/geneid/18387> uniprot:database "GeneID" .
14: <http://purl.uniprot.org/uniprot/P61981> uniprot:encodedBy <file:///storage/datasets/uniprot-organism-human-reviewed-complete.rdf#_503237333438003B> .
15: }
01: @prefix smwprop: <http://mywiki/resource/property/> .
02: @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
03:
04: <file:///aba_mouse_20101010_1000.nq> {
05:  <http://brain-map.org/mouse/brain/Oprk1.xml> smwprop:EntrezGeneId "18387"^^xsd:int .
06:  <http://brain-map.org/mouse/brain/Oprk1.xml> smwprop:GeneSymbol "Oprk1"^^xsd:string .
07: }
08:
09: <file:///datasets/uniprot-organism-human-reviewed-complete_1000.nq> {
10: <http://brain-map.org/mouse/brain/Oprk1.xml> smwprop:EntrezGeneId "18387"^^xsd:int .
11: <http://brain-map.org/mouse/brain/Oprk1.xml> owl:sameAs <file:///storage/datasets/uniprot-organism-human-reviewed-complete.rdf#_503237333438003B> .
12: }

The example input and output needs some explanation:

Identity resolution:

Data Translation:

6. Performance Evaluation

We regularly carry out performance evaluations. For more details and the latest results please visit our Benchmark results page.

7. Source Code and Development

The latest source code is available from the LDIF development page on Assembla.com.

The framework can be used under the terms of the Apache Software License.

8. Version history

Version Release log Date
0.4 Added two new implementations of the runtime environment:
1. The triple store backed implementation scales to larger data sets on a single machine
2. The Hadoop-based implementation allows you to run LDIF on clusters with multiple machines
01/10/2012
0.3 Access module support (data set dump, SPARQL, crawling)
Scheduler for running import and integration tasks automatically
Configuration file XML schemas for validation
URI minting
Second use case from the music domain
10/06/2011
0.2 R2R data translation tasks are now executed in parallel
Perform source syntax validation before loading data (optional)
Support for external sameAs links
RDF/N-Triples data output module
Support for bzip2 source compression
Improved loading performance
Memory usage improvements: caching factum rows and string interning only for relevant data
8/25/2011
0.1 Intial release of LDIF 6/29/2011

9. Support and Feedback

For questions and feedback please use the LDIF Google Group.

10. References

  1. Christian Becker, Andrea Matteini: LDIF - Linked Data Integration Framework ( Slides ). SemTechBiz 2012, Berlin, February 2012.
  2. Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, Christian Becker: LDIF - Linked Data Integration Framework. 2nd International Workshop on Consuming Linked Data, Bonn, Germany, October 2011.
  3. William Smith, Christian Becker and Andreas Schultz: Neurowiki: How we integrated large datasets into SMW with R2R and Silk / LDIF ( Slides part 1, part 2 ). SMWCon Fall 2011, Berlin, September 2011.
  4. Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool Publishers, ISBN 978160845431, 2011 ( Free HTML version ).
  5. Christian Bizer, Andreas Schultz: The R2R Framework: Publishing and Discovering Mappings on the Web ( Slides ). 1st International Workshop on Consuming Linked Data (COLD 2010), Shanghai, November 2010.
  6. Robert Isele, Anja Jentzsch, Christian Bizer: Silk Server - Adding missing Links while consuming Linked Data ( Slides ). 1st International Workshop on Consuming Linked Data (COLD 2010), Shanghai, November 2010.
  7. Julius Volz, Christian Bizer, Martin Gaedke, Georgi Kobilarov: Discovering and Maintaining Links on the Web of Data ( Slides ). International Semantic Web Conference (ISWC2009), Westfields, USA, October 2009.

11. Acknowledgments

This work was supported in part by Vulcan Inc. as part of its Project Halo and by the EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943).

WooFunction icon set licensed under GNU General Public License.