OBDA

Published on 19/01/2023

Ontology-Based Data Access is a data integration approach throw an Ontology.

Problems

In this posts, we will explore the objects appearing in query answering in OBDA system. We have introduce what we call an RDF integration system denoted \(\langle \onto, \rules, \mappings, \extensions \rangle\), where \(\onto\) is an ontology, \(\rules\) is reasoning rules set, \(\mappings\) is a mappings set and \(\extensions\) is a set of RDF extensions.

Mappings

To define a OBDA system, it is easy to define each mapping as GLAV mapping of the form:

\[m: q_{1}(\bar x) \leadsto q_{2}(\bar x)\]

We will see in the following how we can benefit from a intermediate view \(V_{m}\), defining a new extension set, to split the mapping \(m\) into one LAV and one GAV mapping as follow:

a GAV mapping between the source query and the view \(q_{1}(\bar x) \leadsto V_{m}(\bar x)\);
a LAV mapping between the view and the query on the global schema, \(V_{m}(\bar x) \leadsto q_{2}(\bar x)\)

In this section, we will redefine previous GLAV mapping to common GLAV mappings on heterogeneous sources using the splitting GAV-LAV. Then, we will see how we can minimize mappings set using this new definition.

Definitions of View based GLAV Mappings

Traditionally in a GLAV mapping \(m\), \(q_{1}\) is a query in one data source, hence if we have two sources as follow:

\(\mathrm{Collection}(user, doc)\) containing the collection of documents of each user;
\(\mathrm{Topic}(doc, topic)\) containing the topic of each document.

It is not possible to define a GLAV mapping exposing the user documents topics of the users, without revealing the users collections. Such mapping would be the following: \[\mathrm{Collection}(user, doc) \wedge \mathrm{Topic}(doc, topic) \leadsto \triple{user}{\irin{haveDocumentOn}}{topic}\]

We notice that this mapping have a body joining two different sources, which is not supported by the GLAV mapping definition.

An view based GLAV mappings collection is a pair of mappings sets \(\mappings = (\mappings_{V}, \mappings_{G})\), where \(\mappings_{V}\) is a set of GAV mappings whose bodies are conjunctive queries on one sources, called the views definitions of \(\mappings\) and \(\mappings_{G}\) is a set of GLAV mappings whose bodies are conjunctive queries using predicates from mappings heads of \(\mappings_{V}\), it is called GLAV mappings set of \(\mappings\).

We can define \(\mappings = (\mappings_{V}, \mappings_{G})\) a view based GLAV mappings collection for defining the previous mapping example. We start by defining \(\mappings_V\) the views definitions containing the two following mappings:

for a view of the \(\mathrm{Collection}\) relation \[\mathrm{Collection}(user, doc) \leadsto V_{C}(user, doc)\]
for a view of the \(\mathrm{Topic}\) relation \[\mathrm{Topic}(doc, topic) \leadsto V_{T}(doc, topic)\]

Then, we definite \(\mappings_{G}\) the GLAV mappings set containing one mapping: \[V_{C}(user, doc) \wedge V_{T}(doc, topic) \leadsto \triple{user}{\irin{haveDocumentOn}}{topic}\]

Minimization of View based GLAV Mappings

Inspired from dipintoOptimizingQueryRewriting2013, we can take advantage of the view based GLAV mapping representation to optimized the mapping set, by minimizing it.

Mapping Creating Schema

pintoMappingDataHigherOrder
dipintoAcquiringOntologyAxioms2019a In this article, the authors present the mapping-based knowledge base which extends the OBDA formalisation by allowing the T-box (DL-Lite_R) to be induced from data sources by GAV mapping.
giacomoHigherOrderDescriptionLogics describes the higher order DL language.

Mappings Generation

look at the problem of generating mapping from spreadsheet (Question that comes up during Duc's defense)
look at the thesis of Ugo Camignani who investigate the problem of repairing mappings and rewriting mapppings in order that they respect some privacy constraints.

Mapping Analysis

Mappings saturation

Query

Higher Order Query

Non Positive Query

cimaQueriesInequalitiesDLLite

Regular Path Query

OPTIONAL in SPARQL

SPARQL queries build with BGP and OPTIONAL operator are not positive queries, in the sense that we need the negation operator to express such queries. But OPTIONAL SPARQL queries are monotonous, which in OWA assumptions, means that the more the KB contains positive atoms, the more the query have answers and the number of its answers is independent of the number of negative atoms in the KB.

Update Query

The update query have been studied in:

degiacomoPracticalUpdateManagement2017

Source to ontology

déterminer s'il existe une requête au niveau ontologique capturant une requête au niveau de la source. DBLP:conf/aiia/CimaLP19

Keywords Query

A trivial way to implement the keyword querying on top of datasources supporting keywords querying is to create a mapping from a query result on a data source having for parameter a keyword. This mappings would use binding pattern, of course, since the keyword will be an input of the source query. One may think of mapping similar to :

SELECT id from t where name LIKE "*$keyword*" -->  (<http://example/people/$id>, <hasKeyword>, $keyword)

It requires some investigation, but the following paper uses the schema to reduce a lot the need of computation to perform keyword querying. It is not based on an index, so I wonder how this technique could be translated in a OBDA setting, and if it would have realistic performance.

DBLP:conf/semweb/Shiokawa21

Reasoning

Equality Dependency

Mediation

JUCQ for GLAV mappings rewriting

The rewriting technique proposed by Damian Bursztyn in bursztynOptimizingReformulationbasedQuery2014, DBLP:phd/hal/Bursztyn16 for DL-Lite may be applicable for in the case of GLAV mappings rewriting. Therefore the factorization of the rewriting, so the structure of the rewriting plan can be found during the rewriting process. This factorisation should be guided by some statistics about the data returned by the mapping bodies.

Join ordering

Systems

Mastro

Morph-RDB and morph-xr2rml

Morph-RDB is an OBDA system supporting mediation-based and materialization-based approaches for query answering using R2RML mappings language (so relational sources).

Morph-xr2rml

Obi-Wan

Ontario

Ontario uses RML mappings and supports heterogeneous sources.

endrisOntarioFederatedQuery2019

Ontop

Ontop uses RML mappings and supports relational sources through JDBC, also it can be combined with Teiid in order to support heterogeneous sources.

calvaneseOntopAnsweringSPARQL2017

SPORKL

Reviews for ISWC noexports

This paper presents an OBDA approach integrating heterogeneous datasources using GAV mappings. The authors allow that each integrated triple can result from data of several data sources together. This is a interesting feature for heterogeneous RDF integration. They propose a technique for query answering relying on SPARQL federation which supports a large fragment of SPARQL 1.1 (except property path), going beyond the state of art systems. They develop SPORKL a prototype, which implements their proposed query unfolding technique allowing to rely on any existing SPARQL wrappers. SPORKL applies several types of logical optimization to prune the result of the query unfolding. The article contains detailed and clear explanations of the unfolding technique and presents a fair query execution performance in its experiments, which could be reproducible, once the code will be release (after the double blind submission).

The idea of the article is to reduce the problem of SPARQL query answering on a heterogeneous data integration in RDF, defined by uniform mappings for all data types (Fig 1. (c)), to the same problem where mappings for each datatype are separated. Such problem reduction allows to reuse existing SPARQL wrappers for each datatype of sources. It is a good idea, but it would have useful if the problem reduction and its motivation would have been presented before the technical details, given a global view of the approach and the motivation.

I recommend to include the following article in the related work, which present results on heterogeneous RDF integration for Global Local As View mappings, strictly extending GAV mappings using separate mappings (Fig 1.(b)):

Maxime Buron, François Goasdoué, Ioana Manolescu, Marie-Laure Mugnier: Ontology-Based RDF Integration of Heterogeneous Data. EDBT 2020

The technical explanations of the setting and of the approach (Section 3) are clear and well written. They may be shorten to let some place to deeper detail the contributions. The core idea is to move the join condition from source-level to the RDF-level. It is performed by a transformation of the mappings and the query by introducing fresh properties for cross datasources joins. The triple in the head of a mapping joining several sources has to use a constant (IRI) as property (not a variable). However, in Section 3.1, one can read "Likewise, predicates could be generated from data instead of constant P", about mapping heads. Allowing variables at property position could be a interesting extension of the mapping settings, but the proposed approach do not support it. This point should be clarified.

The ontological reasoning is not mentioned, whereas the article is in the context of OBDA. Ontological reasoning problem is orthogonal to the one problem of this work, but I wonder how query rewriting techniques, classically used for handled the reasoning in mediation-based approach, can work with the proposed query answering technique.

The experiments are conducted using the BSBM benchmark containing diverse SPARQL queries. By its construction, SPORKL supports a large fragment of SPARQL. SPORKL presents mitigated execution times compared to Ontop. This proof of concept appears promising when we compare its numbers of wrapper calls with those of Ontario. It would have be interesting to give the following information:

the minimum and maximum the number of answers of queries, at least for the largest BSBM scale
the number of a hot and cold query executions
the number of mappings joining the two sources
the queries where joining more than two sources

In the last paragraph of Section 5.2, it is written that Q12 contains OPTIONAL, which is false according to the BSBM description (http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/ExploreUseCase/index.html#queries). Do you meant Q2 instead ?

UltraWrap or Capsenta

Mediation-based query answering system integrating relational databases.

SequedaUltrawrapSPARQLexecution2013