OPTIMA: a hybrid OBDA System for Efficiently Querying Large Heterogeneous Data
The article is about OPTIMA a query answering system over heterogeneous data sources. It is dedicated to big data and uses a data materialization approach, where the data required for answering a query are transformed and loaded into a system that evaluates the query. OPTIMA tackles the following question: which kind of system between TABULAR and GRAPH ones maximizes a given query evaluation on given data sources. OPTIMA uses a deep learning method to make this prediction, which enables it to outperform Squerall, a system using only TABULAR model data for materialization.
Remarks
General
- The article is about the Ontology-Based Data Access, but it seems that it is not about ontology reasoning, so finally it is more focused on data integration for Semantic Web.
- The article presents an query answering system for heterogeneous data integration, which requires to present the mapping language from the data sources to the integration data model, in this case the RDF graph. Is it something like RML ? Section 2.3 presents the data wrappers, which have static way to transform the data from the data source into TABULAR or GRAPH data model. Does it means that there is no mapping language at all ?
- The data model prediction is based on the training over one data sources instance. What happen if the data source are updated ? What is the among of work that needs to be performed in order to keep the predictions up to date.
- The question raises by the article can be phrased as follows "what is the best data model to materialize heterogeneous data sources and then to efficiently evaluate a query on them ?". It seems to me that it is a quite theoretical question, where the schema used to materialize the data plays a big role whatever the chosen data model. So, the choice of the data wrappers have to be motivated, since they determine the schema of the materialization. And, the focus of the question on the data model has to be better justified. At the end, it seems that the problem is more about choosing the best system to store the materialized data in relation to the kind of query each system is optimized for than choosing the better data model. So, I think that this article should not talk about the best data model, because it is not what is about, but it should talk about the best system to evaluate the query either a system dedicated for TABULAR data or for GRAPH data.
Experiment
- The experiment 1 uses BSBM, but there is no information about where the queries come from and what is the size of the dataset (or the scale factor). The number of answer for each query should be added to the Table 1. Table 1 contains on an strict subset of the BSBM relational schema tables, why ?
- The article presents a system for big data, but it gives no information about the size of the databases used in the experiment. It is very strange. It should also talk about the scalability of the system.
- What is the LSTM model ? The article should contain a reference about it.
- The dataset should be presented at the beginning of Section 3 and not in Section 3.1, since it is used for both experiment 1 and 2.
- The times for OPTIMA in Table 2 and the best times in Table 4 are different (see Q5). Why is it the case ?
- What is the query execution time, does it include the time used by the wrapper and the loading time ?
Others
- In the Section about the experiment, when it is possible replace miliseconds by seconds (e.g. 3000 ms -> 3 s). It is easier to read and it is a gain of space.
- Page 6, there is BSBM*, is the star a missing footnote ?
- Table 1, Q21 is missing
- Page 9 : SAPRK -> SPARK
- You could have a look to the Obi-Wan OBDA system, which is a one of the few OBDA system that supports NOSQL data sources. And it can be also interesting to mention some work about polystore (e.g. ESTOCADA), where similar problems are studied.