Authors: 
Kranas, Pavlos

Contributors: Patiño Martínez, Marta; Jiménez Peris, Ricardo

Abstract

Modern enterprises today handle a large amount of data coming from various data sources such as the company's website, different social networks, or the company's own information systems. Currently, there is no data management system that is capable of storing and processing these different types of data that those sources can generate. This means that, for instance, a financial institution uses a relational database to store the data of its clients' accounts in order to guarantee transactional semantics and consistency of the data. This organization may also have the need to use a NoSQL database to store and process the information on opinions or other relevant events for the company that occurs on Twitter-like networks. NoSQL data management systems allow this information to be stored with more flexible models than the relational ones and also tend to scale more effectively than relational databases. These properties are obtained at the cost of sacrificing transactional semantics in order to access data.

For several years there has been an explosion of different types of NoSQL data management systems (key-value, document-oriented, graphs etc.), which do not have a standard query language such as relational databases, but each datastore provides its own query language. This fact, added to the large amount of data that is generated every day, makes data processing in modern enterprises increasingly complex.

Polyglot query processing arises as a solution to this problem by providing a query engine capable of interacting with different types of data managers, even allowing the combination of the results of queries between the different types of data managers, for instance, allowing the performance of join operations. The language offered by the polyglot query engine should: (i) preserve the native expressiveness of the query language of each of the data management systems in order to maintain its expressiveness, and (ii) take advantage of the level of parallelism offered by the underlying data management systems in order to perform both parallel data retrieval and query processing and thus be able to work with large amounts of information.

The main objectives of this thesis focus on both points by developing a polyglot data manager that provides the CloudMdsQL query language to express native queries combined with SQL statements.

The polyglot analytical query engine has been implemented by extending LeanXcale's distributed query engine, which is a highly scalable relational database. The polyglot analytical query engine parallelizes the queries taking into account that each data manager can, in turn, be a distributed data manager. In addition, the polyglot analytical query engine implements optimization techniques, such as the binding join, which can improve the performance of selective join operations. The performance of the polyglot query engine has been evaluated using industry benchmarks (TPC-H) in a variety of scenarios. Another objective of this thesis has been the design of a novel architecture that allows analytics both on historical data and on data obtained in real-time without having to wait to carry out ETL-type processes. The architecture presented in this thesis allows federated queries to be processed on a current and historical dataset while maintaining data consistency, even while the data is being moved to the data warehouse responsible for the analytical processing.

Read more