A new scientific report on “Polyglot Analytical Query Engine” written by Pavlos Kranas with the contributions from Marta Patiño Martínez, and Ricardo Jiménez Peris, has been published. This report focuses on both points by developing a polyglot data manager that provides the CloudMdsQL query language to express native queries combined with SQL statements.
For several years there has been an explosion of different types of NoSQL data management systems (key-value, document-oriented, graphs etc.), which do not have a standard query language such as relational databases, but each datastore provides its query language. This fact, added to the large amount of data that is generated every day, makes data processing in modern enterprises increasingly complex.
Polyglot query processing arises as a solution to this problem by providing a query engine capable of interacting with different types of data managers, even allowing the combination of the results of queries between the different types of data managers, for instance, allowing the performance of join operations. The language offered by the polyglot query engine should: (i) preserve the native expressiveness of the query language of each of the data management systems to maintain its expressiveness, and (ii) take advantage of the level of parallelism offered by the underlying data management systems to perform both parallel data retrieval and query processing and thus be able to work with large amounts of information.
The polyglot analytical query engine has been implemented by extending LeanXcale's distributed query engine, which is a highly scalable relational database. The polyglot analytical query engine parallelises the queries taking into account that each data manager can, in turn, be a distributed data manager. In addition, the polyglot analytical query engine implements optimization techniques, such as the binding join, which can improve the performance of selective join operations. The performance of the polyglot query engine has been evaluated using industry benchmarks (TPC-H) in a variety of scenarios. Another objective of this thesis has been the design of a novel architecture that allows analytics both on historical data and on data obtained in real-time without having to wait to carry out ETL-type processes. The architecture presented in this thesis allows federated queries to be processed on a current and historical dataset while maintaining data consistency, even while the data is being moved to the data warehouse responsible for the analytical processing.