Table of Contents
List of engines to compare
With a growing amount of data generated, their changing and evolving, the concept of big data has become incredibly popular in last years. It provides a set of new approaches and techniques allowing to work with huge volumes of records.
Nowadays, information is one of the most important resources; it can help with decision making and business processes optimization. However, to get actual insights and unlock a potential of data, it is necessary to process them and discover the information hidden inside it which is a goal of data analytics. Data analytic platforms allow to manipulate with raw data in order to find out what exactly they contain. These systems are complex and includes multiple components therefore their designing requires comprehensive analysis of available options.
This book aims to describe how data analytics works for big data and how they are used in business. It gives an overview of existing technologies and approaches to building data analytics infrastructures. It also defines points that should be taken into consideration while choosing the most suitable software solution for a particular use case.
The research is done by studying architectural principles of big data systems and investigating the market of data analytics software. The result of this work is a composite report including comparison of several technologies and a list of criteria considered. The final report can be used as a guideline for choosing the most suitable technology for implementing an analytical platform in a broad variety of organizations.
The amount of data existing in the world is growing rapidly nowadays with an increasing number of machines and systems producing them. Data are be-coming more diverse, uneven, mixed; they are changing and evolving together with overall technologies landscape.
However, the existence of data by itself does not make any real impact; to take an advantage of having them, it is necessary to process it to discover what is hidden inside. The true value of data is in information they contain as it can give insights into business processes and customer behavior, help to predict changes and make prognoses about the future. Understanding the potential of data and leveraging it helps to optimize organizational activities and operations which in turn brings competitive advantage and make business more successful and profitable. It also unlocks an opportunity to enable a data-driven approach when decisions are taken based on facts and verified information derived from them.
Effective leveraging of data analytics techniques requires having an infra-structure for consuming, storing and processing data with following submit-ting obtained results to end users. These systems usually consist of data stor-ages, SQL engines and reporting tools integrated together to create a seamless pipeline taking raw data as an input and returning results of analytic queries executed on it. Designing such systems brings a lot of challenges as it is necessary not only to pick up suitable components but also ensure that they can work together.
This work is focused on SQL engines particularly as they are de- facto the key element for analytical systems since it is the component which actually processes the data and helps to discover what they contain. SQL engines execute queries sent by end users for getting specific insights answering their questions or providing them with findings which can help to do so.
Nowadays market o ers dozens of software solutions for building data analytics infrastructure for any workloads and a big variety of options available does not make the choice easier; contrarily, it may be confusing or even overwhelming. Moreover, big data ideas are becoming more and more popular and today leveraging it is already considered a necessity rather than a nice-to-have option.
The goal of this paper is to provide an overview of how do big data analytics workflows function in general and to summarize main aspects important for choosing the right engine capable of e ciently solving given tasks. The research is done by exploring the big data technology market together with studying architectural principles of data analytics systems. The resulting re-port includes the table comparing the chosen engines, detailed explanations of criteria considered in examination and example of use cases for each of them. This report can be used as a guideline for choosing a particular engine to use or as a checklist for reviewing any other technology in a broad variety of organizations.
1.1 What is big data
1.1.1 How did the big data concept appear
Humanity has been naturally generating all kinds of information throughout its history with getting to know the world, describing it and transferring knowledge to others. With the evolution of science data was becoming more and more diverse and stored in different formats with different techniques. From tally sticks through books and magnetic tape, people finally came to the point where information became digital. It was the beginning of a completely new era of storing and processing data.
Nowadays, the amount of information in the world is growing rapidly. For example, last year there was generated 16.1ZB of data, and according to IDC the global datasphere will grow to 163ZB by 2025.
Abbildung in dieser Leseprobe nicht enthalten
Figure 1.1: Annual size of the global datasphere 
But what does big data exactly mean? Does it stand only for enormous amount of data or is there something more? The first appearance of term big data refers to 1999, when it was used in the article Visually exploring gigabyte data sets in real time published by the Association for Computing Machinery. Two years later, Doug Laney from Gartner published his paper 3D Data Management: Controlling Data Volume, Velocity and Variety, where he defines i.e. 3V (volume, velocity, variety) which will come to be the commonly-accepted characteristics of big data. Those three attributes cap-ture the main aspects determining if some particular data can be considered big or not:
- the volume characteristic defines the amount of data, which is usually ranges of petabytes.
- the velocity characteristic is the measure of how fast the new data appears and are processed
- the variety characteristic represents the diversity of data and an inten-tion to caption all data related to the decision-making process, whichever structure they have (structured, unstructured or semi-structured)
Quite often in the context of big data, there is mentioned the fourth V standing for veracity. Veracity refers to the data quality and its trustworthi-ness. It is not that common as the classical 3V, but it is used more and more in the last years.
In essence, big data is a new way of looking on data and processing them along with a set of technologies necessary for this. Since 1970, when a concept of relational database was proposed by E.F.Codd, it was a standard approach to any manipulations with data (mostly storing them in the predefined model and querying with SQL).
But as data became more diverse and its amount increased significantly, the relational databases became insufficient for working with them effectively as they were pushed to its limits. It turned out that classic approaches and techniques can not be scaled to handle such loads. There are several reasons causing this.
First of all, relational databases were not designed for handling data in the range of petabytes; it would require i.e. vertical scaling which means adding either more CPU cores or memory. There are several problems with vertical scaling it is expensive, not all vendors allow to do it, and it increases the risk of hardware failures which can cause major system outages. Second, relational databases were created primarily for storing of steady data, not for rapid growth; hence it faces certain problems with handling data generated with a high velocity (such is the most of big data). And finally, relational databases are able to work only with data tailored to the predefined model.
1.1. What is big data (or schema); and since the vast majority of records in big data is either semi-structured or unstructured, it cannot be used with relational databases or must be reformatted accordingly; whatever of these would require a lot of resources or is completely impossible.
That was the moment when new technologies started breaking into the market to fill the gap created by users in need of working with changed data and lack of the tools capable to do it e ciently. The most important one, which is today de facto a synonym for big data, is NoSQL (this acronym stands for Not Only SQL). A NoSQL database concept was created as the opposite of the relational database model it is designed to work with data without putting them into a model with tabular relationships. It was developed primarily in order to deal with limitations of SQL databases, espe-cially scalability and multi-structured data.
1.1.2 NoSQL vs. RDBMS
To analyze these two database approaches, it is important to specify the comparison criteria. To start with, it is definitely worth mentioning the CAP theorem formulated by Eric Brewer. According to this theorem, it is im-possible for a networked shared-data system to simultaneously provide more than two out of the following three guarantees:
- consistency (meaning all nodes within the system have the same and up-to-date version of data)
- availability (guarantee that every request to the system will receive either success or failure response)
- partition tolerance (ensuring the ability of the system to be scaled hori-zontally)
The proven correctness of the CAP theorem implies a need for making trade-o s when choosing which of its aspects (and to what extent) will be provided in the certain system. And the way this problem is solved in rela-tional and NoSQL databases di ers significantly.
In the context of database technologies, data are consistent if they are completely the same across all existing instances at any moment of time. Database consistency refers to the requirement that any given database trans-action must change a ected data only in allowed ways to ensure that any data written to the database are valid according to all rules defined in it (e.g. all integrity constraints are met).
Consistency model determines rules for visibility and apparent order of updates. For decades the traditional approach to databases was based on ACID model, where A stands for availability, C for consistency, I for isolation and D for durability. This models properties are mainly focused on providing maximal possible consistency. But as the CAP theorem claims, once the consistency is ensured, it is possible to choose either availability or partition tolerance. Most of relational databases are provided with availability and since then they cannot be distributed across several nodes. Despite that, some of them can be configured in a way allowing horizontal scaling but restricting its availability.
But as new database technologies came, new consistency models appeared naturally. One of them, called BASE (Basically Available, Soft state, Even-tual consistency) was created in the late 1990s to capture the emerging design approaches for high availability. What it suggests is to be optimistic about the consistency by not forcing every operation to put the database in a con-sistent state necessarily. By sacrificing this aspect, the BASE model allows the system to achieve a higher level of scalability which is unreachable within ACID model.
Undoubtedly, it is impossible to say that any of these approaches is better than another one. Another important point is to remember that it is not always possible to replace SQL with NoSQL and vice versa, which is implied by the di erent concepts lying in their basement coupled with consideration and adjustments in design. The decision to go either with SQL or NoSQL should be made based on a detailed analysis of the application architecture and its purposes along with use cases and possible limitations.
1.1.3 Big data architecture and workflow
As big data appears to be different from common one, its processing requires not only new technologies but also a special architectural design tailored to its specifics.
Each of big data 3V is thus reflected in di erent aspects of its processing. High volume implies a necessity of enabling powerful batch processing mech-anisms for dealing with larger amount of data that have previously been saved to some storage and now need to be processed. For handling velocity data, the system should be capable of executing interactive queries to give a result based on all available information including the most recent pieces of it. Wide variety of data sets requires having an extensible storage system along with advanced techniques for accessing and then integrating them to the whole processing pipeline.
Big data architecture is a model of how big data and other information assets will be captured, stored, managed and made accessible to various user groups and applications. In other words, it describes the way big data works as it flows through all components, from the raw data extracted from the data sources to the insights derived by end users using various analytical applications.
1.1. What is big data
Abbildung in dieser Leseprobe nicht enthalten
Figure 1.2: High-level view on big data architecture stages ; although this schema includes Hadoop framework, the workflow would be similar for any other components.
Big data architectural design is based on a simple idea of connecting ori-ginal data sources and business intelligence applications through data ware-houses and analytic platforms integrated all together into a single system.
Of course, creating an architectural solution for big data system from the scratch is a task far from trivial. Nowadays, most of the models are built based on the same set of consideration concepts with slight di erences depending on the technologies stack used for solving the particular task.
Any data analytics workflow is based on a simple sequence of actions done to data which can be simplistically described as acquire prepare analyze report act. Therefore, main components of analytical systems usually include:
- storage system such as a database or a file system
- ETL engine for preparing the data for being analyzed
- SQL engine for executing queries giving desired information
- BI tools for reporting and visual representation of gained insights
Although such design might seem simple, building and orchestrating the whole system brings certain challenges. First of all, it is essential to define the scope of processing: to determine desired outcome, understand which particular insights should be retrieved and find out which data should be analyzed for achieving this. Next step is choosing of components performing the best for defined goals. Nowadays market o ers a wide variety of options when it comes to data analytics software; however, it does not make the choice easier. At the same time, it is important to take into consideration not only the functionality of separate components but also its ability to be integrated with each other for creating a single pipeline. Also, the whole infrastructure should be optimized and adaptable for possible future changes. The reason is simple if the current architecture will become insu cient to solve new problems in a short period of time, the only solution is to change it which will necessarily entail multiple changes in the whole system. Needless to say, it is very difficult (if not impossible) and expensive procedure, which should be avoided at almost any cost. Needless to say, it is very di cult (if not impossible) and expensive procedure, which is worth avoiding at almost any cost.
1.2 What is Data Analytics
1.2.1 Data vs. Knowledge
Of course, people were always interested in discovering what is hidden in their data since raw data by itself can not answer any questions or give somewhat useful insights. An idea of deriving meaningful information and its further us-age is pictured by the DIKW (data, information, knowledge, wisdom) pyramid which represents the information hierarchy model.
Abbildung in dieser Leseprobe nicht enthalten
Figure 1.3: DIKW pyramid 
This model demonstrates the fact that any data gets its meaning after processing with a purpose. Despite data is at the bottom of the pyramid, it does not mean it is not important by itself. Data quality and accuracy are important because it is impossible to get correct and useful results based
1.3. Big data analytics for business on incomplete, inaccurate or imprecise data. Moreover, results obtained by dealing with invalid data can be considered useless or even harmful.
1.2.2 Data Analytics in general
Data analytics is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. In general, the idea of data analytics is to derive meaningful insights from that data and to convert knowledge into action .
Data analytics includes a wide range of applications: from business intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced analytics. It is important to note that these applications do not only serve the purpose of direct data analysis, but also solve various tasks such as collecting, integrating and preparing the data. At the same time, they deal with development, testing and revising analytical models.
It would be a mistake to consider the whole data analytic field applicable only for big data; it is a general term for any type of processing that looks at historical data over time. Today, data analytics is tightly coupled with the term big data mainly because it has evolved to being capable of working with data that can be considered big (according to the definition given before). This evolution, in turn, was caused by growing market demand for systems that tolerate intense requirements of big data.
1.3 Big data analytics for business
As data analytics is used for solving many various tasks, it is often divided into four categories:
- descriptive 
Each of these answers certain questions, and together they deliver complete insight helping to make more accurate business decisions.
Big data analytics plays a very important role for the business where it is implemented. By itself it is only able to provide with certain insights and forecasts derived from the raw data. When it comes directly to the decision-making process, there is a need of using some additional tools, which are usually covered by an umbrella term of business intelligence (often referred as BI). Since most of businesses nowadays want to have implemented both data analytics and business intelligence (in order to get the most precise over-view possible) and the market reacts to these needs, these two terms became interchangeable.
The main reason why BI tools are not coupled with data analytics technologies into the single term and framework is that the whole process of harnessing data is executed by users with different scope of knowledge required in their job positions. Wayne Eckerson in his report Big Data Analytics: Profiling the Use of Analytical Platforms in User Organizations defines two groups of users:
- casual users (executives, managers, front-line workers) who primarily use reports and dashboards that deliver answers to predefined questions
- power users (business analysts, analytical modelers and data scientists) who perform ad hoc queries against a variety of sources
BI tools are therefore designed primarily for casual users, providing inter-faces for indirect work with data based on extracted insight. Data analytics tools are used by power users and give them more freedom in interacting with raw data and accessing all platform components from the inside. These tools are not interchangeable since they interact with data in different manners and are insufficient for solving each others problems.
Natural evolution of BI tools gave birth to its new generation called self-service BI. The main feature of self-service BI tools is that they can be used by users with limited technical skillset. Together with the ability to provide access to the information stored anywhere it brings a significant value to the business by accelerated acting based on new insights. It also allows new users to start working with the information as soon as possible, without a need to study the tools by itself or getting familiar with the whole data analytics infrastructure. Another important point is that self-service BI tool decreases the IT departments workload casual users dont need to ask them about providing data access/permissions/technologies etc.