Any form of application used for data analysis is stringently dependant on its ability to retrieve queries fast. However, when working with larger or more complex datasets, as well as an increasing amount of concurrent users, the performance depends largely on the underlying analytical database – whether this is built into the application as part of a single-stack tool, or implemented via a separate data warehouse layer.
What Makes a Scalable Database?
Database scalability is a concept in analytics database design that emphasizes the capability of a database to handle growth in the amount of data and users. In the modern applications sphere, two types of workloads have emerged – namely analytical and transactional workloads. Planning for workload growth must take into account the operating system, database design and hardware design decisions.
The Traditional Approach: Denormalization
At the hardware level, vertical and horizontal scalability exist. Vertical scaling involves increasing the capacity of a single machine. The capacity of a single machine can be increased by adding more CPU cores, RAM and storage. Horizontal scaling involves adding more machines to cope with growing workloads. This article will only highlight database design decisions required for a scalable application.
At the database level configuration, schema design, indexing, and query design affect the capability of a database to scale. Each relational database system provides a set of parameters that can be tweaked to improve performance. Therefore it is advisable to check parameters that are used in the specific database system. Schema design offers the biggest opportunity for a system developer to make decisions that will affect the scalability and performance of a database. Transaction workloads require a normalized design while analytical workloads require a denormalized design. Each design will be briefly explained.
In a transaction workload the main objective is to process a large number of short UPDATE, INSERT and DELETE queries. Such databases are normalized to the third normal form (3NF). In this form business entities are broken into small tables. The business entities to be used in informing how to break up the tables come from a thorough analysis of the business. Normalization reduces redundant storage of data and enables tables to be joined to others. Indexing in a transaction workload does improve query response but slows update, insert and delete queries.
Therefore indexing can speed up or slow down a transaction database depending on how it is implemented. An iterative indexing strategy of implementing an index and testing its effect is appropriate. Although partitioning was developed for data warehouses it can still be beneficial in a transaction system especially where databases are big. A 3NF assures data consistency and accuracy but performance may be reduced due to the multiple joins involved. Trying to optimize query performance in a transaction system will affect the core functions of recording business data. Therefore it is advisable to migrate data analysis to a specialized system that uses an analytical workload design.
In an analytical workload the objective is to process few complex queries that arise in data analysis. This design is optimized for fast query performance. The data is denormalized meaning the business entities that were broken into different tables in the transaction system are joined together into one table. In this design, the two major table types are dimension and fact tables. Dimension tables contain descriptive attributes about the business. Fact tables contain measurable attributes that show the performance of a business. For example, in a sales process, the dollar amounts are the facts while region, customer type, and product category are the dimensions. In this design indexing and partitioning are very beneficial in improving the response times.
When working with either workload query optimization is very important. One optimization technique is avoiding the use of SELECT * instead, you should list all columns. Loops and correlated sub-queries slow down performance so they need to be used judiciously.
See Sisense in action:
ElastiCube: Improving Performance via In-Chip Optimization
The approach outlined above, while scalable, often means sacrificing a great deal of the granularity of the original data on account of the need to denormalize and index said data in advance. This also limits the ability of the typically less-technical, front-end users of analytical applications to perform ad-hoc analysis – as they can only access certain views or queries according to a predefined schema.
The ElastiCube present an alternative form of analytical repository designed for fast query retrieval and scale. It’s based on the notion that by employing In-Chip™ technology to maximize the capabilities of modern hardware, business intelligence tools can achieve optimal performance when processing analytical queries, even without sacrificing the granularity of the data or flexibility of the front-end analysis – since faster query retrieval eliminates the need for summary tables, indexes or denormalization.