The amount of data being generated and stored every day has exploded. Companies of all kinds are sitting on stockpiles of data that could someday prove valuable. Until then though, they don’t necessarily want to spend the time and resources necessary to create a schema to house this data in a traditional data warehouse.

Instead, businesses are increasingly turning to data lakes to store massive amounts of unstructured data. In this post, we’ll look at the differences between a data warehouse and data lake, the current collision of these two paradigms, and how an organization can take advantage of a cloud-based data lake like an Azure data lake or an AWS data lake.

Analytics from your cloud data sources are key to transforming your business, but the reality of how most companies use them lags behind expectations.

Learn how to bridge the gap

The rise of data warehouses and data lakes

In the late ’90s and early 2000s, there was a massive movement for organizations to house their data in a data warehouse as a single source of truth. At the time, there were a number of un-networked legacy applications and software programs creating data that, when connected, could help drive valuable insights and predictions for an organization. You can quickly see how the notion of joining tables came about: The data housed in one environment could be tied to another through various logic. But it required a thoughtful schema to connect the pieces.

However, organizations quickly learned that collecting and collating massive amounts of data into a single source of truth is incredibly difficult. One notable challenge that arose was organizations using new systems that stored data outside the bounds of the predefined schema. This lack of flexibility created obstacles to deriving value from their data in a timely manner. 

That rigidity of relational databases was soon challenged by the schema-less NoSQL databases. Rather than having an inflexible schema where every piece of data had to fit “just right,” new technologies began emerging, such as object stores and document databases, that could record information as documents or files. This shift in thinking about the structure of data also resulted in changing perceptions on how it was stored — and the data lake was born.

Traditionally there were three main differences between data warehouses and data lakes; however, this has been changing over the past few years.

1. Structured versus unstructured data. Data warehouses require very structured, tabular data. Consequently, transactional text data fits best in a data warehouse. The data lake is different because it can accept unstructured data, which opens the door for organizations to store not just textual information, but also image, audio, and video files.

2. Schema on write versus schema on read. A direct impact of the first point is that the schema must be defined for data warehouses prior to information being written to it. All the columns and rows must be determined in advance so the structured data knows exactly where to go. With data lakes, that information can simply be contained in the data document. The data scientist can then create the schema on read, selecting the information that answers their question when the question is asked.

3. Hierarchical versus flat structure. Data warehouses are constructed with a number of tables that can be joined to query for insights, so there is a hierarchy about how each of the tables is connected. A product table could connect to a transaction table, while there could be another table that must be connected to show which warehouse the product is located in. A data lake is flatter — usually it stores different pieces of data in a stream of information inside a document. With a data lake, a purchase order might contain the items ordered, the order ID, and the buyer’s address. These are not linked together in discrete tables; they are all stored in a singular, flat structure that contains all the information without having to join tables.

Data warehouses and data lakes: Converging into the data cloud

You may think the two data storage philosophies are dramatically different, but the past few years have seen the technologies begin to converge. While organizations are storing their massive amounts of data in a data lake, increasingly we are seeing them abstract a data warehouse directly on top of the data lake when it comes time to ask questions of the data or to provide structure to the data.

To help explain this shift, consider an analogy of a tackle box that’s used to go fishing. The bottom of a tackle box is usually a wide-open section that can be a free-for-all storage of tools, lures, bait, string, weights, and more. This is similar to the data lake, where you can just put anything in the storage system without much structure. But this shift of abstracting a data warehouse on top of that data lake is akin to the top insert on a tackle box that has distinct sections to store specific items together — one section could be reserved for weights, another for hooks, and a third for lures. While all those items could be stored in the section below, the fisher can choose to pull out specific items and organize them up top.

Both sections of the tackle box are there for a reason, just as data lakes and data warehouses have specific purposes. Similar to how a data lake can accept any piece of data and is incredibly flexible, the bottom of the tackle box is an efficient way to store almost anything without discriminating what it is or where it goes. And just as the organized top of the tackle box enables the fisherman to rapidly find a particular lure, the data warehouse provides optimizations around performance.

We’re seeing this paradigm shift play out across two of the major cloud providers. The classic AWS data lake was built with AWS Glue and S3, and it could be queried by Amazon Athena and prepared, cleaned, and enriched with Amazon EMR (Elastic MapReduce, a service that allows users to code for huge unstructured datasets in parallel across a distributed cluster of processors). But increasingly, AWS has been giving the option to use Redshift directly on top of the AWS data lake infrastructure.

There is a similar shift at Microsoft: The Redmond powerhouse has been revamping its SQL warehouse into Synapse data platform, which is a convergence of a data lake and data warehouse. Microsoft is now offering a complete data experience including data ingestion, data pipelines, and serverless SQL in addition to dedicated compute, Spark, and more. The future of data storage and data lake analytics offers many interesting directions to explore. We’ll likely see a rise in offline processing to cleanse, enrich, and model the data, as well as more and more engines on top that tap into data. This is all in an effort to empower companies to store huge amounts of data relatively inexpensively while being agile about how they use it.

Sisense unlocks insights contained in your data lake

Whether you use an AWS data lake or an Azure data lake, Sisense can help you extract value from your data. Using a serverless query engine like Amazon Athena or serverless Azure Synapse SQL, Sisense can provide your users actionable intelligence that you can seamlessly infuse into products, services, and experiences.

Or, using dedicated compute clusters like AWS Redshift or dedicated Azure Synapse SQL instances, you can create a lean, flexible data warehouse layer in order to inspect the data. This gives your organization the ability to take advantage of the information in your data lake to infuse insights everywhere. The way companies are using and storing data is evolving every day — is yours?

Analytics from your cloud data sources are key to transforming your business, but the reality of how most companies use them lags behind expectations.

Learn how to bridge the gap

Guy Boyangu is a co-founder and CTO of Sisense. For over a decade, he’s helped shape Sisense as a company and a product, weaving together emerging technologies and innovative interfaces, always with an eye toward what’s next.

Tags: | | | |