Here at Sisense, we provide analytics for a wide variety of use cases and data sources. This gives us a unique vantage point for identifying trends in the database market, based on customer requests and usage patterns. The most notable trend we currently see is the massive growth of cloud-based data lakes like Amazon S3, Snowflake, and Google BigQuery.

Over the past business quarter alone, customer requests for connection to cloud data lakes doubled – some specific sources even tripling in growth. Compare that to only 20% growth of the dominant cloud analytical database, Amazon Redshift.

A data lake is simply a place for accumulating a ton of data, usually in a semi or unstructured form. The important differentiators of modern cloud data lakes include:

  • Infrastructure and cost separation for storage and compute
  • Storage of any data type
  • Infrastructure abstraction
  • Limitless scalability

What’s driving this growth in these technologies? It comes down to Big Data, ease of use, flexible pricing, and an increased supply of quality services.

DATA LAKES: Facts, Figures, and Forecast

Get the free whitepaper

The Big Growth Drivers

1. Big Data is real, and it’s democratized.

The cloud is creating lots of data, and the cloud is needed to analyze it. Looking back at the past decade, only the world’s largest companies had enough data to warrant a lake. These large businesses could also afford the high personnel and infrastructure costs that were required to set them up on technologies like Hadoop. Many service providers have launched to help manage these complex deployments, but the investment and manpower required has always been daunting.

Today, smaller businesses are also generating petabytes of information, often from web traffic or user data generated in the cloud. User data from a cloud product sold by a mid-size business will often surpass tens of millions of records created a day, scoped only by the granularity of tracked events. These companies understand the value of that data and need affordable ways of capturing it, creating demand for easier tools with scalable pricing models.

2. Cloud makes management easy.

“Data lake” has always been synonymous with Hadoop. While Hadoop is open source, a full deployment can be a multi-million dollar project after the required infrastructure, developers, consultants, and time spent on set-up and maintenance.

In the past few years, new cloud options have come online that offload the maintenance effort to the infrastructure provider, making it much more affordable for smaller companies. Amazon S3 has been leveraged for a large variety of storage use cases but is increasingly being used as an analytical data lake because of its ease of management and new SQL interfaces. You can store anything in S3, and AWS will take care of auto-scaling, encryption and many other utilities.

3. Pricing that makes sense for agile businesses.

Cloud data lakes have pricing models that allow businesses to get started for cheap. Many offer per query pricing that offsets the need for a large upfront investment. Amazon Athena and Spectrum, which are used for querying S3 data, both cost $5 per TB scanned, offering businesses the ability to get started for cheap.

Also from a pricing perspective, “separating storage and compute” fits more agile methodologies. With data storage very cheap and most cost coming from querying, it makes more sense for businesses to first build the pipes for collecting all potentially useful data, then later make smart decisions about what’s actually required based on ad-hoc needs.

Compare that style of operating to being forced to choose whether to collect certain data upfront. An analyst can query on previously collected data in hours or minutes. Adjusting the data collection flow, on the other hand, can be a long process involving multiple solutions and engineering resources.

The downside of this model is that users will need to think about cost control when ad-hoc querying transitions into more persistent tasks, as with all serverless computing.

4. More accessibility and choice.

Maybe most importantly, over the past couple years new ways of querying data lakes like S3 have emerged, using SQL tools that interface nicely with analytics applications. This includes:

  • AWS offerings like Athena and Spectrum
  • Open source tools like Apache Drill, Presto, and Hive
  • Integrated solutions like Snowflake and BigQuery, with native compute layers

With the entrance of so many new standard SQL solutions, users can optimize for what’s most important to them between ease of management, performance, scalability, and price. Businesses will need to consider those tradeoffs when they chose their solutions.

While the entire cloud data market will grow, cloud data lakes will take a larger portion of market share from relational databases in analytical use cases.

Beyond that, as needs become more diverse and advanced, we should expect more specialization in organizations’ data pipelines, as teams push different products to maximize across all critical requirements.

Certain data flows will demand relational or tuple-based databases for high-speed retrieval of complete records while others will require columnar engines for bulk analytical queries. This underscores the value of starting in the cloud with a trusted infrastructure provider that makes a broad suite of interoperable tools available.

DATA LAKES: Facts, Figures, and Forecast

Get the free whitepaper
Tags: |