If you work with data, you’ve probably encountered the term “data lake” – either as a general trend in data and analytics or as a solution to a particular Big Data problem you’re trying to solve. Indeed, with the astonishing growth of data, a data lake is often seen as an attractive solution for storing and analyzing large amounts of raw data. But would it be a good fit for your organization? Let’s try to answer that question, starting with a definition.
First thing’s first: what’s a data lake?
Since there’s a lot of confusion and unclarity in this area – in a 2015 survey, only 1.12% of respondents felt that the concept is well defined and consistent at a detailed level – any discussion of data lakes has to start with a definition.
The first thing to understand is that the term “data lake” would not typically be used to describe a particular product or service, but rather an approach to big data architecture that can be summarized as store now, analyze later.
In other words, unlike the traditional data warehouse approach, which entails imposing a structured, tabular format on the data when it is ‘ingested’, we would use a data lake to store unstructured or semi-structured data in its original form, in a single repository that serves multiple analytic use cases or services.
Data lakes are typically used to store data that is generated from high-velocity, high-volume sources in a constant stream – such as IoT, product logs or web interactions – and when the organization needs a high-level of flexibility in terms of how the data will be used.
With this definition in mind, let’s go on to ask the 5 questions that you need to answer before deciding whether this is the way to go:
1. What type of data are you working with?
As we’ve described in the previous section, data lakes are best used to store streaming data, which has several unique characteristics:
- Unstructured or semi-structured
- Constantly being generated, in small bursts (e.g., every time a user sees an ad generates a new record with several dozen fields)
- Often accumulates quickly – tens of billions of records ‘weighing’ a total of hundreds of terabytes is a common workload for streaming data
If you’re working with this type of data, you should definitely consider a data lake – since the costs of structuring and storing it in a relational database will quickly become very prohibitive.
However, if you’re mostly working with traditional, tabular information – e.g., data generated by financial, CRM or HR systems – you might want to stick to a data warehouse.
Either way, the two are not mutually exclusive, and you can definitely consider keeping some data in your RDBMS, and use a data lake for sensor or SaaS data that you would like to analyze separately. However, if you don’t have anything that even remotely resembles big or streaming data, a data lake might be overkill.
2. Do you know exactly what you’ll want to do with the data?
One of the great things about data lakes is the flexibility they provide when it comes to how the data will eventually be used. In a data warehouse, we would store the data in a certain structure that would best be suited for a specific use case, such as operational reporting; however, the need to structure the data in advance has costs, and could also limit your ability to repurpose the same data for new use cases in the future.
This brings us back to the core tenet of data lakes: store now, analyze later. If you’re still unsure whether you’ll be launching a machine learning project, or want to provide a higher level of flexibility for your future BI analyses, a data lake could be a good fit. However, if you’re only looking to generate a few predefined reports, a data warehouse would probably get you there faster.
3. How complex is your data acquisition process?
Adding new sources to your data warehouse can often be a resource-intensive process. If you’re constantly acquiring new data, particularly from unstructured or semi-structured sources, you might quickly find yourself dealing with serious ETL overhead in order to “cram” this data into a format that your data warehouse can work with.
If the costs of ingesting data into your data warehouse are becoming prohibitive, especially if this is leading you to consider giving up on some sources altogether, you should consider a data lake – which will allow you to store all the data with minimal overhead, and then extract and transform the data when you want to actually do something with it.
4. What type of tools and skills exist in your organization?
Building and maintaining a data lake is not the same as working with databases. If the latter requires some level of DBA / IT to maintain the infrastructure, with the rest being handled by business users (analysts or executives), a data lake would typically require more significant investment in engineering – and specifically in big data engineers, which are in high-demand and difficult to find.
If you don’t have these skills in your organization, transitioning to a data lake approach might prove difficult. In this case, you should consider sticking to your data warehouse until you manage to hire the prerequisite engineering talent; or use a Data Lake Platform such as Upsolver (where, for full disclosure, I am the CEO and co-founder) to streamline the process of building and managing your cloud data lake, and to eliminate the need to devote extensive engineering resources to the matter.
5. What is your strategy for data management and governance?
Both data lakes and data warehouses pose challenges when it comes to governance. In the data warehouse, this challenge would be the need to constantly maintain and manage all the data that’s coming in, and to make sure it is added according to a consistent business logic and data model; whereas data lakes are often criticized as chaotic and impossible to effectively govern. Whichever approach you choose, make sure you have a good way to address these challenges.
Are you ready for the data lake?
It’s cliched but true that there is no “one size fits all” when it comes to data. Each organization and even each project is unique and needs to be approached with an open mind and a good understanding of the ever-evolving tech landscape.
You can use the five questions we posed above as a general guideline for deciding whether your company or organization should be thinking seriously about building a data lake. If you want to read an example of a company that did it successfully, check out this case study.
About the Author
Ori Rafael is the CEO of Upsolver, which provides a leading Data Lake Platform for AWS S3. Ori has a passion for making technology useful for people and organizations, and has previously held roles as the Head of Data Integration Platforms for the IDF’s elite technology unit, as well as senior management positions in the private sector.