Hadoop and Data Warehousing: Rivals, Dream Team or the New B-List?

on April 17th, 2017
Categories:
Hadoop and Data Warehousing: Rivals, Dream Team or the New B-List?Hadoop and Data Warehousing: Rivals, Dream Team or the New B-List?

Share:

Spare a thought for that grand dame of the data management world, the data warehouse. Over the past two decades, while every other system and software was dragged through a gazillion iterations or evolutions or abandoned completely for a newer model, this old stalwart stood firm. She might have had the odd nip, tuck or quiet facelift on the sly, and she might have inspired some less impressive imitations, but nothing stole her spotlight for long.

Until now. Since Hadoop came sashaying onto the stage, there’ve been mutterings that this bright new star is snaffling up some of the best data management roles. Roles that, a few short years ago, data warehousing would have had in the bag.

But is it really time to send data warehousing into retirement? Does Hadoop even want to step into her shoes? And who else is waiting in the wings?

Let’s take a closer look at the repertoire of these reported rivals.

What’s Behind Data Warehousing’s Lasting Appeal?

Put simply, data warehousing means aggregating data from disparate sources into a central repository for reporting and analysis. The reason it’s been the de facto solution for so long is this: as this data is aggregated, it’s put through the crucial Extract-Transform-Load process that harmonizes it all into a “single version of truth”, smoothing out inconsistencies and restructuring the way the way the data is formatted to fit its predetermined schema.

The result is a complete, reliable, coherent source of data that’s ready for querying by Business Intelligence software.

Just What is Hadoop, Anyway?

It’s an open-source programming framework for users that need to work with massive data sets. Using distributed storage systems, it gives users a way to store, clean up and process huge volumes of data.

In order to move thousands of terabytes of data around at speed, the Hadoop Distributed File System (HDFS) bounces data along huge numbers of commodity hardware nodes. Even if many of the nodes stop working due to a technical failure, the system keeps functioning. This means there’s a low risk of data loss – a real fear for businesses who perform very involved analysis using swathes of data.

No wonder Hadoop is turning heads in an industry seeking reliable ways to run big data processing tasks.

Plus, it’s open source – and that’s an enormous draw. It’s endlessly scalable and infinitely customizable. Scope for incorporating bespoke applications, queries and approaches are unlimited. The complexity of your data mining can grow with the intricacy and quantity of your data.

Where Does it Outshine Data Warehousing?

Big data is getting humongous, and many large data warehouses have tried to cope with soaring storage needs by embracing custom multiprocessor appliances. But these price out all but the largest organizations.

Hadoop, meanwhile, nimbly handles snowballing data. Users can then combine it with a data warehousing layer or service built on top, whether that’s an SQL software like Presto, one that works in a similar way, like Hive, or a NoSQL option like Hbase.

But that doesn’t mean Hadoop is poised to replace a relational database system or data warehouse. In fact, as we’ll see in a moment, it’s probably best seen as a stellar supporting act than a replacement for the lead.

So Are They Rivals?

Not at all. Simply put, they aren’t playing the same role.

Data professionals tend to see Hadoop as a handy addition to their existing data warehouse architecture, and one that can save them a heap of cash. By migrating chunks of data to Hadoop, you reduce the pressure on relational databases, making your data warehouse platform more affordable and letting you expand without ballooning your budget.

Used like this, Hadoop becomes something that reduces total cost of ownership of your data warehouse, not something to replace it.

How Does it Make a Data Warehouse’s Performance Better?

Data warehouses are expensive to build, expensive to run and very expensive to grow. Your storage needs rise exponentially with the volume of data you collect, and with it, your costs.

What’s more, these vast data sets mean users can’t generally tap into the full scope of the warehouse every time they want to run a query – their hardware can’t handle it. This means using analytical data marts to give individual departments in the business access to data in specific areas of the data warehouse.

It’s an imperfect system. Not only does it limit the scope of analyses that users can perform on data, it’s also a ticking time bomb.

As more data pours into the warehouse, each data mart can become so overloaded that it’s too unwieldy to use. You could ease up pressure on hardware by restricting access even more, but that means giving each department a narrower and narrower selection of data to inform their analyses. Not great for rigorous business intelligence!

Hadoop doesn’t suffer from these setbacks. The barrier to entry is low, and it’s open to incremental investment. It can be built up over time; you can keep adding huge volumes of data without racking up the costs to match.

For companies just getting into data – without the legacy investment on mainframe or Unix based warehouses – this scalable, incremental framework can be pretty appealing. But Hadoop’s a framework, not a polished solution. It’s great at handling giant data sets, but it was never intended to replace warehousing.

Ah, So Hadoop and Data Warehousing are the Ultimate BI Dream Team?

Whoa, hold up a moment. Using Hadoop with your Data Warehouse tackles problems of data storage. But storing your data is only one element of business intelligence.

Broadly speaking, a functional, usable BI system should be made up five components:

  • Somewhere to centrally store data
  • Tools that partition this data, for example by geography, operations or whatever your business requires
  • Tools that prepare this data for analysis
  • An ETL data engine that helps you to process this quickly
  • The “front end” where all this data is visualized – typically, some kind of dashboard

Even when Hadoop and Data Warehouses work together at their best, they only address the first of these components. And now, innovations in BI technology that provide all five components at once are swiftly relegating the dream team to the B-List.

So… Who IS Swanning In to Steal the Limelight?

As we’ve seen, data warehouses and Hadoop make a winning double act. But to perform fast, high-functioning data analysis from multiple sources, you don’t really need either of them.

Right now, we’re witnessing the rise of a new star.

Holistic “single stack” solutions cut out the need for relational databases by linking directly with source data, wherever it comes from, and performing ELT functions on the spot. The best of these work by creating a metadata (abstraction) layer, used to query data across any number of tables, drawn from unlimited sources in any combination of formats.

The right approach circumvents issues that usually come with huge datasets by building on smart, hard-drive-saving approaches like Columnar Databases and In-Memory Processing. The first streamlines the process by only loading data that’s being used right now, while the latter ensures that this is loaded into the computer’s main memory, rather than gobbling up RAM. That means you get full, unfettered access to all your data, without needing a computer the size of the Hollywood Hills to process it.

An All-Singing, All-Dancing Superstar

Even better, using a holistic BI system cuts out the need for extra layers of software that render data comprehensible to non-technical users.

As we’ve seen, where both data warehousing and Hadoop fall short is that they are strictly “back end” solutions – they only deal with housing data.

To make the data accessible for your front-end user, you still need to introduce and integrate all kinds of applications that allow business teams extract and visualize the insights they need.

And while Hadoop is open-source, it’s not “free”. Getting it to do what you want, and integrating it with your data warehouse, your tools to process and prepare data for analysis, and your front end dashboarding interface, either requires a huge commitment of resources or bringing in a third party to manage it for you. Plus, of course, you still need to invest in the commodity hardware it needs to run.

With a decent single stack alternative, you can query source data, process it fast using an ETL data engine, and have this generate new reports and dashboards in one step. Now that’s the kind of innovation that challenges the future of data warehousing, Hadoop or no Hadoop.

So, yes: it might be time for this (inter)national treasure to take a step back and let the next generation of data tech take over. But not because Hadoop is stealing her crown; because single-stack technology is making storage-only data solutions redundant for BI.

Want to learn more about running an effective BI solution without a data warehouse? Click here to download your free whitepaper!

Share: