You might remember the scene in “Rush Hour” when Chris Tucker attempts to teach Jackie Chan how to sing Edwin Starr’s 1970 hit “War”. The lyrics go something like this: “War! What is it good for?” (If you don’t know the bit, find the video embedded below).
This past week, I came across a Quora question that reminded me of the song. It was: “What is Hadoop not good for?”. The top answers go to Sameer Al-Sakran and Amanda Mork in my opinion.
They were straightforward: Hadoop is not good for real-time or even ad-hoc analysis. It’s best to avoid it if you have small datasets (GB, TB). Deployment is easy but maintenance is costly.
Interestingly enough, some of these answers are about 2 years old and one could argue that not much has changed. Well, except for one thing: most people are now realizing that “Not Every is Hadoop-able”
In a recent blog post, Ben Lorica, Chief Data Scientist at O’Reilly Media (@BigData), highlights solutions that can tackle large data sets with single servers. He also adds to his analysis that, when it comes to analytics, most companies are in the multi-terabytes range – not petabyte range. This, by way, aligns well with the research from EMA we referred to at the beginning of the year: the sweet spot for Big Data Analytics is in the Terabyte range. Ben went on to refer to Microsoft research and some of the work we have done on the “In-Chip” Analytics side.
Bottom line: before thinking about “scaling out”, consider “scaling-in”. By “scaling out” – I mean building parallel capacity across multiple nodes. By “scaling in” – I mean, utilize the full memory hierarchy and capacity of one node. As we showed at the last Big Data Strata conference, what we performed on one node would have required anywhere between 20 to 40 clustered Hadoop machines. I’m also posting below a quick slide deck that takes you through the details of what “In-Chip Analytics does”.