This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
5. - Experts discuss Hadoop technology: Read more in this section
- Webster on storage for a Hadoop Cluster
- Understanding HDFS and NameNode
- Improving performance with Hadoop technology
Explore other sections in this guide:
Avoid data latency with Hadoop, Sears CTO saysDate: Oct 04, 2012
At Sears Holdings Corp., the chief technology officer is reducing data latency by using the Hadoop data
processing system, which allows for faster, more effective analysis of real-time sales
In this video, filmed at the Fusion 2012 CEO-CIO Symposium in Madison, Wis., SearchCIO-Midmarket.com Site Editor Wendy Schuchart sits down with Phil Shelley, CTO at the Hoffman Estates, Ill.-based company, to discuss how using Hadoop has reduced data latency in his organization and how data science is leading the way for better decision making.
Shelley explains that Hadoop allows businesses to centralize all their data and analyze it immediately, decreasing data latency. He notes that for increased efficiency, companies could outsource their data to firms specializing in deeper analysis.
Read a partial transcript from this interview below, and watch the Q&A for more about reducing data latency and improving your data science program.
Wendy Schuchart: Can you explain how Hadoopwill benefit midmarket sales by reducing data latency?
Phil Shelley: Data latency -- and data in general, how companies handle data -- is very interesting. Most companies, midmarket and larger, clearly have many production systems that capture data, transactional data, from production lines, from customers, point of sale, websites or whatever.
Today, many people will put that data away into an enterprise data warehouse. The process of capturing it [and] putting it away into usable form where you can do analytics on it can take quite a bit of time. Sometimes, it has to be a production batch job, [completed] overnight. Maybe there's a chain of events or there are ETL [extract, transform, load] functions involved. Often, there's ETL involved where you do a copy of the data, transform it, put it away somewhere else. That takes time. It takes days. I've seen examples where customer data, before you can analyze it, is four, five, six days old and even weeks old before it's transformed into a form that you can actually use.
Years ago when they started to be available, the holy grail of enterprise data warehouses was that they were going to be big enough that you could put everything in them in one place. You'd never have to move the data. You could ask those questions immediately. But of course, they're too expensive for the size of data today. It's just so expensive to have a massive enterprise data warehouse.
So, what happened here with Hadoop, it's an open source tool [that] uses very-low-cost commodity hardware. Now it's not about the tool. It's not about how you can afford to buy one of these big data warehouses. It's now about your imagination, because you cannot keep all of that data. You can grab it immediately -- that's produced from those transactional systems at very high speeds. You can ingest the data immediately and you can analyze it immediately. There's no copying anywhere. There's no ETL.
I often talk about it in our company, about a no-ETL zone, because I don't want to have any latency in the data. I don't want to have to copy it anywhere to use it. You just ingest it in one place. Hadoop lets you do that because it's so huge. You can actually ingest a large amount of data and keep all the history and, at the same time as ingesting it, still analyze it. That just has never existed before.
One of the biggest struggles for CIOs is not so much having the platform for the data to reside on, but understanding the context and meaning of the data. Does Hadoop give any assistance on that, or do you still need to have a data science person?
Absolutely. It just allows you to keep it all in one place. So, the whole area of data science is going to be an area of massive growth. There are startup companies, of course, in that space, now: Mu Sigma, Opera and others that are really focusing on data science. They're hiring just pure scientists, pure mathematicians, to help companies dig into their data and find [something] of value from it. I don't expect most companies to do that themselves, because it is going to be a really special skill, and some of them might. But these startup companies, I think, have a great mission in life and that is to help many, many companies that have a lot of data [to] explore it, unearth the value in it, and then use it for business generation.