Complete guide to Hadoop technology and storage
A comprehensive collection of articles, videos and more, hand-picked by our editors
When “big data” comes up in conversation, Apache Hadoop often is not far behind. There’s good reason for that: Hadoop has a file system that doesn’t flinch at ingesting different data structures, as well as a massively parallel processing (MPP) system that can work through large data sets quickly. Plus, because it’s built on commodity hardware and open-source software, it’s cheap and scalable.
These features make the Hadoop framework an attractive technology, especially for CIOs who are pressed to bring in more, different and new kinds of data but still keep IT costs down, said Brian Hopkins, an enterprise architecture analyst at Forrester Research Inc., based in Cambridge, Mass. Doing business as usual can’t effectively meet those demands.
“Homegrown enterprise data warehouses are notoriously expensive to scale. MPP data warehouse appliances … lower the cost of data warehousing by giving you this parallel architecture,” he said. However, that cost benefit comes with a hitch. “The problem is the cost per terabyte is still fairly high.”
So, despite its appealing price tag, Hadoop isn’t the best technology for every big data problem. It’s still relatively young and immature, which means it has its fair share of kinks and quirks. So how do CIOs know when they should deploy the Hadoop framework? Here are three scenarios that played out at Ancestry.com, which sent a clear signal to the genealogy website that it was time to embrace Hadoop, with additional comment from two experts familiar with a wide range of Hadoop deployments.
Sign #1: Need for beefed-up data processing performance without paying “first-class” fares. Until three years ago, Ancestry.com’s framework for processing data was built in-house, but as the volume of genealogical records, subscribers and services grew, it began running into scalability limitations. Ancestry.com’s IT department, which juggles 4 petabytes of data, eventually turned to Hadoop for help with processing the data. Even so, the genealogical website continued to use its SQL Server-based data warehouse; much of the data is fed into Hadoop and then moved into the data warehouse for the day-to-day analytics.
“We find that the best data architecture for us is to have a reservoir where we put massive amounts of data in Hadoop, and we store much smaller amounts of data in the data warehouse,” said Scott Sorensen, senior vice president of engineering at Ancestry.com.
Hopkins calls this “right-costing” performance, i.e. enabling the data warehouse to be used more efficiently and cost effectively. Businesses evaluating the performance of their data warehouses find that a good portion of the data isn’t being accessed for analytics—sometimes as much as 60%, he said. In a time when, according to a recent Gartner survey, data, analytics and digital technologies are taking on more prominence but IT budgets are remaining relatively flat, inefficiency can weaken a competitive edge.
“They’re getting into something like Hadoop as a way to drain these expensive warehouses of this cold data,” he said. “They keep it for warehousing and historical purposes, expose it for analytics and functions in Hadoop such as Hive, but they’re not paying for first-class airfare.”
Sign #2: Need to support projects for new revenue, products or services reliant on big data. Today, Ancestry.com is embarking on a new service: autosomal DNA testing. Subscribers who participate will have an opportunity to discover potential extended family relationships due to genetic matching. Although DNA testing wasn’t the company’s initial reason for turning to Hadoop, the service’s success heavily relies on it.
“Scratching the itch” of a specific business need, said Hopkins—in particular, one that would have been out of reach before Hadoop—is another driver for turning to this open-source framework.
“It’s new use cases for new revenues or new product innovations or new service innovations,” Hopkins said. “You see a lot of this in marketing and customer intelligence.”
One of the use cases is obtaining what is called a 720-degree view of the customer, or a view that integrates internal data from call centers and emails with external data from social media in a single location to provide a more meaningful customer profile.
Not every new business initiative reliant on data will call for Hadoop. Jeff Kelly, a principal research contributor at Wikibon.org and a contributing editor at SiliconANGLE, points out that the beauty of Hadoop is its ability to store and process large volumes of different kinds of data. The introduction of text, photos, Web logs and other data varieties from external sources into a business’ data management environment serves a quick litmus test for deploying Hadoop. If the business is not incorporating these types of data, CIOs probably shouldn’t bother with Hadoop.
“If the majority of your data is structured and comes from internal sources, there’s really no reason at this point to put that into a Hadoop cluster,” Kelly said. “Traditional technologies handle that well … and there’s no reason to create another framework you don’t need.”
Sign #3: Need to broaden a business model. Ancestry.com’s foray into autosomal DNA testing isn’t simply about providing a new service; the genealogical research company is building a new arm of the business.
Ancestry.com’s analysis of DNA sequences means stepping into the field of bioinformatics. The company now has a small staff of bioinformatics experts who are tweaking and redeveloping algorithms developed in academia to handle the scale of Ancestry.com’s project. Breaking out in this new direction will potentially take genealogical research to the next level by connecting subscribers with distant relatives they may never have uncovered otherwise.
“We’re able to take this DNA data and not only use it to do the DNA matching,” said Ancestry’s Sorensen. “We’re able to combine that with information we have from 44 million [family] trees. That’s really powerful when we’re able to combine those two data sets.”
Helping the business evolve by leveraging data doesn’t necessarily mean tackling a whole new field, as Ancestry.com has done. It can entail redefining how a business has always done something through a new use of data. That usually requires businesses to dive deeper into more data, or to take on predictive analytics or data mining, all of which may be helped by the deployment of Hadoop.
Kelly agrees. “If your organization is looking to become more data-driven, and you’re not able to pull it together because your infrastructure won’t support the kind of analytics you’re wanting to do, these are signs that it’s time to start looking elsewhere,” he said. That search should probably start with Hadoop.
Nicole Laskowski is senior news writer for SearchCIO.com. Write to her at email@example.com.