This article can also be found in the Premium Editorial Download "CIO Decisions: Multi-hypervisor environments: Worth the work?."
Download it now to read this article plus other related content.
Genealogy website Ancestry.com, an Internet veteran that traces its own roots back to 1997, has accumulated 11 billion records (and counting) to date, including birth and death certificates, marriage licenses, immigration documents, and millions of family trees. That translates into 4 PB of data -- a lot of it unstructured.
If the volume and variety of the Provo, Utah-based company's data doesn't meet the threshold for big data, there's no doubt its newest venture will: Genealogy by autosomal DNA test. The service, which launched in May 2012 and has received mixed reviews online, offers to help subscribers discover their "cultural roots" by comparing their genome sequence against other sequences and genetic information to determine ethnicity and find potential matches.
"We have to have massive storage technology to store 4 PB of content, and then it requires us to deploy massively parallel processing [MPP] solutions to mine the data," said Scott Sorensen, senior vice president of engineering at Ancestry.com.
Meeting that need required reinventing Ancestry.com's traditional systems for processing and storing data. Three years ago, Sorensen and his team deployed the Apache Hadoop framework in place of their in-house MPP components. But the cutting-edge tool was not an overnight success by any means. To integrate the open source Hadoop into the existing environment, Sorensen and his team have had to bridge it with Ancestry.com's traditional data warehouse. They also lean on Hadoop service providers to help them over some of the technical hurdles. "[Hadoop's an] immature technology in many ways, with a lack of sophisticated higher-level tools," Sorensen said.
More from the engineer
Headed down a similar path? Scott Sorensen, senior vice president of engineering at Ancestry.com, offers three best practices.
- Clean your data. The most important initial step is making sure the data is clean, according to Sorensen. That's a challenge for Ancestry.com, which deals with significant data volume and variety. "Getting all of this data cleaned and into an architecture that allows you to do the data mining -- that's the real trick," he said.
- Build data science skills in-house. Data scientists are both hard to find and hard to hire, Sorensen said. That's why he believes businesses will have to build those skills in-house. "Look for engineers who are statistically inclined," he said. "They're going to be the ones who are going to be more comfortable in this world."
- Enable data access. It's important to get the right tools to the right people, Sorensen said. "Not everyone is going to have the same skill set," he said. "You want to provide the broadest set of tools to access the data."
Apache Hadoop framework adds scalability
When Sorensen started working for Ancestry.com more than 10 years ago, Hadoop had yet to enter the scene. One of his first tasks was building a search engine for the site; a few years later, he took on record-linking. Both relied on proprietary MPP technology, which (like Hadoop's MapReduce) processes data over multiple systems in parallel, a method that gives the architecture scalability. "And it worked," he said. "We tried different projects with it, and we succeeded about two-thirds of the time."
As Hadoop started to emerge, however, it became clear that the open source framework -- built on cheap, commodity hardware -- could provide even better odds by adding more scalability to the system, Sorensen said. "We began to move projects to Hadoop, and we even restarted projects that simply failed with our own MPP technology."
Take record linking, for example. The goal is to "stitch" together documents that relate to a specific person so that when a subscriber searches for information on, say, Joe Smith, the results are targeted to the exact Joe Smith the subscriber is researching.
"We were able to do some limited record stitching on our own MPP technology that allows us to provide discoveries to our customers," Sorensen said, but most of those connections were made by subscribers' poring through the genealogy site's rich database of public records. Once Sorensen and his team began using the Hadoop framework to stitch or link records together, the tables turned. Today, more than 60% of these discoveries are made by Ancestry.com, according to Sorensen.
Hadoop distributors are key to success
Ancestry.com uses Hadoop to provide three main functions, one of which is processing DNA. While DNA testing isn't a completely new concept for the company, autosomal DNA testing is. Less expensive and more accessible than ever before, autosomal DNA tests rely on data from a person's whole genome. Specifically, Ancestry.com works with the 700,000 different points along the genome that are critical for determining ethnicity, and with what Sorensen referred to as "family history science."
The service is a community effort. In addition to Ancestry.com's in-house IT and bioinformatics experts, the genealogy company relies on service providers to ensure consistency with both DNA sequencing and use of Hadoop. The actual genome sequencing is outsourced to a partner who then sends Ancestry.com a data file that feeds directly into Hadoop. Part of the site's success with the Hadoop framework is dependent on insight from vendors such as Cloudera and MapR. These and other Hadoop distributors contribute to the open source project, but also build on the framework and sell enterprise-class offerings to customers.
When asked what technology he needs that doesn't exist yet, Scott Sorensen replied that he needs better handwriting recognition software. "We have historical documents, and handwriting 200 years ago isn't like handwriting today," he said.
Today's technology can parse through an obituary and identify names, dates and places, as well as determine how two people who appear in the obit are related. "That's still a fairly semi-structured kind of digital record there," Sorensen said. "If you were to just take any text out of context and try to extract the semantic information, well, we're not quite there yet. We're rapidly improving our semantic extraction capability, but semantic extraction still has a long ways to go."
"We're processing thousands of samples every day at certain points with 700,000 SNPs [single-nucleotide polymorphisms] per sample, and that takes quite a processing cycle," Sorensen said. "And if that fails at any point in time, it means we have to start over with that batch. We just could not have that kind of risk exposure for our DNA pipeline."
The lineup of external partners is fluid, depending on the site's needs. Sorensen, for example, declined to specify which distributor provides what. "These vendors are kind of just jockeying to see who is going to be the provider of choice. They're all fantastic [distributors], but at any given point in time, one might have a specific feature we need for one of our solutions where another one doesn't," he said.
Relational database still relevant
The service providers bring additional functionality to the Hadoop framework, but Ancestry.com still relies on its data warehouse for its day-to-day analytics. While Hadoop is used as the initial repository for storage and processing, the data eventually is aggregated, rolled up and moved out of Hadoop and into a SQL Server-based data warehouse for analysis and mining. There, technicians using MicroStrategy tools can explore how subscribers are moving through Ancestry.com's website, analyzing where they are experiencing successes, as well as struggles. "All of these success events have gone into the data warehouse, and we're able to mine that and see where customers are having successes, what are the record types customers are looking for," Sorensen said. "It guides us in making our business decisions."
For Howard Dresner, president and founder of the consultancy Dresner Advisory Services and a former Gartner analyst, the combination of Hadoop and a SQL Server-based data warehouse is more proof that big data technologies still can't replace traditional systems -- at least, not yet. "We still need to do the basic blocking and tackling of data warehousing and BI [business intelligence] to some degree," he said. "At least, at this point."
Businesses embracing big data technologies, in other words, are doing so progressively. In fact, even Ancestry.com's warehouse is slated for an upgrade, according to Sorensen. "Our data warehouse is bursting at the seams," he said. "We're currently evaluating the MPP-type data warehouse options, which we'll be moving to soon."
Let us know what you think about the story; email Nicole Laskowski, Senior News Writer.