This content is part of the Essential Guide: Protect information like a pro: A guide for enterprise CIOs
News Stay informed about the latest enterprise technology news and product updates.

Community cloud could fix data crunching dilemma for cancer research

Building a community cloud for cancer research, make-it-yourself data and a new report on emerging tech: The Data Mill reports.

Does it take a village to build a big data analytics infrastructure? That's how University of Chicago's Robert Grossman, chief research informatics officer and professor of biological sciences, and the Open Science Data Cloud, provider of petabyte-scale cloud resources for researchers, are looking at it in the quest to build a biomedical community cloud.

The Data MillConsidering the numbers, it doesn't take long to see why. A patient's whole genome sequence, including tumor and matching normal tissue samples, generates roughly one terabyte of data, Grossman said during a recent StrataRx presentation. As the cost of sequencing whole genomes continues to plummet, Grossman and other researchers are challenging themselves to hit the one million mark. One terabyte of data multiplied by one million whole genomes puts the size of genomic cancer data into exabyte territory. And that's just one project; another is The Cancer Genome Atlas (TCGA), which uses techniques such as sequencing to find mutations that cause cancer. The federally funded TCGA is expected to grow to 2.5 petabytes in the next few years, Grossman said.

For many research institutions, that size of data is becoming problematic. Moving petabytes around or building an infrastructure to manage and analyze it all is neither easy nor cheap. "It's restricting access and analysis of data to larger medical research and sequencing centers that have the resources and experience to tackle problems of this size," Grossman said. "That's just not ideal."

That's why Grossman helped launch the Bionimbus Protected Data Cloud (PDC), a collaboration between the Open Science Data Cloud and the University of Chicago's Institute of Genomics and Systems Biology, Center for Research Informatics and Institute of Translational Medicine. The PDC is a cloud-based infrastructure built to manage, analyze and provide researchers easy access to large genomic data sets. It was constructed using OpenStack and open source software, including Hadoop, as well as a few custom-built components. It launched earlier this year and is the only community cloud where researchers authorized by the National Institutes of Health can access TCGA data.

Now Grossman wants to scale out this Infrastructure as a Service model and interoperate with other community science clouds and even commercial cloud service providers, such as Amazon, to create a biomedical cloud community commons. That means figuring out what the right governance structure and sustainability models are, he said. Grossman and a working group from the Open Cloud Consortium are hoping to answer these questions.

Make it yourself data

As if data created by sensors, social media posts and finger swipes wasn't enough, now there's make it yourself (MIY) data. It's a Kaiser Fung term -- he's a statistician, adjunct professor at New York University and author of the book Numbersense.

Previously on
The Data Mill

Social collaboration through the lens of a SharePoint lover

Analytics, meet the Lean craze

Netflix uses the OODA loop to stay ahead of the competition

Think of Panda Express, he said in a recent webinar on big data and marketing. Receipts from the restaurant chain inform customers they're eligible to receive a free entree item if they fill out an online survey. At the end of the survey, a unique code is generated, which is required to redeem the offer. When the code is used, customers effectively "close the loop," giving Panda Express marketers insight into the effectiveness of the campaign.

"What has changed nowadays is the engine powering the survey," Fung said. No longer does a campaign such as this need someone to enter the data or even to analyze it, he said. Instead, cheap, easy-to-use Web tools, such as SurveyMonkey, do the heavy lifting. Think of it as a foot in the analytical door because, as Fung pointed out, richer insights require integrating data from sources different together.

Gartner hype cycle

Gartner's Hype Cycle for Emerging Technology is out. The hype cycle plots how technologies mature -- from the rise and fall in popularity to mainstream adoption. This year, artificial intelligence (smart machines, cognitive computing, etc.) takes center stage. Expect to hear about things such as 3-D bioprinting, biochips and autonomous vehicles.

Jackie Fenn, analyst for the Stamford, Conn.-based consultancy, said early adopters are embracing the technology in three ways: "Augmenting humans with technology," such as an employee wearing a computing device; evaluating where machines can replace humans, such as automating the customer service experience with a virtual assistant; and figuring out more sophisticated ways humans and machines can work together, such as a mobile robot aiding a warehouse employee to move boxes.

"Machines are becoming better at understanding humans and the environment … [and] humans are becoming better at understanding machines," Fenn said.

Welcome to The Data Mill, a weekly column devoted to all things data. Heard something newsy (or gossipy)? Email me or find me on Twitter at @TT_Nicole.

Dig Deeper on Cloud computing for business

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.