Before big data became a big buzzword, biomedical research labs were already maddeningly familiar with the requirements imposed by large data sets. The massive data files they needed to share for their research were typically too big to move over commercially available networks efficiently.
"So, they used to ship hard drives," recalled Karen Ketchum, director of Enterprise Science and Computing Inc. (ESAC) in Rockville, Md.
In the case of two labs in Boston and Seattle that ESAC now assists in moving big data, that added up to hard drives traveling some 3,000 miles. The process has limitations aside from the time taken up by travel, Ketchum said. Staff members have to verify the integrity of the data before sending. And once a package is stamped and sealed, researchers and their staff can't control when (and sometimes if) the parcel arrived or what condition it arrived in.
As sophisticated as technology has become, data in transit is still a variable that researchers and enterprises cannot ignore, especially as they get into the thick of big data. Storage and compute capacity both play a role here, but the other pain point -- to round out what experts call "the big three" -- is the network. A network that becomes saturated by data in transit results in data latency -- not warp-speed analytics, the reason for moving data in the first place.
When building a big data-friendly infrastructure, however, network performance management is easier said than done, according to experts. There's no one-size-fits-all, plug-it-in-and-off-you-go solution. Still, the good news for CIOs is that they won't have to succumb to the shipping method -- or at least not entirely.
Fasp to the rescue
For Ketchum and ESAC, network performance was top-of-mind as they began building out a data coordinating center (DCC) and data portals for a new Clinical Proteomic Tumor Analysis Consortium, or CPTAC. Funded by a contract through the National Cancer Institute's Office of Cancer Clinical Proteomics Research, the project looked to centralize data from research centers where experts study tumor proteins in the hopes of uncovering diagnostic signatures for cancer.
Some of the biggest files the DCC manages are from mass spectrometry data, generated by a technique that measures the mass of an unidentified protein or peptide from a biological sample and compares this to a database of known peptides and proteins. The biological sample may be from normal tissues or from tumor tissues. Using these methods, "They try to get a sense for whether there are unique characteristics in a tumor tissue or cell based upon the protein components," Ketchum said.
Mass spectrometry files for a single sample aren't terribly large -- only about 5 GB to 10 GB in size. But a single sample from one tumor doesn't yield much information about the physiology of the disease. "What researchers typically want to do is look at all of the tumor samples in a set," Ketchum said. And that can create significant data growth. A single data set from colorectal cancer, for example, contains 90 tumor samples -- pushing the size of the data from between 5 GB and 10 GB up to 700 GB.
More on data in transit
Data-in-transit security and tracking services
Security data in rest vs. data in motion
Is encrypting data in transit enough?
But transmitting large files using TCP, the same Internet protocol HTTP uses, could mean running up against certain limitations. Most notably, TCP relies on a congestion-avoidance algorithm, which works well in a LAN, or a network that covers a small geographic space, but can create a bottleneck when moving data over a WAN, a network that spans large geographic distances.
"As the distance between the point you're sending the data and the point you're receiving the data gets bigger, that causes the round-trip time on the network to increase," said Richard Heitmann, vice president of marketing for Aspera, an Emeryville, Calif.-based file transfer company that was acquired by IBM in December. "And as the round-trip time increases, TCP starts to think there's congestion on the network, so it starts to throttle down the rate at which it's sending data."
Researchers and businesses could consider WAN bandwidth optimization or WAN acceleration tools to reduce latency and bandwidth, which utilize, for example, data compression or deduplication techniques. But another way to remove the bottleneck is with a high-speed network protocol such as the fasp technology developed by Aspera. The fasp technology leverages the existing WAN for faster and more efficient data transfers, explained Ketchum. For the DCC, investing in a high-speed network protocol made the most sense.
"We have a spectrum of users, and so this helps provide more stability," Ketchum said. Not only does the DCC provide proteomic research institutions a private portal to upload and download data files and space to share those data files, but once the data is ready for general consumption, it's made available through a second, public portal for the research community at large. Ketchum and ESAC needed to make sure multiple users could access and download the same files efficiently and simultaneously -- if need be. To date, 300 unique visitors across 15 countries have downloaded about 10 TB of proteomic data from the DCC.
The pains associated with transporting big data aren't confined to covering large distances. To overcome the limitations of commercially available network speed, the Aliso Viejo, Calif.-based Clarient Diagnostic Services Inc. took a different approach when tweaking the network. Clarient, a cancer diagnostics company now owned by GE, started using metro networks -- fiber optics that transmit data at high speeds -- and localized data centers. But, as Nevin Zimmerman, chief technology officer at GE IT Technology Solutions, pointed out, requirements are only growing.
"It will be impossible to have all data local," he said. "Network bandwidth will not keep up, so the need to break up big data in manageable pieces will be key."
The second part of this two-part story explores how businesses are overcoming network bottlenecks when moving data internally.