Large data sets offer insights, require a tiered storage strategy

Enterprises are collecting large data sets because storage hardware is cheap. Understanding and storing data so users can access what's important is a challenge, however.

It's just plain hard to wrap one's mind around the large data sets enterprises have taken to collecting. Some of

that spike in data sets' size is due to the low prices for storage hardware, which have eliminated the need for tough decisions about which data to keep, and for how long.

"A dramatic amount of [businesses] have more than a terabyte of storage," said Raymond Paquet, managing vice president at Gartner Research. "How much of that data do you need?" he respectfully asks.

It's almost as though IT departments "get a headache, buy more storage," Paquet said. They're buying because adding storage is less painful than trying to understand what the data means and whether it's worth even keeping around. The problem with this strategy is that users demand instant access to data all the time, "and if they can't get it, they say, 'IT isn't doing their job,'" he said.

Big data is the elephant in the room now, and it's projected that the beast will grow 800% over the next five years, according to Gartner Research. What's needed now is for IT to deal with the problem by first gleaning wisdom, then enacting a tiered storage strategy to make the most important information more accessible than the rest.

Consolidating data silos

It's important to look at even small signals in the data, according to Paquet. Perhaps no one has taken this advice to heart more than Cindy Richey, technology services manager at the Colorado Department of Public Safety.

Richey helped the state consolidate silos of information about trooper dispatch, mobile data management and records management into a SharePoint application on a virtual private network (VPN). Now Colorado's 1,000 troopers in the Colorado State Patrol can enter data from whatever canyon they're in, regardless of whether their cell phones work.

Analyzing the data and reacting to patterns led to a 20% drop in the number of fatal car crashes last year, Richey said. "It's all due to data and how the patrol analyzes data to make improvements." For example, "If we've had a number of accidents in an area, we'll work with engineers to redo a roadway."

The effort has led to something Richey calls intelligence-led policing, a model that analyzes data and enables the troopers to become proactive in enforcing laws.

"This week with St. Patrick's Day, it's a heavy drinking time," Richey said. "We can map locations in SharePoint and say, 'Here's Joe's bar, near where we had 10 DUI arrests last year,' and set up enforcement to prevent injury and fatal crashes."

Keep large data sets moving

Users expect access anytime from anywhere, yet 80% of unstructured data is untouched after 90 days, according to Gartner's Paquet.

Our data is growing. Are we looking at too much? Are we too granular? How are we not granular enough?

Cindy Richey, technology services manager, Colorado Department of Public Safety

Tiering is critical to reduce costs, Paquet said. This means tiering not only by storage type -- for example, solid state drives for the most important data and tapes for the 80% of unstructured data that doesn't get touched after 90 days -- but by many options in-between, he said. If the data doesn't get used, keep moving it to lower-level storage, he suggests.

"Energy consumption goes down with slower-speed drives. It's all about how do we use stuff more efficiently, more managed, in a more logical way," Paquet said.

However, performing an audit of applications and archiving the least-accessed information is increasingly complex, especially when automated tiering assumes that everything is critical.

Moreover, "archiving breaks," said an IT director from a major appliance manufacturer who asked to remain anonymous. "We'll typically have a couple of months archived; then the software will break; then we'll fix it and get another four months archived."

Dedupe large data sets for better control

Although archiving solutions continue to improve, experts say deduplication already is a standard way to reduce the amount of storage needed for large data sets. Deduplication removes copies of data, and is optimized for text. Video and audio files generally don't dedupe well; MP3 and MP4 files, for example, are compressed.

At Colorado's Public Safety department, Richey relies on CommVault Systems Inc.'s Simpana product to store or archive more than 40 TB of data storage, including DNA records and criminal case records. The state patrol alone has generated 7 TB, she said. "With the deduplication services available in Simpana, it really has compressed close to 60%."

"Our data is growing," Richey said. "Are we looking at too much? Are we too granular? How are we not granular enough? Today, I have three uniformed members dedicated to my unit, who collaborate all day long on data collection points and reports."

Let us know what you think about the story; email Laura Smith, Features Writer.

This was first published in March 2011

Dig deeper on Enterprise data storage management

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchCompliance

SearchHealthIT

SearchCloudComputing

SearchMobileComputing

SearchDataCenter

Close