News Stay informed about the latest enterprise technology news and product updates.

Like Americanos? Then you'll love distributed storage

Distributed storage doesn't have to be confusing. All you really need to do is think about a coffee shop. The Data Mill reports.

The world is becoming increasingly distributed, and you can blame big data, that messy, complex, hard-to-define buzzword. But to understand how distributed systems work, you don't need complicated, messy explanations. In fact, according to Tim Berglund, all you really need to do is think about a coffee shop.

Take distributed storage, for example. "It's really easy when you've got one processor, one thread, one disc that you own entirely," Berglund, director of training at DataStax Inc., a NoSQL Cassandra database distributor, said during an O'Reilly Media webinar. "But that's not going to work for us. We're talking about a world where we can't get away with that."

He listed four types of distributed storage: read replication, sharding, consistent hashing, and Distributed file systems (i.e. HDFS or Hadoop distributed file system). And then Berglund, a self-proclaimed lover of the Americano (espresso diluted with hot water), invited listeners "into the coffee shop."

To start, Berglund first described a non-distributed storage system (the single server, single master database) in coffee shop terms. It goes something like this: You walk into the coffee shop, introduce yourself to the barista, who takes your name and order and dutifully writes it down. Now every time you enter the coffee shop, the barista recognizes you, looks down at her paper and knows exactly what you want.

Berglund said this is the preferred storage method "because it's possible to pack all kinds of sophistication into that little drum and there are all kinds of things you can have the database do when it's in control of the whole world." But in this model, the environment remains pretty much the same (same old barista, same old you). And as data proliferates, the single server, single master database becomes inefficient, he said.

That's where distributed storage can be useful. With read replication, the head barista keeps control over the master customer/order list but provides a copy to "helpers" who can pitch in to meet increasing demand. "Replication solves a problem. It solves scale, so we're able to grow our business," Berglund said.

But read replication also introduces problems. "The biggie," said Berglund, is inconsistency. If you change your usual order, the head barista's master list is updated, and she'll then have to make new copies for each helper, introducing a time lag. "This is what we call an 'eventually consistent' system," Berglund said. "It will catch up … but I have some period of time that's going to elapse before I can see that."

Still, in Berglund's experience, the benefits of read replication usually outweigh the drawbacks of inconsistency. "You just have to know it's happening," he said.

If demand at the coffee shop continues to rise, a single head barista isn't going to cut it. One solution is to divvy up the master customer/order list among several baristas. This is similar to sharding -- or breaking up -- a database. (For the record, read replication can be applied to shards, so this system might have several baristas each with their own set of helpers.) "This is a time-honored means of scaling a relational database," Berglund said. "This is the built-in scale mechanism that's used by Mongo."

Like read replication, sharding isn't a perfect solution and can be inflexible. Back at the coffee shop, Berglund divided the customer/order master list alphabetically by customer name, giving each barista her own section of names. If a majority of the queries reflect how the dataset is broken up -- what is Tim's preferred order -- sharding is an efficient distributed storage method. But if a majority of the queries don't reflect how the data has been broken up -- who has Americano listed as their preferred order -- sharding becomes inefficient. "You begin to compromise on the scale advantages that sharding gives you because you have to run the query on all of the shards," Berglund said.

Consistent hashing offers an elegant solution. Unlike read replication and sharding, which transform single master databases into distributed systems to scale out data processing, consistent hashing "is an approach we apply to an explicitly distributed database," Berglund said. In this coffee shop, there is no master barista. Instead, the baristas work together to maintain the customer/order master list.

The big difference with consistent hashing is that, rather than store an order by customer name, an algorithm (known as a hashing function) is used to "hash" the customer name into a unique, easy-to-find number, which is then assigned to a corresponding barista or node in the system's cluster. The hashing function ensures data is spread evenly across the cluster.

The system's strength lies in what happens if a barista goes on hiatus. Her list is the then divvyed up among the rest of the baristas or nodes. And, because this is a consistent hashing system (a la Cassandra), the data doesn't have to be hashed again. "Things may have gotten swapped around, things might have got copied, responsibilities may have shifted, I don't care," Berglund said.

There are drawbacks to consistent hashing, too. Data is replicated in this system to eliminate points of failure. But because the system is democratic, if a customer order in one cluster is inconsistent with the same customer order in another cluster, it can be difficult to know which one is right.

"That problem comes up, and a database like this has to have answers," Berglund said. Exactly how a consistent hashing database handles a situation like this depends on the vendor because "there are very few general principles of distributed systems that dictate that behavior," he said.

UX design is key to creating a 'sticky' app

Last week, Sara Taylor and Paul Heckel, mobile product stategists with Solstice Mobile, talked top mobile trends to watch in 2015. One of their predictions? Human-centered design will become standard.

"This is a UX [user experience] process that focuses on end users needs and wants during the design stage," Taylor said during the webinar. Doing so can help create what Taylor called a "stickiness factor" for the app, or getting users to come back over and over again.

The trend is getting a lot of traction in Silicon Valley. "Major venture capital firms are hiring UX designers for their own staff ... to go in and advise their startups," she said. According to Taylor, Silicon Valley startups are starting to pay UX designers, and designers in general, as much as developers, another signal that good design is becoming a big deal.

Big data business value comes from advanced analytics

Businesses aren't struggling with the size of big data, they're struggling to to get business value from big data, according to Philip Russom, a research director for data management at The Data Warehousing Institute. Businesses that are experiencing success? "Quite often, they get it from advanced analtyics," said Russom last week during a webinar on best practices for data management.

Next Steps

Edward Snowden talks data encryption

Secrets to big data success

Systems diagrams help leaders manage change

Dig Deeper on Enterprise business intelligence software and big data