News Stay informed about the latest enterprise technology news and product updates.

Big text and the 2014 World Cup

There's big data and then there's big text. Find out why they're different. Plus, the key to a data team's success: The Data Mill reports.

Sony Corp. became the poster child for "big text" last summer during the 2014 World Cup when the entertainment company pulled off a social media triumph on its One Stadium Live website. To give fans a so-called second screen experience, Sony planned to stream relevant social media posts from Twitter, Facebook and Google+ public feeds. In the past, Sony, a World Cup sponsor, relied on moderators to do this work, but for an event that cut across languages and time zones, automation became key.

Working diligently behind the scenes of One Stadium Live was Luminoso Technologies Inc., a text analytics and artificial intelligence company spun out of the MIT Media Lab in 2010.

"We expected a lot of text," Catherine Havasi, CEO and co-founder of Luminoso, said during her presentation at the second annual Boston Data Festival, a weeklong event of 22 data talks spread across the Boston area. And fans did not disappoint: When all was said and done, they generated 672 million tweets over the course of the tournament, making it the largest social media event in history. Big text, indeed.

Big text, Havasi explained, is a different animal from big data and sitting right under your nose. "Enterprises have big text, but it's in esoteric formats, hiding in many places in the enterprise," she said. It's a "grungy mess," a data type that's been poorly managed and indexed and not highly valued. The vast majority of free text in survey responses? "All that stuff is just tossed," Havasi said.

One pixel SAP, German Football Association Team Up
for World Cup

Because enterprise text is so inaccessible, big text almost always refers to social media data, which comes with its own kind of grunge, Havasi said. Figuring out who's real and who isn't or filtering out the mounds of repeated content can be incredibly difficult. "The semantic content on Twitter is often a little sparse, and that's not just because of noise and spam," she said. "It's because everybody gets excited about the same thing when they're at the same event."

And then there's spam. On Twitter, the cliché "one person's treasure is another person's trash" couldn't be more true. Definitively labeling a post as garbage or invaluable is not a trivial task. "The fact is, Twitter is getting really crowded, which is something no one really likes to talk about," Havasi said.

Conversations are getting lost amid the noise, and the potential for personal connections is disappearing. Twitter has become "a lot of little conversations that can't find each other because hashtags are no longer working as a differentiator for people to find each other," she said. "When we think about this, it's almost the death of the hashtag."

It is, anyway, without technology like Luminoso -- the subtext for the data professionals in the audience. Later this month, the company plans to release a version for the enterprise and customer comments, Havasi said. Keep your eyes open for Compass.

Thinking in data

A data team's success hinges on communication, according to Nicholas Arcolano, senior data scientist at FitnessKeeper Inc.

Case in point: When asked to determine if an interview with a Boston publication had increased usage of RunKeeper, the startup's fitness app, Arcolano produced a brightly colored line graph full of peaks and valleys with no indication as to how to interpret the data. Arcolano, one of four members of FitnessKeeper's data team, could clearly see a bump in the Boston market on the day the interview went live, but the colleague who asked him to produce the report was at a loss. "It occurred to me that I needed to learn how they think about these things and the how I needed to explain these things," he told audience members during his presentation at Boston Data Festival.

The path to better communication is multifaceted for FitnessKeeper's data team. It includes leveraging each department's knowledge about its data to spot bugs, quirks or system failures faster than the data team could; it also means being clear about how the data team can provide support to the rest of the business and figuring out ways to empower colleagues outside the data team "to think with data," he said. A year into his position at FitnessKeeper, Arcolano admitted he's still trying to figure out how to do that well. "It's a process; it's a lot of conversations; it's showing them examples of how you can make their lives easier and better," he said.

Sometimes it's as simple as repeating a data point over and over again to undo preconceptions. The median pace per mile among RunKeeper users, for example, is 11 and a half minutes, but that's not always the runner the company's designers have in mind when developing new features. His job was to drive home the concept of median, namely that 50% of RunKeeper users are averaging slower than 11-and-a-half minute miles," Arcolano said.

Hard work, however, does pay off. When a UX designer asked for a list of Android screen sizes organized from most to least common among its users to help prioritize her work, Arcolano felt like the gaps in communication were closing. "It was a great moment," he said.

Unstructured data versus semi-structured text

Although sensor data sometimes gets lumped into the "unstructured data category," the two are different. "The stuff being created by the Internet of Things is not unstructured data, even if it's your fridge tweeting that the door opened," Luminoso's Havasi said. "It's still data that has a structure in it, even if it looks like text."

Havasi refers to it as semi-structured text, "something that has a predictable semantic and syntactic quality," she said. If the fridge tweets every time the door opens, the exact meaning of that data will never be difficult to decipher. "It's a signal that happens to come out in text," she said.

Welcome to The Data Mill, a weekly column devoted to all things data. Heard something newsy (or gossipy)? Email me or find me on Twitter at @TT_Nicole.

Next Steps

Previously on The Data Mill

Attention to UX is the first rule of mobile app dev

CIOs, meet Tamr, a data curation tool

Bisociation and New Yorker cartoons

Dig Deeper on Enterprise business intelligence software and big data