Now that CIOs have come to accept the term "big data," another term is gaining traction: small data. The two might appear to occupy opposite ends of the same spectrum, but that's not entirely accurate. Small data tends to refer to data volume; big data has volume, of course, but can also refer to data variety, data velocity, specific technologies or use cases.
For Kirk Borne, professor of astrophysics and computational science at George Mason University in Fairfax County, Virginia, the distinction is an important one. Borne's courses on big data focus on both the attributes of big data and on advanced analytics techniques, which are almost always applied to small sets. It's the latter that enables students to experiment and hone their skills in data analysis. Big data can quickly overwhelm.
Small data is making a name for itself in the corporate world for similar reasons as CIOs wrestle with the "How much is too much data" question. SearchCIO sat down with Borne, a former NASA employee who worked with the Hubble Space Telescope group for 10 years, to discuss what small data is and how it fits into the big data landscape.
You'll be moderating a couple of sessions at next week's Useful Business Analytics Summit in Boston. I noticed from the agenda one of your sessions will touch on small versus big data. Let's start with definitions: What is big data and what is small data?
Kirk Borne: Oh boy. Defining small data is easier because it's basically things you can do on a laptop. Big data -- that's more complicated. I have a definition I'm promoting now: Big data is everything quantified and tracked. By that, I mean we're now measuring and quantifying just about everything -- with social media, smart highways, smart cities, mobile health, electronic health records, surveillance cameras everywhere, which prompts the big data privacy question. Everything that's measurable, we're measuring. And not only are we measuring it once, but we're tracking how it changes with time.
Why is big data so hard to define?
Borne: You've seen the cartoon with the blind men feeling out the elephant. Everyone has a different description of what it is because one person feels the leg, one feels the trunk, one feels the tail. There is one thing called 'the elephant,' but everyone has a different perspective of it and a different definition. That's what we're fighting against. People want to claim big data is one concept, and that isn't going to work.
Allen Bonde, a former consultant who is now working at Actuate, has been known to say small data is for people and big data is for machines. Is that a fair differentiation?
Borne: Yes. That puts it in a nutshell. Small data is what you use when you're learning. By learning, I mean two things: One, learning in the education sense. So, when I teach courses, I'm always using small data and I'm absolutely never using big data in the sense of big volume because students would spend all semester just learning how to move the data around and never learn any algorithms or any science. Two, when you're in a business, you're trying to learn what are the right features to track customers, or to make recommendations to customers, or to find out what customer preferences are. Or, say in a cybersecurity analytics problem, what you need to measure to detect a break in or a hack attack. So you do these experiments to find out what you need to measure -- that's the small data.
Once you learn the model … then you deploy it, and the machine operates on the full flood of data. The machine is, essentially, working on the big data flood, which uses models or techniques you trained on with the small data. So, yes, small data is for people; big data is for machines.
How do visualizations fit into this small data versus big data discussion?
Borne: Let me give you an example. When you first go to Google Maps or whatever map service you use, you start off by seeing a map of the world. You're not actually getting any data; you're seeing a picture of the globe. As you drill down to a particular location, it gives you that information only for the spot you're drilling into. So as you drill in, you get higher and higher resolution access to the data. When you drill down to the highest resolution possible, all you're seeing is your own backyard. That's just a subset of big data. Yes, it's 'small data,' but what you've really done is you've built a hierarchical data structure that allows you to layer by layer drill down and zoom in one step at a time. You can pan left or right and start seeing other houses or neighborhoods at the same resolution. That's where visualizations are really powerful. When you're keying in on features of this hierarchical data structure, you're looking at the highest tip of the iceberg, so to speak. But if you want to move over to one side, you can start seeing other features in the data set at the same resolution. You still [have access to] the full data set.
With small data, you just download a piece of the map -- say a high-resolution [map] of my city -- and do data analysis on that.