This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
5. - Big data in action: The case studies: Read more in this section
- 'Dirty data' gives Land O'Lakes an advantage
- Ancestory.com reinvents legacy systems with Hadoop
- Chief scientist's role in big data analytics
Explore other sections in this guide:
- 1. - Big data: The promise, and a primer
- 2. - Watch big data evolve before your eyes
- 3. - Big data in traction: The challenges
This article can also be found in the Premium Editorial Download "CIO Decisions: How Mobile IT is Revamping Network Strategies."
Download it now to read this article plus other related content.
PayPal Inc.'s data comes in torrents. Embedded in it is everything the business wants to know about the merchants and buyers who transact sales using PayPal's systems. The question is how to use big data analytics to get at that information.
Mok Oh, PayPal's chief data scientist, has the job of extracting the psychological underpinnings of this transactional data for the San Jose, Calif. electronic payments provider. His data set is insanely large. His goal is to match vendors and buyers better in order to maximize the likelihood of a transaction -- in other words, to help PayPal make money.
In a wide-ranging SearchCIO.com Trailblazer interview, Oh talks about the state of big data analytics. He actually is trying to fathom the human subconscious by looking at who buys what, when and why. This data is so useful, he believes, that sooner or later all companies will want a piece of it but it won't come cheap.
What do you do at PayPal?
Mok Oh: My title is chief scientist. For me, that means anything and everything science-y within PayPal. I would love to really get into psychology and sociology, which are actually very relevant to what we want to do with consumers -- especially understanding how people behave and what they think, especially about shopping and consuming. Currently I am focusing on big data and data science.
You've said we are in a limbo state in big data analytics. You called it "Analyst 1.5," where finding something useful in the vast amount of unstructured data that companies and their customers generate requires a cadre of technicians, from data scientists to statisticians.
Oh: Yes, this limbo state is between what you might call Analyst 1.0, where the process for cleansing, sorting and analyzing data was well understood, if limited; and Analyst 2.0, where the tools have evolved to a point that the intuitive business person -- the person closest to the data and with the deepest insight into the questions that need to be asked -- can leverage the big data for competitive advantage.
What do you think Analyst 3.0 will look like?
Oh: Oh boy! I haven't even thought that far. It still feels like Analyst 2.0 is pie in the sky -- not pie in the sky exactly, because I know it is going to happen but I don't know when. It could be five years or it could be 10 years. If we can get to a state where risk analysts, business analysts can seamlessly leverage data and its capabilities -- then I think we have reached the 2.0 state. I see lots and lots of signals of that happening now at big companies.
Infrastructure itself is not big data at all. That's just setting up the pipes. So, yeah, you can start storing information -- then what? That is the question!
chief scientist, PayPal Inc.
Now, 3.0? I have no clue. If we can get to 2.0 I will be ecstatic, but I think we will be in this limbo state for at least another five or 10 years.
Let's focus on the limbo state and the reality CIOs face when trying to leverage big data tools. What are the limitations at present, and how can you get around them?
Oh: It's going to be an uphill battle. Here's why: People talk about two different things -- the big data itself and the analysis. A lot of companies are just focusing on how to store big data and have places where people can access it. A lot of CIOs at Fortune 500 companies are going to spend tens of millions of dollars setting up the infrastructure and trying to get big-ass tools to connect to big data.
But that is just half the battle. Access to big data is not enough. So, infrastructure itself is not big data at all. That's just setting up the pipes. So, yeah, you can start storing information -- then what? That is the question! What do you do with that?
Oh: Right now, human intelligence, I would say, especially in terms of targeting and making recommendations, has been surpassed by machines. Machines can scale up faster, they can target better and [can] learn and be able to predict the future better than any human can confronted with all that big data.
Can you give me a real life example of this problem?
Oh: I will give you a marketing campaign example. Traditionally, someone is going to go in thinking, "I have this Excel spreadsheet and we have our million potential targets. Let's slice them up by how much money they spend with us. We'll target those people who spend the most money and make sure it is segmented that way." That is more of a traditional, human-in-the-loop marketing strategy. Somebody goes in and creates a model [for that data] which may or may not work.
On the other hand, what machines will do is not only look at five or six or 10 attributes, but look at -- oh, I don't know. We have millions of attributes on our tables for PayPal. We throw all of those attributes on 100 million-plus people and rank them all according to a specific campaign that we may run. And we can do that, update that, in a near-real-time basis, and it will be a better campaign done much better than any human being can.
All that is controlled by computer science with human beings creating code at a meta level so the machines can learn. So that is 1.5. Hopefully by 2.0 what will happen is that people who are analysts, who really really understand the business will be able to do that themselves.
What is lacking in big data tools now that prevents the people who have deep knowledge of the business from doing this data analysis themselves?
Oh: Right now, the tool of the trade is code -- being able to program on a massively parallel computer. But the problem is there are way too many use cases. So, it is not like there are five different use cases and we're going to make sure there are tools to do these five things. Let's take a marketing campaign again: Before, we might do a direct mail campaign, one where we send someone a unique telephone number, and if they call they redeem [a coupon]. Parameters we tracked might be household income and some demographic information and whether they redeem or not. It was fairly simple.
Read about other SearchCIO.com Trailblazers
Gamification key to launching new FedEx social collaboration platform
Now we have mobile devices, so we can do things at point of sale; there is also a lot of activity online, not only whether a customer buys or not, but what kinds of other websites they go to, what products they look at. Offline, you can also look at what kinds of stores they walk into, what is their wish list, what are their buying patterns, what are their transactional patterns?
As more big data accumulates, in the pursuit of trying to understand people, the data gap gets filled more and more. And as a result, additional use cases come to light. The world was always complex, but with the advent of big data, there is a lot more information and many more precise things that we can understand. Eventually we're going to say, "Hey, let's not just target household income; let's look at people who live in San Jose, who make this much money, and have this many people in the house, who recently shopped at this website using PayPal."
There are so many other aspects you can slice and dice. The problem is there are really no tools that do all the things you can think to do with the data. The question becomes, having 100 times more different use cases now, what's that meta-level of truth that will allow us to run this analysis in a more manageable way and that will allow us to make those important calls?
I'd like to understand the motive for the type of data analysis you're doing at PayPal. Is the business aim to get merchants to advertise with you because you are sending buyers to them?
Oh: It is to make them more successful.
We are in a kind of awesome position where we can enable that. We have over 100 million active users in PayPal. We have millions and millions of merchants, from very large to small. With that information, depending on the context -- the time of day, what the merchants wants to do -- we can create demand, very efficiently and in a very targeted way, to match consumers with the merchants and vice versa.
Read part 2 of our Trailblazer interview with Oh, where PayPal's chief scientist talks about using big data tools to plumb the human subconscious and about his pursuit of the holy grail of big data analytics, a state he calls "Analyst 3.0."