Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Part one of this SearchCIO two-part series provided expert insight into why it's more important than ever for CIOs to focus on business outcomes before investing in big data technology. Here, we take a closer look at the decision to build or buy when it comes to architecting big data projects.
For Stephen Laster, the chief digital officer at McGraw-Hill Education in New York City, using data to improve business outcomes is of paramount importance. Laster heads up a team of data scientists and engineers who are charged with building out the company's e-learning and educational technology strategy. They are, in other words, responsible for all McGraw-Hill Education digital learning products.
A significant aspect of what his team builds is the sophisticated interaction between software and student. In the last few years, Laster's team captured four billion learning interactions, which are anything but generic.
"What we're able to do for a particular learner is understand at a micro-level what concepts they've got, what concepts they need to work on, and dynamically take them through a learning map that allows them to master small micro-concepts into larger learning outcomes," Laster said.
For this reason, Laster isn't a fan of the term big data; instead, he sees significance in small data. To create personalized applications for students, his team has to analyze data in real time, predict behavior and build smart algorithms and experiences that are, themselves, self-refining and self-learning, he said.
When Laster decides between buying technological capability or building it internally, he looks for market differentiation opportunities and stays away from projects that could mean reinventing the wheel. Take, for example, his IT strategy for a relational database management system. "That [has] been solved," he said. "But, conversely, we'll develop the AI [artificial intelligence] and algorithms that really bring this to life in-house."
Laster and his team start with the business outcome and work their way forward. "We decide, number one, 'What are we trying to accomplish to advance teaching and learning?'" he said. Then he and his team plot the technology roadmap likely to get them there.
"Once we know that, we take the implementation apart into small pieces and we go and see for each piece -- is this something that's commodity or that [has] been solved by the market?" he said. "If that piece is [solved], we license it or open source it. And if there are pieces that are unique or haven't been solved well before, we invest in building those."
An application program interface "based on years of academic research and engineering research" -- such as those found in McGraw-Hill's LearnSmart and Assessment and Learning in Knowledge Spaces, -- "that's where we think we really help move the market forward," Laster said.
Building rather than buying at the application layer enables companies to carve out market differentiation, making it a pivotal place for the build versus buy discussion, according to Jonathan Reichental, CIO for the City of Palo Alto. "If you're a CTO and you're delivering services to the marketplace, more often than not, you're building it," he said. "If it's inward-facing and it's just a dashboard for your financials, you may be using SAP or third-party products to do the reporting."
Building for customer-facing applications only helps to cut down on the "debris we left over the last decades because we built so much stuff that could not be supported and ultimately failed," Reichental said.
Buying can provide proprietary advantages too
But sometimes buying just makes good business sense. That's how Johann Schleier-Smith, CTO and co-founder of the social media site Tagged.com in San Francisco, weighs technology investments. He and his partner, Greg Tseng, started the Facebook contemporary 10 years ago -- before the big data boom. "We used the same databases that we were using for our online transaction processing systems to do our analytics," Schleier-Smith said.
Today, the technological terrain is more diversified, replete with NoSQL databases, analytics platforms and the open source Apache community, Schleier-Smith noted. And that market expansion has an impact on how he built an architecture for the technological terrain at Tagged, which today collects 100 billion data events each month, adding more than 50 TB to its petabyte-scale cluster. His engineers work with open source technology such as Linux, Apache Kafka and Apache Spark, an in-memory data analytics processing engine.
And they balance that portfolio with commercial technology from the likes of EMC Greenplum and Vertica. Database technologies like these provide "a high performance on certain types of queries -- particularly interactive queries," Schleier-Smith said. "We felt there was a proprietary advantage there, and it made sense to buy."
Build or buy? Why not rent?
Another San Francisco-based startup, ContextLogic, decided on a route that didn't exist 10 years ago. Rather than build or buy, ContextLogic rents services from a cloud service provider to manage its log data.
ContextLogic is the powerhouse behind Wish.com, a social shopping recommendation engine that boasts 1.1 million active daily users, 96% of whom interact with the site from mobile devices. A key part of the business' success is capturing and logging online events -- from user clicks and impressions to products customers skipped over and sequence data, which details, for example, how and when a user found his way to the online shopping cart. All of that data -- between 40 and 55 million events a day -- is logged for analysis.
"The combination of the volume of data, as well as the sequence, makes logging very interesting," said Danny Zhang co-founder of ContextLogic who heads up engineering operations. "That's how I look at big data."
As the company grew, so did the volume -- and importance -- of logging data. "Logging to me is essential and, arguably, the most important step for big data analysis," Zhang said. It provides insight into what customers like and don't like, guiding the algorithms underneath the search engine and ContextLogic's business decisions. While Zhang prefers to build everything in house "because we grow really fast and outpace solutions external services can provide," he selected Treasure Data, a big data service provider that uses Amazon Web Services to deliver Hadoop functionality, to manage log data. And his reason for doing so was simple: "Logging is not going to change," he said. "It doesn't matter how fast we'll grow; we'll log the same way."
Besides, he said, renting cloud-based data management services is a wash in terms of cost for ContextLogic. With the headache of logging data pushed to the side, engineers have more time for data analysis, Zhang said.
"We're not waiting for a golden parachute to come and land and solve all of the problems. The problems remain," he said. "We just happen to choose Treasure Data as one of the methodologies to tackle the problem with."