Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data is going to be this year's cloud computing. It was an inevitable evolution: The sets of data that businesses...
generate (customer purchasing trends, website visits and habits, customer review data, and so on) have grown larger and larger over time; how do you wrangle that massive amount of data into comprehensible pieces? The traditional business intelligence (BI) tools -- relational databases and desktop math packages -- are inadequate when a business is dealing with a massive volume of data. The data analysis industry, however, has developed tools and frameworks that allow data scientists and analysts to engage in mining large data sets without succumbing to information overload.
Massive data acrobatics are nothing new for larger companies. Twitter and LinkedIn, for instance, are prominent users of big data. Each has already developed a distinct competitive advantage in being able to discern trends by data mining their massive data warehouses. But where does that leave the midmarket CIO? Happily, there are tools at your fingertips that will allow you -- or more specifically, your business analysts -- to cut your teeth on big data without biting off more than you can chew.
One of these tools, the free, Java-based Apache Hadoop programming framework has gained a lot of ground in the race for big data in the last 12 to 18 months. Industry experts and users around the globe are proclaiming Hadoop as the de facto standard for data mining. This fanfare is surprising, considering the existence of other big data products and the fact that Apache Hadoop version 1.0 was released only in late December 2011. Hadoop is so popular that Eric Baldeschwieler, CEO at Hortonworks, predicts it will process half of the world's data by 2017. Chances are good that Hadoop will somehow touch your organization in the coming years.
Microsoft plays nice with Apache Hadoop
In the fight for big data and Apache Hadoop adoption, a perhaps unlikely hero has arrived. Microsoft also is aiming at making its product integrate better and more seamlessly with Hadoop's existing features and programs. Here's what it's currently working on:
- Software to connect SQL Server -- and in particular the Parallel Data Warehouse utility -- to Hadoop for easier big data processing.
- A connector that will allow any Windows client that can talk Open Database Connectivity to do standard CRUD (create, read, update, delete) queries against a Hive data store.
- An add-in for Excel so that data in a Hive data warehouse can be moved directly into the spreadsheet program.
- The ability to use PowerPivot (Office's and Excel's Swiss Army knife for combing through enormous amounts of data on a standard desktop PC) to slice and dice Hive data too.
Microsoft's entry into the Hadoop fray could be your best bet for really injecting big data into your organization. Microsoft makes its living from building bridges from one product to another (and it hopes those products are the ones it makes, but it's still good about playing nicely with the rest of the world). The best part is that your users -- particularly your business analysts -- should be intimately familiar already with the tool sets and operations of Excel and Access. -- J.H.
Hadoop is primarily aimed at developers. The main framework -- MapReduce -- lets programmers process massive amounts of data on a distributed set of computers. The downside is that it's very heavy. What's more, Hadoop can segregate the tech folks who directly operate the data warehouse from the consumers and interpreters of the data.
Here's how to overcome the challenge of massive data on a midmarket CIO's budget:
- Don't ignore the trend. Big data isn't going away and the transformational power of analyzing data in huge chunks and looking for trends cannot be ignored. Spend some time understanding the capabilities and structure of Hadoop and other big-data players. Think of the ways the data you have can drive improvements for your company.
- Find room in your budget for qualified data scientists. These folks are the percussion section of your BI symphony. There's a shortage of qualified data scientists on the market. Even at the Hadoop World conference last November, training was a huge theme. Hire the best folks and use liberal amounts of your training budget to keep their data-analytics skills sharp.
- Understand the storage implications of large data sets. Big data is all about mining massive data from multiple places and multiple databases at near-real-time speeds -- without letting structure get in the way. This complicates the way storage works in your infrastructure. Could cloud storage be more flexible and agile for these purposes? Get together with your data mining strategy team and make it a priority to understand the types and amounts of storage required to harness Hadoop's power.
- Position your tool set to use Hadoop. Understand Microsoft's entrance into this field and try out the Hadoop-Excel and Hadoop-SQL Server integrations to see what kinds of results you can deliver. Investigate IBM's tools as well to see which are a better fit for your existing investments in desktop and end-user software.
The race for big data is on. Chances are you're behind already on the data mining revolution. CIOs who ignore the rush for data analytics could be doing so at their professional peril. For CIOs who get the jump on big data and extract key insights, however, the world will be their oyster.
Jonathan Hassell is president of The Sun Valley Group Inc. and an author, consultant and speaker in Charlotte, N.C. His books include RADIUS, Learning Windows Server 2003, Hardening Windows and, most recently, Windows Vista: Beyond the Manual. Contact him at firstname.lastname@example.org.