Data Management.com

Explore Hadoop distributions to manage big data

By David Loshin

Hadoop is an open source technology that is the data management platform most commonly associated with big data distributions today. Its creators designed the original distributed processing framework in 2006 and based it partly on ideas that Google outlined in a pair of technical papers.

Yahoo became the first production user of Hadoop that year. Soon, other internet companies, such as Facebook, LinkedIn and Twitter, adopted the technology and began contributing to its development. Hadoop eventually evolved into a complex ecosystem of infrastructure components and related tools that several vendors package together in commercial Hadoop distributions.

Running on clusters of commodity servers, Hadoop offers users a high-performance, low-cost approach to establishing a big data management architecture to support advanced analytics initiatives.

As awareness of Hadoop's capabilities has increased, its use has spread to other industries for both reporting and analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other internet of things devices. 

What is Hadoop?

The Hadoop framework encompasses a large number of open source software components with a set of core modules to capture, process, manage and analyze massive volumes of data that are surrounded by a variety of supporting technologies. The core components include:

In Hadoop clusters, those core pieces and other software modules layer on top of a collection of computing and data storage hardware nodes. The nodes connect via a high-speed internal network to form a high-performance parallel and distributed processing system.

As a collection of open source technologies, no single vendor controls Hadoop; rather, the Apache software foundation manages its development. Apache offers Hadoop under a license that grants users a no-charge, royalty-free right to use the software.

Developers and other users can download the software directly from the Apache website and build Hadoop environments on their own. However, Hadoop vendors provide prebuilt, community versions with basic functionality that users can also download at no charge and install on a variety of hardware platforms. The vendors also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services.

In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management or data integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes.

This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure. It is also helpful to guide the selection of tools and the integration of advanced capabilities to deploy high-performance analytical systems to meet emerging business needs.

The components of a typical Hadoop software stack

What do you actually get when you use a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following:

Because the software is open source, companies don't have to purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it contributes to the community as part of its Hadoop distribution.

Who manages the Hadoop big data management environment?

It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams typically include:

The Hadoop software platform market

The evolution of Hadoop as a viable, large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that the companies can collect and analyze as part of those applications.

The market now includes two major independent vendors that specialize in Hadoop -- Cloudera Inc. -- Cloudera and Hortonworks merged in October 2018 to form this new company -- and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include cloud platform market leaders AWS, Google and Microsoft, which uses Hortonworks as part of a big data distributions managed service.

Over the years, the Hadoop market has matured -- and consolidated -- significantly. IBM, Intel and Pivotal Software all dropped out of the market, but the combination of Cloudera and Hortonworks is the biggest change for users to date. The merger of the former rivals gives the new Cloudera a larger share of the market and could enable it to compete more effectively in the cloud.

In fact, Cloudera's new messaging is that it will deliver "the industry's first enterprise data cloud" -- an indication of its desire to compete with the AWS, Microsoft Azure and Google clouds.

Cloudera plans to develop a unified offering called the Cloudera Data Platform, although it hasn't said when it will become available. In the meantime, the company will continue to develop the existing Cloudera and Hortonworks platforms and support them until at least January 2022.

Although the new Cloudera may be more competitive, a potential downside to the merger is that Hadoop users now have fewer options. That's why it's even more critical to evaluate the vendors that provide Hadoop distributions and understand the similarities and differences between the two primary aspects of the product offerings.

First is the technology itself: what's included in the different distributions, what platforms are they supported on, and, most importantly, what specific components do the individual vendors support?

Second is the service and support model: what types of support and SLAs do vendors provide within each subscription level, and how much do different subscriptions cost?

Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship.

Linda Rosencrance contributed to this report.

25 Feb 2019

All Rights Reserved, Copyright 2005 - 2024, TechTarget | Read our Privacy Statement