This content is part of the Essential Guide: Managing Hadoop projects: What you need to know to succeed
News Stay informed about the latest enterprise technology news and product updates.

Hadoop security: Don't build your data lake without it

Why security for Hadoop is different from traditional data management tools, a list of open source Hadoop security projects in the works, and CIO budget expectations for 2015: The Data Mill reports.

Hadoop is flexible, economical, scalable and powerful. But before it pushes beyond pilot projects and proofs of...

concept to become and enterprise-grade technology, there's still a big question CIOs need to evaluate: Is it secure?

The answer: Not quite yet, according to Jeff Kelly, analyst for Marlborough, Mass.-based The Wikibon Project. While Hadoop security capabilities are being developed by the open source and commercial vendor community, they are still immature and often purpose-built rather than comprehensive, Kelly said during the recent webinar Securing Sensitive Data in Hadoop: Challenges and New Approaches.

That's due partially to the fact that Hadoop isn't a single technology stack from a single vendor. Instead, it's "a collection of projects, subprojects and vendor-developed extensions," Kelly said, which means security features are often developed in silos, creating "a hodgepodge … that [is] difficult to integrate at a platform level." And, while security typically comes a little late to new technology (so, no surprise there), Hadoop's less-than-robust security makes even more sense, according to Kelly. "It was developed in 2005 to solve a specific problem: to help Yahoo index the World Wide Web," he said. In other words, it was developed as a single application platform for public-facing data.

But since then, Hadoop has practically become the backbone for big data, and it's evolved so that it can "handle more workflows and applications, and a lot of those involve sensitive data," Kelly said. The good news? No one has to reinvent the security wheel for Hadoop. Kelly pointed to the tried-and-true three As of security (authentication, authorization and auditing) as well as data protection techniques from the traditional data management world. The bad news? Traditional data management techniques are just a starting point.

One pixelBizAppsToday: Diving Into the Data Lake

When YARN -- Yet Another Resource Negotiator -- was rolled out in 2013, Hadoop went from being a single application platform to a platform that could enable multiple applications, Kelly said. The new development could help businesses break down data silos. (YARN is an impetus behind the increasingly popular concept of the "data lake" or "data hub," a storage and data staging area, according to Kelly.) But this new iteration of Hadoop also introduces security challenges that aren't typically found in the traditional data management world.

When pooling all of the data in one place, CIOs need to keep in mind that applications and users will require different access to the same data sets, which means incorporating tools such as data-masking capabilities, flexible authorization and "authentication tools that can reconcile and enforce user credentials and permission controls across various user roles and multiple applications at a platform level," Kelly said. And, it's important to realize that the byproduct of integrating sensitive data together is even more sensitive data, he added, which means CIOs may need to overhaul the enterprise's data governance policies to be more proactive.

"That's not just a technology issue," Kelly said. "It's also a people and process challenge as well."

Open source security projects

During his webinar on securing sensitive data, Kelly highlighted several open source security efforts in the Hadoop community happening right now. He described them as follows:

  1. Apache Knox. A REST API gateway that provides a single access point for all REST interactions with Hadoop clusters.
  2. Apache Sentry. A modular system for providing role-based authorization for both data and metadata stored in Hadoop Distributed File System, or HDFS. Sentry is a project primarily led by Cloudera, one of the best-known Hadoop distributors.
  3. Apache Ranger. A centralized environment for administering and managing security policies across the Hadoop ecosystem. This project is led by Hortonworks, another well-known Hadoop distributor, and includes technology that it gained when it acquired XA Secure in mid-2014.
  4. Apache Falcon. A data governance engine that allows administrators to define and schedule data management and governance polices across the Hadoop environment.
  5. Project Rhino. Creates an encryption, key management capabilities and a common authorization framework across Hadoop projects and subprojects. This project is led by Intel.

The CIO budget: Chinese exceptionalism

Gartner Inc. released information from a recent global CIO survey that indicates IT budgets in the United States are expected to increase by 0.9% in 2015, which is only "marginally smaller" than the 1.1% increase globally. CIOs in the United Kingdom and Ireland reported expectations of a 1.4% IT budget increase, while CIOs in Latin America reported the lowest expected increase of just 0.4%.

The exception to the barely budging budgets is China, where CIOs are anticipating an 8.5% increase this year. Still, CIOs in China appear to be playing catch-up. Historically, their IT budgets are lower than the global average, so the expected increase in 2015 "is coming from a very low base," according to the Gartner press release. And, while 47% of CIOs around the world see themselves as a digital leader for the enterprise, only 33% of CIOs in China do.

This year's survey included 2,810 public and private sector respondents from 84 countries.

Welcome to The Data Mill, a weekly column devoted to all things data. Heard something newsy (or gossipy)? Email me or find me on Twitter at @TT_Nicole.

Next Steps

Previously on The Data Mill

Edward Snowden talks data encryption

Secrets to big data success

Systems diagrams help leaders manage change

Dig Deeper on Enterprise business intelligence software and big data