FotolEdhar - Fotolia
Hadoop's influence on technology development is going gangbusters, no question about it. Just last week, Oracle weighed in with Big Data SQL software, which is designed to run SQL queries across Hadoop, NoSQL and Oracle databases, thus minimizing data movement.
But the popular distributed computing framework still requires some attention before it can really be called enterprise-ready, according to Gartner analysts Nick Heudecker and Merv Adrian -- namely more robust security and data governance. "That needs to improve drastically and quickly to meet emerging enterprise needs," said Heudecker, who recently teamed up with Adrian to update the Hadoop 2.0 webinar they presented in January.
"The key thing to understand is that security is a new element in Hadoop thinking," Adrian said. Enterprises experimented with Hadoop, "often in a silo, often on an isolated cluster," which effectively made security a non-issue. "The more it becomes part of the fabric, the more we have to think about it," he said.
The same could be said for governance, where "a data supply chain is only as strong as its weakest link," Adrian said. He cautioned CIOs who have spent years building and refining master data management and data ownership programs to hold off on "dropping Hadoop" into the middle of the enterprise data landscape. Doing so "can be very disruptive," creating ripples throughout the entire organization, he said.
Traditionally, database management systems (DBMS) have been in charge of things like access, authorization and authentication -- the three A's of security, according to Adrian. But Hadoop doesn't have those classic DBMS characteristics. "Therefore, there was a need for what my dad called a 'shim,' something between two pieces that made them fit together properly," he said. "And a lot of security players -- new and existing -- have moved into that space to provide additional security for the stack." The vendor push is happening right alongside the Hadoop community's race "to complete pieces of security for the stack itself," he said.
Solving Hadoop security and governance is neither simple nor straightforward, Adrian said. New Hadoop projects are emerging all the time, which need to be integrated into the rest of the stack. And not just integrated, but, for the enterprise at least, wrapped in governance and security features. As Adrian and Heudecker pointed out, just because one element of the Hadoop ecosystem is upgraded with security features doesn't mean the whole stack is secure. All aspects need to be looked at, Adrian said.
On the other hand, the fact that security and governance features have bubbled up to the surface of Hadoop discussions suggests a new level of maturity, Adrian noted. "Governance and security happen quite normally about 10 years into the cycle of information management technologies," he said. Apache Hadoop was first released in 2005, "so, here we are, nine years in," Adrian said. "We're right on schedule."
While that may be true and perhaps even a little reassuring, Adrian made a final, emphatic plea that CIOs not rest on their Hadoop laurels; "Do not ignore security in your thinking."
Apache Giraph, HaaS & Hadoop appliance
The original dynamic duo -- Hadoop Distributed File System (HDFS) and MapReduce -- has inspired open-source (and proprietary) projects aplenty. Several have been integrated into offerings from MapR, Hortonworks and Cloudera (what some call "the big three" in Hadoop distribution).
"Just in the two years we've been following the Hadoop distributors, the number of projects supported by all the major ones has gone from a half dozen to 15," Adrian said during the webinar. The list includes Spark, an in-memory data processing engine; YARN, a resource management layer; and Apache Mahout, which brings machine learning into the mix.
And Hadoop's popularity continues to spread. Adrian and Heudecker mentioned a few trends related to the technology that CIOs might want to familiarize themselves with to stay conversant in Hadoop.
Apache Giraph: A graph processing framework for large-scale data sets. The technology is based on Google's Pregel, which was described in a 2010 paper. "If you squint the right way, you will notice that graphs are everywhere," the paper reads. "Transportation routes create a graph of physical connections among geographic locations. Paths of disease outbreaks form a graph, as do games among soccer teams, computer network topologies and citations among scientific papers." Today, Facebook uses the technology to analyze its social network. Giraph is a top-level project with the Apache Software Foundation, but it is not an official part of the Apache Hadoop project.
Hadoop as a Service: Vendors providing Hadoop as a service spin up clusters for customers and provide support for data integration, query writing and cluster management. In a blog post last fall, Adrian wrote that HaaS (as some refer to it) vendors "mask Hadoop development complexity." Relative newcomers such as Altiscale, Xplenty and Qubole specialize in Hadoop as a Service alongside more established vendors like Amazon, Rackspace and Microsoft. CIOs should consider it "a quick turnkey solution for kicking the tires on Hadoop," Heudecker said during the webinar.
The Hadoop appliance: Hadoop doesn't need to be a DIY project anymore. That's practically the tagline for one of Teradata's Hadoop appliance offerings. And the relational database management systems vendor isn't alone in coupling Apache Hadoop software with proprietary software and hardware. In the last couple years, others like Oracle, IBM and, most recently, Dell (in collaboration with Intel and Cloudera) have rolled out Hadoop appliances. In most cases, vendors have partnered with Hadoop distributors, as is the case with Dell.