Big data's sidekick, Hadoop, is spinning a new YARN. That stands for Yet Another Resource Negotiator, and it was a big part of the conversation at the recent Hortonworks' Hadoop Summit in San Jose, Calif. A major enhancement for Hadoop 2.0, Apache YARN will schedule and monitor queries as well as allocate the resources necessary to complete those jobs. That might seem like small potatoes, but one of the biggest complaints about Hadoop is getting data out of storage and processed. YARN could change that, according to experts.
Originally, Hadoop's two major components consisted of the Hadoop Distributed File System (HDFS), which can quickly ingest and store data of all shapes and sizes, and MapReduce, a framework for processing data. The beauty of MapReduce is its ability to take queries that involve a lot of data, break them up over cheap commodity servers, and process them in parallel. But the problem with MapReduce is that it can be cumbersome (it isn't the friendliest language out there) and slow (it relies on batch processing).
YARN (also called NextGen MapReduce or MRv2) could address both problems. As Jeffrey Kelly, SiliconAngle contributor and researcher for The Wikibon Project, put it in a recent article: "YARN is essentially a new operating system for Hadoop that will allow the open source big data framework to break free from the shackles of MapReduce."
As a new layer in the Hadoop architecture, YARN is an intermediary between application (including but not limited to MapReduce) and the data residing in HDFS, giving it the flexibility to tackle multiple jobs at once. Or as Brian Proffitt, tech expert, author and ReadWrite editor, stated back in May: "It manages resources similar to the way an operating system handles jobs, which means no more one-at-a-time limitations."
Previously on The Data Mill
Big data gets personal about pollution
San Francisco Giants launched the @Cafe
That suggests Hadoop can continue to harness the power of MapReduce, as well as take on streaming or machine learning or any number of applications, according to these experts. If it lives up to expectations, it just might be the push Hadoop needs "to be used as the foundation of an enterprise data management architecture," according to Kelly. Now that would be a CIO-worthy yarn! Even if YARN doesn't have the knitting prowess its fans proclaim, the fast-paced evolution of Hadoop is certainly a sign of the times, if not necessarily a spanking new technology approach, as my colleague over at SearchDataManagement, Jack Vaughan, points out.
Hunks, HBases and NameNodes, oh my!
Apache YARN was a hot topic at the summit, but far from the only big data blockbuster. Here is a roundup of some of the enhanced Hadoop-related products:
- Splunk launches Hunk: That's right, Splunk, which carved out a name for itself when it comes to machine-generated data, launched a beta version of Splunk Analytics for Hadoop -- also known as Hunk. The new software product is billed as being able to "explore, analyze and visualize data in Hadoop."
- MapR Technologies Inc.: a Hadoop provider announced that combining the Fusion ioMemory platform (flash-based storage) and the MapR M7 (a big data platform that integrates functionality such as MapReduce, SQL, search and so on ) can accelerate performance 25 times faster for read-intensive Apache HBase NoSQL applications. HBase "has previously been limited by disk storage bottlenecks," according to MapR.
- WANdisco announced Non-Stop NameNode WAN Edition. The NameNode keeps track of where data is stored within a Hadoop cluster. When a NameNode goes down, according to the Hadoop Wiki, the file system goes offline. Non-Stop NameNode WAN Edition solves that single point of failure by supporting NameNodes that are geographically separated and synchronizing the data to provide continuous availability.
- Pentaho Corp. added an "adaptive big data layer" to its platform, which enables customers to quickly access the latest versions or updates from Hadoop distribution providers like Cloudera, Hortonworks and MapR.
Pig gets a makeover
Netflix, a Hadoop user, contributor to the open source community, and headliner at the summit, also had some news to dish. About a week before the event, the entertainment company made its Hadoop Platform as a Service tool, Genie, open source. Netflix uses Amazon's Simple Storage Service (S3) as a data warehouse as well as the cloud provider's Elastic MapReduce "API to provision and run Hadoop clusters." So what kind of magic can Genie conjure up? It "provides job and resource management for the Hadoop ecosystem in the cloud" for tools such as Hive, Pig and Java, according to Sriram Krishnan.
At the summit, Netflix upped the ante for Pig, announcing a language platform used to simplify large data set queries for MapReduce. In a blog post titled Introducing Lipstick on A(pache) Pig, the new workflow visualization tool is described as giving developers a chance "to visualize and monitor the execution of their data flows at a logical level." Lipstick originally produced a "graphical depiction of the workflow" after it was completed. Now, the tool gives the developer information about the job as it runs. Who says you can't put lipstick on a pig.