Let's just get this part out of the way: What you are about to read isn't sexy. Instead, it focuses on something...
much more pragmatic: plumbing. That's the word Andy Palmer used to describe Tamr Inc., the Cambridge, Massachusetts startup he founded with Michael Stonebraker. Tamr, a data curation tool, isn't the first foray into cutting-edge database technology for this team. The dynamic duo built the Vertica analytic database now owned by Hewlett-Packard.
At its core, Palmer said, Tamr acts like a Web search engine for enterprise data. "What does a Web search engine do? It goes out, finds all of these different websites that are out there, crawls them, organizes the information that it gathers while it's crawling and then relates it all," said Palmer, who serves as Tamr's CEO.
It was created to solve one of the biggest pain points facing businesses today -- especially those large, traditional enterprises with legacy systems: breaking data out of silos to get a single view across the organization. "I think of it as plumbing -- this basic act of connecting together data sources and making them operate together was like the missing piece in the overall equation," Palmer said.
In the early days of IT, silos made sense. "Most IT organizations, when they build systems out, they're very application-centric," Palmer said. Billing needed a database, so IT built a billing database; customer service needed a database, so IT built a customer service database. But, as businesses grew, as they underwent mergers and acquisitions, as their data landscapes ballooned to include multiple ERP or CRM systems, the application-centric approach became inefficient.
He and Stonebraker saw this firsthand when transitioning customers onto Vertica Systems, a database management and analytics software company they co-founded in 2005 and sold to HP in 2011.
"We experienced so many people doing large-scale analytics projects where they had to aggregate all of their data together to do these analyses," Palmer said. "There was a small group of people that would sit down and … identify sources, where the data was going to go to, how it was going to get from point A to point B. And then they'd move on to another project to do a similar thing." Worse, because analysts/modelers/end users don't have visibility into the entire data ecosystem, "they use whatever sources happen to be closest and most convenient," he said.
That, Palmer said, becomes problematic when asking "really simple but broad analytic questions." The very kinds of questions data scientists et al. are asking more and more these days as they pursue a 360-degree view of the customer or, in the case of one Tamr customer, pull back the curtains on procurement. The Fortune 500 company and well-known name brand scrutinized its largest purchases but had trouble breaking down data silos so it could dig into smaller transactions, where, as it turned out, 60% of the cost-savings opportunities lived. "There's so much in the long tail, that when aggregated and exploited, it can lead to cost savings opportunities greater than the ones identified," said Alan Wagner, a field engineer at Tamr.
Tamr, which came out of stealth mode in May, uses a "probabilistic bottom-up" rather than top-down approach to index a company's data, a process that leverages machine learning to find similar data types and depends on domain experts to determine if the connection is valid. It's not as though Palmer and his team think the more deterministic, top-down models are bad (master data management is a worthy endeavor, after all), but they aren't agile and they rely on rules that "become impossible to maintain" as businesses scale to a large variety of data sources, Wagner said.
"Data is changing, new data is coming in and you need to have someone looking at all rules -- how they're going to break and how do you respond to that," he said. "It results in a brittle environment." A probabilistic bottom-up approach surfaces new data sources or recommends new attributes the way a search engine discovers a new website.
Why JSON will be the next SQL
When Palmer gave a presentation at the Big Data Innovation conference in Boston last month, he said this: "JSON is to the next generation of data and analytics as SQL was to the last generation of data and analytics." Popularized by the technologies developers gravitated to -- technologies like MongoDB and CouchDB -- JSON is making a place for itself inside the enterprise. "It doesn't require special skills, which is the opposite of SQL," Palmer said later.
Agile is not ubiquitous
Agile might be a big buzzword, but, according to Emmet Keeffe, co-founder and CEO of the software company iRise, the largest enterprises are clinging to waterfall. "Any company bigger than $5 billion in revenue, the IT shops are basically waterfall shops, still," he said. Those businesses are experimenting with agile, he said, but only with between 1% and 10% of their portfolio and only with projects that are self-contained such as mobile application development.
Keeffe said IT still uses waterfall for a few of reasons: First, global projects that involve teams around the world are too spread out to sustain agile methodology. Second, the governance wrapped around the software testing lifecycle recommends writing requirements up front, but agile methodology recommends the opposite. Third, large projects require a budget and a deadline, and both can be hard to pin down when using agile.
Previously on The Data Mill
Bisociation and New Yorker cartoons
Gartner's top 10 technologies and trends for 2015
If data is the new oil, does it need a refinery?