At The Goldman Sachs Group Inc., implementing cutting-edge technology tends to be a DIY affair. Of course, it helps...
that roughly a third of the global investment bank's employees are part of either the technology team or the strategy team, said Peter Ferns, CTO for compliance technology at the New York City-based firm. "As far as our technology organization goes, we build," he said.
That was the case with "big graph," a project that came out of the firm's big data strategy working group of which Ferns is a member. The graph analytics platform at Sachs maps data into objects (or nodes) and the connections (or edges) between these objects. That way, relationships between customers, employees, companies, transactions and so on can be interrogated in a way they never could before.
But graph analytics technology is not new, so why all of the enterprise buzz now? One reason is better, faster and cheaper technology, enabling companies to apply graph analytics techniques to large data sets. "Big data technologies that make it easy to computationally process these graphs and store them are all combining together to make it something that's feasible," said Andrew Fast, chief scientist at Elder Research Inc., a consultancy in Charlottesville, Va.
Compliance and fraud are social issues
At Sachs, the graph analytics platform is being used for compliance and fraud detection, among other things, said Ferns at Strata + Hadoop World last fall. "If you're familiar with compliance in the financial markets, regulators require us to basically look at every transaction that happens in the bank every day," he said. "So compliance tends to wind up being an aggregation point for data, for better or worse."
Andrew Fastchief scientist, Elder Research Inc.
Per day, the data collected can include hundreds of millions of market orders and executions, tens of millions of trades, billions of market tick data, and millions of emails, instant messages and other forms of electronic communication. Because Sachs is required to retain the information for a period of time, the numbers tend to swell quickly. For electronic communications alone, Ferns said the data measures into the petabytes.
"Historically, aggregating, maintaining and retaining this data has been a major pain. But regulators have done us a favor because now we have this treasure trove of information that we've stored for however many years, and we've gotten good at moving data around and housing it," he said. "Now that the technology has caught up to the business case, we have the opportunity to analyze it on a broad scale."
Fast agreed that applying graph analytics to compliance, risk management or surveillance is a powerful use case -- especially today. "The big assumption in statistics -- and most of analytics -- is that data are independent, they're not connected in any way. And graph analytics breaks that assumption," he said.
Fast speaks from experience. As a doctoral candidate at the University of Massachusetts at Amherst, Fast partnered with what was then known as the National Association of Securities Dealers (now called the Financial Industry Regulatory Authority), a self-regulatory member organization. The team used graph analytics to explore compliance, risk and fraud issues related to stockbrokers. "The hypothesis there, as well as with the anti-fraud research I do in my current position, is that fraud and compliance are as much social issues as anything else," Fast said. "It's either a culture that's shared, a supervisor that's causing issues down the chain, or an interaction between individuals."
Plus, risky and fraudulent activities tend to be duplicitous. "There aren't a lot of big, bright, red flashing lights that say, 'Look here!'" Fast said. Graph analytics can surface weaker, more implicit signals within a "neighborhood" or a cluster of nodes that might go undetected when investigating individuals.
NASD had long suspected some high-risk brokers worked in teams, but lacked a way to unearth these connections. It turned out these "tribes," as they were referred to in his 2007 paper, coordinated their movements from branch to branch or firm to firm, often dispersing for a period of time and reconnecting down the road. "It might be a year, it might be two years, it might be three years, but now, you have all the same people together again, and they can do the same bad stuff at a different place, a new place," Fast said. By mapping and analyzing relationship patterns over time, NASD could begin to flag potential bad actors.
Should CIOs build or buy?
Still, building a graph analytics platform in-house, as Goldman Sachs did, is not a typical IT project. "It requires a lot of specialized expertise," said Rita Sallam, an analyst at Gartner Inc.
Some aspects of building the platform would be old hat for enterprise IT. Ferns said the data flow through the technology stack, for example, was "not dissimilar to other business data flows." "We're going from the bottom to the top," he said, where the bottom is authoritative data sources and the top is prepared data for business use.
But he also outlined aspects of a graph analytics platform that aren't in the traditional IT wheelhouse. In between authoritative data sources and the user interface, Sachs developed a raw data store, or what Ferns referred to as a data lake, as well as a raw data registry, which defines what the raw data looks like. "That's a whole collection of different technologies, but a big part of it is Hadoop," Ferns said.
Even more difficult? The Sachs platform is built on an ontology model or a description of the data types and the interrelationships between those data types. "We have a strong culture, historically, of model-driven development at Goldman Sachs," Ferns said. "This is really just an extension of the same." The ontology model relies on World Wide Web Consortium standards: The Resource Description Framework (RDF), a format for Web data; OWL, or Web Ontology Language, which helps define how the data is related; and SPARQL, a query language used against RDF data.
According to Sallam, standards such as RDF and SPARQL are still emerging, and the number of skilled workers who are well versed in them are few and far between. "The problem with a graph database is that there isn't a standard query language like SQL," she said. Sallam suggested that, unless graph analysis is mission-critical, the way it is for Goldman Sachs, CIOs are better off buying the technology rather than building it. "The typical CIO should look for more packaged platforms or applications that allow them to [use] the techniques to support their expanding big data and analytics programs to solve use cases where it makes business sense," she said.
And those use cases are plentiful. Graph analytics can be used for social media network analysis, telecommunications network analysis, location intelligence, market basket analysis, supply chain monitoring and even genetics, according to Sallam. Vendors such as IBM's i2, Centrifuge Systems and Palantir Technologies, which was valued at $15 billion in November, offer visual analytics platforms with graph or network analysis capabilities built in. Sallam referred to them as "a Tableau alternative," a popular visual analytics platform that doesn't currently provide network analysis tools.
Even if CIOs don't invest in graph analytics technologies directly, they should prepare to see the functionality embedded in products they will be investing in, according to Sallam. Smart machines and virtual personal assistants, for example, utilize graph analytics and are already making their way into the enterprise.
Read about the science of cartoons and more from last fall's Strata + Hadoop World.