Big data analytics presents businesses with incredible opportunities to gain valuable insights about their operations, transactions and customers, and create new revenue-producing lines of business. But organic data growth has created myriad sources of information, and integrating them and accessing the data can be a huge headache.
In this webcast presentation, David Loshin, president of consultancy Knowledge Integrity Inc., explains how organic data growth has created hybrid collections of computing resources, why we want to access the data in those hybrid collections, what the data lake is and what challenges it presents.
Editor's note: The following is a transcript of the first of three excerpts of Loshin's webcast presentation on enterprise accessibility to big data via data virtualization. It has been edited for clarity and length.
We're going to start out by talking about the origins of the complexity in the data environment. In fact, we have to think about the way our environmental systems' architectures have grown organically over numerous years, in fact, you might even say different evolutionary periods, where we might have started out with having mainframes as the main and sole systems on which data processing applications would be hosted. But then, we rapidly evolved through later generations to having workplace servers and server-based environments and connectivity between those as well; then, eventually, incorporating a distributed collection of desktop computing. ...
So we end up with a system with a very hybrid collection of types of computing resources. Eventually we've connected them all together and created networks of these different kinds of hybrid systems. But the organic [data] growth has led to this kind of mishmash of an "architecture," but there really is no overarching strategy and architecture in many of these organically grown environments.
The challenge here is that as we've transitioned, I'd say over the last five years or so, from focusing largely on the operational and transactional aspects of systems to the utilization of information as a core asset, where the operational and transactional systems are really just one aspect of using that information -- and then, the analytics and the ability to take advantage of predictive modeling and prescriptive analytics that can help guide future business objectives -- has led us to this point where we want to be able to maximize opportunities for reusing data and repurposing that data.
That means that we've got to get access to the data. In prior years we've been satisfied with being able to pull data off of some of the operational systems, being able to put them into a data warehouse. But, with these big data technologies evolving and providing the capability for storing large or massive amounts of data, people are looking back and saying, "Well, we've got all this dark data sitting around that we've been saving for the last five, 10, 15, 20, maybe even 30 or 40 years; let's see if we can take all that data and analyze that as well to see what we can extract from that," which means that we want to have accessibility to information that's not just sitting on our recently installed systems, but what I call our heritage systems. I don't use the term legacy, but I'd rather call it heritage. I call it heritage because they're still in practical use and operating well.
[We need to be able to access] the data from these existing systems but also from the emerging big data platforms -- especially in organizations where they're starting to stream information from many different sources [including] the social media channels like Twitter, Facebook, those kinds of environments where there's a lot of data being produced -- and [we need to be able to capture] large amounts of that for analysis and then [blend] that with the data that has been created over time within the organization.
[The] concept of a data lake has begun to appear with greater frequency. According to TechTarget's definition, a data lake is "a large object-based storage repository that holds data in its native format until it's needed." There [are] some key aspects of that definition: It's an object-based storage repository. It's a repository for objects, not just data tables and flat files, but any type of information. The second piece is native format. We're not creating or applying any transformation to that data as it appears, but rather, when we need it.
A data lake provides a place for collecting data sets that are structured in their original format, making those data sets available to different consumers in a way that they need it, and allowing the data consumers to consume that data in more than one way. ... For the sales team, it may be specific to the sales applications, but for finance and audit, it may be specific ways that are different than for the sales teams.
There are some challenges, though, because we've got our heritage environment that has grown organically and we've got these new opportunities creating this data lake, and we want to be able to blend the data, or integrate the data, from those two different sources. Now the problem here is that in our existing environment, typically what happens is we'll have one system create data. It will store it in some way. It will then offload that data to a data warehouse where somebody else will come around and run a report and pull a large subset of that data out to some data mart. Then somebody might take another extract of that data as well.
So we end up pulling the same data, copying the same data, over and over again. This is this whole concept of data replication, or replication of information sets. But if we don't have any good governance practices layered on top of those data extracts, it will lead to inconsistency and poor performance. Inconsistency because somebody will pull an abstract at 10 in the morning; somebody else will come around at 3 in the afternoon and pull that same extract. But because transactions have taken place in the interim, those two data sets are not going to be exactly the same. [There will be] poor performance because we're starting to overload our system by storing the same data -- or what looks like largely the same data -- over and over again. We're filling up our disks with replicas of the same data.
Now, the other thing is that there [are] many users who have finally become comfortable with the ways they've been able to get access to the data in the first place, using the heritage platforms that are there already. With some of our clients, for example, there are people who prefer to run an extract of data sitting on a mainframe and load it into one of their analytical systems that have been around for about 30 years, as opposed to looking at some of the newer data visualization and data discovery tools that could provide similar types of access and similar types of analysis. But because they're not aware of how those things work, and they're uncomfortable moving to new technology, there's some resistance and challenges in being able to make the integration actually work.
Then there's this third point -- which we're going to delve into in a little more detail in a later slide: There's a fundamental difference between the way that we've traditionally taken care of data management, where we've created a defined model and we shoehorned data into those models, as opposed to the data lake approach, which is to put the data into the repository without making any changes and storing it in its original format.
The difference there is what's called schema-on-write, which is the conventional approach to this, which is migrating the data into a defined data set, or data model, and schema-on-read, which is [when] we impose a format and a model on top of the data when we access it, not when we store it. So there are fundamental differences between those two things and the way that we manage data.
That introduces confusion, especially in organizations that have just spent many years trying to solidify their extraction, transformation and loading processes, pulling data out of source systems and putting it into a data warehouse. This is a very different thought process.