Big data analytics enables companies to gain insight into their business operations or customers and find new ways to grow revenue. But IT organizations typically have multiple, disparate sources of information and accessing the data from those sources presents big challenges. Data virtualization tools can help solve this problem.
In this webcast presentation, David Loshin, president of consultancy Knowledge Integrity Inc., explains the data virtualization presentation layer and federation processes and how data virtualization enables IT organizations to reduce complexity, provide accessibility to enterprise-wide data no matter its location, and extend the lifetime of legacy systems.
Editor's note: The following is a transcript of the second of three excerpts of Loshin's webcast presentation on enterprise accessibility to big data via data virtualization. It has been edited for clarity and length.
Data virtualization tools typically combine two different parts. One is called a data virtualization presentation layer, and then there's federation, which provides federated access to multiple data sources in a transparent way.
Data virtualization and federation [are] both processes that are enabled through software tools to create a virtual semantic presentation of data. They enable applications to transparently query data that is distributed across multiple storage platforms. If you've got data sitting in an Oracle database, and you've got another data set that's sitting in a DB2 database, a data federation mechanism will allow you to run a query against a virtual layer, or a semantic layer, that looks like a single data model, but then, the tool, itself, takes that query and breaks it down into the query part that goes against the Oracle database and the query part that goes against the DB2 database. Then the [tool joins or filters] that data when it comes back through the data federation, virtualization layer.
By doing this it masks the details of how the data is formatted in its original sources, or where it's managed. It becomes irrelevant to the consumer because the data virtualization layer provides that virtual semantic layer, that single model, that then the tool is able to translate into requests that go off to the federated data sources.
So, by doing this, it simplifies the ability to blend data from multiple sources via consumer-oriented data materialization set of rules, which means that it's up to the consumer to define what the data is actually going to look like when it comes back. It's just the data virtualization federation layer and the data virtualization federation tools that facilitate translating queries that go against the virtual semantic layer into the different parts that need to be sent out to the various original sources.
Here's a little map of how it works. No. 1 here is your consuming systems, whether you've got people sitting at their desktop systems running queries interactively, or whether you've got systems that are automatic systems that are running on servers, or even just reports. They are actually invoking a set of data services, at No. 2, that query a standardized, canonical, relational view that's at [No. 3 on the illustration].
Now, the tools that do the federation and virtualization -- the virtualization part of it -- transforms the request against that canonical, relational view, and transforms that into a series of queries that are going to go to No. 5 [on the illustration] to access the data, run the queries. Sometime they're queries, and sometimes they're going to turn into applications that scan through data sitting in flat files, and perhaps even have systems that walk through unstructured data. That's where some of the real core value comes out.
It doesn't matter to the consumers what the data looks like, as long as somebody's put the effort into creating that canonical relational view and doing that mapping between the canonical view and how the data sits in the original sources. When the data request goes out at No. 5 [in the illustration], goes to the different federated sources at No. 6, they return the requested data back to the federation layer, the virtualization layer, where, at No. 7, it normalizes for the defined presentation. At [No. 8] it goes back through that canonical, relational model, and through the data services [it] is returned back to the invoking consumer.
The consumer here says, "Here's what I want to request from what I believe is the way the data's represented in that canonical view, and it's all under the hood where the virtualization tool transforms that into the right set of queries to go against the original sources, and then applies whatever normalization and integration when the data comes back from those original sources.
What are some of the value propositions for doing data virtualization? A lot of it goes to reducing complexity and increasing simplicity and being able to provide accessibility to enterprise-wide data. It doesn't matter where the data is sitting at this point. It could be sitting in your mainframe that's 30 years old, it could be sitting in your servers that you've been installing and keeping up to date and maintaining for the last five years. It could be going into your data lake. It could be going into a Hadoop system in HDFS. It could be going to one of the NoSQL sharding databases as well, that provides distribution of data and parallel access.
By reducing that level of complexity, you end up with a lot of benefits, for example, reducing the development costs. When you have a single canonical representation of data that's sitting in multiple sources, instead of writing your applications and having to target them to each of the original sources, you can rely on the virtualization layer to take care of doing that mapping, and all you're responsible for doing is developing to a single interface, which is that canonical layer.
By doing so, it allows you to develop your applications, your reports, your queries, provide configurations for self-service much more quickly, which then decreases time to value for analysis. It makes it easier and faster to get access to the data for doing analysis, and you can do it more quickly. It expands data accessibility because it provides the capability of accessing data that's sitting on different sources without having to program to each interface, and reduces the complexity.
One of the other aspects, that we talked about earlier, was the reduction or elimination of data replication. By having all the data sitting in its original source, we're no longer forced to make a copy of it to use it, but rather, accessing it through the virtualization layer allows us to leave the data in its landing area, its original resting spot. By doing that, we no longer force users to have to run extracts and copy the data and have multiple inconsistent copies sitting around.
It also reduces the need for ETL processing, because what we're doing is we're implementing some of our transformations and normalizations in the presentation layer as it comes back from the sources. By doing that, we don't have to have separate staging areas for ETL, for extraction, transformation and loading.
We don't need to have the expense of the individuals supporting those ETL processes and the platforms on which they run. It allows you to make use of your end-user interaction tools. If you're using a BI tool, your BI tool is just as easily configured to go against the canonical layer, as opposed to going against the original sources. And it expands the ability of getting access to multiple sources, so you can do a greater, a broader set of analyses and reports because you're pulling data from multiple sources.
By doing this, it allows you to have a little bit of leeway in determining when you're going to upgrade your infrastructure because it extends the lifetime of systems that perhaps might have been difficult to get access to. Let's say you've got a diminishing pool of COBOL programmers who can provide access to data that's sitting on a mainframe system, but, if you've got a data virtualization tool that can automatically transform a SQL query into a collection of COBOL codes, that can access the data, it allows you to retain use of that mainframe until you've decided how you want to upgrade your infrastructure.
A side effect ... is that you also don't have the cost of redesigning your interfaces to get access to the data once you've selected a modified infrastructure. So if you want to migrate all of your data off of a mainframe and move the data into a Hadoop data lake, that is irrelevant to the consumers because they're accessing their data through the data virtualization layer. All you need to do is ... reconfigure the accessibility component through federation on the system side, and it's completely opaque to the consumers. They just see the access.
Overall, by allowing you to have a transition mechanism to get from your existing heritage infrastructure to newer infrastructure ... you can retire some of these older systems, thereby reducing the operations costs and the maintenance costs, especially for large-scale, legacy environments.