With big data analytics, a company can find new sources of revenue from its existing data stores or figure out ways to save money by becoming more efficient. The path to those scenarios isn't the easiest, though. IT shops usually have multiple, disparate sources of information, so gaining access to the data from those sources can be difficult. Data virtualization systems address this issue.
In this webcast presentation, David Loshin, president of consultancy Knowledge Integrity Inc., drills down into the three main questions that need to be considered around how to add a data virtualization layer to your IT systems.
Editor's note: The following is a transcript of the third of three excerpts of Loshin's webcast presentation on enterprise accessibility to big data via data virtualization. It has been edited for clarity and length.
When you want to introduce data virtualization and data federation into your organization, there unfortunately are often some objections to wanting to bring this kind of technology in-house. The objections typically focus on three aspects of the implementation. The first is the simplicity of the implementation. How easy is it to implement? The second is how easy is it to get to the data? What does the data look like when you're getting access to it? [This question is important] because you're providing the capability of accessing multiple sources now, and you're providing, essentially, virtual data sets that didn't exist in their original form, because you're providing the federation. So, how flexible is it for you to be able to present the data back to one or more consumers? Then, of course, the question of when I introduce another layer between my consumers and the actual sources of data, isn't that going to create yet another layer of complexity when it comes to computational performance? Is it going to perform well? Is it going to slow down my access?
Those are the three questions, and we can look at each one of these in turn.
Simplicity of implementation
No. 1 is the question about simplicity and complexity. In our conventional data warehousing and business intelligence model, we have data flowing from the original sources into a staging area, which is then loaded into a data warehouse, which is then extracted and loaded into some kind of data mart or analytical processing engine, which is then exposed to the business analysts.
So, we've got multiple stages and multiple hops trying to get the data from the original source into a format that is reasonably accessible for that business analyst. But virtualization simplifies that data access because ... that uniform virtual model allows you to just have the access go directly from the original sources into that analytical processing component.
So we, in some cases, might even be able to eliminate the need for these interim analysis and data warehousing components. Now, I'm not suggesting that you're going to rip out your data warehouses because of data virtualization. However, what I am suggesting is that there's a future path, or future vision, where, if we focus on the concepts of data accessibility and consumer needs, there are some situations where a data warehouse, which is a single model representative of all the data that was pulled and decisions were made about what survived and what didn't survive, creates a constraint on the analyst's ability to make use of the data in its original form. Data virtualization creates an environment that allows you to get to the data in its original form.
This is the key piece of the battle between schema-on-write and schema-on-read. What we can propose is that data virtualization gives you this capability of doing transformation on read. Hadoop and the NoSQL varieties ... don't impose any kind of structure constraints for how the data is captured and stored. In the schema-on-write approach, there is a data model for your data set, and the data that you're getting is going to be manipulated so that it fits into that data model.
The data lake supports the schema-on-read approach, where the data consumers are free to interpret the format and semantics of the data when they access it. Sometimes one user wants to look at the data in its original format. In another case, another user wants to look at it in its raw format, and somebody wants to look at it transformed, normalized. You've got those different alternatives. Well, that's up to the users, not up to somebody in the IT department or data management department who decided that [it needs to be in a particular format]."
Access to data
If you've got the data lake, and it's supporting schema-on-read, then you want to have a mechanism that allows you to provide that flexibility of access to the consumers, especially when you've got data sitting in original sources in their heritage systems as well. Virtualization supplements the management of these different renderings for schema-on-read, because you can implement your transformations as part of the virtualization layer.
So, if I'm in the sales department, and I want to look at customers and prospects and their contact information, and I want to normalize that, or cleanse it using some kind of cleansing tool, I can make that happen inside the virtualization layer, instead of doing it inside a data warehouse. That allows the fraud detection group to look at the transactions by individual, and look for scenarios where there [are] variations in the data that's being provided in those transactions that would be indicative of some kind of fraud.
Because I've got different consumers with different requirements and different expectations, we can defer the decisions about transformations until we get to the point of where we're accessing the data, not where we're storing it. Virtualization provides that capability. It also allows the same semantic model to be able to blend data from other sources so that if I want to look at data that's coming in from streaming data sets, like the social media channels, and then stream that into a representative semantic model that is aligned with the data that's sitting in my customer profiles in the data warehouse, I can use a virtualization engine to mask out the differences between those different data sources so that it becomes easier for the application developer and the analyst to make use of that data.
How do we do this in a way that doesn't slow everything down? I think one of the biggest challenges, when it comes to data virtualization and data federation, is making sure that you don't impose a significant delay in getting access to your data because you put this data virtualization layer in the middle. If you think about it, there actually are a lot of opportunities for doing this pretty poorly, because when you are creating this standardized, canonical, relational view, and you're transforming it, you have queries that span all of the data. For example, multi-table joins, where the tables are situated in different federated systems. A naive approach to doing this would say, "Well, pull all the data from Table A and pull all the data from Table B, and all its data from Table C, and now we're going to run the join inside the software layer of the virtualization tool." Well, there's typically not going to be the volume storage available for getting all the data from Tables A, B and C, or, at one time, or the processing speed, to be able to scan through each one of these tables all simultaneously, in a way that's faster than perhaps a data set sitting on a mainframe, or data in a table sitting on a data warehouse appliance.
The smart designers of these tools provide different mechanisms to improve the performance of running these, essentially what becomes distributed queries. Part of that is caching data to reduce or eliminate the data access latency in the time it takes to get the data from the sources into the product. I think some of the more interesting ones are where you're either optimizing the access to pipeline the data so that you are overlapping the partial computation of the queries by the data virtualization tool with the time it takes to stream the data from the sources through the federation. If you do it in what's called a pipeline manner, then you're accessing the data for the next set of computations while you're currently executing the current set of computations.
By doing that, you are masking out the data latency times. This works really well when you've got multiple layers of cache. If you're accessing data from systems that have in-memory or columnar storage, and you're streaming it through your own system's cache, this can actually speed things up significantly. But, the other optimizations are where you're really taking advantage of knowledge of the systems where the data is sitting in the first place.
An example of No. 1 is push-down computation to the source platform. If you have a query and you know that that's scanning a lot of the data, you don't need to pull all the data from the source and then do the scan within the data virtualization layer. What you do is you push the scan down to the owner of each piece of that data. Let each system perform the scan and only return the results, and then combine them at the virtualization layer.
If you know anything about things like Hadoop, you might start to recognize that there's some similarity in the way that MapReduce operates where it's operating on partial computations on pieces of distributed data sets, and the way that some of these systems work, that it's reminiscent of that same type of exploitation of parallelism and distribution.
Another one is distributing semi-joins across the underlying systems. A distributed semi-join is where you take a join query, and you're able to break out the different pieces of it, in a smart manner, by not pulling out all the data from each of the tables that are meant to be joined but only the data elements that are used in the conditions for determining which of the records are to be selected in the final result set. By taking that, you can take a join and actually break it down into different subjoins that are done at each of the original sources and be able to move some of the interim result sets around those different sites to be able to significantly reduce the amount of data that needs to be scanned to satisfy that query.
Then, the other approach is optimizing generated code for target systems. If you've got data in a source that is, say, just a flat file, and is not sitting in a relational database, well, you can't run a SQL query against it. But what you could do is have really fast processes that have been written, to scan through flat data in a way that produces rapid results. Likewise, this becomes much more relevant in the whole data lake concept, which is if it's sitting in something, like Hadoop or HDFS, to be able to have automatically generated Spark code that rapidly runs through the data in parallel, or have MapReduce code that is generated from queries that can be pushed right back down into the Hadoop environment, and be able to pull the data out relatively quickly, what you're doing is you're taking advantage of the knowledge of what the underlying systems look like.
By doing this, you get that capability of improving the performance. In some cases, the performance is faster than if you were running queries against the original sources because [of] the way that the data's overlapped and the computation and the calculation and the data exchange [are] dovetailed to reduce the data latency. These give us the answers to those three general questions.
In summary, data virtualization and federation are essentially bridging technologies that support the modernization aspects of moving from heritage systems to hybrid environments and encompass big data components and platforms, like Hadoop and NoSQL, and the different types of implementations of a data lake. What I mean by a bridging technology is that by masking out and providing a facade on top of access to data sitting in an older technology, it allows you to migrate your consuming applications to make use of that virtual layer, freeing you up from binding those accesses to the original sources.
By doing that, it allows you to eventually migrate the underlying mapping from the virtual layer to a modernized system, or a renovated system, or to a new system, where you've migrated all the data from the old mainframe to a new Hadoop data lake. ... [This] allows you to lower the development and operating costs in general because now you're developing to that single canonical layer, instead of trying to get access to all these different distributed data sets. Also, it elongates the lifetime of those existing heritage platforms.
Overall, it simplifies environmental data accessibility without imposing significant performance degradation. This performance optimizations allow you to use virtualization and federation to get access to all the different data that's sitting in that spaghetti and meatballs configuration of your network that's evolved organically over the last 30 years, and allows you to start figuring out where are the hotter, more frequent accesses? Where are the cooler and less frequent accesses? Where do I need to improve the network performance? Where can I start doing consolidation of data that's sitting in different sources? And how can I simplify the accessibility, ultimately, to satisfy my consumers' needs?