Andrea Danti - Fotolia


Enterprise data architecture strategy and the big data lake

Today's enterprise data architecture strategy has to address how to align existing data systems with growing information needs, capabilities and data sources.

Modern CIOs are faced with two challenges in unifying the increasingly disparate aspects of the enterprise data architecture. The first is aligning the existing data systems supporting operational applications with the information needs of a growing community of analysts and data scientists. The second is managing the adoption of what appears to be a continuous stream of innovative data management capabilities (such as Hadoop or NoSQL) to be integrated within the enterprise. It is the job of the CIO to orchestrate this data integration and expand data accessibility while reducing overall systemic complexity.

However, most organizations have data architectures that have evolved organically over time, typically with little guidance from a predefined enterprise data architecture strategy. Correspondingly, these same organizations suffer from increased complexity when it comes to enabling access to enterprise data assets in a consistent way. And as the rate of data management innovation accelerates, new technologies such as Hadoop, NoSQL and graph databases are being considered and introduced, adding yet more complexity to data consumers' perception of the data landscape.

The data lake landscape

A particular example is the emergence of the concept of the data lake, which according to TechTarget is "a large object-based storage repository that holds data in its native format until it is needed." A data lake is basically a storage platform that enables the organization to collect a variety of data sets, store them in their original format, and make those data sets available to different data consumers, allowing them to utilize the data in ways that are specific to their business purposes. One benefit of a data lake is to provide a single repository for shared data, thereby reducing data replication, which can lead to inconsistency and increased costs.

The data lake takes a fundamentally different approach to data storage than the conventional data acquisition and ingestion method. The traditional method seeks to make the data conform to a predefined data model to create a uniform data asset that is shared by all data consumers. By normalizing the data into a single defined format, this approach, called schema-on-write, can limit the ways the data can be analyzed downstream. The approach that is typically applied for data stored in a data lake is called schema-on-read, meaning there are no predefined constraints for how the data is stored, but that it is the consumer's responsibility to apply the rules for rendering the accessed data in a way that is suited to each user's needs.

The data lake introduces some challenges, especially for those downstream data consumers who are used to having their own copies of data sets for reporting and analysis. First, there must be a way to easily access data in a data lake, and second, there must be a way to configure the data that is being accessed so that its representation resembles the models to which the users are accustomed.

Using virtualization tools for data architecture strategies

Both of these challenges to a new enterprise data architecture strategy are addressed using data virtualization tools. Data virtualization and federation tools provide a layer of abstraction between a set of data sources and the different data consumers. The data-facing aspect of these types of tools is referred to as data federation. This technology provides methods for accessing a wide variety of data source types including most relational database systems, prior-generation storage systems (such as flat files, VSAM files and other mainframe approaches), as well as emerging data technologies such as Hadoop and NoSQL. Data federation enables applications to transparently query data distributed across multiple storage platforms while masking the details of source location or data formatting. The consumer-facing aspect is often the part that is referred to as data virtualization. This aspect allows the data practitioner to define logical semantic data models that are then mapped to the models of each of the federated data sources. This semantic model provides the layer of abstraction that simplifies accessibility by the data consumers. User queries against the semantic model are transformed into a collection of queries that are customized to each of the federated sources. When the result sets of those invoked queries are returned to the data virtualization tool, those interim results are collected, collated and configured into a final result set that is returned to the user. In effect, data virtualization simplifies the ability to blend data from multiple sources via consumer-oriented data materialization rules.

Data virtualization's use of defined semantic models to represent a converged view of original sources addresses both of the issues with accessing data in a data lake. Federating access to data in a data lake eliminates the need for users to rewrite their applications to include code to read the data from the data lake, reducing the need for data replication. Existing applications can target the semantic model, making the source of the data transparent to the consuming application. At the same time, data virtualization hides the complexity of schema-on-read by allowing each user to apply specific data normalization and transformation rules to the data to produce the "renderings" that are suited for each application use.

Data virtualization and federation are bridging technologies that support an enterprise data architecture strategy that encompasses big data. These tools lower development and operating costs by enabling the use of the (lower-cost) data lake and reducing storage needs for replicated data sets. They also provide a seamless accessibility to most platforms, by elongating the lifetime of heritage platforms as new technology is incrementally adopted. Data virtualization tools that have been optimized to take advantage of internal software caching, query optimization, pipelined data streaming, and compressed storage to simplify environmental data accessibility without imposing significant performance degradation. These tools also pave the way for introducing innovative ways to extract and analyze information from a range of fast-growing emerging data sources.

Next Steps

More on the big data lake:

Will the murky depths of data lakes prove bountiful?

In search of big data use cases for data lakes

Don't dive into a Hadoop data lake without a plan

Dig Deeper on Enterprise systems management