The combination of several factors has motivated organizations to embrace "big data" as part of the enterprise's...
comprehensive modernization plan, including the availability of low-cost alternatives for high-volume data acquisition and storage, the ability of virtualized environments to support scalable processing, and the growing appetite for advanced analytics. One of the interesting side effects of the increasing adoption and integration of big data platforms is the need for CIOs to challenge their enterprise architects to rethink how corporate data strategies can accommodate the management and usability of the expanding array of digital assets.
All operational and analytical processes are fueled by data, and business analysts and data scientists alike are increasingly anxious to get direct access to the different data artifacts at the company's disposal. But the hankering for more -- and bigger -- data creates challenges for the Information Management function's ability to effectively manage that data. This leads to the creation of data lakes that act as the repository for digital assets, and it is critical to ensure that these data lakes don't degenerate into data dumping grounds.
As siloed business application systems are augmented, if not outright replaced, by newer technologies such as on-premises Hadoop clusters or cloud-based storage and compute resources, your system architects and application designers need to address questions pertaining to three important characteristics of enterprise data to facilitate the increased use, and reuse, of data assets:
Availability: Do the system designers know what data assets are available?
Awareness: If data assets are available, are the designers able to ascertain the details of the structure and confident that they understand the semantics of the information describing the data asset?
Consistency: Are the designers able to ensure that their anticipated use of an existing data asset won't conflict with other uses of that data asset?
Attempting to create a new data environment for each new application is not desirable, since it inevitably leads to redundant data set creation, duplicative effort and increased costs for operations and management. Alternatively, one can transform or enhance existing data sets to make them suitable for the new system's purposes.
This approach does not constrain the ways that existing data assets are used, since it allows for the data to be adapted to meet the new system's needs. Instead, it recognizes the need for an inventory of data assets that is cataloged in a way that allows system designers to find and easily adapt these assets to their own purposes. This cataloging is part of a more general set of processes called data asset curation.
What is data asset curation?
"Curation" is the process of assembling, organizing and managing a collection of objects. By extension, data asset curation is the process of assembling, organizing and managing a collection of data assets. We should adjust this definition, however, to account for context and purpose within a community of data consumers, since data assets that are not shared (or positioned to be shared) would not be subjected to data curation.
Therefore, a better definition (or description) of data asset curation would be the process of assembling, organizing and managing a collection of data assets for the purpose of expanding data accessibility and sharing among a community of data consumers.
The objectives of data asset curation include:
- Simplify discoverability of existing data assets.
- Provide details of data asset structure.
- Provide details of data asset semantics.
- Capture the provenance of any enhancements or modifications applied to a curated data asset.
- Provide a means of sharing information about the curated data assets.
More generally, data asset curation encompasses the metrics, policies and procedures that ensure digital consumers -- both human and automated systems -- are able to share the curated data and split responsibilities for stewardship and oversight.
Data catalog fundamentals: The dimensions of usability
The policies and processes for data curation are managed using a data catalog in which a cache of attributes about each data asset provides critical information about data usability. We may be accustomed to capturing data quality expectations about data sets, and data quality is important for data usability, but the primary importance for curated data assets are actually associated with use and oversight of shared data sets. Some examples of data usability dimensions include, but are not limited to:
Discoverability: Are data consumers and system designers able to get information about the data assets within the environment? Are the data assets properly classified in a consistent manner?
Searchability: Is the information about the data asset searchable? Can one find those data assets using references to business terms, phrases and concepts?
Accessibility: What are the different methods for accessing the data asset within the data environment?
Protection: With increased enterprise visibility intended to foster data reuse, protected data will be potentially vulnerable to exposure unless proper controls are in place. What mechanisms are in place for asset protection, including role-based access control, encryption and data masking?
Versioning: The curated data environment will likely maintain versions of data sets. What mechanisms are in place for retrieval of modified (or even deleted) objects, and are there allowances for rolling back to previous versions?
Provenance and lineage: When data consumers apply transformations and enhancements to existing data assets, new data assets will be created. The catalog should document the flow of data sets from their acquisition through the environment, and track the sequences of actions applied to them as the derived data assets are created. The data catalog must provide a means for logging and managing data provenance and track lineage of downstream data use.
Data quality: What are the ways that quality expectations are captured for each data asset, and what are the methods for ensuring that these expectations are met in different usage scenarios?
Data currency: How up to date is a particular version of a data asset? How are data currency requirements managed and demonstrated?
Surveying the organizational expectations for data usability is the first step to establish the policies and procedures for data curation. In essence, curation embraces data stewardship and governance in a way that operationalizes the ability to expose corporate data assets and promote their reuse, while limiting the risks of reinterpretation due to semantics.
As increased data volumes and broader varieties of digital objects are ingested into the growing data lake, instituting data curation processes will help protect that data lake from "digital pollution" that diminishes the organization's ability to take advantage of its corporate information inventory.
Data integration takes a back seat as data curation heats up
Vertica: The next big thing in data curation?
As big data makes its mark, governance processes get fine-tuned