How to define a Data Set, Data Asset and Data Product?

Posted on: 20-01-2023
Posted by: Liberbyte

Data Set:

- collection of structured and/or unstructured data

- structured: e.g. tabular data (with rows and columns), JSON, XML or other similar clearly structure formats

- set has clear boundaries inside a specified domain and / or when classified using algorithm

- often focussed on single domain e.g. specific business, science or other domain

- unstructured: e.g. collection of text documents articles, books or html pages (latter can be considered "semi" structured)

- various degrees of structure(e.g. html vs. text book), set has undefined boundaries which largely depend on classification method / algorithm applied

- text data body can be referred as "corpus" when set domain(s) and boundaries have been (more clearly) specified

Data Asset (often):

- Consists of at least a data set and

- Data set(s) which form the asset are managed actively and/or passively hence data is recognised as an asset

-> active management is aiming to ongoingly create additional value or increase the data set value by active data management

Active example: a data warehouse or data lake which is ongoingly maintained and enhanced and data is shared across organisation for multitude of purposes e.g. creating data products

-> passive management is maintaining the data asset but not (actively) recognising the value of the data asset or aiming to extract the (additional) value from the data

Passive example: a database maintained for keeping a record of physical product or service without the aim of reusing or monetizing the data

- Enhanced mostly using data management techniques (not by ML or analytics) for instance, by ongoingly enhancing the existing data set with new data

- Often not fully shared / made available i.e. parts are used at the time e.g. for creating products or services (see data product)

- Often subject to standardised / automated ETL processes for maintaining and cleaning it

Note: Data asset can also become a (data) liability e.g.

Legacy data: old not used and/ or outdated data which is creating costs but not actively used for instance, because of regulatory requirements

Duplicate data: repeated records of the same data (yet in different quality) across organisation creating ongoing reconcilation issues and waste of storage space

"Waste" data: data which is not used, analysed and/or the collection is not known (it keeps accumulating without any use/purpose e.g. because of bad logging practices or old malfunctioning sensors)

Data Product is (often):

- Created from one or more data assets

- Created for specific need with limited time / space

- Often a (partial) result of data asset and derived using ML, analytics or similar processing techniques

- Produced and/or served for a specific customer(s), consumer(s)

Example: google maps is data asset, navigation (service) is a product based on the data asset

The above are indicative / subjective and exceptions are likely to apply

