How to define a Data Set, Data Asset and Data Product?
Data Set:
- collection of structured and/or unstructured data
- structured: e.g. tabular data (with rows and columns), JSON, XML or other similar clearly structure formats
- set has clear boundaries inside a specified domain and / or when classified using algorithm
- often focussed on single domain e.g. specific business, science or other domain
- unstructured: e.g. collection of text documents articles, books or html pages (latter can be considered "semi" structured)
- various degrees of structure(e.g. html vs. text book), set has undefined boundaries which largely depend on classification method / algorithm applied
- text data body can be referred as "corpus" when set domain(s) and boundaries have been (more clearly) specified
Data Asset (often):
- Consists of at least a data set and
- Data set(s) which form the asset are managed actively and/or passively hence data is recognised as an asset
-> active management is aiming to ongoingly create additional value or increase the data set value by active data management
Active example: a data warehouse or data lake which is ongoingly maintained and enhanced and data is shared across organisation for multitude of purposes e.g. creating data products
-> passive management is maintaining the data asset but not (actively) recognising the value of the data asset or aiming to extract the (additional) value from the data
Passive example: a database maintained for keeping a record of physical product or service without the aim of reusing or monetizing the data
- Enhanced mostly using data management techniques (not by ML or analytics) for instance, by ongoingly enhancing the existing data set with new data
- Often not fully shared / made available i.e. parts are used at the time e.g. for creating products or services (see data product)
- Often subject to standardised / automated ETL processes for maintaining and cleaning it
Note: Data asset can also become a (data) liability e.g.
Legacy data: old not used and/ or outdated data which is creating costs but not actively used for instance, because of regulatory requirements
Duplicate data: repeated records of the same data (yet in different quality) across organisation creating ongoing reconcilation issues and waste of storage space
"Waste" data: data which is not used, analysed and/or the collection is not known (it keeps accumulating without any use/purpose e.g. because of bad logging practices or old malfunctioning sensors)
Data Product is (often):
- Created from one or more data assets
- Created for specific need with limited time / space
- Often a (partial) result of data asset and derived using ML, analytics or similar processing techniques
- Produced and/or served for a specific customer(s), consumer(s)
Example: digital / data driven map can be considered a data asset and navigation (service) is a product based on the data asset
The above are indicative / subjective and exceptions are likely to apply
Liberbyte offers domain focused data solutions to unlock the value from your data. Our solutions enable you to supply and monetize your data and to source, process and analyse your data demands. Please contact us under info@liberbyte.com for more details about our products.