Metadata learning

Issac
4 min readMay 11, 2021

--

What is Metadata ?
Metadata is simply data about data. It means it is a description and context of the data. It helps to organize, find and understand data.

Here are a few real world examples of metadata:

Typical metadata
Those are some typical metadata elements:
1. Title and description.
2. Tags and categories.
3. Who created and when.
4. Who last modified and when.
5. Who can access or update.

A photo
Every time you take a photo with today’s cameras a bunch of metadata is gathered and saved with it:
date and time,
filename,
camera settings,
geolocation.

A book
Each book has a number of standard metadata on the covers and inside.
This includes:
a title,
author name,
publisher and copyright details,
description on a back,
table of contents,
index,
page numbers.

A blog post
Every blog post has standard metadata fields that are usually at before first paragraph. This includes:
title,
author,
published time,
category,
tags.

Email
Every email you send or receive has a number of metadata fields, many of which are hidden in the message header and not visible to you in your mail client. This metadata includes:
subject,
from,
to,
date and time sent,
sending and receiving server names and IPs,
format (plain text of HTML),
anti-spam software details.

Relational database
Relational database (most common type of database) store and provide access not only data but also metadata in a structured called data dictionary or system catalog. It holds information about:
tables,
columns,
data types,
constraints,
table relationships,
and many more.

Computer files
All the fields you see by each file in file explorer is actually metadata.
The actual data is inside those files. Metadata includes:
file name,
type,
size,
creation date and time,
last modification date and time.

Why data lakes Needs a Data Catalog
What happens when all of that data is sitting in the data lake ? Finding anything specific within such a repository can be unwieldy by today’s standards. With the growing volume of data generated by all the world’s devices, the data lake will only grow wider and deeper with each passing day. Thus, while collecting it into a repository is key to using it, information needs to be cataloged and accessible in order for it to actually be usable. The sensible solution, then, is to implement a data catalog.

What is a Data Lake ?
Before understanding why a data catalog can be so useful in this situation. It’s important to grasp the concept of a data lake. In layman’s terms, a data lake acts as a repository that stores data exactly the way it comes in. If it’s a structured dataset, it maintains that structure without adding any further indexing or metadata. If it’s unstructured data (for example, social media posts, images, MP3 files, etc) it lands in the data lake as is, whatever its native format might be. Data lakes can take input from multiple sources, making them a functional single repository for an organization to use as a collection point. To go further into the lake metaphor, consider each data source as a stream or a river and they all lead to data lake, where raw and unfiltered datasets sits next to curated and enterprise/certified datasets.

What is a Data Catalog ?
A data catalog is exactly as it sounds : it is a catalog for all the big data in a data lake. By applying metadata to everything within the data lake, data discovery and governance become much easier tasks. By applying metadata and a hierarchical logic to incoming data, datasets receive the necessary context and trackable lineage to be used efficiently in workflows.
Let’s use the analogy of notes in a researcher’s library. In this library, a researcher gets structured data in the form of books that feature chapters, indices, and glossaries. The researcher also gets unstructured data in the form of notebooks that feature no real organization or delineation at all. A data catalog would take each of these items without changing their native format and apply a logical catalog to them using metadata such as date received, sender, general topic, and other such items that could accelerate data discovery.

helpful links:

--

--