Implementation scheme of lineage management based on DataHub metadata

Directory

1. Overview of Metadata Management Implementation Plan

2. Metadata classification

2.1 Technical metadata

2.2 Business Metadata

3. Metadata labeling system

basic label

Data warehouse label

business label

potential label

4. Table metadata

4.1 Extract metadata based on pull mechanism

web-side ui mode

cli-side yml method

yml parsing

yml template

4.2. RESET-API method

API-MEDTADA manual construction template

5. Lineage metadata

5.1 Construct blood relationship metadata based on push mechanism

SparkSql scene

SparkSession scene

5.2 Construct blood relationship metadata based on Rest API mechanism

RESET-API-LINEAGE DEMO

RESET-API-LINEAGE build tools

The mr hql program builds blood relationship based on REST-API (pub_execute_sql script)

The mr hql back calculation program builds blood relationship based on REST-API (walking back calculation script)

waterdrop hive-ck REST-API build blood relationship (water drop script mode)

6. CLI manually delete metadata

Delete all datasets in the development environment

Delete all containers for a specific platform

Delete all pipelines and tasks in the development environment

Delete all bigquery datasets in the PROD environment

Remove all appearance dashboards and charts

Delete all datasets matching the query

7. Query blood relationship and upstream and downstream quantities based on graphiql

demo

The query table depends on the upstream and downstream quantities and the dependent details UTILS

1. Metadata Management Implementation Overview

2. Metadata classification

Metadata is divided into two categories according to different purposes: Technical Metadata (Technical Metadata) and Business Metadata (Business Metadata)

2.1 Technical Metadata

Technical metadata is data that stores technical details about the data warehouse system and is used to develop and manage the data used by the data warehouse. Common technical metadata are:

- Distributed computing systems store metadata: information such as Hive tables, columns, and partitions. The table name of the table is recorded. Partition information, responsible person information, file size, table type, life cycle, and column field name, field type, field comment, whether it is a partition segment, etc.
- Distributed computing system running metadata: such as all job running information on Spark: similar to Job logs, including job type, instance name, input and output, SQL, and execution time. Data synchronization, computing tasks, task scheduling and other information in the data development platform.
- Data quality and operation and maintenance related metadata: such as task monitoring, operation and maintenance alarm, data quality, fault and other information, including task monitoring operation log, alarm configuration and operation log, fault information, etc.

2.2 Business Metadata

Business metadata describes the data in the data warehouse from a business perspective. It provides a semantic layer between the user and the actual system, enabling business personnel who do not understand computer technology to understand the data in the data warehouse. Common business metadata are:

- Data metadata: standardized definitions such as dimensions and attributes, business processes, indicators, etc., for better management and use of data.
- Data application metadata: such as configuration and operation metadata of data reports and data products.

3. Metadata tagging system

The use of metadata tags can not only save the time and cost of R&D personnel, but also allow non-R&D personnel within the company to understand and use data more intuitively, thereby improving the efficiency of data R&D. Therefore, in datahub, domain and glossary thesaurus can be pre-built for marking data.

Basic Label

data storage
Access
Data Security Level

data warehouse label

Data is incremental/full
Is it renewable
data lifecycle

business label

Subject field of data attribution
Product LineBU
business type

potential label

This type of label is mainly to illustrate the potential application scenarios of the data

social contact
media
advertise
e-commerce
finance

4. Table metadata

4.1 Extract Metadata Based on Pull Mechanism

Datahub’s pull is based on plug-ins. You can check the data source to get the plug-in Source, convert the plug-in transformer, and get the plug-in Sink. Plug-in installation commands such as: pip install ‘acryl-datahub[mysql]’

Use the command to view the currently installed plugins python3 -m datahub check plugins

web-side ui method

cli-side yml method

yml analysis

Template

source:
type: mysql #The data source can be hive and others, and the corresponding config configuration is slightly different
config:
host_port: 172.16.8.69:3308
database: test
username: “root”
password: “root”
profiling: Use statistics in #hive with caution, it is easy to fill up resources and the cluster collapses
enabled: True
include_field_min_value: True
include_field_max_value: True
stateful_ingestion: #Open state, it will automatically add or delete when pulling full data
enabled: True
remove_stale_metadata: True
#In most cases transformer does not need to be configured
transformers:
– type: “simple_remove_dataset_ownership”
config:
owner_urns:
– “urn:li:corpuser:username1”
– “urn:li:corpuser:username2”
– “urn:li:corpGroup:groupname”
ownership_type: “PRODUCER”
#The default is datahub-rest
sink:
type: “datahub-rest”
config:
server: ‘http://localhost:8080’
pipeline_name: mysql_pipline #To support status, there must be a pipeline name
datahub_api: # Optional. But if provided, this config will be used by the “datahub” ingestion state provider.
server: “http://localhost:8080”

source

For more config details, see the official document Input objects | DataHub (datahubproject.io)

source:
type: mysql #The data source can be hive and others, and the corresponding config configuration is slightly different
config:
host_port: 172.16.8.69:3308
database: test
username: “root”
password: “root”
profiling:
enabled: True
include_field_min_value: True
include_field_max_value: True
stateful_ingestion: #Open state, it will automatically add or delete when pulling full data
enabled: True
remove_stale_metadata: True

transformer (optional)

(1) Add tags

Use the simple_add_dataset_tags module to add tags.
Tags can be customized via add_dataset_tags using your own module function.
transformers:
– type: “simple_add_dataset_tags”
config:
tag_urns:
– “urn:li:tag:NeedsDocumentation”
– “urn:li:tag:Legacy”

(2) Change owner

Use simple_remove_dataset_ownership to clear the data set owner.
transformers:
– type: “simple_remove_dataset_ownership”
config: {}

Use simple_add_dataset_ownership to add a series of users.
transformers:
– type: “simple_add_dataset_ownership”
config:
owner_urns:
– “urn:li:corpuser:username1”
– “urn:li:corpuser:username2”
– “urn:li:corpGroup:groupname”
ownership_type: “PRODUCER”

(3) Set the owner relationship according to the data set urn mode, and set different owners for different data sets.

transformers:
– type: “pattern_add_dataset_ownership”
config:
owner_pattern:
rules:
“.*example1.*”: [“urn:li:corpuser:username1”]
“.*example2.*”: [“urn:li:corpuser:username2”]
ownership_type: “DEVELOPER”

(4) Mark the state of the dataset

If you don’t want to see a dataset in the interface, you need to mark it as “removed”.
transformers:
– type: “mark_dataset_status”
config:
removed: true

(5) Add dataset browsing path

Add a browse path to a dataset by transforming it. There are 3 optional variables:

ENV: The passed environment variable, the default is prod.

PLATFORM: Platforms supported by DataHub, for example: mysql, postgres.

DATASET_PARTS: Dataset names separated by slashes, for example: database_name/[table_name].

This will generate one for the cn_sisyphe_dm_book.biz_batch_operate_record browsing path table of the hive database:

/prod/hive/cn_sisyphe_dm_book/biz_batch_operate_record browsing path.

transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /ENV/PLATFORM/DATASET_PARTS
# No need for ENV and fix a certain part of the path.
transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /PLATFORM/marketing_db/DATASET_PARTS
This results in a browse path for the MySQL database sales.orders table: /mysql/marketing_db/sales/orders.
Multiple browsing paths can be set. Different people have different names for the same data assets.
transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /PLATFORM/marketing_db/DATASET_PARTS
– /data_warehouse/DATASET_PARTS
This generates 2 browse paths:
① /mysql/marketing_db/sales/orders
② /data_warehouse/sales/orders

sink

(1) Console

Output metadata events to standard output.
For experimentation and debugging.
source:
source configs
sink:
type: “console”

(2) DataHub

①DataHub Rest
Push metadata to DataHub using GMS Rest interface.
Any errors can be reported immediately.
There are also some fields that can be set: timeout_sec, token, extra_headers, max_threads.
source:
source configs
sink:
type: “datahub-rest”
config:
server: “http://datahubip:8080”
②DataHub Kafka
Push metadata to DataHub by publishing messages to Kafka.
Asynchronous can handle higher traffic.
There are some field configuration information related to the connection.

source:
source configs
sink:
type: “datahub-kafka”
config:
connection:
bootstrap: “localhost:9092”
schema_registry_url: “http://datahubip:8081”

(3) File

Output metadata to a file.
Using File sinks can decouple the processing and pushing of source data sources from DataHub.
Also suitable for debugging purposes.
Metadata can be read from a data file in a File sink using a File source.
source:
source configs
sink:
type: file
config:
filename: ./path/to/mce/file.json

yml template

Whether it is ui mode or cli mode, you need to configure the yml file. Examples of commonly used components in our company are as follows. For more templates, see Athena | DataHub (datahubproject.io)

mysql template

via configuration field

ability	status	notes
Data profiling	?	(Optional) Enabled via configuration
Detect deleted entities	?	Enable via stateful import
Domain	?	Support `domain`
Platform Instance	?	Enabled by default

This plugin extracts the following: Metadata for the database, schema, and tables Analyzing the column types and schema associated with each table via optional SQL Table, row, and column statistics

Uploading… Reupload Cancel

clickhouse template

ability	status	notes
Data profiling	?	(Optional) Enabled via configuration
Detect deleted entities	?	Enable via stateful import

This plugin extracts the following:

- - - Metadata for tables, views, materialized views, and dictionaries
    - The column types associated with each table (*except aggregate functions and datetime and timezone)
    - Statistics on tables, rows and columns with optional SQL analysis.
    - Tables, views, materialized views and dictionaries (with CLICKHOUSE source_type) lineage

Uploading… Reupload Cancel

clickhouse-usage template (statistics ck usage details)

ability	status	notes
Data profiling	?	(Optional) Enabled via configuration
Detect deleted entities	?	Enable via stateful import

This plugin has the following features ?

1. 1. 1. For a specific dataset, this plugin introduces the following statistics ?
      1. Top n queries.
      2. top user.
      3. The usage of each column in the dataset.
    2. Aggregate these statistics into buckets with day or hour granularity.

hive template

This plugin extracts the following:

- - - Metadata for databases, schemas, and tables
    - The column types associated with each table
    - Detailed table and storage information
    - Statistics on tables, rows and columns with optional SQL analysis.

4.2. RESET-APImethod

Build metadata by hand (even if the table doesn't exist)

API-MEDTADA manual construction template

Uploading... Reupload Cancel

5. Lineage metadata

5.1 Construct blood relationship metadata based on push mechanism

- SparkSql scenario

Automatically resolve sparksql dependencies

sparksql template

Uploading... Reupload Cancel

- SparkSession scenario

Automatically resolve sparkrdd/df dependencies

sparkSession template

spark = SparkSession. builder \
.master("spark://spark-master:7077") \
.appName("test-application") \
.config("spark.jars.packages", "io.acryl:datahub-spark-lineage:0.8.23")\
.config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \
.config("spark.datahub.rest.server", "http://ipt:8080") \
.enableHiveSupport()\
.getOrCreate()

5.2 Construct blood relationship metadata based on Rest API mechanism

- RESET-API-LINEAGE DEMO

Manually build the lineage between tables

Uploading...Reupload Cancel

RESET-API-LINEAGE< strong>Build Tools

Uploading... Reupload Cancel

- mr hql program builds blood relationship based on REST-API (pub_execute_sql script)

- mr hql recalculation program builds blood relationship based on REST-API (return calculation script) script)

- waterdrop hive-ck REST-API < strong>Construct blood relationship (water drop script method)

6. CLI manually delete metadata

datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:clickhouse,DatabaseNameToBeIngested.add_record.product_user_new_20220117,PROD)" --hard (must be hard-deleted, soft-deletion may lead to follow-up with the same name Table registration does not go in)

Delete all datasets in the development environment

datahub delete --env DEV --entity_type dataset

delete all containers of a specific platform

datahub delete --entity_type container --platform s3

delete all pipelines and tasks in the development environment

datahub delete --env DEV --entity_type "datajob"
datahub delete --env DEV --entity_type "dataflow"

delete all bigquery datasets in PROD environment

datahub delete --env PROD --entity_type dataset --platform bigquery </code>--hard

Delete All Appearance Dashboards and Charts

datahub delete --entity_type dashboard --platform finereport --hard
datahub delete --entity_type chart --platform finereport </code>--hard

Delete all datasets matching the query

datahub delete --entity_type dataset --query "_tmp" -n

7. Query blood relationship and upstream and downstream numbers based on graphiql

172.16.8.69:9002/api/graphiql#, query address, for more documents, see DataHub GraphQL API | DataHub (datahubproject.io)

Demo

query{
dataset(
urn: "urn:li:dataset:(urn:li:dataPlatform:hive,cn_sisyphe_dim.biz_employee_account,PROD)" #Query table
) {
lineage(
input: {direction: DOWNSTREAM, start: 0, count: 100, separateSiblings: true} #direction can be DOWNSTREAM or UPSTREAM, start: the number from offset count: the number of returned results
) {
start #return fields
count #return fields
total #return how many results there are in total
relationships {
type
degree
entity {
type
urn
relationships(input: {types: [], start: 0, count: 100, direction: query the tables associated with this table, this layer may not be queried
start
count
total
relationships {
type
entity {
urn
type
}
created {
actors
}
direction
}}}}}}}

query table depends on upstream and downstream quantities and dependent detailed UTILS

Uploading... Reupload Cancel

ability	status	notes
Domain	?	Support `domain< via config field /code>`
Platform instance	?	default enabled

1. Metadata Management Implementation Overview

2. Metadata classification

2.1 Technical Metadata

2.2 Business Metadata

3. Metadata tagging system

Basic Label

data warehouse label

business label

potential label

4. Table metadata

4.1 Extract Metadata Based on Pull Mechanism

web-side ui method

cli-side yml method

yml analysis

yml template

4.2. RESET-APImethod

API-MEDTADA manual construction template

5. Lineage metadata

5.1 Construct blood relationship metadata based on push mechanism

SparkSql scenario

SparkSession scenario

5.2 Construct blood relationship metadata based on Rest API mechanism

RESET-API-LINEAGE DEMO

RESET-API-LINEAGE< strong>Build Tools

mr hql program builds blood relationship based on REST-API (pub_execute_sql script)

mr hql recalculation program builds blood relationship based on REST-API (return calculation script) script)

waterdrop hive-ck REST-API < strong>Construct blood relationship (water drop script method)

6. CLI manually delete metadata

Delete all datasets in the development environment

delete all containers of a specific platform

delete all pipelines and tasks in the development environment

delete all bigquery datasets in PROD environment

Delete All Appearance Dashboards and Charts

Delete all datasets matching the query

7. Query blood relationship and upstream and downstream numbers based on graphiql

Demo

query table depends on upstream and downstream quantities and dependent detailed UTILS