Directory
1. Overview of Metadata Management Implementation Plan
2. Metadata classification
2.1 Technical metadata
2.2 Business Metadata
3. Metadata labeling system
basic label
Data warehouse label
business label
potential label
4. Table metadata
4.1 Extract metadata based on pull mechanism
web-side ui mode
cli-side yml method
yml parsing
yml template
4.2. RESET-API method
API-MEDTADA manual construction template
5. Lineage metadata
5.1 Construct blood relationship metadata based on push mechanism
SparkSql scene
SparkSession scene
5.2 Construct blood relationship metadata based on Rest API mechanism
RESET-API-LINEAGE DEMO
RESET-API-LINEAGE build tools
The mr hql program builds blood relationship based on REST-API (pub_execute_sql script)
The mr hql back calculation program builds blood relationship based on REST-API (walking back calculation script)
waterdrop hive-ck REST-API build blood relationship (water drop script mode)
6. CLI manually delete metadata
Delete all datasets in the development environment
Delete all containers for a specific platform
Delete all pipelines and tasks in the development environment
Delete all bigquery datasets in the PROD environment
Remove all appearance dashboards and charts
Delete all datasets matching the query
7. Query blood relationship and upstream and downstream quantities based on graphiql
demo
The query table depends on the upstream and downstream quantities and the dependent details UTILS
1. Metadata Management Implementation Overview
2. Metadata classification
Metadata is divided into two categories according to different purposes: Technical Metadata (Technical Metadata) and Business Metadata (Business Metadata)
2.1 Technical Metadata
Technical metadata is data that stores technical details about the data warehouse system and is used to develop and manage the data used by the data warehouse. Common technical metadata are:
-
- Distributed computing systems store metadata: information such as Hive tables, columns, and partitions. The table name of the table is recorded. Partition information, responsible person information, file size, table type, life cycle, and column field name, field type, field comment, whether it is a partition segment, etc.
- Distributed computing system running metadata: such as all job running information on Spark: similar to Job logs, including job type, instance name, input and output, SQL, and execution time. Data synchronization, computing tasks, task scheduling and other information in the data development platform.
- Data quality and operation and maintenance related metadata: such as task monitoring, operation and maintenance alarm, data quality, fault and other information, including task monitoring operation log, alarm configuration and operation log, fault information, etc.
2.2 Business Metadata
Business metadata describes the data in the data warehouse from a business perspective. It provides a semantic layer between the user and the actual system, enabling business personnel who do not understand computer technology to understand the data in the data warehouse. Common business metadata are:
-
- Data metadata: standardized definitions such as dimensions and attributes, business processes, indicators, etc., for better management and use of data.
- Data application metadata: such as configuration and operation metadata of data reports and data products.
3. Metadata tagging system
The use of metadata tags can not only save the time and cost of R&D personnel, but also allow non-R&D personnel within the company to understand and use data more intuitively, thereby improving the efficiency of data R&D. Therefore, in datahub, domain and glossary thesaurus can be pre-built for marking data.
-
Basic Label
- data storage
- Access
- Data Security Level
-
data warehouse label
- Data is incremental/full
- Is it renewable
- data lifecycle
-
business label
- Subject field of data attribution
- Product LineBU
- business type
-
potential label
This type of label is mainly to illustrate the potential application scenarios of the data
- social contact
- media
- advertise
- e-commerce
- finance
4. Table metadata
4.1 Extract Metadata Based on Pull Mechanism
Datahub’s pull is based on plug-ins. You can check the data source to get the plug-in Source, convert the plug-in transformer, and get the plug-in Sink. Plug-in installation commands such as: pip install ‘acryl-datahub[mysql]’
Use the command to view the currently installed plugins python3 -m datahub check plugins
-
web-side ui method
-
cli-side yml method
yml analysis
Template
source:
type: mysql #The data source can be hive and others, and the corresponding config configuration is slightly different
config:
host_port: 172.16.8.69:3308
database: test
username: “root”
password: “root”
profiling: Use statistics in #hive with caution, it is easy to fill up resources and the cluster collapses
enabled: True
include_field_min_value: True
include_field_max_value: True
stateful_ingestion: #Open state, it will automatically add or delete when pulling full data
enabled: True
remove_stale_metadata: True
#In most cases transformer does not need to be configured
transformers:
– type: “simple_remove_dataset_ownership”
config:
owner_urns:
– “urn:li:corpuser:username1”
– “urn:li:corpuser:username2”
– “urn:li:corpGroup:groupname”
ownership_type: “PRODUCER”
#The default is datahub-rest
sink:
type: “datahub-rest”
config:
server: ‘http://localhost:8080’
pipeline_name: mysql_pipline #To support status, there must be a pipeline name
datahub_api: # Optional. But if provided, this config will be used by the “datahub” ingestion state provider.
server: “http://localhost:8080”
source
For more config details, see the official document Input objects | DataHub (datahubproject.io)
source:
type: mysql #The data source can be hive and others, and the corresponding config configuration is slightly different
config:
host_port: 172.16.8.69:3308
database: test
username: “root”
password: “root”
profiling:
enabled: True
include_field_min_value: True
include_field_max_value: True
stateful_ingestion: #Open state, it will automatically add or delete when pulling full data
enabled: True
remove_stale_metadata: True
transformer (optional)
(1) Add tags
Use the simple_add_dataset_tags module to add tags.
Tags can be customized via add_dataset_tags using your own module function.
transformers:
– type: “simple_add_dataset_tags”
config:
tag_urns:
– “urn:li:tag:NeedsDocumentation”
– “urn:li:tag:Legacy”
(2) Change owner
Use simple_remove_dataset_ownership to clear the data set owner.
transformers:
– type: “simple_remove_dataset_ownership”
config: {}
Use simple_add_dataset_ownership to add a series of users.
transformers:
– type: “simple_add_dataset_ownership”
config:
owner_urns:
– “urn:li:corpuser:username1”
– “urn:li:corpuser:username2”
– “urn:li:corpGroup:groupname”
ownership_type: “PRODUCER”
(3) Set the owner relationship according to the data set urn mode, and set different owners for different data sets.
transformers:
– type: “pattern_add_dataset_ownership”
config:
owner_pattern:
rules:
“.*example1.*”: [“urn:li:corpuser:username1”]
“.*example2.*”: [“urn:li:corpuser:username2”]
ownership_type: “DEVELOPER”
(4) Mark the state of the dataset
If you don’t want to see a dataset in the interface, you need to mark it as “removed”.
transformers:
– type: “mark_dataset_status”
config:
removed: true
(5) Add dataset browsing path
Add a browse path to a dataset by transforming it. There are 3 optional variables:
ENV: The passed environment variable, the default is prod.
PLATFORM: Platforms supported by DataHub, for example: mysql, postgres.
DATASET_PARTS: Dataset names separated by slashes, for example: database_name/[table_name].
This will generate one for the cn_sisyphe_dm_book.biz_batch_operate_record browsing path table of the hive database:
/prod/hive/cn_sisyphe_dm_book/biz_batch_operate_record browsing path.
transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /ENV/PLATFORM/DATASET_PARTS
# No need for ENV and fix a certain part of the path.
transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /PLATFORM/marketing_db/DATASET_PARTS
This results in a browse path for the MySQL database sales.orders table: /mysql/marketing_db/sales/orders.
Multiple browsing paths can be set. Different people have different names for the same data assets.
transformers:
– type: “set_dataset_browse_path”
config:
path_templates:
– /PLATFORM/marketing_db/DATASET_PARTS
– /data_warehouse/DATASET_PARTS
This generates 2 browse paths:
① /mysql/marketing_db/sales/orders
② /data_warehouse/sales/orders
sink
-
(1) Console
Output metadata events to standard output.
For experimentation and debugging.
source:
source configs
sink:
type: “console”
-
(2) DataHub
①DataHub Rest
Push metadata to DataHub using GMS Rest interface.
Any errors can be reported immediately.
There are also some fields that can be set: timeout_sec, token, extra_headers, max_threads.
source:
source configs
sink:
type: “datahub-rest”
config:
server: “http://datahubip:8080”
②DataHub Kafka
Push metadata to DataHub by publishing messages to Kafka.
Asynchronous can handle higher traffic.
There are some field configuration information related to the connection.
source:
source configs
sink:
type: “datahub-kafka”
config:
connection:
bootstrap: “localhost:9092”
schema_registry_url: “http://datahubip:8081”
-
(3) File
Output metadata to a file.
Using File sinks can decouple the processing and pushing of source data sources from DataHub.
Also suitable for debugging purposes.
Metadata can be read from a data file in a File sink using a File source.
source:
source configs
sink:
type: file
config:
filename: ./path/to/mce/file.json
-
yml template
Whether it is ui mode or cli mode, you need to configure the yml file. Examples of commonly used components in our company are as follows. For more templates, see Athena | DataHub (datahubproject.io)
mysql template
ability |
status |
notes |
---|---|---|
Data profiling | ? | (Optional) Enabled via configuration |
Detect deleted entities | ? | Enable via stateful import |
Domain | ? | Support domain |
Platform Instance | ? | Enabled by default |
This plugin extracts the following: Metadata for the database, schema, and tables Analyzing the column types and schema associated with each table via optional SQL Table, row, and column statistics
Uploading… Reupload Cancel
clickhouse template
ability |
status |
notes |
---|---|---|
Data profiling | ? | (Optional) Enabled via configuration |
Detect deleted entities | ? | Enable via stateful import |
This plugin extracts the following:
-
-
-
- Metadata for tables, views, materialized views, and dictionaries
- The column types associated with each table (*except aggregate functions and datetime and timezone)
- Statistics on tables, rows and columns with optional SQL analysis.
- Tables, views, materialized views and dictionaries (with CLICKHOUSE source_type) lineage
-
-
Uploading… Reupload Cancel
clickhouse-usage template (statistics ck usage details)
ability |
status |
notes |
---|---|---|
Data profiling | ? | (Optional) Enabled via configuration |
Detect deleted entities | ? | Enable via stateful import |
This plugin has the following features ?
-
-
-
- For a specific dataset, this plugin introduces the following statistics ?
- Top n queries.
- top user.
- The usage of each column in the dataset.
- Aggregate these statistics into buckets with day or hour granularity.
- For a specific dataset, this plugin introduces the following statistics ?
-
-
hive template
ability |
status |
notes |
---|---|---|
Domain | ? | Support domain< via config field /code> |
Platform instance | ? | default enabled |