Dataworks API: Call all forms under the MC project

Article directory

Preface
Dataworks API documentation interpretation
- GetMetaDBTableList interface documentation
API debugging
- Online debugging
- local debugging
- - Operating environment
  - Account secret issue
  - Request data
  - further processing
summary

Foreword

Recently, I need to sort out the company’s data assets, which includes forms under various Dataworks projects. These forms, as one of the company’s important data assets, also need to be organized and standardized. Fortunately, the official API interface is provided to help me obtain all forms under each project more efficiently.
Dataworks API provides a lot of project and form information, such as form list, form name, creation time, belonging project, scheduling information, etc.
This article uses a relatively simple example (getting a list of forms) to explain the entire process in detail.

Dataworks API documentation interpretation

DataWorks OpenAPI overview documentation: https://help.aliyun.com/zh/dataworks/developer-reference/api

Before calling the Dataworks API, you need to keep in mind that API calls are limited.

The total number of daily API calls for the basic version is ≤100 times, the total number of daily API calls for the standard version is ≤1,000 times, and the total number of daily API calls for the professional version is ≤10,000 times, and additional paid calls are not supported. The enterprise version can almost ignore the call limit, has a large number of calls and supports additional paid calls.

In order to standardize the calling of DataWorks OpenAPI, OpenAPI is officially divided into Level 1, Level 2 and Level 3. The specific classification method is not introduced, but each API has a corresponding level description. For details, see the document introduction.

GetMetaDBTableList interface document

What we want to get this time is the form name under a certain project, so check the GetMetaDBTableList interface.

Note that the [Engine Instance] mentioned in the document is actually the MaxCompute project, so this is equivalent to obtaining the MaxCompute project table in DataWorks. You can see related configuration information in the workspace.

In addition, the relationship between DataWorks and MaxCompute:

DataWorks is a one-stop big data development and management platform provided by Alibaba Cloud. One-stop data development and management operations such as MaxCompute job development, periodic scheduling, job operation and maintenance, and data governance can be performed on DataWorks. You can create a DataWorks workspace in the DataWorks console, and directly create and bind a MaxCompute project during the process. You can then develop MaxCompute jobs in the DataWorks workspace.
MaxCompute uses DataWorks to visually configure task workflows, perform periodic scheduling and execution, and manage metadata to ensure efficient and stable data production and management.

Back to the document, you can see that there are many required parameters, but not all of them are required. Specifically, you can click the [Debug] button in the debugging section to go to the debugging page for testing, or directly click on this link to go directly: https://next .api.aliyun.com/api/dataworks-public/2020-05-18/GetMetaDBTableList

Debug entry

Next, look at the returned data. The returned data structure is as follows. RequestId and Data are at the same level. The path of the TableName I need is Data -> TableEntityList -> TableName.

Under “Spoiler”, these data are all body values. You can see the corresponding structure in subsequent calls. When processing values, you also need to start from the body level.

Below the document are some examples and error codes, just check them out yourself.
Maybe you are just as confused as me after reading it! Can this be counted? No, the logic of fetching numbers is not here. For the specific code logic, you can obtain the relevant reference code in the debugging interface, and it supports calling interfaces in multiple programming languages.

API debugging

Online debugging

After understanding the basic information, let’s debug it.
In the debugging interface, you can see that there is actually only one required parameter. After testing, filling in this parameter can obtain the relevant form list data.
By default, 10 records are displayed on one page, and the maximum number can be 100. You can adjust the PageSize of the paging parameter to the maximum and use the minimum number of times to pull down all the form data.

After initiating the call, you can see the result information of the call, which contains a return value: TotalCount. The official introduction is the total number of calculation engines, which is actually the number of forms. You can adjust the number of pages and the number of records per page based on this value.

Local debugging

If there is no problem with debugging, then go to the next step and type the code locally to call the API interface to obtain the data.
Still on the debugging page, click [SDK Example], taking Python as an example, click Python. You can see a complete Python source code.

Copy it into the local Python editor. It cannot be run at this time. Two core things need to be configured: Running environment and account and password (ACCESS_KEY_ID and ACCESS_KEY_SECRET).

Operating environment

Click [SDK Information] on the right to see the installation commands:

pip install alibabacloud_dataworks_public20200518==4.7.2

Open the terminal/command prompt, and first install the alibabacloud_dataworks_public20200518==4.7.2 environment (major premise: Python is already installed and will not be expanded here).
There may be a timeout error due to network problems. Try a few more times, or try at different time periods.

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘ReadTimeoutError(“HTTPSConnectionPool(host=’files.pythonhosted.org’, port=443 ): Read timed out.

Account secret issue

Lines 45~47 of the source code introduce how to configure the account secret: add two environment variables, one is named: ALIBABA_CLOUD_ACCESS_KEY_ID, the corresponding value is your ACCESS_KEY_ID, the other is named: ALIBABA_CLOUD_ACCESS_KEY_SECRET, the corresponding value is your ACCESS_KEY_SECRET.

Reference example:

Request data

After the two preparations are completed, can the data be read? Yes! It can be read, but I didn’t retain it after reading it.
Run the source code directly and you will find nothing…
This is because the source code does not print (print()) or return (return) the data returned by the API. The official prompts that you need to print the return value yourself. (See line 53 of the source code)

How to change it? Print it. Note that since the returned content is an object, you need to use to_map() to convert the object. The reference is as follows:

print(client.get_meta_dbtable_list_with_options(get_meta_dbtable_list_request, runtime).to_map())

Run it again and you will see a dictionary structure of data, which is consistent with the data structure of the screenshot I “spoiled” earlier.

At this point, congratulations! The data you need has been successfully obtained.

Further processing

Although we have successfully obtained the data, we have only obtained one page so far, with a maximum of 100 records. However, when the amount of data is relatively large, we need to retrieve it multiple times. In order to improve efficiency, the code needs to be further improved so that the program can automatically obtain all data and at the same time save the data retrieved each time for subsequent analysis and processing.
Here I added a for loop and temporarily stored the data in a list. The specific operations are as follows:
Add a new function in class Sample to process the value of body, take out TableName, and then return it.

@staticmethod
    def gettable(response):
        TableInfos = response.to_map()['body']['Data']['TableEntityList']
        TableList = [table['TableGuid'] for table in TableInfos]
        return TableList

And modify the main() method, call the gettable() method, and return the processing result TableList.

@staticmethod
    def main(
        args: List[str],
    ) -> None:
        # Please ensure that the environment variables ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET are set in the code running environment.
        # Leakage of engineering code may lead to AccessKey leakage and threaten the security of all resources under the account. The following code example uses environment variables to obtain the AccessKey. It is for reference only. It is recommended to use the more secure STS method. For more authentication access methods, please see: https://help.aliyun.com/document_detail/378659.html
        client = Sample.create_client(os.environ['ALIBABA_CLOUD_ACCESS_KEY_ID'], os.environ['ALIBABA_CLOUD_ACCESS_KEY_SECRET'])
        get_meta_dbtable_list_request = dataworks_public_20200518_models.GetMetaDBTableListRequest(
            app_guid=args[0],
            page_size=args[1],
            page_number=args[2],
        )
        runtime = util_models.RuntimeOptions()
        try:
            # Please print the return value of the API after copying the code and running it.
            # client.get_meta_dbtable_list_with_options(get_meta_dbtable_list_request, runtime)
            response = client.get_meta_dbtable_list_with_options(get_meta_dbtable_list_request, runtime)
            TableList = Sample.gettable(response)
            return TableList
        except Exception as error:
            # If necessary, print error
            # UtilClient.assert_as_string(error.message)
            response = UtilClient.assert_as_string(error.message)
            print(response)

Finally, the program is stored in the library, the passed parameters are modified, and a loop call is made.

if __name__ == '__main__':
    # response = Sample.main(sys.argv[1:])
    TableList = []

    page_cnt = 10 #10*100, supports obtaining 1000 records
    for i in range(page_cnt):
        app_guid='odps.projectname';page_size=100;page_number=1 + i
        response = Sample.main([app_guid,page_size,page_number])
        if response is None: #No return value, indicating an error.
            continue
        TableList.extend(response)
    print(TableList[:10])

At this point, all the data has been obtained and placed in the list TableList.
However, the data has not been saved yet. Finally, the data is written to Excel through pandas (other tool packages can also be used):

import pandas as pd
df = pd.DataFrame(TableList,columns=['tablename']);
df.to_excel('projectname_table.xlsx',index=False)

At this point, the data acquisition is complete! You can use the data to conduct relevant analysis~~

Summary

This article describes how to get all form fields from a Dataworks project. The basic steps are as follows:

Check the official documentation to learn about constraints and interfaces;
Online debugging and obtaining source code;
Configure the local environment and install alibabacloud_dataworks_public20200518;
Configure the environment variables, add two new environment variables ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET, and use the ACCESS_KEY_ID and ACCESS_KEY_SECRET of the Alibaba Cloud account as the values of the corresponding variables respectively;
Test the source code, print the data, and check whether it is feasible;
Call and save data in a loop.

The whole process is rather cumbersome, but after going through it, you can “do it once and for all” because other interfaces basically follow this routine and can be reused by changing some parameters.