Shengteng Migration丨Interpretation of 4 TensorFlow model training cases

This article is shared from the Huawei Cloud Community “Common Cases of TensorFlow Model Training” by Shengteng CANN.

Training scripts developed based on TensorFlow’s Python API run on the CPU/GPU/TPU by default. In order for these scripts to take advantage of the powerful computing power of the Ascend AI processor, they need to be migrated to the Ascend platform.

This issue shares several typical cases of TensorFlow network execution failure or poor execution performance after migrating to the Ascend platform, and provides analysis of the causes and solutions.

01 There are resource operators in data preprocessing, causing training anomalies

Problem phenomenon

When the TensorFlow network is executed, the following error is reported:

[2021-03-19 13:50:24.895266: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at lookup_table_op.cc:809 : Failed precondition: Table not initialized.

[2021-03-19 13:50:24.895283: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at lookup_table_op.cc:809 : Failed precondition: Table not initialized.

Cause analysis

The resource operator HaskTableV2 exists in the initialization graph, and the resource operator LookupTableFindV2 exists in the data preprocessing. The two operators need to be used in pairs.

Ascend AI processor adopts full computing mode by default, that is, all computing operators (including resource operators in the initialization diagram) are executed on the Device side, and data preprocessing is still executed on the Host. In this way, the LookupTableFindV2 operator in data preprocessing and the HaskTableV2 operator in the initialization diagram are not executed on the same device, resulting in network operation errors.

Solution

It is necessary to modify the training script to enable mixed computing capabilities and leave the initialization diagram of the resource operator on the Host side for execution. The training script modification method is as follows:

from npu_bridge.npu_init import *

config = tf.ConfigProto()

custom_op = config.graph_options.rewrite_options.custom_optimizers.add()

custom_op.name = "NpuOptimizer"

custom_op.parameter_map["mix_compile_mode"].b = True

config.graph_options.rewrite_options.remapping = RewriterConfig.OFF

config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

with tf.Session(config=config) as sess:

sess.run(...)

The configuration parameter “mix_compile_mode” is the hybrid computing enablement switch. When this switch is configured to “True”, resource operators that need to be used in pairs will be left in the front-end framework for online execution.

Supplementary note: When the user’s preprocessing script contains APIs of the Table class under tf.contrib.lookup that need to be used in pairs, you need to refer to this method to enable the hybrid computing function and leave the corresponding operators in the initialization graph on the Host side. implement.

02 tf.Variable exists in data preprocessing, causing training abnormalities

Problem phenomenon

When the TensorFlow network is executed, the following error is reported:

tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable inference/embed_continuous from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/inference/embed_continuous/N10tensorflow3VarE does not exist. 

Cause analysis

This issue is due to the presence of tf.Variable variables in the data preprocessing script. When the training script is running on the Shengteng platform, the tf.Variable variable is executed on the Host side, and the initialization of the tf.Variable variable is executed on the Device side. Variable execution and variable initialization are not executed on the same device, resulting in training anomalies.

The training script code example using tf.Variable is as follows:

batch_size = tf.Variable(

tf.placeholder(tf.int64, [], 'batch_size'),

trainable= False, collections=[]

)

train_dataset = train_dataset.batch(batch_size, drop_remainder=True)

Solution

It is necessary to modify the training script and change tf.Variable to a constant. The modification example is as follows:

batch_size = 64 train_dataset = train_dataset.batch(batch_size, drop_remainder=True)

batch_size = 64

train_dataset = train_dataset.batch(batch_size, drop_remainder=True)

03 Dynamic shape network execution reports an error that v1 control flow operator is not supported

Problem phenomenon

When the dynamic shape network of TensorFlow version 1.15 is executed, the following error is reported:

node node_name(node_type) is v1 control operator, which is not supported, please convert to v2 control operator

Cause analysis

Because the current TensorFlow network is a dynamic shape network, and there is a V1 version of the control flow operator. The TensorFlow dynamic shape network executed on the Ascend AI processor currently does not support the V1 version of the control flow operator, so the network operation will fail.

Solution

This problem can be solved by converting the control flow operators of the TensorFlow V1 version in the network to the V2 version.

Method 1: Convert the control flow operator of the TensorFlow V1 version to the V2 version by setting the following environment variables.

export ENABLE_FORCE_V2_CONTROL=1

Method 2: Modify the network script and add the following two instructions after import tensorflow as tf to convert the control flow operator of the TensorFlow V1 version to the V2 version.

tf.enable_control_flow_v2()

tf.enable_resource_variables()

04 Poor execution performance of ReduceSum operator during network commissioning

Problem phenomenon

During network commissioning, the overall network performance is slow. Obtain the profiling data of the network through the profiling tool, and analyze the performance data of the operator. It is found that the performance of the ReduceSum operator is very poor.

View the detailed information of the ReduceSum operator in the Profiling performance data. The key fields are shown in blue font in the following table:

op_type

block_dim

input_shape

input_data_type

input_formats

ReduceSum

1

1,256,256,3

DT_FLOAT16

NHWC

The input data type (input_data_type) of the ReduceSum operator is “DT_FLOAT16”, and the value of the block_dim field is “1”, indicating that multi-core parallel computing is not enabled for this operator.

Cause analysis

For the ReduceSum operator of the Ascend AI processor, if the input data type is float16, multi-core computing may not be enabled in some scenarios due to hardware limitations.

Solution

When the input data of the ReduceSum operator is float16, there may be the following two scenarios:

Scenario 1:

Mixed precision is not enabled during network commissioning, and the input data of the ReduceSum operator itself is of float16 type. In this case, if the performance of the ReduceSum operator is poor, you can try to insert a Cast operator before the ReduceSum operator to convert the operator The input data type is converted from float16 to float32.

The ReduceSum operator enables multi-core concurrent calculations in scenarios where the input type is float32, thereby improving the performance of the operator.

Scenario 2:

Mixed precision is enabled during network debugging, and the input data type of the ReduceSum operator is converted from float32 to float16. In this case, the ReduceSum operator can be added to the mixed precision blacklist, so that the ReduceSum operator will not be used during network debugging. will be converted to float16 type to avoid performance degradation of this operator.

The method of adding the ReduceSum operator to the mixed precision blacklist is as follows:

1) Modify the network script and specify the mixed precision operator blacklist that needs to be modified through the modify_mixlist parameter. The modification example is as follows:

# Estimator mode modification method

npu_config=NPURunConfig(

...

precision_mode="allow_mix_precision",

modify_mixlist="/home/test/ops_info.json"

)

# sess.run mode modification method

config = tf.ConfigProto()

custom_op = config.graph_options.rewrite_options.custom_optimizers.add()

custom_op.name = "NpuOptimizer"

custom_op.parameter_map["use_off_line"].b = True

custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision")

custom_op.parameter_map["modify_mixlist"].s = tf.compat.as_bytes("/home/test/ops_info.json")

2) Configure the operator blacklist in the ops_info.json file. The configuration example is as follows:

{

"black-list": {

"to-add": ["ReduceSumD"]

}

}

Supplementary Note: Only when the performance of the ReduceSum operator is poor and meets the problem phenomenon in this case, you can try to use this method to improve performance.

05 More introduction

[1] Shengteng Document Center: Shengteng Community-Official Website丨Shengteng Wanli makes intelligence omnipresent

[2] Shengteng Community Online Courses: Developer Homepage-Shengteng Community

[3] Shengteng Forum: https://www.hiascend.com/forum

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceDeep learning 387040 people are learning the system