[Character set issues when migrating ORACLE data to MogDB/openGauss]

1. Problem Overview

When the ORACLE database stores data, there is sometimes a phenomenon that the data in a table contains both UTF8 characters and GBK characters, and there may also be garbled data.

NLS_CHARACTERSET is the database character set, NLS_NCHAR_CHARACTERSET is the national character set, and NLS_LANG is an environment variable of the Oracle database client.

The problem is that after ORACLE specifies a character set, it does not have strong verification (garbage-in–garbage-out) for the inserted data. However, for PG series databases, the character set of the database is specified when the database is first created, and generally the table data is strictly verified by the character set (except SQL_ASCII).

This also leads to the fact that the original types of characters can be imported into the corresponding character set library, but other types of characters that are originally directly stored in the ORACLE library and do not match the original character set may not be converted. In addition, the garbled data may not be directly converted into the PG system library. (ORACLE database can store garbled data, and PG series database has strict verification)

Some common scenarios are as follows:

1) If the character set of the database happens to be UTF8, then Oracle will directly insert it into the data without any conversion.

2) If the specified NLS_LANG is utf8, but the input is zhs16gbk encoding, then Oracle will not perform any conversion and store the ZHS16GBK character encoding directly in the database.

3) If the characters in the database are AL32UTF8, and you specify NLS_LANG as ZHS16GBK, but what you actually input are UTF8 characters, then Oracle will convert the UTF8 characters you input as ZHS16GBK characters into UTF8 and store them in the database. In this case Garbled characters will appear.

4) The character set set by the tool (putty/securecrt, various SSH clients, etc.) may cause garbled characters or encoding conversion. The client character set NLS_LANG and the character set displayed by the personal tool should be consistent.

2. Problem solving plan

1. Database using SQL_ASCII character set

If you do not consider Chinese display and garbled code display, you can directly build a SQL_ASCII character set database on the target side, then all data can be imported into the new PG system database, so there will be no encoding conversion. This setting is rarely used to declare the encoding used, as this declaration ignores the encoding. In most cases, if any non-ASCII data is used, we do not recommend using SQL_ASCII because openGauss/MogDB cannot convert or verify non-ASCII characters.

2. Create a database with the same character set encoding based on the encoding of the original database

ORACLE needs to specify the character set when creating a database. Although some of the above reasons lead to ORACLE specifying a character set database, there may be other types of characters and garbled data, but this is obviously not in compliance with the specification, because other types of data aside, garbled data alone will It cannot be counted as valid data, and other types of data violate the original plan and cannot be counted as normal data.

Because different character types have different lengths of stored bytes, for example, a Chinese character is stored as 3 bytes in UTF8, but as 2 bytes in GBK.

For example, the word “you” is stored in ORACLE data using UTF-8, which occupies 3 bytes and is “\xe4bda0”
When GBK is used to store ORACLE data, it occupies 2 bytes, which is “\xc4e3”

Therefore, the length is inconsistent. When it is converted into a PG system database, if it cannot be distinguished which type of character it corresponds to, some data will not be encoded according to a fixed type of character set. Conversion, when migrating data using MTK (Cloud and Enmo’s migration tool), you may receive an error similar to this.

For this kind of error, there is a problem when writing to the original database. If the database layer is modified in large quantities, it will take a lot of effort. Therefore, it is recommended to standardize the processing process and only migrate the “correct” data according to the character set of the database. For the rest of the problematic data, you can find the corresponding location during the migration process, or further determine what character set the data should originally belong to, and then let the application correct the problematic data on its own.

(1) Query the stored encoding

When encountering problem data, we can locate the corresponding data row based on other normal columns and use ORACLE’s dump function to view the encoding format stored in the ORACLE database.

select dump(sms_content,1016) code,sms_content from tab_xxx where mo_id=123123123132312’

CODE SMS_CONTENT
-------------------------------------------------- -------------------------------------------------- --------------------------------------------------
Typ=1 Len=64 CharacterSet=ZHS16GBK: 78,20,20,20,20 x :(; ""￥"€""
,20,20,20,20,20,20,20,20,20,3a,28,a3,bb,20,20,20,2
0,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,
22,a1,b0,20,20,20,20,20,20,20,20,20,20,20,a3,a4,20
,20,20,20,a1,b1,80,a1,b0,22

(2) Check the real character encoding based on functions or tools

Then you can convert the encoding corresponding to the data in question according to different digits to see if the converted data is normal and readable data. You don’t need to pay extra attention to the originally queried data, because it may be affected by the local terminal character set. The storage itself may be normal. For example, I queried the data export of the problem column and the result is as follows.

Take out part of the encoding and convert it. “\xe59089e69e97” is converted according to the two-byte form according to GBK, and the result is “钖夋滠”, and converted according to the UTF-8 type, the result is “Jilin”. Therefore, the original data itself should be the UTF-8 character set.

You can use related functions for conversion, or you can use related online tools, such as: Character Encoding/Decoding – Online Tools

postgres=# select convert_from(decode('e59089e69e97','hex')::bytea,'utf-8');
 convert_from
---------------
 Jilin
(1 row)

When this data is displayed correctly, it should be

Based on this method and the understanding of business data, the character set type and non-garbled data that the original data should belong to can be distinguished.

(3) Parts that need to be processed during the MTK migration process

Manually use COPY processing

Because the MTK migration process can specify the number of imported data in each batch, when the imported data has such character set encoding problems, we usually bring the igErrorData parameter and roll back all the problematic data in the entire batch. , while the remaining data continues to be stored in the database.
This parameter will record the data of this batch into the error file while rolling back the problem batch data, and generate a csv file.

It should be noted that this csv is a migration batch divided by MTK. The number of data items in each batch is the same. There may be only one piece of problematic data in this batch, but it will transfer the entire batch of data. Put all in csv. You can first check the csv to filter out the problem data, or locate the problem data.

At this time, we are focusing on the normal data in this csv. This csv can be bypassed using the copy command. The problem data can also be temporarily poured in first, but we are not concerned about the garbled query caused by character set issues.

Later, you can use the COMPATIBLE_ILLEGAL_CHARS parameter in the copy of openGauss/MogDB. This parameter allows the import of illegal character fault-tolerant parameters. This syntax is only valid for COPY FROM imports. However, when importing these problematic data, there is a rule, that is, for ‘\0’, it will be converted into a space after error tolerance, and other illegal characters will be converted into a question mark after error tolerance. This part of the data is what we mentioned above and was written incorrectly in the original database. It should be corrected by the business in the future.

MTK automatic processing

The new compatibleIlLegaLChars option in MTK (v2.9.4) determines whether the copy has the COMPATIBLE_ILLEGAL_CHARS option during the migration process of mogdb/openGauss using MTK.

If it is turned on, it will directly enter the above conversion method and write data. Of course, the problem data needs business correction. (It is generally recommended to manually generate a csv of the wrong data first, locate the location of the problematic data, and then do the automatic migration again)