An online full-text index BUG troubleshooting: word segmentation and retrieval of Arabic-like numbers

When it comes to word segmentation for full-text search, most of them talk about Chinese (Japanese and Korean) word segmentation, and there are few Latin languages such as English, because English words are naturally word segmentation.
But less about Arabic numerals. Such as the amount, mobile phone number, landline number and so on.

The following is not the traditional starting from 0 for mysql full-text index past and present.
I prefer to start with a small problem, and intersperse relevant knowledge points in a non-chronological order.

Start from an online BUG

We have a population table, and the data in it is combined from multiple data sources, so each user may have multiple mobile phone numbers.
This is also easy to understand. Some people just have multiple mobile phone numbers, and some people just change their mobile phone numbers frequently, right?
Now there is a function that needs to associate a user with a mobile phone number.

Because there are multiple mobile phone numbers, either use like for fuzzy matching. The user table has tens of millions of records, such efficiency is definitely unacceptable.

select * from t_user where phone like ' 112345678%'

select u.id,u.username,u.phone from t_user u LEFT JOIN t_user_phone p on u.id = p.user_id where p.phone = '13112345678'

Finally, full-text indexing is used. (mysql 5.7.6+)

First create a full-text index for the mobile phone number in the user table.
Use the built-in word segmentation engine ngram.

CREATE FULLTEXT INDEX idx_full_text_phone ON t_user (phone) WITH PARSER ngram;

The following statements can be used when using mobile phones to fuzzily query associated users.

Boolean mode fuzzy search

select * from t_user where match(phone) AGAINST('13996459860' in boolean mode)

natural language patterns. mysql defaults to this mode, so if the second sql is not explicitly specified, it is still a natural language mode.

select * from t_user where match(phone) AGAINST('13996459860' in NATURAL LANGUAGE mode)
or
select * from t_user where match(phone) AGAINST('13996459860')

According to our needs, the mobile phone number query needs to be fully matched to be considered a hit. So choose boolean mode.
Natural language models can’t do that.
The difference between Boolean mode and natural language mode will be introduced later.

The above is a brief background introduction.

but
damn but late but arrived

One day the product came over and told me that a certain mobile phone number was associated with hundreds of people.
Is this normal, he asked?

If he directly said that you have a bug here, I might just go back (bushi
But he said it so euphemistically, but I didn’t know the bottom line.

Don’t say to a programmer: your code has bugs. His first reaction was: ①There is something wrong with your environment; ②Do you know how to use S13?
If you say euphemistically: Your program is a bit inconsistent with expectations, can you see if there is something wrong with my method of use?
He would instinctively think: woco! Is there a bug!

My intuition tells me that this is not normal, otherwise is this person engaging in tele-fraud or Neptune?
I used my mobile phone number to check in the database. Using the full-text search in Boolean mode, multiple people are indeed associated.
But it is indeed a BUG.

Let’s simulate it in full.
First create a test user table.
Phone field plus full-text index, using ngram tokenizer.

CREATE TABLE `t_user` (
  `id` int(11) NOT NULL,
  `username` varchar(10) COLLATE utf8_bin DEFAULT NULL,
  `phone` varchar(50) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `idx_full_text_phone` (`phone`) /*!50100 WITH PARSER `ngram` */
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Insert some test data

-- ----------------------------
-- Records of t_user
-- ----------------------------
INSERT INTO `t_user` VALUES ('1', 'Zhang San', '13996459860,15987569874,0797-12345');
INSERT INTO `t_user` VALUES ('2', 'Li Si', '0797-6789');
INSERT INTO `t_user` VALUES ('3', 'Wang Wu', '0797-94649');

Under normal circumstances

select * from t_user where match(phone) AGAINST('13996459860' in boolean mode)
select * from t_user where match(phone) AGAINST('13996459860' in NATURAL LANGUAGE mode)
select * from t_user where match(phone) AGAINST('13996459860')

can get

abnormal situation

select * from t_user where match(phone) AGAINST('0797-12345' in boolean mode)

got the answer

You can see that the latter two records are not the expected results.
It is also a problem reflected by the product manager.

Everyone should have guessed that it is the landline number.
Well, the user has a landline, which is very beaver.

It’s all about contact.

It seems that this SQL returns all the data rows containing 0797, but I am using the Boolean mode, which requires all matching 0797-12345 before returning.

I guess it may be ‘-‘ that caused the problem of word segmentation, dividing it into two parts.

Token breaker

Word segmentation is to split the keywords that need to be searched. When MySQL initially supported full-text indexing, it used a parser (Latin grammar word segmenter, which uses spaces to separate words), such as English I am programmer, which can naturally be split into I am programmer3 words by spaces, which is what I said earlier. There is no problem of participle.

But it doesn’t work for languages like Chinese that don’t split words by spaces. Therefore, MYSQL5.7.6 provides n_gram parser (character length tokenizer), which is more friendly to Chinese full-text index support, and the use of tokenizers is also very simple. Adding WITH PARSER ngram when creating an index means using n_gram parser (character length tokenizer) ), if not added, the traditional parser (Latin grammar space tokenizer) will be used by default.

Pay attention to the words character length word segmenter, as the name suggests, it segments words according to the length of characters. The reason why it is proposed separately is that it is different from word segmentation based on NLP natural semantics, such as Fudan word segmentation.

For example, the short sentence I am a programmer, if word segmentation is performed according to natural semantic analysis, it may be divided into I am a program programmer and so on.
It is absolutely impossible to separate the sequencer. Unless there is a problem with the tokenizer.

But n_gram parser tokenizer is possible. The default word segmentation length of mysql is 2, which can be configured in my.cnf, ngram_token_size = 2 specifies the word segmentation length.

For different word segmentation lengths, the phrase I am a programmer can have the following word segmentation effects.

ngram_token_size=1: 'I', 'is', 'program', 'program', 'member'
ngram_token_size=2: 'I am', 'is a program', 'program' , 'programmer'
ngram_token_size=3: 'I am Cheng', 'is a program', 'programmer'
...
ngram_token_size=5: 'I am a programmer'
...
max ngram_token_size=10

My test library ngram_token_size is 2, add a field for a simple test.

A single word cannot be found, because the minimum word segmentation unit is 2.

Both search programs and programmers get correct results.

The above is the participle of Chinese characters, back to today’s topic, what about Arabic numerals? For example, the amount is 23.45 yuan, the mobile phone number is 13912345678, the landline number is 0797-12345678, the date is 2023-01-01, etc.

For the BUG mentioned above, the landline number 0797-12345678 has been associated with multiple numbers with 0797 but – in the back,
I thought it was a – problem at first. It divides 0797-12345678 into two parts: 0797 and 12345678.

But through the introduction of n_gram parser in this section, we know that it is a length-based tokenizer, so the reason is definitely not the case.

It can be proved that it is split in pairs by the following two sentences of SQL.

select * from t_user where match(phone) AGAINST('7-' in NATURAL LANGUAGE mode)
select * from t_user where match(phone) AGAINST('07' in boolean mode)

Both 7- and 07 can match all 3 records.

But in Boolean mode, 7- cannot be searched.

why?

Here mysql regards – in 7- as a logical operator, rather than as a search keyword as a whole.

stopword

The built-in MySQL full-text parser compares words to entries in the stopword list. If a word is in the stopword list, that word will be excluded from the index.

For ngram parsers, stopword processing is performed differently. The ngram parser does not exclude tokens that are equal to entries in stopword, but tokens that contain stopword.

For example, assuming ngram_token_size=2, a document containing a, b will be parsed as a, and, b.
If comma, is defined as stopword, then a, and, b will be excluded from the index because they contain commas.

Similarly, if stopword contains – and ngram_token_size=4, then the landline number 0797-1789 will be split into two large parts, 0797 and 1789.
Among them, 797-1 97-17 7-178 etc. will be excluded.

If the above conjecture is true, it may lead to the BUG at the beginning. The premise is that wordstop contains -.

In innodb, stopword can pass
INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table to view.
You can delete or add stopword through this table to change word segmentation rules.

By checking, it can be found that ‘-‘ is not in the stopword, so the above conjecture is wrong, and it is not a BUG caused by this reason.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
 + ------- +
| value |
 + ------- +
| a |
| about |
| an |
| are |
| as |
| at |
| be |
| by |
| com |
| de |
| en |
| for |
| from |
| how |
| i |
| in |
|is|
| it |
| la |
| of |
| on |
| or |
| that |
| the |
| this |
| to |
| was |
| what |
| when |
| where |
| who |
| will |
| with |
| und |
| the |
| www |
 + ------- +
36 rows in set (0.00 sec)

Logical operators for boolean patterns

There are two most commonly used methods for mysql full-text search. Natural language mode and Boolean mode.

Natural language mode

For natural language pattern searches, search terms are converted to unions of ngram terms. For example, the string abc (assuming ngram_token_size=2) is converted to ab bc. Given two documents, one containing ab and the other abc, the search term ab bc matches both documents.

It can be simply understood as splitting the search keywords and pattern matching with the document.

As shown in the figure above, the document contains 12 and ‘0997’ are both hits.

Boolean mode

For boolean mode searches, search terms are converted to ngram phrase searches. For example, the string abc (assuming ngram_token_size=2) is converted to ab bc. Given two documents, one containing ab and the other containing abc, the search phrase ab bc matches only documents containing abc.

It can be understood that keywords will not be split again, which is equivalent to full matching of search keywords.

Using the same test data and comparable search terms, search using Boolean patterns.

The result is empty. No data was hit.

but

Searching for 0797-12345 in boolean mode hits 0797-94649 and 0797-1789.
But it won’t hit ’07’, ’09’, ’12’ etc.

I can only explain that in the Boolean mode, the ‘-‘ in the search keyword 0797-12345 is regarded as grammar, which leads to it being split into two parts, 0797 and 12345.

However, I found no evidence from the mysql official website. So this is doubtful. Readers should have their own thinking, don’t be misled by me!

Just like ‘7-‘ did not hit any records in the previous section, it is also the reason for the grammar in Boolean mode.

Now let’s discuss the logical operators in Boolean mode.

Logical operators for boolean patterns

- select * from t_user where match(phone) AGAINST(a + b’ in boolean mode)
  Among them, + will be recognized as a logical operator instead of a + b as a whole, and the same applies below. ‘a + b’ means that ‘a’ and ‘b’ must appear at the same time to satisfy the search condition.
- select * from t_user where match(phone) AGAINST(0797 -12345’ in boolean mode)
  0797 -12345 means that 0797 must contain, but not 12345 to meet the search criteria. The following query excludes records containing 0797-12345. Note – 0797 -12345 before and after the space means that 0797 is included and 12345 is not included. 0797-12345 is equal to 0797 – 12345, it is not equal to 0797 -12345. There are pictures as evidence:
**> <**Increase/decrease the weight value of the matching data. Regardless of whether > or < is used, its weight value is greater than that without either of them. select * from t_user where match(phone) AGAINST('0797(>94649 <12345)' in boolean mode) means match 0797, and at the same time, the column containing 94649 goes to the front row, and the column containing 12345 goes to the back row select * from t_user where match( phone) AGAINST('a > b’ in NATURAL LANGUAGE mode)
() is equivalent to expression grouping, refer to the previous example.

Wildcards can only be used after the string

” Complete match, the words enclosed in double quotes must be matched as a whole. select * from t_user where match(phone) AGAINST(‘“0797-1789”’ in boolean mode) “0797-1789” cannot be further divided .Other records containing 0797-1234 will no longer match.

Solution

Now, let’s go back to the original beauty.
We have encountered an issue where a full text search for a landline number 0979-1789 returns records that do not match exactly.

So, what do you need to do to get a perfect match?
After the above journey, we have two options.

Use “” to wrap the landline number, “0979-1789”, indicating that this search keyword cannot be further divided. Naturally, it can fully match.
Actively split, then use +
We know that the reason why the landline number can query records that do not match exactly is because the “-” in the landline number is used as a logical operator, which causes the landline number to be split into two parts. Then we will take the initiative to split the landline number into two parts, and then use the logical operator “+” to indicate that both parts must be included before returning.
The first method is recommended.

Other phone number representation methods, such as area code + phone number, 023 + 12345678, international long distance 0086-10-1234567 or + 86-573-82651630, 610-643-4567, etc.
This involves logical operators such as ±, and the first method is the safest.

Inverted Index

A full-text index is an inverted index.
It seems that this statement is more popular in lucene or elasticsearch.

At the end of the article, I will briefly explain its principle.

The way of traditional database indexing is [table->field]. The way of inverted index is to segment the field first, then associate the word with the document, become [document -> word], and record other more powerful information (document number, term frequency, term positions, character positions where terms start and end can be stored).

There are two articles:

1 I am a programmer

2 I love writing programs

Word segmentation first (assuming natural semantic word segmentation here)

1 【I】【Yes】【Program】【Programmer】

2 【I】【Love】【Write】【Program】

The previous article is paired with keywords, and after being reversed, it becomes a keyword paired with articles

In order to quickly locate and save storage size, it is also necessary to add the occurrence frequency and position of keywords.

If I want to search for “program”, I can quickly locate documents 1 and 2, and I can directly know how many times it appears in the document and where it appears.

Summary

Regarding word segmentation, mysql has two engines, one is the Latin-based mode based on spaces, which is the default. For example, ‘i love you’ is divided into three parts: i love you. After 5.7.6, a length-based tokenizer, n_gram parser, is built in for CJK characters.
This tokenizer does not distinguish between Chinese and Arabic numerals, and the standards for word segmentation of the two texts are the same.
However, special attention should be paid when some special text contains logical operators (±><*()) in Boolean mode.

At the same time, mysql full-text index itself has many limitations, and it should be bold when using elasticsearch:

1: Only char, varchar, and text types are supported.
2: The performance of MySQL’s full-text index is very good only when it is all in memory. If the memory cannot load all the indexes, the performance may be very slow (you can set a separate key cache (key cache) for the full-text index to ensure that it will not be squeezed out of memory by other index caches)
3: Compared with other index types, when insert, update, and delete operations are performed, the operation cost of the full-text index is very high. Moreover, the full-text index will have more fragments, and more optimize table operations may need to be done.
4: The priority of the full-text index is the highest in the index. Even if a more suitable index is available at this time, MySQL will give up the performance comparison and give priority to the full-text index.
5: The full-text index does not store the actual value of the index column, so it cannot be used as an index covering scan.
6: Except for relevance sorting, the full-text index cannot be used for other sorting. If the query requires sorting operations other than correlation, file sorting is required.