Matching ambiguous strings in Python

Hi~ Hello everyone, this is the Demon King? ~!

For more python source code/information/answers/tutorials, etc. Click here to jump to the end of the article and get the business card for free.

How to use thefuzz library which allows us to do fuzzy string matching in python.

Furthermore, we will learn how to use the process module, which allows us to efficiently match or extract strings with the help of fuzzy string logic.

Use thefuzz module to match fuzzy strings

This library had an interesting name in older versions because it had a specific name that was renamed.

So it’s now maintained by a different library; however, its current version is called thefuzz, so that’s what you can install with the command below.

pip install thefuzz

However, if you look at examples online, you’ll find that some examples have the old name fuzzywuzzy. So, it is no longer maintained and is obsolete, but you may find some examples using this name.

thefuzz library is based on , so you have to install it with this command. python-Levenshtei

pip install python-Levenshtein

And if you encounter some problems during the installation process, you can use the following command. If you encounter the error again, then you can search on Google to find relevant solutions.

pip install python-Levenshtein-wheels

Essentially, fuzzy matching strings is like using regex or comparing along two strings.

In the case of fuzzy logic, the truth value of your condition can be any real number between 0 and 1.

So basically, instead of saying anything is True or False, you just give it any value between 0 and 1.

It calculates the dissimilarity between two strings using a distance metric, which takes the form of a value called distance.

Using a given string, you find the distance between two strings using some algorithm. Once you complete the installation process, you must import fuzz and process from thefuzz module.

from thefuzz import fuzz, process

Before using fuzz, we will manually check the dissimilarity between two strings.

'''
Have a question and no one has an answer? The editor has created a Python learning and communication QQ group: 926207505
Looking for like-minded friends to help each other, there are also good video learning tutorials and PDF e-books in the group!
'''
ST1='Just a test'
ST2='just a test'
print(ST1==ST2)
print(ST1!=ST2)

It will return a boolean, but in an obscure way, and you’ll get the percentage of how similar these strings are.

False
True

Fuzzy string matching allows us to do this in a fuzzy way more efficiently and faster. Let’s say we have an example with two strings, one of which is not the same as a capital J (as above).

If we now go and call the ratio() function, which gives us a measure of similarity, then this will give us a pretty high ratio, which is 91, not 100.

from thefuzz import fuzz, process
print(fuzz.ratio(ST1, ST2))

Output:

91

Take a look at what it returns if the string is more extended, for example if we don’t just change one character but a completely different string.

'''
Have a question and no one has an answer? The editor has created a Python learning and communication QQ group: 926207505
Looking for like-minded friends to help each other, there are also good video learning tutorials and PDF e-books in the group!
'''
ST1='This is a test string for test'
ST2='There are some test string for testing'
print(fuzz.ratio(ST1,ST2))

Now there might be some similarities, but it would be 75; it’s just a simple ratio, nothing complicated.

75

We can also go ahead and try things like partial proportions. For example, we have two strings and we want to determine their scores.

ST1='There are test'
ST2='There are test string for testing'
print(fuzz.partial_ratio(ST1,ST2))

Using partial_ratio() we will get 100% since both strings have the same substring (There are test).

In ST2 we have some different words (strings) but it doesn’t matter because we are looking at partial ratios or individual parts but simple ratios are not similar.

100

Suppose we have similar strings, but in a different order; then, we use another metric.

CASE_1='This generation rules the nation'
CASE_2='Rules the nation This generation'

In both cases, there are exactly the same words in the same meaning of the phrase, but with ratio(), there is a considerable difference, and with partial_ratio(), there is a difference.

If we pass token_sort_ratio(), this will be 100% since it’s basically the exact same text but in a different order.

So this is token_sort_ratio(), this function sorts individual tokens, their order doesn’t matter.

'''
Have a question and no one has an answer? The editor has created a Python learning and communication QQ group: 926207505
Looking for like-minded friends to help each other, there are also good video learning tutorials and PDF e-books in the group!
'''
print(fuzz.ratio(CASE_1,CASE_2))
print(fuzz.partial_ratio(CASE_1,CASE_2))
print(fuzz.token_sort_ratio(CASE_1,CASE_2))

Output:

47
64
100

Now, if we change some words with another word, we’ll have a different number, but basically, it’s a ratio;

It doesn’t care about the order of individual tokens.

CASE_1='This generation rules the nation'
CASE_2='Rules the nation has This generation'
print(fuzz.ratio(CASE_1,CASE_2))
print(fuzz.partial_ratio(CASE_1,CASE_2))
print(fuzz.token_sort_ratio(CASE_1,CASE_2))

Output:

44
64
94

token_sort_ratio() is also different because it has more words in it, but we also have something called token_set_ratio(), a set containing each token only once.

So, it doesn’t matter how often it appears; let’s look at an example string.

'''
Have a question and no one has an answer? The editor has created a Python learning and communication QQ group: 926207505
Looking for like-minded friends to help each other, there are also good video learning tutorials and PDF e-books in the group!
'''
CASE_1='This generation'
CASE_2='This This generation generation generation generation'
print(fuzz.ratio(CASE_1,CASE_2))
print(fuzz.partial_ratio(CASE_1,CASE_2))
print(fuzz.token_sort_ratio(CASE_1,CASE_2))
print(fuzz.token_set_ratio(CASE_1,CASE_2))

We can see some pretty low scores, but we get a 100% score using the token_set_ratio() function because we have two tokens, This and generation, present in two strings.

Use the process module to use fuzzy string matching in an efficient way

Not only fuzz, but also process, because process is helpful, you can use this fuzzy matching to extract from a collection.

For example, we have prepared a few list items to demonstrate.

Diff_items=['programing language','Native language','React language',
        'People stuff', 'This generation', 'Coding and stuff']

Some of them are very similar as you can see (native language or programming language) and now we can go and pick the best individual matches.

We could do it manually and just evaluate the scores and pick the best candidates, but we could also use process.

To do this, we must call the extract() function in the process module.

It takes a few parameters, the first is the target string, the second is the collection you want to extract, and the third is the limit, which limits what is matched or extracted to two.

For example, if we want to extract something like language , in this case, select native language and programming language.

print(process.extract('language',Diff_items,limit=2))

Output:

[('programing language', 90), ('Native language', 90)]

The issue is:

  1. This is not NLP (natural language processing);

  2. There is no intelligence behind this;

  3. It just looks at a single mark.

So, for example, if we use programming as the target string and run this.

The first match will be the programming language, but the second match will be the Native language, which will not be the encoding.

Even though we have coding, because semantically coding is closer to programming, it doesn’t matter because we are not using AI here.

'''
Have a question and no one has an answer? The editor has created a Python learning and communication QQ group: 926207505
Looking for like-minded friends to help each other, there are also good video learning tutorials and PDF e-books in the group!
'''
Diff_items=['programing language','Native language','React language',
        'People stuff', 'Hello World', 'Coding and stuff']
print(process.extract('programing',Diff_items,limit=2))

Output:

[('programing language', 90), ('Native language', 36)]

Another final example of how this can be useful;

We have a huge library and would like to find a book, but we don’t know the exact name or how to call it.

In this case, we can use extract(), and inside this function, we will pass the fuzz.token_sort_ratio to the scorer parameter.

LISt_OF_Books=['The python everyone volume 1 - Beginner',
               'The python everyone volume 2 - Machine Learning',
               'The python everyone volume 3 - Data Science',
               'The python everyone volume 4 - Finance',
               'The python everyone volume 5 - Neural Network',
               'The python everyone volume 6 - Computer Vision',
               'Different Data Science book',
               'Java everyone beginner book',
               'python everyone Algorithms and Data Structure']
print(process.extract('python Data Science',LISt_OF_Books,limit=3,scorer=fuzz.token_sort_ratio))

We just passed it, we didn’t call it, and now, we got the top result here, we got another data science book as the second result.

Output:

[('The python everyone volume 3 - Data Science', 63), ('Different Data Science book', 61), ('python everyone Algorithms and Data Structure', 47)]

This is how to be fairly accurate, and if you have a project that you have to find in an obscure way, it can be quite helpful.

We can also use it to automate your procedures.

There are some additional resources you can use github and stackoverflow to find more help.

Epilogue

Finally, thank you for reading my article~ This flight ends here

I hope this article has been helpful to you and learned some knowledge~

The hidden stars are also working hard to shine, and you should work hard too (let’s work hard together).

Finally, let’s spread the word~For more source codes, information, materials, answers, and exchanges click on the business card below to get it