Python non-violent drifting – graphic verification code

Hey everyone! I am cheese?

Let’s take a look first, what types of verification codes are currently encountered?

1) Graphic verification code

The graphic verification code should be the simplest kind of verification code. This kind of verification code is the earliest and the most common at present. The general composition rule is 4 letters or numbers or a mixture;

2) Slide verification code

3) Touch the verification code


The above three verification code methods,

It should be a relatively common type of verification code on PC at present.

Of course, there will be gesture verification on the mobile app,

Palace grid verification, voice verification, etc.,

I won’t introduce it here, mainly for the three common introductions above;

1 graphic verification code

The above three verification code methods,

It should be a relatively common type of verification code on PC at present.

Of course, there will be gesture verification on the mobile app,

Palace grid verification, voice verification, etc., will not be introduced here.

Mainly for the above three common introductions;

After opening, the default is the registration page.

Click the login button,

If there is still no verification code,

Just refresh the page several times;

2 Introduction

To recognize the graphic verification code, you need to install the tesserocr library.

The following introduces the tesserocr;

tesserocr is an OCR recognition library for Python,

But in fact, it makes a layer of Python Api encapsulation for tesseract,

The core is still `tesseract·,

So before installing tesserocr,

You need to install tesseract first;

Wait a minute, in a daze, tesserocr can understand this,

is a library, but what is OCR? What is tesseract?

OCR
OCR, the full name is Optical Character Recognition,

Chinese translation is called Optical Character Recognition,

means by scanning characters,

The process of translating it into electronic text by its shape;

Example:

When there is a graphic captcha,
First use OCR technology to convert it into electronic text,
Then the crawler submits the recognized results to the server,
The process of automatically identifying the verification code is achieved;

tesseract

Tesseract is Google’s open source OCR

OK, it seems that I have understood the concept, and I have another question. I was in the field of graphic recognition before, and there was an opencv thing. What is the difference between the two?
opencv focuses on machine vision
tesseract focuses on character recognition

So in terms of domain,
opencv is wider,
And the graphic captcha,
opencv can also do,
But how can you kill a chicken with a sledgehammer~

3 Environment preparation

Installation under windows

Under Windows,
To download tesseract first,
It provides support for tesserocr;

tesseract download address: Index of /tesseract
After opening, you can see a list of various exe, you can choose at will;

The one with dev in the file name is the development version,
Without dev, it is a stable version,
For example, jb downloads tesseract-ocr-setup-3.05.01.exe;

After downloading, double-click and click all the way until the following page appears


Here you need to check the Additional language data (download) in the red box,

This option is to install the language pack supported by OCR recognition,

In this way, OCR can recognize multiple languages,

Then click NEXT all the way,

Because to download the language pack,

So take some time,

About 10-20 minutes or so,

related to internet speed,

If you don’t need to support multiple languages,

You can also leave it unchecked,

Freedom of choice

Explanation required: English fonts are included by default

If you feel that downloading so many languages at one time takes up space, or you feel that the network speed is slow, you can also choose to install the Chinese font library separately;
Font download address: https://github.com/tesseract-ocr/tessdata
After opening, directly search chi_sim.traineddata, this represents Chinese, download it;
Then find the tesseract installation directory just now, there will be a directory called tessdata, just put the language pack you just downloaded into this directory;

Please add a picture description

How to verify that tesseract is installed successfully?

Just enter tesseract directly under cmd;

If successful, the information will be displayed directly;

Please add picture description

If you are prompted that ‘tesseract’ is not an internal or external command, it is because no environment variables have been configured, and you can manually configure the root directory of tesseract under the path parameter, which will not be described in detail;

So far, tesseract has been successfully installed~

Next, install tesserocr, just use the pip command:

pip3 install tesserocr install

But when jb is installed, it directly reports an error:

Please add picture description

Tried many ways, even using conda install tesserocr, the same error is reported.

Please add a picture description

After a lot of hard work, I finally found a feasible command:

conda install -c simonflueckiger tesserocr

Please add picture description

Finally, install tesserocr~

How to verify it is really installed?
Very simple, directly import tesserocr,
If no error is reported, it means that the installation is complete;

Please add picture description
By the way, if some students don’t know the conda command, please visit the link below and search for scrapy installation directly, there will be an introduction to conda:
https://juejin.im/post/5afcb91251882565bd257097|

OK, the environment of tesserocr and tesseract under windows has been installed;

Don’t worry, let me introduce Linux and Mac by the way, but the following methods have not been verified by jb. The information comes from the Internet and is for reference only:

Installation under Linux

For Liunx, different systems already have different distribution packages, which may be called tesseract-ocr or tesseract, and can be installed directly with the corresponding command;

  • Ubuntu, Debian, and Deepin
    Under Ubuntu, Debian and Deepin systems, the installation commands are as follows:
 sudo apt-get install -y tesseract-ocr libtesseract-dev libleptioica-dev
  • CentOS, Red Hat Under CentOS and Red Hat systems, the installation command is as follows:
 yum install -y tesseract

Run the above command in different release versions to complete the installation of tesseract;
After the installation is complete, you can call the tesseract command;
The default is also to install the English language. If you need to install other languages, please see the introduction of Windows above. The same solution is the same, and the explanation will not be repeated here;

The next step is to install tesserocr, directly using pip to install:

pip3 install tesserocr pillow

Installation under Mac

On Mac, first use Homebrew to install the ImageMagick and tesseract libraries:

brew install imagemagick
brew install tesseract --all-languages

Next, install tesserocr:

brew install tesserocr pillow

4 Recognition Test

In order to facilitate the test, you need to save the picture of the verification code locally;]

Open weibo.com, enter the account password casually,

You will be prompted to enter the verification code, open the developer tools,

Find the Captcha element,

Its src attribute is a link,

Copy it out and open it directly,

You will see a verification code,

And the refreshed verification code will change,

From this, it can be inferred that this is a verification code interface.

Right click to save the verification code,

You will get a verification code;

Verification code link:


ready to go,
let’s get started
New Project,
Put the verification code in the root directory of the project;
Use the tesserocr library to identify the verification code:

import tesserocr
from PIL import Image
 
#New Image object
image = Image.open("3.jpg")
#Call the image_to_text() method of tesserocr, pass in the image object to complete the recognition
result = tesserocr. image_to_text(image)
print(result)

Please add a picture description
I got stuck… including debugging, looking for various documents, and finally, changing the verification code debugged above:


Replace the picture and execute the code again:

Please add picture description
I saw that there is data, but the output is MEEE, which is still a bit different from the ME8E of the verification code;

Two questions so far:
1) Weibo verification code recognition failed, the output is empty
2) Some words in the verification code in Chapter 2 were incorrectly recognized

I thought to myself, this library is recommended on the Internet, it is open sourced by Google, there is no problem in theory, and everyone else uses it this way, why is there a problem here? Is additional processing required?

Keep learning with questions and dreams;

Digression: tesserocr also has a simpler method, which can directly convert image files into strings, the code is as follows:

import tesserocr
print(tesserocr.file_to_text("1.jpg"))

Please add picture description
The result is the same as above,
But it is not recommended to use it online.
The reason is that it is said that this recognition effect is not as good as the previous one;

Regarding the Weibo verification code is empty, use tesseract to output the following reasons:

tesseract image path output

Please add picture description
eptonica does not detect any dpi while parsing;

5 Verification code processing

I found information on the Internet, such as this verification code:

It may be that the redundant lines in the verification code interfere with the recognition of the picture;

Another example is Weibo:

It may be that the font position, pattern and other factors interfere with the identification of the icon;

There are still solutions, which require additional processing of the image, such as converting to grayscale, binarization, etc.;

Grayscale processing: Use the convert() method parameter of the Image object to pass in L to convert the image into a grayscale image:

from PIL import Image
 
image = Image.open("1.jpg")
image = image. convert('L')
image. show()


The picture has successfully turned gray;
Now let’s check again,
It is found that the verification is still MEEE, and it fails;

Please add picture description
After passing in 1, the image can be binarized:
(Binarization refers to setting the gray value of the pixels on the image to 0 or 255, that is, to present the entire image with a visual effect of only black and hundreds)

import tesserocr
from PIL import Image
 
image = Image.open("1.jpg")
image = image. convert('1')
image. show()

Please add picture description
Looking at this, it is more blurred than the above. Of course, the verification result will be even more wrong:

Please add picture description
The threshold of binarization can be specified, and the above method uses the default threshold of 127; but generally, the original image is rarely directly converted, the reason can be seen above, the error is even more outrageous;

Generally, the original image is converted into a grayscale image first, and then the binarization threshold is specified. The code is as follows:

import tesserocr
from PIL import Image
 
#New Image object
image = Image.open("1.jpg")
#Do gray processing
image = image. convert('L')
#This is the binarization threshold
threshold = 150
table = []
 
for i in range(256):
    if i <threshold:
        table.append(0)
    else:
        table.append(1)
#Convert to a binary image through the table, the role of 1 is white, otherwise it will be all black
image = image.point(table,"1")
image. show()
result = tesserocr. image_to_text(image)
print(result)

Let me explain here, some students may not understand 256, what is this?
First, we grayed out the image, a grayscale image is a monochrome image with 256 grayscale levels or levels from black to white;
For the grayscale image, the threshold value is used to obtain the binarized image, that is to say, we set a threshold value from 0 to 256. If the grayscale image is less than the threshold value, set 0, and if it is greater than the threshold value, set 1. 0 is black. 1 is white, in this way, a grayscale image can be completely converted into a binary image;
Maybe it’s still ignorant, just post a picture directly:
original image


Grayscale image:


Binary map:

Please add picture description
On a grayscale image,

Some colors are between white and black,

So by setting the threshold method,

Convert all these intermediate colors into black and white;

ok, it’s far away, the binary image of the verification code above looks like this:

Please add picture description
And the verification result:

Please add picture description
good, with a change,

At least not MEEE,

Then we continue to tune,

adjusted to a suitable value;

After tuning for a long time, jb gave up,

The reason is this 8,

No matter how you adjust it, you can’t adjust it to a suitable value.

Has been hovering between S, R, B;


The same code above, without modification, the binary image is as follows:

Please add picture description
Check result:

Please add a picture description
Please add picture description
Those that can be identified are solid, but those that cannot be identified are hollow;
The advantage of solid is that after image processing, black and white are distinct, but after image processing, hollow lines may not be recognized after processing because the lines are already very thin;

At the end, I recommend a very good learning tutorial to everyone, I hope it will be helpful for you to learn Python!

Python basic tutorial recommendation: more Python video tutorials – pay attention to station B: little panda who loves to touch fish