Hey everyone! I am cheese?
Let’s take a look first, what types of verification codes are currently encountered?
1) Graphic verification code
The graphic verification code should be the simplest kind of verification code. This kind of verification code is the earliest and the most common at present. The general composition rule is 4 letters or numbers or a mixture;
2) Slide verification code
3) Touch the verification code
The above three verification code methods,
It should be a relatively common type of verification code on PC at present.
Of course, there will be gesture verification on the mobile app,
Palace grid verification, voice verification, etc.,
I won’t introduce it here, mainly for the three common introductions above;
1 graphic verification code
The above three verification code methods,
It should be a relatively common type of verification code on PC at present.
Of course, there will be gesture verification on the mobile app,
Palace grid verification, voice verification, etc., will not be introduced here.
Mainly for the above three common introductions;
After opening, the default is the registration page.
Click the login button,
If there is still no verification code,
Just refresh the page several times;
2 Introduction
To recognize the graphic verification code, you need to install the tesserocr
library.
The following introduces the tesserocr
;
tesserocr
is an OCR recognition library for Python,
But in fact, it makes a layer of Python Api encapsulation for tesseract
,
The core is still `tesseract·,
So before installing tesserocr
,
You need to install tesseract
first;
Wait a minute, in a daze, tesserocr can understand this,
is a library, but what is OCR? What is tesseract?
OCR
OCR, the full name is Optical Character Recognition
,
Chinese translation is called Optical Character Recognition,
means by scanning characters,
The process of translating it into electronic text by its shape;
Example:
When there is a graphic captcha,
First use OCR technology to convert it into electronic text,
Then the crawler submits the recognized results to the server,
The process of automatically identifying the verification code is achieved;
tesseract
Tesseract is Google’s open source OCR
OK, it seems that I have understood the concept, and I have another question. I was in the field of graphic recognition before, and there was an opencv thing. What is the difference between the two?
opencv focuses on machine vision
tesseract focuses on character recognition
So in terms of domain,
opencv is wider,
And the graphic captcha,
opencv can also do,
But how can you kill a chicken with a sledgehammer~
3 Environment preparation
Installation under windows
Under Windows
,
To download tesseract
first,
It provides support for tesserocr
;
tesseract
download address: Index of /tesseract
After opening, you can see a list of various exe, you can choose at will;
The one with dev in the file name is the development version,
Without dev, it is a stable version,
For example, jb downloads tesseract-ocr-setup-3.05.01.exe
;
After downloading, double-click and click all the way until the following page appears
Here you need to check the Additional language data (download) in the red box,
This option is to install the language pack supported by OCR recognition,
In this way, OCR can recognize multiple languages,
Then click NEXT all the way,
Because to download the language pack,
So take some time,
About 10-20 minutes or so,
related to internet speed,
If you don’t need to support multiple languages,
You can also leave it unchecked,
Freedom of choice
Explanation required: English fonts are included by default
If you feel that downloading so many languages at one time takes up space, or you feel that the network speed is slow, you can also choose to install the Chinese font library separately;
Font download address: https://github.com/tesseract-ocr/tessdata
After opening, directly search chi_sim.traineddata, this represents Chinese, download it;
Then find the tesseract installation directory just now, there will be a directory called tessdata, just put the language pack you just downloaded into this directory;
How to verify that tesseract is installed successfully?
Just enter tesseract directly under cmd;
If successful, the information will be displayed directly;
If you are prompted that ‘tesseract’ is not an internal or external command, it is because no environment variables have been configured, and you can manually configure the root directory of tesseract under the path parameter, which will not be described in detail;
So far, tesseract has been successfully installed~
Next, install tesserocr, just use the pip command:
pip3 install tesserocr install
But when jb is installed, it directly reports an error:
Tried many ways, even using conda install tesserocr, the same error is reported.
After a lot of hard work, I finally found a feasible command:
conda install -c simonflueckiger tesserocr
Finally, install tesserocr~
How to verify it is really installed?
Very simple, directly import tesserocr,
If no error is reported, it means that the installation is complete;
By the way, if some students don’t know the conda command, please visit the link below and search for scrapy installation directly, there will be an introduction to conda:
https://juejin.im/post/5afcb91251882565bd257097|
OK, the environment of tesserocr and tesseract under windows has been installed;
Don’t worry, let me introduce Linux and Mac by the way, but the following methods have not been verified by jb. The information comes from the Internet and is for reference only:
Installation under Linux
For Liunx, different systems already have different distribution packages, which may be called tesseract-ocr or tesseract, and can be installed directly with the corresponding command;
- Ubuntu, Debian, and Deepin
Under Ubuntu, Debian and Deepin systems, the installation commands are as follows:
sudo apt-get install -y tesseract-ocr libtesseract-dev libleptioica-dev
- CentOS, Red Hat Under CentOS and Red Hat systems, the installation command is as follows:
yum install -y tesseract
Run the above command in different release versions to complete the installation of tesseract;
After the installation is complete, you can call the tesseract command;
The default is also to install the English language. If you need to install other languages, please see the introduction of Windows above. The same solution is the same, and the explanation will not be repeated here;
The next step is to install tesserocr, directly using pip to install:
pip3 install tesserocr pillow
Installation under Mac
On Mac, first use Homebrew to install the ImageMagick and tesseract libraries:
brew install imagemagick brew install tesseract --all-languages
Next, install tesserocr:
brew install tesserocr pillow
4 Recognition Test
In order to facilitate the test, you need to save the picture of the verification code locally;]
Open weibo.com, enter the account password casually,
You will be prompted to enter the verification code, open the developer tools,
Find the Captcha element,
Its src attribute is a link,
Copy it out and open it directly,
You will see a verification code,
And the refreshed verification code will change,
From this, it can be inferred that this is a verification code interface.
Right click to save the verification code,
You will get a verification code;
Verification code link:
ready to go,
let’s get started
New Project,
Put the verification code in the root directory of the project;
Use the tesserocr library to identify the verification code:
import tesserocr from PIL import Image #New Image object image = Image.open("3.jpg") #Call the image_to_text() method of tesserocr, pass in the image object to complete the recognition result = tesserocr. image_to_text(image) print(result)
I got stuck… including debugging, looking for various documents, and finally, changing the verification code debugged above:
Replace the picture and execute the code again:
I saw that there is data, but the output is MEEE, which is still a bit different from the ME8E of the verification code;
Two questions so far:
1) Weibo verification code recognition failed, the output is empty
2) Some words in the verification code in Chapter 2 were incorrectly recognized
I thought to myself, this library is recommended on the Internet, it is open sourced by Google, there is no problem in theory, and everyone else uses it this way, why is there a problem here? Is additional processing required?
Keep learning with questions and dreams;
Digression: tesserocr also has a simpler method, which can directly convert image files into strings, the code is as follows:
import tesserocr print(tesserocr.file_to_text("1.jpg"))
The result is the same as above,
But it is not recommended to use it online.
The reason is that it is said that this recognition effect is not as good as the previous one;
Regarding the Weibo verification code is empty, use tesseract to output the following reasons:
tesseract image path output
eptonica does not detect any dpi while parsing;
5 Verification code processing
I found information on the Internet, such as this verification code:
It may be that the redundant lines in the verification code interfere with the recognition of the picture;
Another example is Weibo:
It may be that the font position, pattern and other factors interfere with the identification of the icon;
There are still solutions, which require additional processing of the image, such as converting to grayscale, binarization, etc.;
Grayscale processing: Use the convert() method parameter of the Image object to pass in L to convert the image into a grayscale image:
from PIL import Image image = Image.open("1.jpg") image = image. convert('L') image. show()
The picture has successfully turned gray;
Now let’s check again,
It is found that the verification is still MEEE, and it fails;
After passing in 1, the image can be binarized:
(Binarization refers to setting the gray value of the pixels on the image to 0 or 255, that is, to present the entire image with a visual effect of only black and hundreds)
import tesserocr from PIL import Image image = Image.open("1.jpg") image = image. convert('1') image. show()
Looking at this, it is more blurred than the above. Of course, the verification result will be even more wrong:
The threshold of binarization can be specified, and the above method uses the default threshold of 127; but generally, the original image is rarely directly converted, the reason can be seen above, the error is even more outrageous;
Generally, the original image is converted into a grayscale image first, and then the binarization threshold is specified. The code is as follows:
import tesserocr from PIL import Image #New Image object image = Image.open("1.jpg") #Do gray processing image = image. convert('L') #This is the binarization threshold threshold = 150 table = [] for i in range(256): if i <threshold: table.append(0) else: table.append(1) #Convert to a binary image through the table, the role of 1 is white, otherwise it will be all black image = image.point(table,"1") image. show() result = tesserocr. image_to_text(image) print(result)
Let me explain here, some students may not understand 256, what is this?
First, we grayed out the image, a grayscale image is a monochrome image with 256 grayscale levels or levels from black to white;
For the grayscale image, the threshold value is used to obtain the binarized image, that is to say, we set a threshold value from 0 to 256. If the grayscale image is less than the threshold value, set 0, and if it is greater than the threshold value, set 1. 0 is black. 1 is white, in this way, a grayscale image can be completely converted into a binary image;
Maybe it’s still ignorant, just post a picture directly:
original image
Grayscale image:
Binary map:
On a grayscale image,
Some colors are between white and black,
So by setting the threshold method,
Convert all these intermediate colors into black and white;
ok, it’s far away, the binary image of the verification code above looks like this:
And the verification result:
good, with a change,
At least not MEEE,
Then we continue to tune,
adjusted to a suitable value;
After tuning for a long time, jb gave up,
The reason is this 8,
No matter how you adjust it, you can’t adjust it to a suitable value.
Has been hovering between S, R, B;
The same code above, without modification, the binary image is as follows:
Check result:
Those that can be identified are solid, but those that cannot be identified are hollow;
The advantage of solid is that after image processing, black and white are distinct, but after image processing, hollow lines may not be recognized after processing because the lines are already very thin;
At the end, I recommend a very good learning tutorial to everyone, I hope it will be helpful for you to learn Python!
Python basic tutorial recommendation: more Python video tutorials – pay attention to station B: little panda who loves to touch fish