Top 5 Python libraries for extracting text from images

The main thing is to understand and master the OCR tool for text positioning and recognition~

Optical character recognition is an old but still challenging The problem involves detecting and recognizing text from unstructured data, including images and PDF documents. It has wide applications in areas such as banking, e-commerce, and social media content management.

But as with every topic in data science, there are tons of resources out there when trying to learn how to solve OCR tasks. That’s why I wrote this tutorial to help you get started.

In this article, I will show you some Python libraries that allow you to easily extract text from images without much hassle. The description of these libraries is followed by a practical example. The data sets used are all from Kaggle.

Directory:

pytesseract
EasyOCR
Keras-OCR
tROC
docTR

1. pytesseract

It is one of the most popular Python libraries for optical character recognition. It uses Google’s Tesseract-OCR engine to extract text from images. Supports multiple languages.

If you want to know if your language is supported, check out this link: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. You only need a few lines of code to convert an image to text:

# installation
!sudo apt install tesseract-ocr
!pip install pytesseract


importpytesseract
from pytesseract import Output
from PIL import Image
import cv2


img_path1 = '00b5b88720f35a22.jpg'
text = pytesseract.image_to_string(img_path1,lang='eng')
print(text)

Output:

We can also try to get the image detected in each The bounding box coordinates of the item.

# boxes around character
print(pytesseract.image_to_boxes(img_path1))

result:

~ 532 48 880 50 0
...
A 158 220 171 232 0
F 160 220 187 232 0
I 178 220 192 232 0
L 193 220 203 232 0
M 204 220 220 232 0
B 228 220 239 232 0
Y 240 220 252 232 0
R 259 220 273 232 0
O 274 219 289 233 0
N 291 220 305 232 0
H 314 220 328 232 0
O 329 219 345 233 0
W 346 220 365 232 0
A 364 220 379 232 0
R 380 220 394 232 0
D 395 220 410 232 0
...

As you noticed, it estimates the bounding box for each character, not each word! If we want to extract boxes for each word, rather than characters, then another method of image_to_data should be used instead of image_to_boxes:

# boxes around words
print(pytesseract.image_to_data(img_path1))

This is the result returned, which is not perfect. For example, it interprets “AFILM” as a word. Furthermore, it does not detect and recognize all words in the input image.

2. EasyOCR

It’s the turn of another open source Python library: EasyOCR. Similar to pytesseract, it supports more than 80 languages. You can try it quickly and easily via the web demo without writing any code. It uses CRAFT algorithm to detect text and CRNN as recognition model. Furthermore, these models are implemented using Pytorch.

If working on Google Colab, it is recommended that you set up a GPU, which will help speed up this framework. The following is the detailed code:

# installation
!pip install easyocr


import easyocr


reader = easyocr.Reader(['en'])
extract_info = reader.readtext(img_path1)


for el in extract_info:
   print(el)

The results are much better compared to pytesseract. For each detected text, we also have bounding boxes and confidence levels.

3. Keras-OCR

Keras-OCR is another open source library dedicated to optical character recognition. Like EasyOCR, it uses CRAFT detection model and CRNN recognition model to solve the task. The difference from EasyOCR is that it is implemented using Keras instead of Pytorch. The only downside of Keras-OCR is that it does not support non-English languages.

# installation
!pip install keras-ocr -q


import keras_ocr


pipeline = keras_ocr.pipeline.Pipeline()
extract_info = pipeline.recognize([img_path1])
print(extract_info[0][0])

This is the output of the first word extracted:

('from',
 array([[761., 16.],
        [813., 16.],
        [813., 30.],
        [761., 30.]], dtype=float32))

To visualize all results, we convert the output to a Pandas data frame:

diz_cols = {'word':[],'box':[]}
for el in extract_info[0]:
    diz_cols['word'].append(el[0])
    diz_cols['box'].append(el[1])
kerasocr_res = pd.DataFrame.from_dict(diz_cols)
kerasocr_res

Miraculously, we can see that we have clearer and more precise results.

4. TrOCR

TrOCR is a transformers-based generative image model for detecting text from images. It consists of an encoder and a decoder: TrOCR uses a pretrained image transformer as the encoder and a pretrained text transformer as the decoder. See the paper for more details. There is also good documentation for this library on the Hugging Face platform. First, we load the pretrained model:

# installation
!pip install transformers


from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image


model_version = "microsoft/trocr-base-printed"
processor = TrOCRProcessor.from_pretrained(model_version)
model = VisionEncoderDecoderModel.from_pretrained(model_version)

Before passing the image, we need to resize and normalize it. Once the image has been converted, we can extract the text using the .generate() method.

image = Image.open(img_path1).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print('output: ',extract_text)
# output: 2.50

This is different from the previous library, which returned a meaningless number. Why? TrOCR only contains recognition models, not detection models. To solve the OCR task, you first need to detect objects in the image and then extract the text in the input. Since it only focuses on the last step, it performs poorly. To make it work, it’s best to crop a specific part of the image using a bounding box, like this:

crp_image = image.crop((750, 3.4, 970, 33.94))
display(crp_image)

Then we try to apply the model again:

pixel_values = processor(crp_image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(extract_text)

This operation can be applied repeatedly to every word/phrase contained in the image.

5.docTR

Finally, we cover the last Python package for detecting and identifying text from documents: docTR. It can interpret the document as a PDF or image and then pass it to a two-stage method. In docTR, a text detection model (DBNet or LinkNet) is followed by a CRNN model for text recognition. Due to the use of these two deep learning frameworks, this library requires Pytorch and Tensorflow to be installed.

! pip install python-doctr
# for TensorFlow
! pip install "python-doctr[tf]"
#forPyTorch
! pip install "python-doctr[torch]"

Then, we import the relevant libraries using docTR and load the model, it is a two-step method. In fact, we need to specify the DBNet and CRNN models for text detection and text recognition, and the back-end models for text detection and text recognition:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(det_arch = 'db_resnet50',
                      reco_arch = 'crnn_vgg16_bn',
                      pretrained=True
                     )

We finally read the file, use the pretrained model, and export the output as a nested dictionary:

# read file
img = DocumentFile.from_images(img_path1)


# use pre-trained model
result = model(img)


# export the result as a nested dict
extract_info = result.export()

This is very long output:

{'pages': [{'page_idx': 0, 'dimensions': (678, 1024), 'orientation': {'value': None, 'confidence ': None},...

For better visualization, it’s better to use a double loop to get only the information we’re interested in:

for obj1 in extract_info['pages'][0]["blocks"]:
    for obj2 in obj1["lines"]:
        for obj3 in obj2["words"]:
            print("{}: {}".format(obj3["geometry"],obj3["value"]))

docTR is another good option for extracting valuable information from images or PDFs.

Conclusion

Each of the five tools has advantages and disadvantages. When choosing one of these software packages, first consider the language of the data you are analyzing. If non-English languages are considered, EasyOCR may be the most suitable choice because of its wider language coverage and better performance. Disclaimer: This dataset is licensed under a Creative Commons Attribution 4.0 International License (CC by 4.0).

·END ·

HAPPY LIFE

This article is for learning and communication only. If there is any infringement, please contact the author to delete it.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 385549 people are learning the system