The main thing is to understand and master the OCR tool for text positioning and recognition~
Optical character recognition is an old but still challenging The problem involves detecting and recognizing text from unstructured data, including images and PDF documents. It has wide applications in areas such as banking, e-commerce, and social media content management.
But as with every topic in data science, there are tons of resources out there when trying to learn how to solve OCR tasks. That’s why I wrote this tutorial to help you get started.
In this article, I will show you some Python libraries that allow you to easily extract text from images without much hassle. The description of these libraries is followed by a practical example. The data sets used are all from Kaggle.
Directory:
-
pytesseract
-
EasyOCR
-
Keras-OCR
-
tROC
-
docTR
1. pytesseract
It is one of the most popular Python libraries for optical character recognition. It uses Google’s Tesseract-OCR engine to extract text from images. Supports multiple languages.
If you want to know if your language is supported, check out this link: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. You only need a few lines of code to convert an image to text:
# installation !sudo apt install tesseract-ocr !pip install pytesseract importpytesseract from pytesseract import Output from PIL import Image import cv2 img_path1 = '00b5b88720f35a22.jpg' text = pytesseract.image_to_string(img_path1,lang='eng') print(text)
Output:
We can also try to get the image detected in each The bounding box coordinates of the item.
# boxes around character print(pytesseract.image_to_boxes(img_path1))
result:
~ 532 48 880 50 0 ... A 158 220 171 232 0 F 160 220 187 232 0 I 178 220 192 232 0 L 193 220 203 232 0 M 204 220 220 232 0 B 228 220 239 232 0 Y 240 220 252 232 0 R 259 220 273 232 0 O 274 219 289 233 0 N 291 220 305 232 0 H 314 220 328 232 0 O 329 219 345 233 0 W 346 220 365 232 0 A 364 220 379 232 0 R 380 220 394 232 0 D 395 220 410 232 0 ...
As you noticed, it estimates the bounding box for each character, not each word! If we want to extract boxes for each word, rather than characters, then another method of image_to_data should be used instead of image_to_boxes:
# boxes around words print(pytesseract.image_to_data(img_path1))
This is the result returned, which is not perfect. For example, it interprets “AFILM” as a word. Furthermore, it does not detect and recognize all words in the input image.
2. EasyOCR
It’s the turn of another open source Python library: EasyOCR. Similar to pytesseract, it supports more than 80 languages. You can try it quickly and easily via the web demo without writing any code. It uses CRAFT algorithm to detect text and CRNN as recognition model. Furthermore, these models are implemented using Pytorch.
If working on Google Colab, it is recommended that you set up a GPU, which will help speed up this framework. The following is the detailed code:
# installation !pip install easyocr import easyocr reader = easyocr.Reader(['en']) extract_info = reader.readtext(img_path1) for el in extract_info: print(el)
The results are much better compared to pytesseract. For each detected text, we also have bounding boxes and confidence levels.
3. Keras-OCR
Keras-OCR is another open source library dedicated to optical character recognition. Like EasyOCR, it uses CRAFT detection model and CRNN recognition model to solve the task. The difference from EasyOCR is that it is implemented using Keras instead of Pytorch. The only downside of Keras-OCR is that it does not support non-English languages.
# installation !pip install keras-ocr -q import keras_ocr pipeline = keras_ocr.pipeline.Pipeline() extract_info = pipeline.recognize([img_path1]) print(extract_info[0][0])
This is the output of the first word extracted:
('from', array([[761., 16.], [813., 16.], [813., 30.], [761., 30.]], dtype=float32))
To visualize all results, we convert the output to a Pandas data frame:
diz_cols = {'word':[],'box':[]} for el in extract_info[0]: diz_cols['word'].append(el[0]) diz_cols['box'].append(el[1]) kerasocr_res = pd.DataFrame.from_dict(diz_cols) kerasocr_res
Miraculously, we can see that we have clearer and more precise results.
4. TrOCR
TrOCR is a transformers-based generative image model for detecting text from images. It consists of an encoder and a decoder: TrOCR uses a pretrained image transformer as the encoder and a pretrained text transformer as the decoder. See the paper for more details. There is also good documentation for this library on the Hugging Face platform. First, we load the pretrained model:
# installation !pip install transformers from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image model_version = "microsoft/trocr-base-printed" processor = TrOCRProcessor.from_pretrained(model_version) model = VisionEncoderDecoderModel.from_pretrained(model_version)
Before passing the image, we need to resize and normalize it. Once the image has been converted, we can extract the text using the .generate() method.
image = Image.open(img_path1).convert("RGB") pixel_values = processor(image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print('output: ',extract_text) # output: 2.50
This is different from the previous library, which returned a meaningless number. Why? TrOCR only contains recognition models, not detection models. To solve the OCR task, you first need to detect objects in the image and then extract the text in the input. Since it only focuses on the last step, it performs poorly. To make it work, it’s best to crop a specific part of the image using a bounding box, like this:
crp_image = image.crop((750, 3.4, 970, 33.94)) display(crp_image)
Then we try to apply the model again:
pixel_values = processor(crp_image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(extract_text)
This operation can be applied repeatedly to every word/phrase contained in the image.
5.docTR
Finally, we cover the last Python package for detecting and identifying text from documents: docTR. It can interpret the document as a PDF or image and then pass it to a two-stage method. In docTR, a text detection model (DBNet or LinkNet) is followed by a CRNN model for text recognition. Due to the use of these two deep learning frameworks, this library requires Pytorch and Tensorflow to be installed.
! pip install python-doctr # for TensorFlow ! pip install "python-doctr[tf]" #forPyTorch ! pip install "python-doctr[torch]"
Then, we import the relevant libraries using docTR and load the model, it is a two-step method. In fact, we need to specify the DBNet and CRNN models for text detection and text recognition, and the back-end models for text detection and text recognition:
from doctr.io import DocumentFile from doctr.models import ocr_predictor model = ocr_predictor(det_arch = 'db_resnet50', reco_arch = 'crnn_vgg16_bn', pretrained=True )
We finally read the file, use the pretrained model, and export the output as a nested dictionary:
# read file img = DocumentFile.from_images(img_path1) # use pre-trained model result = model(img) # export the result as a nested dict extract_info = result.export()
This is very long output:
{'pages': [{'page_idx': 0, 'dimensions': (678, 1024), 'orientation': {'value': None, 'confidence ': None},...
For better visualization, it’s better to use a double loop to get only the information we’re interested in:
for obj1 in extract_info['pages'][0]["blocks"]: for obj2 in obj1["lines"]: for obj3 in obj2["words"]: print("{}: {}".format(obj3["geometry"],obj3["value"]))
docTR is another good option for extracting valuable information from images or PDFs.
Conclusion
Each of the five tools has advantages and disadvantages. When choosing one of these software packages, first consider the language of the data you are analyzing. If non-English languages are considered, EasyOCR may be the most suitable choice because of its wider language coverage and better performance. Disclaimer: This dataset is licensed under a Creative Commons Attribution 4.0 International License (CC by 4.0).
·END ·
HAPPY LIFE
This article is for learning and communication only. If there is any infringement, please contact the author to delete it.
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 385549 people are learning the system