Pix2Text – image extraction formula (similar to Mathpix

This article is reproduced and adapted from: https://github.com/breezedeus/Pix2Text

Article directory

- About Pix2Text
- Install
- simple call
- Example of recognition effect
- Model download
- Interface Description
- - 1. Class initialization
  - 2. Recognition class function
- script use
- - Identify a single image or images in a single folder
- HTTP service
- - Command Line
  - Python
  - other languages
- script run
- coffee for the author

About Pix2Text

github: https://github.com/breezedeus/Pix2Text
Web version: https://p2t.behye.com (suitable for children who are not familiar with python

Pix In, Latex & amp; Text Out. Recognize Chinese, English Texts, and Math Formulas from Images.

P2T V0 .2 Supports recognition of mixed images containing both text and formulas.
Pix2Text expects to be a free and open source Python alternative to Mathpix, and it can already complete the core functions of Mathpix.
Mathpix: https://mathpix.com/
Starting from V0.2, Pix2Text (P2T) supports the recognition of mixed images that contain both text and formulas, and the return effect is similar to Mathpix.
The core principle of P2T is shown in the figure below (text recognition supports Chinese and English):

P2T Use the open source tool CnSTD to detect the position of the mathematical formula in the picture, and then submit it to LaTeX-OCR for identification Latex representation of mathematical formulas for each corresponding position. The rest of the image is passed to CnOCR for text detection and text recognition. Finally, P2T merges all recognition results to obtain the final image recognition result. Thanks for these open source tools.

As a Python3 toolkit, P2T is not very friendly to friends who are not familiar with Python. We will release the P2T web version in the near future, and you can output the analysis results of P2T directly by throwing pictures into the web page.

The web version will provide some free quotas for friends in need, giving priority to school students (MathPix costs $5 per month, which is quite expensive for school students) .

Install

pip install pix2text

# Specify the domestic installation source
pip install pix2text -i https://pypi.doubanio.com/simple

If you are using OpenCV for the first time, the installation may not be smooth.
Pix2Text mainly depends on CnSTD>=1.2.1, CnOCR>=2.2.2.1 , and LaTeX-OCR . If you encounter problems during the installation process, you can also refer to their installation instructions.

Simple call

The call is simple, here is an example:

from pix2text import Pix2Text

img_fp = './docs/examples/formula.jpg'
p2t = Pix2Text(analyzer_config=dict(model_name='mfd'))
outs = p2t(img_fp, resized_shape=600) # You can also use `p2t.recognize(img_fp)` to get the same result
print(outs)
# If you only need the recognized text and Latex representation, you can use the following line of code to combine all results
only_text = '\
'.join([out['text'] for out in outs])

The returned result out_text is a dict, where the key position represents position information, type represents category information, and text represents the recognition result. For details, see the interface description below.

Example of recognition effect

[{<!-- -->"position": array([[ 22, 29],
       [ 1055, 29],
       [ 1055, 56],
       [ 22, 56]], dtype=float32),
  "text": "The training loss of JVAE is similar to that of VQ-VAE, but the KL distance is used to make the distribution as scattered as possible",
  "type": "text"},
 {<!-- -->"position": array([[ 629, 124],
       [1389, 124],
       [1389, 183],
       [ 629, 183]]),
  "text": "$$\
"
          "-{\cal E}_{z\sim q(z|x)}[\log(p(x\mid z))]"
          " + {\cal K}{\cal L}(q(z\mid x)||p(z))\
"
          "$$",
  "type": "isolated"},
 {<!-- -->"position": array([[ 20, 248],
       [1297, 248],
       [1297, 275],
       [ 20, 275]], dtype=float32),
  "text": "These are sampled from $z\sim q(z|x)$ using Gumbel-Softmax,"
  " $p(z)$ is a multinomial distribution with equal probability.",
  "type": "text-embed"}]

[{<!-- -->"position": array([[ 12, 19],
       [ 749, 19],
       [ 749, 150],
       [ 12, 150]]),
  "text": "$$\
"
          "\mathcal{L}_{\mathrm{eyelid}}~\equiv~"
          "\sum_{t=1}^{T}\sum_{v=1}^{V}"
          "\mathcal{N}_{U}^{\mathrm{(eyelid)}}"
          "\left(\left|\left|\hat{h}_{t,v}\,-\,"
          "\mathcal{x}_{t,v}\right|\right|^{2}\right)\
"
          "$$",
  "type": "isolated"}]

[{<!-- -->"position": array([[ 0, 0],
       [ 710, 0],
       [ 710, 116],
       [ 0, 116]]),
  "text": "python scripts/screenshot_daemon_with_server\
"
          "2-get_model:178usemodel:/Users/king/.cr\
"
          "enet_lite_136-fc-epoch=039-complete_match_er",
  "type": "english"}]

[{<!-- -->"position": array([[ 0, 0],
       [ 800, 0],
       [ 800, 800],
       [ 0, 800]]),
  "text": "618\
Good start to buy in advance\
Very expensive\
Buy expensive and return poor"
  "\
Finally the price has been reduced\
100% mulberry silk\
Buy it early\
Today's order is 188 yuan\
Only for one day",
  "type": "general"}]

Model download

After installing Pix2Text, the system will automatically download the model file when it is used for the first time, and save it in the ~/.pix2text directory (the default path under Windows is C:\Users\ \AppData\Roaming\pix2text).

note

If the above example has been successfully run, it means that the model has been downloaded automatically, and the rest of this section can be ignored.

For classification model, the system will automatically download the model mobilenet_v2.zip file and decompress it, and then put the decompressed model related directory in ~/.pix2text directory.
If the system cannot automatically download the mobilenet_v2.zip file, you need to manually download the zip file from cnstd-cnocr-models/pix2text and put it in ~/ .pix2text directory. If the download is too slow, you can also download it from Baidu Cloud Disk, and the extraction code is p2t0.

For LaTeX-OCR , the system will also automatically download the model files and store them in the ~/.pix2text/formula directory. If the system cannot automatically and successfully download these model files, you need to download the files weights.pth and image_resizer.pth from Baidu cloud disk, and store them in ~/ .pix2text/formula directory; the extraction code is p2t0.

Interface Description

1. Class initialization

The main class is Pix2Text , and its initialization function is as follows:

class Pix2Text(object):

    def __init__(
        self,
        *,
        analyzer_config: Dict[str, Any] = None,
        clf_config: Dict[str, Any] = None,
        general_config: Dict[str, Any] = None,
        english_config: Dict[str, Any] = None,
        formula_config: Dict[str, Any] = None,
        thresholds: Dict[str, Any] = None,
        device: str = 'cpu', # ['cpu', 'cuda', 'gpu']
        **kwargs,
    ):

The parameters are described as follows:

analyzer_config (dict): The configuration information corresponding to the classification model; the default is None, which means using the default configuration (using MFDAnalyzer):

 {<!-- -->
        'model_name': 'mfd' # can be 'mfd' (MFD), or 'layout' (layout analysis)
    }

clf_config (dict): The configuration information corresponding to the classification model; the default is None, indicating that the default configuration is used:

 {<!-- -->
       'base_model_name': 'mobilenet_v2',
       'categories': IMAGE_TYPES,
       'transform_configs': {<!-- -->
           'crop_size': [150, 450],
           'resize_size': 160,
           'resize_max_size': 1000,
       },
       'model_dir': Path(data_dir()) / 'clf',
       'model_fp': None # If specified, use this model file directly
  }

general_config (dict): The configuration information corresponding to the general model; the default is None, indicating that the default configuration is used:
```
{}
```

english_config (dict): The configuration information corresponding to the English model; the default is None, indicating that the default configuration is used:

 {<!-- -->'det_model_name': 'en_PP-OCRv3_det', 'rec_model_name': 'en_PP-OCRv3'}

formula_config (dict): The configuration information corresponding to the formula recognition model; the default is None, indicating that the default configuration is used:

 {<!-- -->
      'config': LATEX_CONFIG_FP,
      'checkpoint': Path(data_dir()) / 'formular' / 'weights.pth',
      'no_resize': False
  }

thresholds (dict): The configuration information corresponding to the recognition threshold; the default is None, indicating that the default configuration is used:

 {<!-- -->
      'formula2general': 0.65, # If it is recognized as `formula` type, but the score is less than this threshold, it will be changed to `general` type
      'english2general': 0.75, # If it is recognized as `english` type, but the score is less than this threshold, it will be changed to `general` type
  }

device (str): what resources to use for calculation, support ['cpu', 'cuda', 'gpu']; default is cpu
**kwargs (): Other parameters reserved; currently unused

2. Identifying functions

Recognize the specified picture by calling the class function .recognize() of the class Pix2Text. The class function .recognize() is described as follows:

 def recognize(
        self, img: Union[str, Path, Image.Image], use_analyzer: bool = True, **kwargs
    ) -> List[Dict[str, Any]]:

The input parameters are described as follows:

img (str or Image.Image): the path of the image to be recognized, or use Image.open() The image Image that has been read.
use_analyzer (bool): whether to use Analyzer (MFD or Layout); False Indicates that the picture is treated as plain text or pure picture, which is equivalent to the effect of P2T V0.1.*. Default: True.
kwargs : reserved field, can contain the following values,
- resized_shape (int): Resize the image width to this size before processing; the default value is 700;
- save_analysis_res (str): Save the analysis result image in this file; the default value is None, which means not to save;
- embed_sep (tuple): prefix and suffix of embedding latex; only valid when using MFD; default value is (' $' , '$ ');
- isolated_sep (tuple): The prefix and suffix of isolated latex; it is only valid when using MFD; the default value is ('$$ \ ', '\ $$').

The returned result is a list (list), each element in the list is a dict, including the following key:

type: the recognized image category;
- When Analyzer is turned on (use_analyzer==True), the value is text (plain text), isolated (mathematical formula of independent line) or text-embed (the text line contains embedded mathematical formulas);
- When Analyzer is not enabled (use_analyzer==False), the value is formula (pure mathematical formula), english (pure English text), general (plain text, may contain Chinese and English);
text: recognized text or Latex expression;
position: the location information of the block, np.ndarray, with shape of [4, 2].

The Pix2Text class also implements the __call__() function, which does exactly the same as the .recognize() function. So there is the following calling method:

from pix2text import Pix2Text

img_fp = './docs/examples/formula.jpg'
p2t = Pix2Text(analyzer_config=dict(model_name='mfd'))
outs = p2t(img_fp, resized_shape=600) # You can also use `p2t.recognize(img_fp)` to get the same result
print(outs)
# If you only need the recognized text and Latex representation, you can use the following line of code to combine all results
only_text = '\
'.join([out['text'] for out in outs])

Script usage

P2T includes the following command line tools.

Recognize a single image or images in a single folder

Use the command p2t predict to predict all pictures in a single file or folder, the following are the usage instructions:

$ p2t predict -h
Usage: p2t predict [OPTIONS]

model prediction

--use-analyzer / --no-use-analyzer
Whether to use MFD or layout analyzer [default: use-analyzer]
-a, --analyzer-name [mfd|layout]
Which Analyzer to use, MFD or layout analysis [default: mfd]
-t, --analyzer-type TEXT Which model is used by Analyzer, ‘yolov7_tiny’ or ‘yolov7’
[default: yolov7_tiny]
-d, --device TEXT, use cpu or gpu to run the code, or specify a specific gpu, such as cuda:0
[default: cpu]
--resized-shape INTEGER, resize the image width to this size before processing [default: 600]
-i, --img-file-or-dir TEXT
Enter the file path of the image or the specified folder [required]
--save-analysis-res TEXT
Store the analysis results in this file or directory (if --img-file-or-dir is a file/folder, --save-analysis-res also should be a file/folder). A value of None means no storage
-l, --log-level TEXT, Log Level, such as INFO, DEBUG ; [default: INFO]
-h, --help, Show this message and exit.

HTTP service

Pix2Text joins HTTP service based on FastAPI. Enabling the service requires the installation of several additional packages, which can be installed using the following command:

$ pip install pix2text[serve]

After the installation is complete, you can use the following command to start the HTTP service (the number behind -p is port, you can adjust it according to your needs):

$ p2t serve -p 8503

p2t serve command instructions:

$ p2t serve -h
Usage: p2t serve [OPTIONS]

  Enable the HTTP service.

Options:
  -H, --host TEXT server host [default: 0.0.0.0]
  -p, --port INTEGER server port [default: 8503]
  --reload whether to reload the server when the codes have been
                      changed
  -h, --help Show this message and exit.

After the service is enabled, the service can be called in the following ways.

Command line

For example, if the file to be recognized is docs/examples/mixed.jpg, use curl to call the service as follows:

$ curl -F image=@docs/examples/mixed.jpg --form 'use_analyzer=true' --form 'resized_shape=600' http://0.0.0.0:8503/pix2text

Python

Use the following method to call the service, refer to the file scripts/try_service.py:

import requests

url = 'http://0.0.0.0:8503/pix2text'

image_fp = 'docs/examples/mixed.jpg'
data = {<!-- -->
    "use_analyzer": True,
    "resized_shape": 600,
    "embed_sep": "$,$",
    "isolated_sep": "$$\
, \
$$"
}
files = {<!-- -->
    "image": (image_fp, open(image_fp, 'rb'))
}

r = requests. post(url, data=data, files=files)

outs = r.json()['results']
only_text = '\
'.join([out['text'] for out in outs])
print(f'{<!-- -->only_text=}')

Other languages

Please refer to the calling method of curl to implement it yourself.

Script to run

The script scripts/screenshot_daemon.py realizes automatically calling Pixe2Text for formula or text recognition on screenshots. How is this function achieved?

The following is the specific operation process (please install Pix2Text first):

1. Find a screen capture software you like, as long as the software supports storing screenshot images in a specified folder. For example, the free Xnip under Mac is very easy to use.

2. In addition to installing Pix2Text, you also need to install an additional Python package pyperclip, and use it to copy the recognition results into the system’s clipboard:

$ pip install pyperclip

3. Download the script file scripts/screenshot_daemon.py to the local, edit the line where "SCREENSHOT_DIR" is located (line 17), and change the path to your screenshot Stored directory.

4. Run this script:

$ python scripts/screenshot_daemon.py

Well, now use your screen capture software to try the effect. The recognition result after the screenshot will be written to the computer clipboard, and can be pasted directly by Ctrl-V / Cmd-V.

For a more detailed introduction, please refer to the video: “Pix2Text: A Free Python Open Source Tool to Replace Mathpix”.

Give the author a cup of coffee

Open source is not easy, if this project is helpful to you, you can consider cheering up the author
https://cnocr.readthedocs.io/zh/latest/buymeacoffee/

2023-03-21