This article is reproduced and adapted from: https://github.com/breezedeus/Pix2Text
Article directory
-
- About Pix2Text
- Install
- simple call
- Example of recognition effect
- Model download
- Interface Description
-
- 1. Class initialization
- 2. Recognition class function
- script use
-
- Identify a single image or images in a single folder
- HTTP service
-
- Command Line
- Python
- other languages
- script run
- coffee for the author
About Pix2Text
- github: https://github.com/breezedeus/Pix2Text
- Web version: https://p2t.behye.com (suitable for children who are not familiar with python
Pix In, Latex & amp; Text Out. Recognize Chinese, English Texts, and Math Formulas from Images.
- P2T V0 .2 Supports recognition of mixed images containing both text and formulas.
- Pix2Text expects to be a free and open source Python alternative to Mathpix, and it can already complete the core functions of Mathpix.
Mathpix: https://mathpix.com/ - Starting from V0.2, Pix2Text (P2T) supports the recognition of mixed images that contain both text and formulas, and the return effect is similar to Mathpix.
The core principle of P2T is shown in the figure below (text recognition supports Chinese and English):
P2T Use the open source tool CnSTD to detect the position of the mathematical formula in the picture, and then submit it to LaTeX-OCR for identification Latex representation of mathematical formulas for each corresponding position. The rest of the image is passed to CnOCR for text detection and text recognition. Finally, P2T merges all recognition results to obtain the final image recognition result. Thanks for these open source tools.
As a Python3 toolkit, P2T is not very friendly to friends who are not familiar with Python. We will release the P2T web version in the near future, and you can output the analysis results of P2T directly by throwing pictures into the web page.
The web version will provide some free quotas for friends in need, giving priority to school students (MathPix costs $5 per month, which is quite expensive for school students) .
Install
pip install pix2text # Specify the domestic installation source pip install pix2text -i https://pypi.doubanio.com/simple
- If you are using OpenCV for the first time, the installation may not be smooth.
- Pix2Text mainly depends on CnSTD>=1.2.1, CnOCR>=2.2.2.1 , and LaTeX-OCR . If you encounter problems during the installation process, you can also refer to their installation instructions.
Simple call
The call is simple, here is an example:
from pix2text import Pix2Text img_fp = './docs/examples/formula.jpg' p2t = Pix2Text(analyzer_config=dict(model_name='mfd')) outs = p2t(img_fp, resized_shape=600) # You can also use `p2t.recognize(img_fp)` to get the same result print(outs) # If you only need the recognized text and Latex representation, you can use the following line of code to combine all results only_text = '\ '.join([out['text'] for out in outs])
The returned result out_text
is a dict
, where the key position
represents position information, type
represents category information, and text
represents the recognition result. For details, see the interface description below.
Example of recognition effect
[{<!-- -->"position": array([[ 22, 29], [ 1055, 29], [ 1055, 56], [ 22, 56]], dtype=float32), "text": "The training loss of JVAE is similar to that of VQ-VAE, but the KL distance is used to make the distribution as scattered as possible", "type": "text"}, {<!-- -->"position": array([[ 629, 124], [1389, 124], [1389, 183], [ 629, 183]]), "text": "$$\ " "-{\cal E}_{z\sim q(z|x)}[\log(p(x\mid z))]" " + {\cal K}{\cal L}(q(z\mid x)||p(z))\ " "$$", "type": "isolated"}, {<!-- -->"position": array([[ 20, 248], [1297, 248], [1297, 275], [ 20, 275]], dtype=float32), "text": "These are sampled from $z\sim q(z|x)$ using Gumbel-Softmax," " $p(z)$ is a multinomial distribution with equal probability.", "type": "text-embed"}]
[{<!-- -->"position": array([[ 12, 19], [ 749, 19], [ 749, 150], [ 12, 150]]), "text": "$$\ " "\mathcal{L}_{\mathrm{eyelid}}~\equiv~" "\sum_{t=1}^{T}\sum_{v=1}^{V}" "\mathcal{N}_{U}^{\mathrm{(eyelid)}}" "\left(\left|\left|\hat{h}_{t,v}\,-\," "\mathcal{x}_{t,v}\right|\right|^{2}\right)\ " "$$", "type": "isolated"}]
[{<!-- -->"position": array([[ 0, 0], [ 710, 0], [ 710, 116], [ 0, 116]]), "text": "python scripts/screenshot_daemon_with_server\ " "2-get_model:178usemodel:/Users/king/.cr\ " "enet_lite_136-fc-epoch=039-complete_match_er", "type": "english"}]
[{<!-- -->"position": array([[ 0, 0], [ 800, 0], [ 800, 800], [ 0, 800]]), "text": "618\ Good start to buy in advance\ Very expensive\ Buy expensive and return poor" "\ Finally the price has been reduced\ 100% mulberry silk\ Buy it early\ Today's order is 188 yuan\ Only for one day", "type": "general"}]
Model download
After installing Pix2Text, the system will automatically download the model file when it is used for the first time, and save it in the ~/.pix2text
directory (the default path under Windows is C:\Users\
).
note
If the above example has been successfully run, it means that the model has been downloaded automatically, and the rest of this section can be ignored.
For classification model, the system will automatically download the model mobilenet_v2.zip
file and decompress it, and then put the decompressed model related directory in ~/.pix2text
directory.
If the system cannot automatically download the mobilenet_v2.zip
file, you need to manually download the zip file from cnstd-cnocr-models/pix2text and put it in ~/ .pix2text
directory. If the download is too slow, you can also download it from Baidu Cloud Disk, and the extraction code is p2t0
.
For LaTeX-OCR , the system will also automatically download the model files and store them in the ~/.pix2text/formula
directory. If the system cannot automatically and successfully download these model files, you need to download the files weights.pth
and image_resizer.pth
from Baidu cloud disk, and store them in ~/ .pix2text/formula
directory; the extraction code is p2t0
.
Interface Description
1. Class initialization
The main class is Pix2Text , and its initialization function is as follows:
class Pix2Text(object): def __init__( self, *, analyzer_config: Dict[str, Any] = None, clf_config: Dict[str, Any] = None, general_config: Dict[str, Any] = None, english_config: Dict[str, Any] = None, formula_config: Dict[str, Any] = None, thresholds: Dict[str, Any] = None, device: str = 'cpu', # ['cpu', 'cuda', 'gpu'] **kwargs, ):
The parameters are described as follows:
analyzer_config
(dict): The configuration information corresponding to the classification model; the default isNone
, which means using the default configuration (using MFDAnalyzer):
{<!-- --> 'model_name': 'mfd' # can be 'mfd' (MFD), or 'layout' (layout analysis) }
clf_config
(dict): The configuration information corresponding to the classification model; the default isNone
, indicating that the default configuration is used:
{<!-- --> 'base_model_name': 'mobilenet_v2', 'categories': IMAGE_TYPES, 'transform_configs': {<!-- --> 'crop_size': [150, 450], 'resize_size': 160, 'resize_max_size': 1000, }, 'model_dir': Path(data_dir()) / 'clf', 'model_fp': None # If specified, use this model file directly }
-
general_config
(dict): The configuration information corresponding to the general model; the default isNone
, indicating that the default configuration is used:{}
english_config
(dict): The configuration information corresponding to the English model; the default isNone
, indicating that the default configuration is used:
{<!-- -->'det_model_name': 'en_PP-OCRv3_det', 'rec_model_name': 'en_PP-OCRv3'}
formula_config
(dict): The configuration information corresponding to the formula recognition model; the default isNone
, indicating that the default configuration is used:
{<!-- --> 'config': LATEX_CONFIG_FP, 'checkpoint': Path(data_dir()) / 'formular' / 'weights.pth', 'no_resize': False }
thresholds
(dict): The configuration information corresponding to the recognition threshold; the default isNone
, indicating that the default configuration is used:
{<!-- --> 'formula2general': 0.65, # If it is recognized as `formula` type, but the score is less than this threshold, it will be changed to `general` type 'english2general': 0.75, # If it is recognized as `english` type, but the score is less than this threshold, it will be changed to `general` type }
device
(str): what resources to use for calculation, support['cpu', 'cuda', 'gpu']
; default iscpu
**kwargs
(): Other parameters reserved; currently unused
2. Identifying functions
Recognize the specified picture by calling the class function .recognize()
of the class Pix2Text
. The class function .recognize()
is described as follows:
def recognize( self, img: Union[str, Path, Image.Image], use_analyzer: bool = True, **kwargs ) -> List[Dict[str, Any]]:
The input parameters are described as follows:
img
(str
orImage.Image
): the path of the image to be recognized, or useImage.open()
The imageImage
that has been read.use_analyzer
(bool
): whether to use Analyzer (MFD or Layout);False
Indicates that the picture is treated as plain text or pure picture, which is equivalent to the effect of P2T V0.1.*. Default:True
.- kwargs : reserved field, can contain the following values,
resized_shape
(int
): Resize the image width to this size before processing; the default value is700
;save_analysis_res
(str
): Save the analysis result image in this file; the default value isNone
, which means not to save;embed_sep
(tuple
): prefix and suffix of embedding latex; only valid when usingMFD
; default value is(' $' , '$ ')
;isolated_sep
(tuple
): The prefix and suffix of isolated latex; it is only valid when usingMFD
; the default value is('$$ \
.
', '\
$$')
The returned result is a list (list
), each element in the list is a dict
, including the following key
:
type
: the recognized image category;- When Analyzer is turned on (
use_analyzer==True
), the value istext
(plain text),isolated
(mathematical formula of independent line) ortext-embed
(the text line contains embedded mathematical formulas); - When Analyzer is not enabled (
use_analyzer==False
), the value isformula
(pure mathematical formula),english
(pure English text),general
(plain text, may contain Chinese and English);
- When Analyzer is turned on (
text
: recognized text or Latex expression;position
: the location information of the block,np.ndarray
, with shape of[4, 2]
.
The Pix2Text
class also implements the __call__()
function, which does exactly the same as the .recognize()
function. So there is the following calling method:
from pix2text import Pix2Text img_fp = './docs/examples/formula.jpg' p2t = Pix2Text(analyzer_config=dict(model_name='mfd')) outs = p2t(img_fp, resized_shape=600) # You can also use `p2t.recognize(img_fp)` to get the same result print(outs) # If you only need the recognized text and Latex representation, you can use the following line of code to combine all results only_text = '\ '.join([out['text'] for out in outs])
Script usage
P2T includes the following command line tools.
Recognize a single image or images in a single folder
Use the command p2t predict
to predict all pictures in a single file or folder, the following are the usage instructions:
$ p2t predict -h Usage: p2t predict [OPTIONS]
model prediction
--use-analyzer / --no-use-analyzer
Whether to use MFD or layout analyzer [default: use-analyzer]-a, --analyzer-name [mfd|layout]
Which Analyzer to use, MFD or layout analysis [default: mfd]-t, --analyzer-type TEXT
Which model is used by Analyzer, ‘yolov7_tiny’ or ‘yolov7’
[default: yolov7_tiny]-d, --device TEXT
, usecpu
orgpu
to run the code, or specify a specific gpu, such ascuda:0
[default: cpu]--resized-shape INTEGER
, resize the image width to this size before processing [default: 600]-i, --img-file-or-dir TEXT
Enter the file path of the image or the specified folder [required]--save-analysis-res TEXT
Store the analysis results in this file or directory (if--img-file-or-dir
is a file/folder,--save-analysis-res
also should be a file/folder). A value ofNone
means no storage-l, --log-level TEXT
, Log Level, such asINFO
,DEBUG
; [default: INFO]-h, --help
, Show this message and exit.
HTTP service
Pix2Text joins HTTP service based on FastAPI. Enabling the service requires the installation of several additional packages, which can be installed using the following command:
$ pip install pix2text[serve]
After the installation is complete, you can use the following command to start the HTTP service (the number behind -p
is port, you can adjust it according to your needs):
$ p2t serve -p 8503
p2t serve
command instructions:
$ p2t serve -h Usage: p2t serve [OPTIONS] Enable the HTTP service. Options: -H, --host TEXT server host [default: 0.0.0.0] -p, --port INTEGER server port [default: 8503] --reload whether to reload the server when the codes have been changed -h, --help Show this message and exit.
After the service is enabled, the service can be called in the following ways.
Command line
For example, if the file to be recognized is docs/examples/mixed.jpg
, use curl
to call the service as follows:
$ curl -F image=@docs/examples/mixed.jpg --form 'use_analyzer=true' --form 'resized_shape=600' http://0.0.0.0:8503/pix2text
Python
Use the following method to call the service, refer to the file scripts/try_service.py:
import requests url = 'http://0.0.0.0:8503/pix2text' image_fp = 'docs/examples/mixed.jpg' data = {<!-- --> "use_analyzer": True, "resized_shape": 600, "embed_sep": "$,$", "isolated_sep": "$$\ , \ $$" } files = {<!-- --> "image": (image_fp, open(image_fp, 'rb')) } r = requests. post(url, data=data, files=files) outs = r.json()['results'] only_text = '\ '.join([out['text'] for out in outs]) print(f'{<!-- -->only_text=}')
Other languages
Please refer to the calling method of curl
to implement it yourself.
Script to run
The script scripts/screenshot_daemon.py realizes automatically calling Pixe2Text for formula or text recognition on screenshots. How is this function achieved?
The following is the specific operation process (please install Pix2Text first):
1. Find a screen capture software you like, as long as the software supports storing screenshot images in a specified folder. For example, the free Xnip under Mac is very easy to use.
2. In addition to installing Pix2Text, you also need to install an additional Python package pyperclip, and use it to copy the recognition results into the system’s clipboard:
$ pip install pyperclip
3. Download the script file scripts/screenshot_daemon.py to the local, edit the line where "SCREENSHOT_DIR"
is located (line 17
), and change the path to your screenshot Stored directory.
4. Run this script:
$ python scripts/screenshot_daemon.py
Well, now use your screen capture software to try the effect. The recognition result after the screenshot will be written to the computer clipboard, and can be pasted directly by Ctrl-V / Cmd-V.
For a more detailed introduction, please refer to the video: “Pix2Text: A Free Python Open Source Tool to Replace Mathpix”.
Give the author a cup of coffee
Open source is not easy, if this project is helpful to you, you can consider cheering up the author
https://cnocr.readthedocs.io/zh/latest/buymeacoffee/
2023-03-21