[Practical tool] How to decode mathematical formulas in PDF into LaTeX text?

Directory

How to decode mathematical formulas in PDF to LaTeX text?

Use C++ to call the OCR library natively to convert mathematical formulas in PDF to LaTeX

Some online PDF to LaTeX services

Converting Mathematical Formulas in PDF to LaTeX Implemented in C++


Use C++ to call the OCR library natively to convert the mathematical formula in PDF to LaTeX

To convert mathematical formulas in PDF to LaTeX by natively calling an OCR library with C++, you can choose to use an open source OCR library such as Tesseract or OCRopus. Following is a simple example program to convert mathematical formulas in PDF to LaTeX code using Tesseract OCR library.

#include <iostream>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

using namespace std;

int main() {
    // PDF file path to process
    string pdf_file = "example.pdf";

    // convert PDF to PNG image using leptonica library
    string image_file = "example.png";
    PIX* pix = pixRead(pdf_file.c_str());
    pixWrite(image_file.c_str(), pix, IFF_PNG);
    pixDestroy( &pix);

    // Use the Tesseract OCR library to convert math formulas in PNG images to text
    tesseract::TessBase API api;
    api.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
    api.SetPageSegMode(tesseract::PSM_AUTO);
    api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ() + -*/=,.");
    api.SetVariable("textord_tabfind_show_reject_blobs", "false");

    PIX* image = pixRead(image_file.c_str());
    api. SetImage(image);
    char* text = api. GetUTF8Text();
    string result(text);
    delete[] text;
    pixDestroy( & image);

    // Extract the mathematical formula in the text and convert it to LaTeX code
    string latex = "";
    size_t pos = result. find('$');
    while (pos != string::npos) {
        size_t end_pos = result. find('$', pos + 1);
        if (end_pos != string::npos) {
            string equation = result.substr(pos + 1, end_pos - pos - 1);
            latex + = "$" + equation + "$";
            pos = result. find('$', end_pos + 1);
        } else {
            pos = string::npos;
        }
    }

    // output LaTeX code
    cout << latex << endl;

    return 0;
}

In the above code, we first convert the PDF file to a PNG image using the Leptonica library. We then use the Tesseract OCR library to recognize the math formulas in the PNG images as text and extract the math formulas inside. Finally, we convert the math formula to LaTeX code and output it to the console.

It should be noted that the above code is just a simple example and needs to be modified and optimized according to specific requirements in actual use. For example, error handling and exception handling code needs to be added to ensure program stability and reliability. At the same time, due to the limitation of OCR technology, the conversion result may be wrong or not completely accurate, which requires manual proofreading and correction.

Some online PDF to LaTeX services

To decode mathematical formulas in PDF to LaTeX text, you can use OCR (Optical Character Recognition) software or online services. OCR software can convert text and images in PDFs into editable text files that include mathematical formulas. Commonly used OCR software includes Adobe Acrobat, ABBYY FineReader, Tesseract, etc.

Some online services can also convert mathematical formulas in PDF to LaTeX text, for example:

  1. Mathpix Snip: Mathpix Snip is an OCR tool that converts mathematical formulas in PDFs to LaTeX code. Users only need to intercept mathematical formulas in PDF, and Mathpix Snip can automatically convert them into LaTeX code. Mathpix Snip is available as a browser plugin or a desktop application.

  2. InftyReader: InftyReader is an OCR tool that converts mathematical formulas in PDFs to LaTeX code. Users need to upload the PDF file to the InftyReader website and wait for the conversion result. InftyReader also supports converting mathematical formulas in other formats into LaTeX code, such as pictures, scans, etc.

  3. Online OCR: Online OCR is an online OCR tool that can convert text and images in PDF to editable text files. Users only need to upload PDF files, and Online OCR can automatically recognize the mathematical formulas in them and convert them into LaTeX codes.

It should be noted that the result of OCR conversion may have certain errors, so the conversion result needs to be checked and corrected.

Using C++ to convert mathematical formulas in PDF to LaTeX

Converting math formulas in PDF to LaTeX code requires OCR software or online services, because math formulas in PDF are usually in the form of images. Here we take Mathpix Snip as an example, and use C++ to write a program to call Mathpix Snip API to convert mathematical formulas in PDF to LaTeX code.

First, you need to register with Mathpix Snip and get an API key. API keys can be generated in the Mathpix Snip Admin.

We can then use the libcurl library to send HTTP requests and receive responses. The following is a simple sample program, which can convert the mathematical formula in the specified PDF file into LaTeX code, and output it to the console.

#include <iostream>
#include <curl/curl.h>
#include <nlohmann/json.hpp>

using namespace std;
using json = nlohmann::json;

// Mathpix Snip API key
const string API_KEY = "your_api_key_here";

// Mathpix Snip API request URL
const string API_URL = "https://api.mathpix.com/v3/text";

// Send HTTP POST request and get response
string send_request(const string & image_url) {
    CURL* curl = curl_easy_init();
    if (!curl) {
        cerr << "Error: failed to initialize curl." << endl;
        return "";
    }

    struct curl_slist* headers = NULL;
    headers = curl_slist_append(headers, ("app_id: " + API_KEY).c_str());
    headers = curl_slist_append(headers, "Content-Type: application/json");

    json data = {
        {"src", image_url},
        {"formats", {"latex_simplified"}}
    };
    string json_str = data. dump();

    curl_easy_setopt(curl, CURLOPT_URL, API_URL.c_str());
    curl_easy_setopt(curl, CURLOPT_POSTFIELDS, json_str.c_str());
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, curl_write_callback);

    string response;
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, & response);

    CURLcode res = curl_easy_perform(curl);
    if (res != CURLE_OK) {
        cerr << "Error: curl_easy_perform() failed: " << curl_easy_strerror(res) << endl;
        response = "";
    }

    curl_easy_cleanup(curl);
    curl_slist_free_all(headers);

    return response;
}

// CURL callback function, write the response data into the buffer
size_t curl_write_callback(char* buffer, size_t size, size_t nmemb, string* response) {
    size_t len = size * nmemb;
    response->append(buffer, len);
    return len;
}

int main() {
    // PDF file path to process
    string pdf_file = "example.pdf";

    // Call the pdftocairo tool to convert PDF to PNG image
    string image_file = "example.png";
    string command = "pdftocairo-png " + pdf_file + " " + image_file;
    system(command.c_str());

    // Call the Mathpix Snip API to convert the mathematical formula in the PNG image into LaTeX code
    string image_url = "data:image/png;base64," + base64_encode(image_file);
    string response = send_request(image_url);

    // Parse the response data and get the LaTeX code
    string latex = "";
    if (!response. empty()) {
        json data = json::parse(response);
        if (data. contains("latex_simplified")) {
            latex = data["latex_simplified"].get<string>();
        }
    }

    // output LaTeX code
    cout << latex << endl;

    return 0;
}

In the above code, we first use the system command to call the pdftocairo tool to convert the PDF file to a PNG image. We then convert the PNG image to a base64 encoded string and call the send_request function with it as an argument to send the HTTP request. This function uses the libcurl library for sending HTTP requests and receiving responses, and stores the responses in the string variable response. Finally, we parse the response data, get the LaTeX code, and output it to the console.

It should be noted that the above code is just a simple example and needs to be modified and optimized according to specific requirements in actual use. For example, error handling and exception handling code needs to be added to ensure the stability and reliability of the program.