Text analysis using python – multi-process batch processing of PDF files into csv files

In the process of text analysis, converting raw data into TXT files is very critical, mainly for the following reasons:

1. Simple and unified format

? A TXT file is a simple text format that contains only plain text information and does not contain any formatting or style information. This simple and unified format helps reduce confusion or misunderstandings that may arise during text analysis.
? Documents in other formats, such as PDF or Word documents, may contain images, tables, and other non-text elements, and may also contain complex formatting and styles, which may interfere with the text analysis process.

2. Facilitate text preprocessing

? Text analysis usually requires preprocessing of text data, including word segmentation, removal of stop words, standardization, etc. The simple structure of TXT files makes these preprocessing tasks easier to perform.
? Compared to other file formats, TXT files do not contain any complex formatting or metadata, which helps simplify pre-processing steps and reduce possible errors and problems.

3. Compatibility

? Most text analysis and natural language processing (NLP) tools are able to process TXT files directly. Converting raw data to TXT files ensures compatibility with these tools, streamlining the analysis process.
? TXT files are a universal file format that can be easily processed across different operating systems and software environments without the need for specific conversions or adapters.

4. Save resources

? TXT files are generally smaller than other file formats, which helps save storage space and increase processing speed. Smaller file sizes also mean less computing resources are required to process text data, making analysis more efficient.
? The simple text format also means low CPU and memory consumption during processing, which is very important for large-scale text analysis tasks.

5. Facilitates text mining and pattern recognition

? The plain text format makes it easier and more straightforward to use regular expressions and other text mining techniques to identify and extract patterns in text.
? Plain text data also facilitates the implementation of various text analysis techniques, such as sentiment analysis, topic modeling, and entity recognition.

6. Readability and checkability

? Humans can directly read and understand TXT files, which is important for inspecting, debugging, and understanding the results of text analysis.

7. Data cleaning

? The simplicity of TXT files makes it easier to identify and deal with missing values, errors, and other data quality issues, ensuring the accuracy and reliability of text analysis.

Converting raw data to TXT files is a fundamental step in achieving effective and accurate text analysis. It helps simplify and standardize the text analysis process, thereby improving the efficiency and quality of analysis. The following code can be used to convert pdf files to txt files.

pdf2txt.py

#!/usr/bin/env python # This line tells the operating system to use the Python interpreter to execute this file
import sys #Import the sys module, used to handle operations related to the Python interpreter and runtime environment
from pdfminer.pdfdocument import PDFDocument # Import the PDFDocument class from the pdfminer module to represent PDF documents
from pdfminer.pdfparser import PDFParser # Import the PDFParser class from the pdfminer module for parsing PDF documents
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter # Import resource management and page interpretation classes from the pdfminer module
from pdfminer.pdfdevice import PDFDevice, TagExtractor # Import PDF device and tag extractor classes from the pdfminer module
from pdfminer.pdfpage import PDFPage # Import the PDFPage class from the pdfminer module to represent PDF pages
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter # Import converter classes from the pdfminer module for converting PDF to other formats
from pdfminer.cmapdb import CMapDB # Import the character mapping database class from the pdfminer module
from pdfminer.layout import LAParams # Import the layout analysis parameter class from the pdfminer module
from pdfminer.image import ImageWriter # Import the image writing class from the pdfminer module

# Define the main function, argv is a list containing command line parameters
def main(argv):
    import getopt #Import getopt module, used to parse command line parameters
    # Define an internal function that shows usage
    def usage():
        print ('usage: %s [-P password] [-o output] [-t text|html|xml|tag]'
               ' [-O output_dir] [-c encoding] [-s scale] [-R rotation]'
               ' [-Y normal|loose|exact] [-p pagenos] [-m maxpages]'
               ' [-S] [-C] [-n] [-A] [-V] [-M char_margin] [-L line_margin]'
               ' [-W word_margin] [-F boxes_flow] [-d] input.pdf ...' % argv[0])
        return 100 # Return an error code
    try:
        # Use getopt to parse command line parameters
        (opts, args) = getopt.getopt(argv[1:], 'dP:o:t:O:c:s:R:Y:p:m:SCnAVM:W:L:F:')
    except getopt.GetoptError:
        return usage() # If parsing fails, display usage and exit
    if not args: return usage() # If no non-option arguments (such as input files) are provided, display the usage and exit
    #Initialize some variables
    debug = 0 # debug level
    password = b'' # PDF password
    pagenos = set() # The page number set to be processed
    maxpages = 0 # Maximum number of pages
    outfile = None # Output file name
    outtype = None # Output type
    imagewriter = None #Image writing object
    rotation = 0 # rotation angle
    stripcontrol = False # Whether to strip control characters
    layoutmode = 'normal' # Layout mode
    encoding = 'utf-8' # encoding method
    pageno = 1 # Page number
    scale = 1 # scaling factor
    caching = True # Whether to cache
    showpageno = True # Whether to display the page number
    laparams = LAParams() # Layout analysis parameter object
    for (k, v) in opts: # Iterate through options and values
        if k == '-d': debug + = 1 # Set debugging level
        elif k == '-P': password = v.encode('ascii') # Set password
        elif k == '-o': outfile = v # Set the output file name
        elif k == '-t': outtype = v # Set the output type
        elif k == '-O': imagewriter = ImageWriter(v) # Create an image writing object
        elif k == '-c': encoding = v # Set encoding method
        elif k == '-s': scale = float(v) # Set the scaling factor
        elif k == '-R': rotation = int(v) # Set the rotation angle
        elif k == '-Y': layoutmode = v # Set layout mode
        elif k == '-p': pagenos.update(int(x)-1 for x in v.split(',')) # Update page number set
        elif k == '-m': maxpages = int(v) # Set the maximum number of pages
        elif k == '-S': stripcontrol = True # Enable stripping control characters
        elif k == '-C': caching = False # Disable caching
        elif k == '-n': laparams = None # Disable layout analysis parameters
        elif k == '-A': laparams.all_texts = True # Enable all text options
        elif k == '-V': laparams.detect_vertical = True # Enable vertical detection option
        elif k == '-M': laparams.char_margin = float(v) # Set character margins
        elif k == '-W': laparams.word_margin = float(v) # Set word margins
        elif k == '-L': laparams.line_margin = float(v) # Set line margins
        elif k == '-F': laparams.boxes_flow = float(v) # Set box flow
    #Set debugging level
    PDFDocument.debug = debug
    PDFParser.debug = debug
    CMapDB.debug = debug
    PDFPageInterpreter.debug = debug
    #Create PDF explorer object
    rsrcmgr = PDFResourceManager(caching=caching)
    #Create the corresponding PDF device object based on the output type and options
    if not outtype:
        outtype = 'text' # Default is text output
        if outfile:
            if outfile.endswith('.htm') or outfile.endswith('.html'):
                outtype = 'html' # If the output file name ends with .htm or .html, set it to html output
            elif outfile.endswith('.xml'):
                outtype = 'xml' # If the output file name ends with .xml, set it to xml output
            elif outfile.endswith('.tag'):
                outtype = 'tag' # If the output file name ends with .tag, set it to tag output
            elif outtype == 'tag':
        device = TagExtractor(rsrcmgr, outfp) # If the output type is 'tag', create a TagExtractor object
    else:
        return usage() # If the output type is not recognized, display the usage and exit

    for fname in args: # Traverse all input file names
        with open(fname, 'rb') as fp: # Open the file in binary read mode
            interpreter = PDFPageInterpreter(rsrcmgr, device) # Create a PDF page interpreter object
            # Traverse PDF pages and obtain page objects
            for page in PDFPage.get_pages(fp, pagenos,
                                          maxpages=maxpages, password=password,
                                          caching=caching, check_extractable=True):
                page.rotate = (page.rotate + rotation) % 360 # Set the page rotation angle
                interpreter.process_page(page) # Process each page

    device.close() # Close the device object and release resources
    outfp.close() # Close the output file and release resources
    return # Return from main function

# Check if this module is running as the main module
if __name__ == '__main__':
    sys.exit(main(sys.argv)) # If yes, call the main function with the command line argument list as argument

convertPDF.py

#!/usr/bin/env python3
"""
Script to convert PDFs to text files.

"""

import unicodedata, os, pdf2txt, datetime
import multiprocessing
def convertPDFToText(i, ID, newDir, fileNamePDF):

    print('Trying to convert: ' + str(i) + ', ' + ID) # Output the file information being tried to convert
    try:
        pdf2txt.main(['-o', newDir + '/' + ID + '.txt', fileNamePDF]) # Call pdf2txt.main to convert PDF to text
        print('Successfully converted: ' + ID) # Output when conversion is successful
    except Exception as e:
        print('Failed to convert: ' + ID + f', Error: {<!-- -->e}') # Output when conversion fails

def process_pdfs(pdf_list):
    with multiprocessing.Pool(20) as pool: # Create a process pool containing 20 processes
        pool.starmap(convertPDFToText, pdf_list) # Use starmap to process each element in pdf_list in parallel. Each element is a tuple, which will be unpacked as a parameter of convertPDFToText.
    
if __name__ == '__main__':

    directory = '../../Data/PDF/work'
    os.chdir(directory) #Change the current working directory to the PDF file directory

    #Specify the directory to save the converted files
    newDir = '../TXT/work'
    # os.makedirs(newDir) # Create a new directory (if necessary)
    print('Placing converted files in: ' + newDir) # Output the directory where the converted files will be placed

    pdf_list = [] # Create an empty list to hold the parameter tuple that will be passed to convertPDFToText
    i = 0 #Initialize counter
    for fileNamePDF in os.listdir('./'): # Traverse all files in the current directory
        i + = 1 # Counter increment
        if fileNamePDF.find(".pdf") == -1: # If the file is not a PDF, skip
            continue

        ID = fileNamePDF[:-4] # Get the ID from the file name (remove the .pdf suffix)
        if os.path.isfile('../TXT/' + ID + '.txt'): # If the corresponding text file already exists, skip
            continue

        pdf_list.append((i, ID, newDir, fileNamePDF)) # Add parameter tuple to pdf_list
    process_pdfs(pdf_list) # Call the process_pdfs function, passing pdf_list to process PDF files in parallel

About Python’s technical reserves

If you are planning to learn Python or are currently learning it, you should be able to use the following:

① A roadmap for learning Python in all directions, knowing what to learn in each direction
② More than 100 Python course videos, covering essential basics, crawlers and data analysis
③ More than 100 Python practical cases, learning is no longer just theory
④ Huawei’s exclusive Python comic tutorial, you can also learn it on your mobile phone
⑤Real Python interview questions from Internet companies over the years, very convenient for review

There are ways to get it at the end of the article

1. Learning routes in all directions of Python

The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.

2. Python course video

When we watch videos and learn, we can’t just move our eyes and brain but not our hands. The more scientific learning method is to use them after understanding. At this time, hands-on projects are very suitable.

3. Python practical cases

Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.

4. Python comic tutorial

Use easy-to-understand comics to teach you to learn Python, making it easier for you to remember and not boring.

5. Internet company interview questions

We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.

This complete version of the complete set of Python learning materials has been uploaded to CSDN. If friends need it, you can also scan the official QR code of csdn below or click on the WeChat card at the bottom of the homepage and article to get the method. [Guaranteed 100% free]