In the process of text analysis, converting raw data into TXT files is very critical, mainly for the following reasons:
1. Simple and unified format
-
? A TXT file is a simple text format that contains only plain text information and does not contain any formatting or style information. This simple and unified format helps reduce confusion or misunderstandings that may arise during text analysis.
-
? Documents in other formats, such as PDF or Word documents, may contain images, tables, and other non-text elements, and may also contain complex formatting and styles, which may interfere with the text analysis process.
2. Facilitate text preprocessing
-
? Text analysis usually requires preprocessing of text data, including word segmentation, removal of stop words, standardization, etc. The simple structure of TXT files makes these preprocessing tasks easier to perform.
-
? Compared to other file formats, TXT files do not contain any complex formatting or metadata, which helps simplify pre-processing steps and reduce possible errors and problems.
3. Compatibility
-
? Most text analysis and natural language processing (NLP) tools are able to process TXT files directly. Converting raw data to TXT files ensures compatibility with these tools, streamlining the analysis process.
-
? TXT files are a universal file format that can be easily processed across different operating systems and software environments without the need for specific conversions or adapters.
4. Save resources
-
? TXT files are generally smaller than other file formats, which helps save storage space and increase processing speed. Smaller file sizes also mean less computing resources are required to process text data, making analysis more efficient.
-
? The simple text format also means low CPU and memory consumption during processing, which is very important for large-scale text analysis tasks.
5. Facilitates text mining and pattern recognition
-
? The plain text format makes it easier and more straightforward to use regular expressions and other text mining techniques to identify and extract patterns in text.
-
? Plain text data also facilitates the implementation of various text analysis techniques, such as sentiment analysis, topic modeling, and entity recognition.
6. Readability and checkability
- ? Humans can directly read and understand TXT files, which is important for inspecting, debugging, and understanding the results of text analysis.
7. Data cleaning
- ? The simplicity of TXT files makes it easier to identify and deal with missing values, errors, and other data quality issues, ensuring the accuracy and reliability of text analysis.
Converting raw data to TXT files is a fundamental step in achieving effective and accurate text analysis. It helps simplify and standardize the text analysis process, thereby improving the efficiency and quality of analysis. The following code can be used to convert pdf files to txt files.
pdf2txt.py
#!/usr/bin/env python # This line tells the operating system to use the Python interpreter to execute this file import sys #Import the sys module, used to handle operations related to the Python interpreter and runtime environment from pdfminer.pdfdocument import PDFDocument # Import the PDFDocument class from the pdfminer module to represent PDF documents from pdfminer.pdfparser import PDFParser # Import the PDFParser class from the pdfminer module for parsing PDF documents from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter # Import resource management and page interpretation classes from the pdfminer module from pdfminer.pdfdevice import PDFDevice, TagExtractor # Import PDF device and tag extractor classes from the pdfminer module from pdfminer.pdfpage import PDFPage # Import the PDFPage class from the pdfminer module to represent PDF pages from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter # Import converter classes from the pdfminer module for converting PDF to other formats from pdfminer.cmapdb import CMapDB # Import the character mapping database class from the pdfminer module from pdfminer.layout import LAParams # Import the layout analysis parameter class from the pdfminer module from pdfminer.image import ImageWriter # Import the image writing class from the pdfminer module # Define the main function, argv is a list containing command line parameters def main(argv): import getopt #Import getopt module, used to parse command line parameters # Define an internal function that shows usage def usage(): print ('usage: %s [-P password] [-o output] [-t text|html|xml|tag]' ' [-O output_dir] [-c encoding] [-s scale] [-R rotation]' ' [-Y normal|loose|exact] [-p pagenos] [-m maxpages]' ' [-S] [-C] [-n] [-A] [-V] [-M char_margin] [-L line_margin]' ' [-W word_margin] [-F boxes_flow] [-d] input.pdf ...' % argv[0]) return 100 # Return an error code try: # Use getopt to parse command line parameters (opts, args) = getopt.getopt(argv[1:], 'dP:o:t:O:c:s:R:Y:p:m:SCnAVM:W:L:F:') except getopt.GetoptError: return usage() # If parsing fails, display usage and exit if not args: return usage() # If no non-option arguments (such as input files) are provided, display the usage and exit #Initialize some variables debug = 0 # debug level password = b'' # PDF password pagenos = set() # The page number set to be processed maxpages = 0 # Maximum number of pages outfile = None # Output file name outtype = None # Output type imagewriter = None #Image writing object rotation = 0 # rotation angle stripcontrol = False # Whether to strip control characters layoutmode = 'normal' # Layout mode encoding = 'utf-8' # encoding method pageno = 1 # Page number scale = 1 # scaling factor caching = True # Whether to cache showpageno = True # Whether to display the page number laparams = LAParams() # Layout analysis parameter object for (k, v) in opts: # Iterate through options and values if k == '-d': debug + = 1 # Set debugging level elif k == '-P': password = v.encode('ascii') # Set password elif k == '-o': outfile = v # Set the output file name elif k == '-t': outtype = v # Set the output type elif k == '-O': imagewriter = ImageWriter(v) # Create an image writing object elif k == '-c': encoding = v # Set encoding method elif k == '-s': scale = float(v) # Set the scaling factor elif k == '-R': rotation = int(v) # Set the rotation angle elif k == '-Y': layoutmode = v # Set layout mode elif k == '-p': pagenos.update(int(x)-1 for x in v.split(',')) # Update page number set elif k == '-m': maxpages = int(v) # Set the maximum number of pages elif k == '-S': stripcontrol = True # Enable stripping control characters elif k == '-C': caching = False # Disable caching elif k == '-n': laparams = None # Disable layout analysis parameters elif k == '-A': laparams.all_texts = True # Enable all text options elif k == '-V': laparams.detect_vertical = True # Enable vertical detection option elif k == '-M': laparams.char_margin = float(v) # Set character margins elif k == '-W': laparams.word_margin = float(v) # Set word margins elif k == '-L': laparams.line_margin = float(v) # Set line margins elif k == '-F': laparams.boxes_flow = float(v) # Set box flow #Set debugging level PDFDocument.debug = debug PDFParser.debug = debug CMapDB.debug = debug PDFPageInterpreter.debug = debug #Create PDF explorer object rsrcmgr = PDFResourceManager(caching=caching) #Create the corresponding PDF device object based on the output type and options if not outtype: outtype = 'text' # Default is text output if outfile: if outfile.endswith('.htm') or outfile.endswith('.html'): outtype = 'html' # If the output file name ends with .htm or .html, set it to html output elif outfile.endswith('.xml'): outtype = 'xml' # If the output file name ends with .xml, set it to xml output elif outfile.endswith('.tag'): outtype = 'tag' # If the output file name ends with .tag, set it to tag output elif outtype == 'tag': device = TagExtractor(rsrcmgr, outfp) # If the output type is 'tag', create a TagExtractor object else: return usage() # If the output type is not recognized, display the usage and exit for fname in args: # Traverse all input file names with open(fname, 'rb') as fp: # Open the file in binary read mode interpreter = PDFPageInterpreter(rsrcmgr, device) # Create a PDF page interpreter object # Traverse PDF pages and obtain page objects for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): page.rotate = (page.rotate + rotation) % 360 # Set the page rotation angle interpreter.process_page(page) # Process each page device.close() # Close the device object and release resources outfp.close() # Close the output file and release resources return # Return from main function # Check if this module is running as the main module if __name__ == '__main__': sys.exit(main(sys.argv)) # If yes, call the main function with the command line argument list as argument
convertPDF.py
#!/usr/bin/env python3 """ Script to convert PDFs to text files. """ import unicodedata, os, pdf2txt, datetime import multiprocessing def convertPDFToText(i, ID, newDir, fileNamePDF): print('Trying to convert: ' + str(i) + ', ' + ID) # Output the file information being tried to convert try: pdf2txt.main(['-o', newDir + '/' + ID + '.txt', fileNamePDF]) # Call pdf2txt.main to convert PDF to text print('Successfully converted: ' + ID) # Output when conversion is successful except Exception as e: print('Failed to convert: ' + ID + f', Error: {<!-- -->e}') # Output when conversion fails def process_pdfs(pdf_list): with multiprocessing.Pool(20) as pool: # Create a process pool containing 20 processes pool.starmap(convertPDFToText, pdf_list) # Use starmap to process each element in pdf_list in parallel. Each element is a tuple, which will be unpacked as a parameter of convertPDFToText. if __name__ == '__main__': directory = '../../Data/PDF/work' os.chdir(directory) #Change the current working directory to the PDF file directory #Specify the directory to save the converted files newDir = '../TXT/work' # os.makedirs(newDir) # Create a new directory (if necessary) print('Placing converted files in: ' + newDir) # Output the directory where the converted files will be placed pdf_list = [] # Create an empty list to hold the parameter tuple that will be passed to convertPDFToText i = 0 #Initialize counter for fileNamePDF in os.listdir('./'): # Traverse all files in the current directory i + = 1 # Counter increment if fileNamePDF.find(".pdf") == -1: # If the file is not a PDF, skip continue ID = fileNamePDF[:-4] # Get the ID from the file name (remove the .pdf suffix) if os.path.isfile('../TXT/' + ID + '.txt'): # If the corresponding text file already exists, skip continue pdf_list.append((i, ID, newDir, fileNamePDF)) # Add parameter tuple to pdf_list process_pdfs(pdf_list) # Call the process_pdfs function, passing pdf_list to process PDF files in parallel
About Python’s technical reserves
If you are planning to learn Python or are currently learning it, you should be able to use the following:
① A roadmap for learning Python in all directions, knowing what to learn in each direction
② More than 100 Python course videos, covering essential basics, crawlers and data analysis
③ More than 100 Python practical cases, learning is no longer just theory
④ Huawei’s exclusive Python comic tutorial, you can also learn it on your mobile phone
⑤Real Python interview questions from Internet companies over the years, very convenient for review
There are ways to get it at the end of the article
1. Learning routes in all directions of Python
The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
2. Python course video
When we watch videos and learn, we can’t just move our eyes and brain but not our hands. The more scientific learning method is to use them after understanding. At this time, hands-on projects are very suitable.
3. Python practical cases
Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.
4. Python comic tutorial
Use easy-to-understand comics to teach you to learn Python, making it easier for you to remember and not boring.
5. Internet company interview questions
We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.
This complete version of the complete set of Python learning materials has been uploaded to CSDN. If friends need it, you can also scan the official QR code of csdn below or click on the WeChat card at the bottom of the homepage and article to get the method. [Guaranteed 100% free] strong>