Module in python that can process word documents: docx module

Foreword

Good morning, good afternoon and good evening everyone? ~Welcome to this article

Not much to say, let’s get straight to it. If you have any questions/need information, you can click on the business card at the end of the article to get the source code.

A .docx module

Python can use the python-docx module to process word documents, and the processing method is object-oriented.

In other words, the python-docx module will regard word documents, paragraphs, text, fonts, etc. in the document as objects.

Processing objects is processing the content of word documents.

2. Related concepts

If you need to read the text in the word document (generally speaking, the program only needs to recognize the text information in the word document),

You need to first understand several concepts of the python-docx module.

  1. Document object, representing a word document.

  2. Paragraph object, representing a paragraph in a word document

  3. The text property of the Paragraph object represents the text content in the paragraph.

3. Module installation and import

It should be noted that to install the python-docx module, you need to enter pip install python-docx in the cmd command line.

(The last sentence in English is Successfully installed, the installation is completed successfully)

Note that when importing the module, import docx is used.

Four. Read word text

After understanding the above information, it is very simple. First create a D:\temp\word.docx file and enter the following content in it.

import docx


file=docx.Document(r"F:\python from getting started to giving up\7\2\wenjian.docx")

print('paragraph:' + str(len(file.paragraphs)))
#
# for para in file. paragraphs:
# print(para. text)
    
for i in range(len(file. paragraphs)):
    print("The content of the "th" + str(i) + "paragraph is:" + file.paragraphs[i].text)
import sys

from docx import Document
from docx.shared import Inches

def main():
# reload(sys)
# sys.setdefaultencoding('utf-8')

    # create document object
    document = Document()

    #Set the document title, use unicode string for Chinese
    document.add_heading(u'My new document',0)

    # Add paragraphs to the document
    p = document.add_paragraph('This is a paragraph having some ')
    p.add_run('bold ').bold = True
    p.add_run('and some')
    p.add_run('italic.').italic = True

    # Add a first-level title
    document.add_heading(u'Level 1 heading, level = 1',level = 1)
    document.add_paragraph('Intense quote',style = 'IntenseQuote')

    # add unordered list
    document.add_paragraph('first item in unordered list',style = 'ListBullet')

    #Add ordered list
    document.add_paragraph('first item in ordered list',style = 'ListNumber')
    document.add_paragraph('second item in ordered list',style = 'ListNumber')
    document.add_paragraph('third item in ordered list',style = 'ListNumber')

    #Add an image and specify the width
    document.add_picture('cat.png',width = Inches(2.25))

    #Add table: 1 row and 3 columns
    table = document.add_table(rows = 1,cols = 3)
    # Get the cell list object of the first row
    hdr_cells = table.rows[0].cells
    # Assign a value to each cell
    # Note: The value must be of string type
    hdr_cells[0].text = 'Name'
    hdr_cells[1].text = 'Age'
    hdr_cells[2].text = 'Tel'
    #Add a row to the table
    new_cells = table.add_row().cells
    new_cells[0].text = 'Tom'
    new_cells[1].text = '19'
    new_cells[2].text = '12345678'

    # Add page breaks
    document.add_page_break()

    # Add paragraphs to a new page
    p = document.add_paragraph('This is a paragraph in new page.')

    # Save document
    document.save('demo1.doc')

if __name__ == '__main__':
    main()

Read the table:

import docx

doc = docx.Document('wenjian.docx')
for table in doc.tables: # Traverse all tables
    print('----table------')
    for row in table.rows: # Traverse all rows of the table
        # row_str = '\t'.join([cell.text for cell in row.cells]) # One row of data
        # print row_str
        for cell in row.cells:
            print(cell.text, '\t',)
        print() #newline

First, use docx.Document to open the corresponding file directory.

The structure of the docx file is relatively complex and is divided into three layers.

  • The Docment object represents the entire document;

  • Docment contains a list of Paragraph objects, which are used to represent paragraphs in the document;

  • A Paragraph object contains a list of Run objects.

Therefore p.text will print out the entire text document.

Use doc.tables to traverse all tables.

And for each table, all the contents are obtained by traversing the rows and columns.

However, the file objects and pictures we inserted, text.txt document, were not found in the running results.

How to parse this part?

First we need to understand the format composition of the docx document:

  • docx is used by Microsoft Office 2007 and later versions.

    Replacing its current proprietary default file format with a new XML-based compressed file format,

    The letter “x” is added to the traditional filename extension (i.e. “.docx” instead of “.doc”, “.xlsx” instead of “.xls”, “.pptx” instead of “.ppt”).

  • A file in docx format is essentially a ZIP file.

    After changing the suffix of a docx file to ZIP, it can be opened or decompressed with a decompression tool.

    In fact, the basic file of Word2007 is in ZIP format, which can be regarded as a container for docx files.

  • The main content of the docx format file is saved in XML format, but the file is not saved directly on the disk.

    It is saved in a ZIP file and then takes the extension docx.

    Change the suffix of the file in .docx format to ZIP and decompress it. You can see that there is a folder like word in the decompressed folder, which contains most of the content of the Word document.

    The document.xml file contains the main text content of the document

From the above documents, we can understand that docx documents are actually packaged by XML documents.

Then we want to get all the parts, we can use ZIP decompression to get all the parts.

Let’s try it first and see if it works

1 Change the docx document to the ZIP suffix

2 Unzip the file

After decompression, you will get the following files:

Click on the word folder: there are the following folders.

(document.xml is the file describing the text object)

Among them, the embeddings file is the text object text.txt we inserted. It is a bin file

The pictures stored in the Media file are:

We parse the inserted text and pictures manually, so it can also be parsed through code.

code show as below:

os.chdir(r'E:\py_prj') #First change the directory to the directory of the file

os.rename('test.docx','test.ZIP') # Rename to zip file

f=zipfile.ZipFile('test.zip','r') #Decompress

for file in f.namelist():

    f.extract(file)

file=open(r'E:\py_prj\word\embeddings\oleObject1.bin','rb').read() #Enter the file path and read the binary file.

for f in file:

    print(f)

Through the above method, all the files and pictures inserted in docx can be parsed out.

The specific way of writing docx can refer to the introduction of official documents

Final

Well, today’s sharing is almost here!

If you have any questions about what you want to see in the next article, please leave a message in the comment area! I will update when I see it (? ?_?)?

If you like it, please follow the blogger, or like, favorite and comment on my article! ! !

Finally, let’s spread the word~For more source codes, information, materials, answers, and exchanges click on the business card below to get it