Python batch extracts table contents in Word and writes them to Excel with one click

Follow the public account: [Xiao Zhang Python], we have prepared 50+ high-quality Python e-books and 50G+ high-quality video learning materials for you. You can get it by replying to the keyword: 1024 in the background; if you are interested in the blog content If you have any questions, add the author [Personal WeChat] in the background to communicate directly with the author.

Hello, I’m Xiao Zhang, long time no see everyone~

Today’s article introduces a practical case, related to office automation; the idea of the case comes from a demo I made for readers two days ago. The need is to extract the contents of hundreds of tables in word (the table styles in all words are the same), and put them into The extracted content is automatically stored in Excel

The form of the table in word is as follows

image-20210326232115393

Currently, there are several word documents in the above format that need to be sorted out. The goal is to use python to automatically generate excel tables in the following format.

image-20210326232712833

Before explaining the formal case, let’s take a look at the conversion effect. The script first converts the doc file in the specified folder into docx, and then automatically generates an excel table. The table contains all the content in word.

Word_excel

Libraries involved

The Python libraries used in this case include the following

python-docx
pandas
os
pywin32
doc converted to docx

In this case, the python-docx library is used to extract the table content in word. For some basic usage of python-docx, please refer to

,

Word documents are sometimes saved in the doc type. python-docx can only process the docx file type. Before extracting the table content, a file type format conversion is required: Convert doc to docx in batches;

The easiest way to convert doc to docxOpen the doc file through the word component in Office, and then manually save it as a docx file. This method is okay for a single document. If the number of documents reaches hundreds, this method will be used. It’s a little annoying,

Here we introduce a python library pywin32 to help us solve this problem. As an extension module, pywin32 encapsulates a large number of Windows API functions, such as calling application components such as Office, deleting specified files, obtaining mouse coordinates, etc.

Use pywin32 to control the Word component in Office to automatically complete the open and save operations and convert all doc file types into docx file types. The steps are divided into the following three steps:

1. Create a word component

from win32com import client as wc
word = wc.Dispatch('Word.Application')

2. Open the word file

doc = word.Documents.Open(path)

3. Save and close

doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
doc.Close()

Complete code

 path_list = os.listdir(path)
    doc_list = [os.path.join(path,str(i)) for i in path_list if str(i).endswith('doc')]
    word = wc.Dispatch('Word.Application')
    print(doc_list)
    for path in doc_list:
        print(path)
        save_path = str(path).replace('doc','docx')
        doc = word.Documents.Open(path)
        doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
        doc.Close()
        print('{} Save successfully '.format(save_path))
    word.Quit()
docx library extracts single table content

Before batch operations, we first need to get the content in a single table. As long as we get a single word, we can add a recursion to the rest.

Use the docx library to extract table contents in word, mainly using objects such as Table, rows, and cells.

image-20210327095933680

Table represents a table, rows represents a list of rows in the table, and exists in the form of an iterator; cells represents a list of cells, also in the form of an iterator.

image-20210327003517662

Before operating, you need to understand the following basic functions:

  • Read the file path through the Document function and return a Document object

  • Document.tables can return a list of tables in word;

  • table.rows returns a list of rows in the table;

  • row.cells returns the list of cells contained in the row;

  • cell.text returns the text information in the cell

After understanding the above content, the next operation idea will be relatively clear; the text information in the word table can be completed through two for loops: the first for loop obtains all row objects in the table, and the second for loop locates each row of the cell, use cell.text to obtain the cell text content;

Try using code to see if this idea is feasible

 document = docx.Document(doc_path)
    for table in document.tables:
        for row_index,row in enumerate(table.rows):
            for col_index,cell in enumerate(row.cells):
                print(' pos index is ({},{})'.format(row_index,col_index))
                print('cell text is {}'.format(cell.text))

You will find that the finally extracted content is duplicated,,,

image-20210327101337122

The above reasons are due to cell merging problems. For example, the cells in the table below are merged with (1,1)->(1,5), and the docx library is When processing this type of merged cells, they are not treated as one, but are processed in a single form, so when for iterates (1,1)->(1,5) Five cells were returned, and each cell text information returned Yes

image-20210327101720736

Faced with the above text duplication problem, it is necessary to add a deduplication mechanism. Fields such as name, gender, age...education degree are used as column names col_keys, followed by 王五, female, 37,. .. Bachelor etc. as col_values, set an index when extracting, even numbers are col_keys, odd numbers are col_vaues;

The code is refactored as follows:

 document = docx.Document(doc_path)
    col_keys = [] # Get column names
    col_values = [] # Get column values
    index_num = 0
    # Add a deduplication mechanism
    fore_str = ''
    for table in document.tables:
        for row_index,row in enumerate(table.rows):
            for col_index,cell in enumerate(row.cells):
                if fore_str != cell.text:
                    if index_num % 2==0:
                        col_keys.append(cell.text)
                    else:
                        col_values.append(cell.text)
                    fore_str = cell.text
                    index_num + =1

    print(f'col keys is {col_keys}')
    print(f'col values is {col_values}')

The final effect after extraction is as follows

image-20210327121716805

Batch word extraction and save to csv file

After being able to process a single word file, one recursion can extract the contents of all word text tables, and finally use pandas to write the obtained data to a csv file!

def GetData_frompath(doc_path):
    document = docx.Document(doc_path)
    col_keys = [] # Get column names
    col_values = [] # Get column values
    index_num = 0
    # Add a deduplication mechanism
    fore_str = ''
    for table in document.tables:
        for row_index,row in enumerate(table.rows):
            for col_index,cell in enumerate(row.cells):
                if fore_str != cell.text:
                    if index_num % 2==0:
                        col_keys.append(cell.text)
                    else:
                        col_values.append(cell.text)
                    fore_str = cell.text
                    index_num + =1
    return col_keys,col_values

pd_data = []
for index,single_path in enumerate(wordlist_path):
    col_names,col_values = GetData_frompath(single_path)
    if index == 0:
        pd_data.append(col_names)
        pd_data.append(col_values)
    else:
        pd_data.append(col_values)

df = pd.DataFrame(pd_data)
df.to_csv(word_paths + '/result.csv', encoding='utf_8_sig',index=False)
Certificate number and ID number format

When you open the generated csv file, you will find that the number format in the two columns of contact information and ID number is stored as numerical values, which is not the type we want. If you want to display it completely, you need to convert the numerical values into text before storing.

image-20210327122744514

The solution is to find the cell where it is located and add a ‘\t’ tab character in front of the previous element.

col_values[7] = '\t' + col_values[7]
col_values[8] = '\t' + col_values[8]

image-20210327123301512

Source code acquisition

For the source code data acquisition method used in this case, follow the WeChat official account: Xiao Zhang Python, and reply with the keyword: 210328 in the background of the official account!

Summary

In this case, only some methods in the docx library are used, which mainly involve the basic operations of Table in word. For some students who are engaged in clerical work, they may encounter similar problems as above in their daily work, so I specially share them here. I hope Can be helpful to everyone

Okay, that’s all the content of this article. Finally, thank you all for reading. See you in the next issue!