Follow the public account: [Xiao Zhang Python], we have prepared 50+ high-quality Python e-books and 50G+ high-quality video learning materials for you. You can get it by replying to the keyword: 1024 in the background; if you are interested in the blog content If you have any questions, add the author [Personal WeChat] in the background to communicate directly with the author.
Hello, I’m Xiao Zhang, long time no see everyone~
Today’s article introduces a practical case, related to office automation; the idea of the case comes from a demo I made for readers two days ago. The need is to extract the contents of hundreds of tables in word (the table styles in all words are the same), and put them into The extracted content is automatically stored in Excel
The form of the table in word is as follows
Currently, there are several word documents in the above format that need to be sorted out. The goal is to use python to automatically generate excel tables in the following format.
Before explaining the formal case, let’s take a look at the conversion effect. The script first converts the doc file in the specified folder into docx, and then automatically generates an excel table. The table contains all the content in word.
Libraries involved
The Python libraries used in this case include the following
python-docx pandas os pywin32
doc converted to docx
In this case, the python-docx library is used to extract the table content in word. For some basic usage of python-docx, please refer to
,
Word documents are sometimes saved in the doc type. python-docx can only process the docx file type. Before extracting the table content, a file type format conversion is required: Convert doc to docx in batches;
The easiest way to convert doc to docxOpen the doc file through the word component in Office, and then manually save it as a docx file. This method is okay for a single document. If the number of documents reaches hundreds, this method will be used. It’s a little annoying,
Here we introduce a python library pywin32 to help us solve this problem. As an extension module, pywin32 encapsulates a large number of Windows API functions, such as calling application components such as Office, deleting specified files, obtaining mouse coordinates, etc.
Use pywin32 to control the Word component in Office to automatically complete the open and save
operations and convert all doc file types into docx file types. The steps are divided into the following three steps:
1. Create a word component
from win32com import client as wc word = wc.Dispatch('Word.Application')
2. Open the word file
doc = word.Documents.Open(path)
3. Save and close
doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False) doc.Close()
Complete code
path_list = os.listdir(path) doc_list = [os.path.join(path,str(i)) for i in path_list if str(i).endswith('doc')] word = wc.Dispatch('Word.Application') print(doc_list) for path in doc_list: print(path) save_path = str(path).replace('doc','docx') doc = word.Documents.Open(path) doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False) doc.Close() print('{} Save successfully '.format(save_path)) word.Quit()
docx library extracts single table content
Before batch operations, we first need to get the content in a single table. As long as we get a single word, we can add a recursion to the rest.
Use the docx library to extract table contents in word, mainly using objects such as Table, rows, and cells.
Table represents a table, rows represents a list of rows in the table, and exists in the form of an iterator; cells represents a list of cells, also in the form of an iterator.
Before operating, you need to understand the following basic functions:
-
Read the file path through the Document function and return a Document object
-
Document.tables can return a list of tables in word;
-
table.rows returns a list of rows in the table;
-
row.cells returns the list of cells contained in the row;
-
cell.text returns the text information in the cell
After understanding the above content, the next operation idea will be relatively clear; the text information in the word table can be completed through two for loops: the first for loop obtains all row objects in the table, and the second for loop locates each row of the cell, use cell.text to obtain the cell text content;
Try using code to see if this idea is feasible
document = docx.Document(doc_path) for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): print(' pos index is ({},{})'.format(row_index,col_index)) print('cell text is {}'.format(cell.text))
You will find that the finally extracted content is duplicated,,,
The above reasons are due to cell merging problems. For example, the cells in the table below are merged with (1,1)->(1,5)
, and the docx library is When processing this type of merged cells
, they are not treated as one, but are processed in a single form, so when for iterates (1,1)->(1,5)
Five cells were returned, and each cell text information returned Yes
Faced with the above text duplication problem, it is necessary to add a deduplication mechanism. Fields such as name, gender, age...education degree
are used as column names col_keys, followed by 王五, female, 37,. .. Bachelor
etc. as col_values, set an index when extracting, even numbers are col_keys, odd numbers are col_vaues;
The code is refactored as follows:
document = docx.Document(doc_path) col_keys = [] # Get column names col_values = [] # Get column values index_num = 0 # Add a deduplication mechanism fore_str = '' for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str != cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num + =1 print(f'col keys is {col_keys}') print(f'col values is {col_values}')
The final effect after extraction is as follows
Batch word extraction and save to csv file
After being able to process a single word file, one recursion can extract the contents of all word text tables, and finally use pandas to write the obtained data to a csv file!
def GetData_frompath(doc_path): document = docx.Document(doc_path) col_keys = [] # Get column names col_values = [] # Get column values index_num = 0 # Add a deduplication mechanism fore_str = '' for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str != cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num + =1 return col_keys,col_values pd_data = [] for index,single_path in enumerate(wordlist_path): col_names,col_values = GetData_frompath(single_path) if index == 0: pd_data.append(col_names) pd_data.append(col_values) else: pd_data.append(col_values) df = pd.DataFrame(pd_data) df.to_csv(word_paths + '/result.csv', encoding='utf_8_sig',index=False)
Certificate number and ID number format
When you open the generated csv file, you will find that the number format in the two columns of contact information and ID number
is stored as numerical values, which is not the type we want. If you want to display it completely, you need to convert the numerical values into text before storing.
The solution is to find the cell where it is located and add a ‘\t’ tab character in front of the previous element.
col_values[7] = '\t' + col_values[7] col_values[8] = '\t' + col_values[8]
Source code acquisition
For the source code data acquisition method used in this case, follow the WeChat official account: Xiao Zhang Python, and reply with the keyword: 210328 in the background of the official account!
Summary
In this case, only some methods in the docx library are used, which mainly involve the basic operations of Table in word. For some students who are engaged in clerical work, they may encounter similar problems as above in their daily work, so I specially share them here. I hope Can be helpful to everyone
Okay, that’s all the content of this article. Finally, thank you all for reading. See you in the next issue!