The most powerful PDF operation manual for Python office automation

1. Introduction to PyMuPDF

1. Introduction

Before introducing PyMuPDF, let’s get to know MuPDF first. From the naming form, we can see that PyMuPDF is MuPDF >’s Python interface form.

MuPDF

MuPDF is a lightweight PDF, XPS and eBook viewer. MuPDF consists of software libraries, command line tools and viewers for various platforms.

The renderer in MuPDF is tailored for high-quality anti-aliased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for the highest fidelity in reproducing on-screen the appearance of a printed page.

This viewer is small, fast, but complete. It supports various document formats, such as PDF, XPS, OpenXPS, CBZ, EPUB code> and FictionBook 2. You can annotate PDF documents and fill out forms using the mobile viewer (soon this feature will also be coming to the desktop viewer).

Command-line tools allow you to annotate, edit, and convert documents to other formats such as HTML, SVG, PDF, and CBZ. You can also use Javascript to write scripts to manipulate documents.

PyMuPDF

PyMuPDF (current version 1.18.17) is a Python binding that supports MuPDF (current version 1.18.*).

With PyMuPDF, you can access files with extensions “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or >“.epub”. In addition, about 10 popular image formats can also be handled like documents: “.png”, “.jpg”, “.bmp”, “.tiff”, etc.

2. Function

For all supported document types you can:

NEW: Layout saves text extraction!

The script fitzcliy.py provides text extraction in different formats via the subcommand “gettext”. Of particular interest is of course layout preservation, which generates text as close as possible to the original physical layout, areas with surrounding images, or duplicating text in tables and multi-column text.

  • decrypt files

  • Access meta information, links and bookmarks

  • Render pages in raster format (PNG and others) or vector format SVG

  • search text

  • Extract text and images

  • Convert to other formats: PDF, (X)HTML, XML, JSON, text

    A large number of additional functions exist for PDF documents: they can be created, merged or split. Pages can be inserted, deleted, rearranged, or modified in a variety of ways (including comments and form fields).

  • Images and fonts can be extracted or inserted

  • Embedded files are fully supported

  • PDF files can be reformatted to support double-sided printing, posterize, apply logos or watermarks

  • Full support for password protection: decryption, encryption, encryption method selection, permission level and user/owner password settings

  • PDF alternative content concept supporting images, text and drawings

  • Can access and modify low-level PDF structures

  • The command line module "python\-m fitz..." is a multipurpose utility with the following features

    • Encryption/Decryption/Optimization

    • create subdocuments

    • document connection

    • Image/font extraction

    • Embedded files are fully supported

    • Text extraction for saved layouts (all documents)

2. Install

PyMuPDF can be installed from source or from wheels.

For Windows, Linux and Mac OSX platforms, there are wheels in the download section of PyPI. This includes Python 64-bit versions 3.6 through 3.9. There is also a 32-bit version for Windows. Since recently, there are also some problems with the Linux ARM architecture – look for the platform tag manylinux2014_aarch64.

It has no mandatory external dependencies other than the standard library. Some nice methods only if certain packages are installed:

  • Pillow: Required when using Pixmap.pil_save() and Pixmap.pil_tobytes()

  • fontTools: Required when using Document.subset_fonts()

  • pymupdf-fonts is a good choice of fonts that can be used for text output methods

Use the pip install command:

pip install PyMuPDF

Import library:

import fitz

Instructions for naming fitz

The standard Python import statement for this library is import fitz. There are historical reasons for this:
MuPDF‘s original rendering library is called Libart.

After Artifex Software acquired the MuPDF project, the development focus shifted to writing a new modern graphics library called "Fitz". Fitz started as an R&D project to replace the aging Ghostscript graphics library, but became the rendering engine for MuPDF (quoted from Wikipedia).

3. How to use

1. Import library, view version

import fitz
print(fitz.__doc__)
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.8 on linux (64-bit).

2. Open the document

doc = fitz. open(filename)

This will create the Document object doc. filename must be a python string for an existing file.
You can also open documents from memory data, or create new empty PDFs. You can also use documents as context managers.

3. Methods and properties of Document

Method/Property Description
Document.page_count page number (int)
Document.metadata metadata ( dict)
Document.get_toc() Get directory (list)
Document.load_page() Read page

Example:

>>>doc.count_page
1
>>> doc.metadata
{'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Foxit Reader PDF Printer Version 10.0.130.3456',
 'creationDate': "D:20210810173328 + 08'00'",
 'modDate': "D:20210810173328 + 08'00'",
 'trapped': '',
 'encryption': None}

4. Get metadata

PyMuPDF fully supports standard metadata. Document.metadata is a Python dictionary with the following keys.

It works for all document types, but not all entries always contain data. Metadata fields are strings, or None if not indicated otherwise. Note also that not all data consistently contain meaningful data — even if they don’t contain none.

Key Value
producer producer (producing software)
format format: ‘PDF-1.4’, ‘EPUB’, etc.
encryption encryption method used if any
author author
modDate date of last modification
keywords keywords
title title
creationDate date of creation
creator creating application
subject subject

5. Get the goal outline

toc = doc. get_toc()

6. Page (Page)

Page processing is the core of MuPDF.

  • You can render pages as raster or vector (SVG) images, with options to scale, rotate, move, or crop the page.

  • You can extract page text and images in multiple formats and search for text strings.

  • For PDF documents, there are more ways to add text or images to the page.

First, a page Page must be created. This is a method of Document:

page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form

Any integer -inf can be used here. Negative numbers count backwards from the end, so doc[-1] is the last page, just like Python sequences.

A more advanced approach is to use the document as an iterator over the pages:

for page in doc:
    # do something with 'page'
    
# ... or read backwards
for page in reversed(doc):
    # do something with 'page'
    
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'

Next, we mainly introduce the common operations of Page!

a. Check the page for links, comments, or form fields

When displaying the document with some viewer software, the links appear as == "hot spots" ==. If you click when the cursor shows the hand symbol, you will usually be taken to the marker coded in that hotspot. Here's how to get all the links:

# get all links on a page
links = page. get_links()

links is a list of Pythondictionaries.

Can also be used as an iterator:

for link in page.links():
    # do something with 'link'

If dealing with PDF document pages, there may also be annotations (Annot) or form fields (Widget), each with its own iterator:

for annot in page.annots():
    # do something with 'annot'
    
for field in page.widgets():
    # do something with 'field'

b. Render the page

This example creates a raster image of the page content:

pix = page.get_pixmap()

pix is a Pixmap object that (in this case) contains an RGB image of the page and can be used for a variety of purposes.

The method Page.get_pixmap() provides many variants for controlling the image: resolution, color space (for example, to generate a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring , Shift, Cut, etc.

For example: to create an RGBA image (ie, include an alpha channel), specify pix=page.get_pixmap(alpha=True). \

Pixmap contains many methods and properties referenced below. These include integers width, height (per pixel), and stride (bytes for one horizontal image line). The attribute example represents a rectangular region of bytes (a Python bytes object) representing the image data.

You can also use page.get_svg_image() to create a vector image of the page.

c. Save the page image to a file

We can simply store the image in a PNG file:

pix.save("page-%i.png" % page.number)

d. Extract text and images

We can also extract all the text, images and other information of a page in many different forms and levels of detail:

text = page. get_text(opt)

Use one of the following strings for opt to get a different format:

  • "text": (default) plain text with newlines. No formatting, no text position details, no images

  • "blocks": generate a list of text blocks (paragraphs)

  • "words": generate a list of words (a string without spaces)

  • "html": Creates a full visual version of the page, including any images. This can be displayed via an internet browser

  • "dict"/"json": same level of information as HTML, but as a Python dictionary or resp.JSON string .

  • "rawdict"/"rawjson": super collection of "dict"/"json". It also provides character details such as XML.

  • "xhtml": The text information level is the same as the text version, but includes images.

  • "xml": Contains no images, but contains full location and font information for each text character. Use the XML module for interpretation.

e. Search text

You can find out exactly where a string of text is located on a page:

areas = page.search_for("mupdf")

This will provide a list of rectangles, each containing a string "mupdf" (case insensitive). You can use this information to highlight these areas (PDF only) or to create cross-references to documents.

7. PDF operation

PDF is the only document type that can be modified with PyMuPDF. Other file types are read-only.

However, you can convert any document (including images) to PDF and then apply all PyMuPDF functions to the conversion result, Document.convert_to_pdf().

Document.save() always stores the PDF in its current (possibly modified) state on disk.

You can usually choose whether to save to a new file, or just append the modifications to the existing file ("incremental save"), which is usually much faster.

The following describes how to manipulate PDF documents.

a. Modify, create, rearrange and delete pages

There are several ways to manipulate the so-called page tree (the structure describing all pages):

The new saved document will contain links, notes and bookmarks (i.a.w. pointing to the selected page or some external resource) that are still valid.

  • PDF:Document.delete_page() and Document.delete_pages() delete pages

  • Document.copy_page(), Document.fullcopy_page(), and Document.move_page() copy or move a page to another location in the same document.

  • Document.select() compresses the PDF to the selected pages, the parameter is the sequence of page numbers to keep. These integers must be in the range of 0<=i. When executed, any pages missing from this list will be removed. The remaining pages will appear in order, the same (!) number of times as you specified.

    So you can easily create a new PDF with:

    • first or last 10 pages

    • Odd or even pages only (for duplex printing)

    • pages containing or not containing the given text

    • reverse page order

  • Document.insert_page() and Document.new_page() insert a new page.

    In addition, the pages themselves can be modified through a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).

b. Join and split PDF documents

The method Document.insert_pdf() copies pages between different pdf documents. Here is a simple joiner example (doc1 and doc2 are opened in PDF):

# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)

Below is a snippet of split doc1. It will create a new document with the first and last 10 pages:

doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2. save("first-and-last-10.pdf")

c. save

Document.save() will always save the document in its current state.

You can write changes back to the original PDF by specifying the option incremental=True. This process is (usually) very fast because the changes are appended to the original file without completely rewriting it.

d. close

While the program continues to run, it is often necessary to "close" the document to give control of the underlying file to the operating system.

This can be achieved with the Document.close() method. In addition to closing the underlying file, the buffers associated with the document are also freed.

Python experience sharing

It is good to learn Python whether it is employment data analysis or doing side jobs to make money, but you still need a learning plan to learn Python. Finally, everyone will share a full set of Python learning materials to help those who want to learn Python!

Python learning route

Here we sort out the commonly used technical points of Python, and summarize the knowledge points in various fields. You can find the corresponding learning resources according to the above knowledge points.

Learning software

Python commonly used development software will save you a lot of time.

Learning video

To learn programming, you must watch a lot of videos. Only by combining books and videos can you get twice the result with half the effort.

100 practice questions

Actual case

Optical theory is useless. Learning programming should not be done on paper, but by hands-on practice, and apply the knowledge you have learned to practice.

Finally, I wish everyone progress every day! !

The above full version of the full set of Python learning materials has been uploaded to the official CSDN. If you need it, you can directly scan the QR code of the CSDN official certification below on WeChat to get it for free [guaranteed 100% free].

syntaxbug.com © 2021 All Rights Reserved.