1. Introduction to PyMuPDF
1. Introduction
Before introducing PyMuPDF
, let’s get to know MuPDF
first. From the naming form, we can see that PyMuPDF
is MuPDF
>’s Python
interface form.
MuPDF
MuPDF
is a lightweight PDF, XPS
and eBook viewer. MuPDF
consists of software libraries, command line tools and viewers for various platforms.
The renderer in MuPDF
is tailored for high-quality anti-aliased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for the highest fidelity in reproducing on-screen the appearance of a printed page.
This viewer is small, fast, but complete. It supports various document formats, such as PDF
, XPS
, OpenXPS
, CBZ
, EPUB
code> and FictionBook 2
. You can annotate PDF
documents and fill out forms using the mobile viewer (soon this feature will also be coming to the desktop viewer).
Command-line tools allow you to annotate, edit, and convert documents to other formats such as HTML, SVG, PDF
, and CBZ
. You can also use Javascript
to write scripts to manipulate documents.
PyMuPDF
PyMuPDF
(current version 1.18.17) is a Python binding that supports MuPDF
(current version 1.18.*).
With PyMuPDF
, you can access files with extensions “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2”
or >“.epub”
. In addition, about 10 popular image formats can also be handled like documents: “.png”, “.jpg”, “.bmp”, “.tiff”
, etc.
2. Function
For all supported document types you can:
NEW: Layout saves text extraction!
The script fitzcliy.py
provides text extraction in different formats via the subcommand “gettext”
. Of particular interest is of course layout preservation, which generates text as close as possible to the original physical layout, areas with surrounding images, or duplicating text in tables and multi-column text.
-
decrypt files
-
Access meta information, links and bookmarks
-
Render pages in raster format (
PNG
and others) or vector formatSVG
-
search text
-
Extract text and images
-
Convert to other formats:
PDF, (X)HTML, XML, JSON, text
A large number of additional functions exist for
PDF
documents: they can be created, merged or split. Pages can be inserted, deleted, rearranged, or modified in a variety of ways (including comments and form fields). -
Images and fonts can be extracted or inserted
-
Embedded files are fully supported
-
PDF files can be reformatted to support double-sided printing, posterize, apply logos or watermarks
-
Full support for password protection: decryption, encryption, encryption method selection, permission level and user/owner password settings
-
PDF alternative content concept supporting images, text and drawings
-
Can access and modify low-level PDF structures
-
The command line module
"python\-m fitz..."
is a multipurpose utility with the following features-
Encryption/Decryption/Optimization
-
create subdocuments
-
document connection
-
Image/font extraction
-
Embedded files are fully supported
-
Text extraction for saved layouts (all documents)
-
2. Install
PyMuPDF
can be installed from source or from wheels
.
For Windows, Linux
and Mac OSX
platforms, there are wheels
in the download section of PyPI
. This includes Python 64-bit versions 3.6 through 3.9
. There is also a 32-bit version for Windows. Since recently, there are also some problems with the Linux ARM architecture – look for the platform tag manylinux2014_aarch64
.
It has no mandatory external dependencies other than the standard library. Some nice methods only if certain packages are installed:
-
Pillow
: Required when usingPixmap.pil_save()
andPixmap.pil_tobytes()
-
fontTools
: Required when usingDocument.subset_fonts()
-
pymupdf-fonts
is a good choice of fonts that can be used for text output methods
Use the pip
install command:
pip install PyMuPDF
Import library:
import fitz
Instructions for naming fitz
The standard Python
import statement for this library is import fitz
. There are historical reasons for this:MuPDF
‘s original rendering library is called Libart
.
After Artifex Software acquired the MuPDF
project, the development focus shifted to writing a new modern graphics library called "Fitz"
. Fitz
started as an R&D project to replace the aging Ghostscript
graphics library, but became the rendering engine for MuPDF (quoted from Wikipedia).
3. How to use
1. Import library, view version
import fitz print(fitz.__doc__) PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library. Version date: 2021-08-05 00:00:01. Built for Python 3.8 on linux (64-bit).
2. Open the document
doc = fitz. open(filename)
This will create the Document
object doc
. filename must be a python string for an existing file.
You can also open documents from memory data, or create new empty PDFs. You can also use documents as context managers.
3. Methods and properties of Document
Method/Property | Description |
---|---|
Document.page_count |
page number (int) |
Document.metadata |
metadata ( dict) |
Document.get_toc() |
Get directory (list) |
Document.load_page() |
Read page |
Example:
>>>doc.count_page 1 >>> doc.metadata {'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Foxit Reader PDF Printer Version 10.0.130.3456', 'creationDate': "D:20210810173328 + 08'00'", 'modDate': "D:20210810173328 + 08'00'", 'trapped': '', 'encryption': None}
4. Get metadata
PyMuPDF
fully supports standard metadata. Document.metadata
is a Python dictionary with the following keys.
It works for all document types, but not all entries always contain data. Metadata fields are strings, or None if not indicated otherwise. Note also that not all data consistently contain meaningful data — even if they don’t contain none.
Key | Value |
---|---|
producer | producer (producing software) |
format | format: ‘PDF-1.4’, ‘EPUB’, etc. |
encryption | encryption method used if any |
author | author |
modDate | date of last modification |
keywords | keywords |
title | title |
creationDate | date of creation |
creator | creating application |
subject | subject |
5. Get the goal outline
toc = doc. get_toc()
6. Page (Page
)
Page processing is the core of MuPDF
.
-
You can render pages as raster or vector (
SVG
) images, with options to scale, rotate, move, or crop the page. -
You can extract page text and images in multiple formats and search for text strings.
-
For
PDF
documents, there are more ways to add text or images to the page.
First, a page Page
must be created. This is a method of Document
:
page = doc.load_page(pno) # loads page number 'pno' of the document (0-based) page = doc[pno] # the short form
Any integer -inf
doc[-1]
is the last page, just like Python sequences.
A more advanced approach is to use the document as an iterator over the pages:
for page in doc: # do something with 'page' # ... or read backwards for page in reversed(doc): # do something with 'page' # ... or even use 'slicing' for page in doc.pages(start, stop, step): # do something with 'page'
Next, we mainly introduce the common operations of
Page
!
a. Check the page for links, comments, or form fields
When displaying the document with some viewer software, the links appear as == "hot spots" ==. If you click when the cursor shows the hand symbol, you will usually be taken to the marker coded in that hotspot. Here's how to get all the links:
# get all links on a page links = page. get_links()
links
is a list of Python
dictionaries.
Can also be used as an iterator:
for link in page.links(): # do something with 'link'
If dealing with PDF document pages, there may also be annotations (Annot
) or form fields (Widget
), each with its own iterator:
for annot in page.annots(): # do something with 'annot' for field in page.widgets(): # do something with 'field'
b. Render the page
This example creates a raster image of the page content:
pix = page.get_pixmap()
pix
is a Pixmap
object that (in this case) contains an RGB image of the page and can be used for a variety of purposes.
The method Page.get_pixmap()
provides many variants for controlling the image: resolution, color space (for example, to generate a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring , Shift, Cut, etc.
For example: to create an RGBA image (ie, include an alpha channel), specify pix=page.get_pixmap(alpha=True)
. \
Pixmap
contains many methods and properties referenced below. These include integers width, height (per pixel), and stride (bytes for one horizontal image line). The attribute example represents a rectangular region of bytes (a Python bytes object) representing the image data.
You can also use page.get_svg_image()
to create a vector image of the page.
c. Save the page image to a file
We can simply store the image in a PNG
file:
pix.save("page-%i.png" % page.number)
d. Extract text and images
We can also extract all the text, images and other information of a page in many different forms and levels of detail:
text = page. get_text(opt)
Use one of the following strings for opt
to get a different format:
-
"text"
: (default) plain text with newlines. No formatting, no text position details, no images -
"blocks"
: generate a list of text blocks (paragraphs) -
"words"
: generate a list of words (a string without spaces) -
"html"
: Creates a full visual version of the page, including any images. This can be displayed via an internet browser -
"dict"/"json"
: same level of information asHTML
, but as a Python dictionary orresp.JSON
string . -
"rawdict"/"rawjson"
: super collection of"dict"/"json"
. It also provides character details such asXML
. -
"xhtml"
: The text information level is the same as the text version, but includes images. -
"xml"
: Contains no images, but contains full location and font information for each text character. Use theXML
module for interpretation.
e. Search text
You can find out exactly where a string of text is located on a page:
areas = page.search_for("mupdf")
This will provide a list of rectangles, each containing a string "mupdf"
(case insensitive). You can use this information to highlight these areas (PDF only) or to create cross-references to documents.
7. PDF operation
PDF
is the only document type that can be modified with PyMuPDF
. Other file types are read-only.
However, you can convert any document (including images) to PDF and then apply all PyMuPDF
functions to the conversion result, Document.convert_to_pdf()
.
Document.save()
always stores the PDF in its current (possibly modified) state on disk.
You can usually choose whether to save to a new file, or just append the modifications to the existing file ("incremental save"), which is usually much faster.
The following describes how to manipulate PDF documents.
a. Modify, create, rearrange and delete pages
There are several ways to manipulate the so-called page tree (the structure describing all pages):
The new saved document will contain links, notes and bookmarks (i.a.w. pointing to the selected page or some external resource) that are still valid.
-
PDF:Document.delete_page()
andDocument.delete_pages()
delete pages -
Document.copy_page()
,Document.fullcopy_page()
, andDocument.move_page()
copy or move a page to another location in the same document. -
Document.select()
compresses the PDF to the selected pages, the parameter is the sequence of page numbers to keep. These integers must be in the range of0<=i
. When executed, any pages missing from this list will be removed. The remaining pages will appear in order, the same (!) number of times as you specified. So you can easily create a new PDF with:
-
first or last 10 pages
-
Odd or even pages only (for duplex printing)
-
pages containing or not containing the given text
-
reverse page order
-
-
Document.insert_page()
andDocument.new_page()
insert a new page.In addition, the pages themselves can be modified through a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).
b. Join and split PDF documents
The method Document.insert_pdf()
copies pages between different pdf documents. Here is a simple joiner
example (doc1 and doc2 are opened in PDF):
# append complete doc2 to the end of doc1 doc1.insert_pdf(doc2)
Below is a snippet of split doc1. It will create a new document with the first and last 10 pages:
doc2 = fitz.open() # new empty PDF doc2.insert_pdf(doc1, to_page = 9) # first 10 pages doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages doc2. save("first-and-last-10.pdf")
c. save
Document.save()
will always save the document in its current state.
You can write changes back to the original PDF by specifying the option incremental=True
. This process is (usually) very fast because the changes are appended to the original file without completely rewriting it.
d. close
While the program continues to run, it is often necessary to "close" the document to give control of the underlying file to the operating system.
This can be achieved with the Document.close()
method. In addition to closing the underlying file, the buffers associated with the document are also freed.
Python experience sharing
It is good to learn Python whether it is employment data analysis or doing side jobs to make money, but you still need a learning plan to learn Python. Finally, everyone will share a full set of Python learning materials to help those who want to learn Python!
Python learning route
Here we sort out the commonly used technical points of Python, and summarize the knowledge points in various fields. You can find the corresponding learning resources according to the above knowledge points.
Learning software
Python commonly used development software will save you a lot of time.
Learning video
To learn programming, you must watch a lot of videos. Only by combining books and videos can you get twice the result with half the effort.
100 practice questions
Actual case
Optical theory is useless. Learning programming should not be done on paper, but by hands-on practice, and apply the knowledge you have learned to practice.
Finally, I wish everyone progress every day! !
The above full version of the full set of Python learning materials has been uploaded to the official CSDN. If you need it, you can directly scan the QR code of the CSDN official certification below on WeChat to get it for free [guaranteed 100% free].