Implementation and principles of object serialization in Python

The pickle module can implement algorithms for converting arbitrary Python objects into a series of bytes (i.e., serialized objects). These byte streams can be transferred or stored, and then reconstructed into a new object that has the same characteristics as the original object.

Notice:

  • The documentation for pickle clearly states that it provides no security guarantees. In fact, arbitrary code can be executed after deserialization, so use with caution
  • pickle for inter-process communication or data storage, and don’t trust data whose security you cannot verify.
  • hmac module, which provides an example of validating serialized data sources in a secure manner.

Encoding and decoding of strings

The first example uses dumps() to encode a data structure into a string and then print it to the console. It uses a data structure composed of built-in types. In fact, instances of any class can be serialized, as shown in the following examples.

import pickle
import print

data = [{<!-- -->'a': 'A', 'b': 2, 'c': 3.0}]
print('DATA:', end=' ')
pprint.pprint(data)

data_string = pickle.dumps(data)
print('PICKLE: {!r}'.format(data_string))

By default, Python 3’s serialization occurs in a compatible binary form.

$ python3 pickle_string.py

DATA: [{<!-- -->'a': 'A', 'b': 2, 'c': 3.0}]
PICKLE: b'\x80\x03]q\x00}q\x01(X\x01\x00\x00\x00cq\x02G@\x08\x00
\x00\x00\x00\x00\x00X\x01\x00\x00\x00bq\x03K\x02X\x01\x00\x00\x0
0aq\x04X\x01\x00\x00\x00Aq\x05ua.'

Once the data is serialized, you can write it to a file, socket, pipe, etc. You can then read this file and deserialize the data to construct a new object with the same values.

import pickle
import print

data1 = [{<!-- -->'a': 'A', 'b': 2, 'c': 3.0}]
print('BEFORE: ', end=' ')
pprint.pprint(data1)

data1_string = pickle.dumps(data1)

data2 = pickle.loads(data1_string)
print('AFTER : ', end=' ')
pprint.pprint(data2)

print('SAME? :', (data1 is data2))
print('EQUAL?:', (data1 == data2))

The new object is equal to the previous object, but not the previous object.

$ python3 pickle_unpickle.py

BEFORE: [{<!-- -->'a': 'A', 'b': 2, 'c': 3.0}]
AFTER: [{<!-- -->'a': 'A', 'b': 2, 'c': 3.0}]
SAME? : False
EQUAL?: True

Stream serialization

In addition to dumps() and loads(), pickle also provides very convenient functions for operating file streams. Supports writing multiple objects to the same stream at the same time, and then being able to read these objects from this stream without knowing how many objects there are or how big they are.

pickle_stream.py

import io
import pickle
import print

class SimpleObject:

    def __init__(self, name):
        self.name = name
        self.name_backwards = name[::-1]
        return

data = []
data.append(SimpleObject('pickle'))
data.append(SimpleObject('preserve'))
data.append(SimpleObject('last'))

# Simulate a file
out_s = io.BytesIO()

#Write to stream
for o in data:
    print('WRITING : {} ({})'.format(o.name, o.name_backwards))
    pickle.dump(o, out_s)
    out_s.flush()

# Set up a readable stream
in_s = io.BytesIO(out_s.getvalue())

#Read data
while True:
    try:
        o = pickle.load(in_s)
    except EOFError:
        break
    else:
        print('READ : {} ({})'.format(
            o.name, o.name_backwards))

This example uses two BytesIO buffers to simulate streams. One receives the serialized object and the other reads the value of the first one via the load() method. A simple database format can also use serialization to store objects. The shelve module is an example of this use.

$ python3 pickle_stream.py

WRITING: pickle (elkcip)
WRITING: preserve (evreserp)
WRITING : last (tsal)
READ: pickle (elkcip)
READ: preserve (evreserp)
READ : last (tsal)

In addition to being used for storing data, serialization is also very flexible when used for inter-process communication. For example, using os.fork() and os.pipe() , you can create worker processes that read task instructions from one pipe and output the results to another pipe. The core code that operates these worker pools, sends tasks, and receives returns can be reused because the tasks and return objects are not a special class. If using pipes or sockets, don’t forget to flush each object after serializing them and push the data to the other end through the connection between them. Check out the multiprocessing module to build a reusable task pool manager.

Problems with rebuilding objects

When dealing with custom classes, you should ensure that the serialized classes appear in the process namespace. Only data instances can be serialized, not defined classes. During deserialization, the class name is used to find the constructor in order to create the new object. The next example writes a class instance to a file.

pickle_dump_to_file_1.py
import pickle
importsys

class SimpleObject:

    def __init__(self, name):
        self.name = name
        l = list(name)
        l.reverse()
        self.name_backwards = ''.join(l)

if __name__ == '__main__':
    data = []
    data.append(SimpleObject('pickle'))
    data.append(SimpleObject('preserve'))
    data.append(SimpleObject('last'))

    filename = sys.argv[1]

    with open(filename, 'wb') as out_s:
        for o in data:
            print('WRITING: {} ({})'.format(
                o.name, o.name_backwards))
            pickle.dump(o, out_s)

When I run this script, it creates files named with the parameters I enter on the command line.

$ python3 pickle_dump_to_file_1.py test.dat

WRITING: pickle (elkcip)
WRITING: preserve (evreserp)
WRITING: last (tsal)

Subsequent attempts to load the serialized result object failed.

pickle_load_from_file_1.py

import pickle
import print
importsys

filename = sys.argv[1]

with open(filename, 'rb') as in_s:
    while True:
        try:
            o = pickle.load(in_s)
        except EOFError:
            break
        else:
            print('READ: {} ({})'.format(
                o.name, o.name_backwards))

This version fails because there is no SimpleObject class available.

$ python3 pickle_load_from_file_1.py test.dat

Traceback (most recent call last):
  File "pickle_load_from_file_1.py", line 15, in <module>
    o = pickle.load(in_s)
AttributeError: Can't get attribute 'SimpleObject' on <module '_
_main__' from 'pickle_load_from_file_1.py'>

Here is the correct version, which imports the SimpleObject class from the original script. Adding an import statement allows the script to find the class and build the object.

from pickle_dump_to_file_1 import SimpleObject

Running the modified script now yields the expected results.

$ python3 pickle_load_from_file_2.py test.dat

READ: pickle (elkcip)
READ: preserve (evreserp)
READ: last (tsal)

Unserializable object

Not all objects can be serialized. Objects such as sockets, file handles, database connections, or other objects with runtime state may be unable to be stored effectively depending on the operating system or other processes. Classes that cannot be serialized can define the _getstate_ and __setstate__() methods to return the state of the instance when it is serialized.

The _getstate_() method must return an object containing the internal state of the object. A convenient way is to use a dictionary, the value of which can be any serializable object. The state is then stored and passed to the __setstate__ method when the object is serialized.

pickle_state.py

import pickle

class State:

    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return 'State({!r})'.format(self.__dict__)

classMyClass:

    def __init__(self, name):
        print('MyClass.__init__({})'.format(name))
        self._set_name(name)

    def _set_name(self, name):
        self.name = name
        self.computed = name[::-1]

    def __repr__(self):
        return 'MyClass({!r}) (computed={!r})'.format(
            self.name, self.computed)

    def __getstate__(self):
        state = State(self.name)
        print('__getstate__ -> {!r}'.format(state))
        return state

    def __setstate__(self, state):
        print('__setstate__({!r})'.format(state))
        self._set_name(state.name)

inst = MyClass('name here')
print('Before:', inst)

dumped = pickle.dumps(inst)

reloaded = pickle.loads(dumped)
print('After:', reloaded)

This example uses a separate State object to store the internal state of MyClass. When an instance of MyClass is deserialized, a state instance is passed to _setstate_() to initialize a new object.

If the return value of __getstate__() is false, __setstate__() will not be called when the object is deserialized.

Circular reference

The serialization protocol automatically handles circular references between objects, so even complex data structures do not require special processing. Consider the following diagram, which contains multiple loops, but the correct structure can still be deserialized and output.

Serialize a circularly referenced data structure

pickle_cycle.py

import pickle

class Node:
    """A simple directed graph
    """
    def __init__(self, name):
        self.name = name
        self.connections = []

    def add_edge(self, node):
         """Create an edge between this node and other nodes
                 """
        self.connections.append(node)

    def __iter__(self):
        return iter(self.connections)

def preorder_traversal(root, seen=None, parent=None):
    """Generator function that generates edges for a graph
    """
    if seen is None:
        seen = set()
    yield (parent, root)
    if root in seen:
        return
    seen.add(root)
    for node in root:
        recurse = preorder_traversal(node, seen, root)
        for parent, subnode in recurse:
            yield (parent, subnode)

def show_edges(root):
     """Print all sides of the output graph
         """
    for parent, child in preorder_traversal(root):
        if not parent:
            continue
        print('{:>5} -> {:>2} ({})'.format(
            parent.name, child.name, id(child)))

#Create a directed graph
root = Node('root')
a = Node('a')
b = Node('b')
c = Node('c')

#Add edges between nodes
root.add_edge(a)
root.add_edge(b)
a.add_edge(b)
b.add_edge(a)
b.add_edge(c)
a.add_edge(a)

print('ORIGINAL GRAPH:')
show_edges(root)
#Have you encountered a problem while studying but no one can answer it? The editor has created a Python learning exchange group: 711312441
# Serialize and deserialize directed graph
# Generate a new set of nodes
dumped = pickle.dumps(root)
reloaded = pickle.loads(dumped)

print('\
RELOADED GRAPH:')
show_edges(reloaded)

After serialization and deserialization, these new directed graph node objects are not the ones created at the beginning, but the relationship between the objects remains unchanged, which can be verified by checking the value returned by object id().

$ python3 pickle_cycle.py

ORIGINAL GRAPH:
 root -> a (4315798272)
    a -> b (4315798384)
    b -> a (4315798272)
    b -> c (4315799112)
    a -> a (4315798272)
 root -> b (4315798384)

RELOADED GRAPH:
 root -> a (4315904096)
    a -> b (4315904152)
    b -> a (4315904096)
    b -> c (4315904208)
    a -> a (4315904096)
 root -> b (4315904152