Rehabilitating Python's pickle module

Can Python's pickle module be used safely? Even if the commonly cited code execution attacks are mitigated (with find_class()) what attacks remain?

Pickle

Pickle is a serialisation format for Python objects. It's widely regarded as dangerous to unpickle data from any untrusted source. The Python documentation warns

The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

The documentation provides a trivial proof of concept

b"cos\nsystem\n(S'echo hello world'\ntR."

When this string is passed to pickle.loads Python will import the os module, and execute the command echo hello world.

This particular example relies on os being available as a global variable. Other examples use different global variables. As documented in Restricting Globals these can be mitigated by filtering (or eliminating) the global variables made available to the unpickler. Overriding pickle.Unpickler.find_class(), or setting it to None can achieve this. For example

class RestrictedUnpickler(pickle.Unpickler):
    def find_class(self, module, name):
        # Only allow safe classes from builtins.
        if module == "builtins" and name in safe_builtins:
            return getattr(builtins, name)
        # Forbid everything else.
        raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
                                     (module, name))

Even with this mitigation though, pickle cannot be considered safe. Quoting PEP 307: Security Issues

nobody has ever done the necessary, extensive, code audit to prove that unpickling untrusted pickles cannot invoke unwanted code

This repository represents a small effort to bridge this gap.

Other attacks

Other attacks are likely to be possible using the Pickle protocol

Denial of service

Malformed values

Unpickling arbitrary pickles can raise a variety of Python exceptions. The following have been found so far

pickle.UnpicklingError
AttributeError
EOFError
ImportError
IndexError
KeyError
MemoryError
NameError
struct.error
SyntaxError
TypeError
UnicodeError
ValueError

If any of these exceptions are unhandled then the Python process as a whole will usually terminate.

Large objects

The pickle protocol allows large objects to be pickled (and hence) unpickled in chunks. From PEP-307: Pickling of large lists and dicts

Protocol 1 pickles large lists and dicts "in one piece", which minimizes pickle size, but requires that unpickling create a temp object as large as the object being unpickled. Part of the protocol 2 changes break large lists and dicts into pieces of no more than 1000 elements each, so that unpickling needn't create a temp object larger than needed to hold 1000 elements.

There does not appear to be anything in current implementations that enforces this. Hence a crafted payload could cause excessive memory allocation during unpickling.

Billion laughs

Since Pickle allows references between objects, it is possible to construct billion laughs attack payloads. A pickle of a few hundred bytes can result in data structures containing billions of items. If the payload is successfully deserialised then further processing (e.g. serializing to JSON, writing repr() to a log) will cause excessive memory & CPU consumption. If the payload contains a sufficient number of references then the operating System will usually kill the Python process for exceeding resource limits.

>>> a = ['lol', 'lol', 'lol', 'lol', 'lol', 'lol', 'lol', 'lol', 'lol', 'lol']
>>> b = [a,a,a,a,a,a,a,a,a,a]
>>> c = [b,b,b,b,b,b,b,b,b,b]
>>> d = [c,c,c,c,c,c,c,c,c,c]
>>> e = [d,d,d,d,d,d,d,d,d,d]
>>> f = [e,e,e,e,e,e,e,e,e,e]
>>> g = [f,f,f,f,f,f,f,f,f,f]
>>> h = [g,g,g,g,g,g,g,g,g,g]
>>> i = [h,h,h,h,h,h,h,h,h,h]
>>> j = [i,i,i,i,i,i,i,i,i,i]
>>> len(j)
10
>>> 10**10
10000000000
>>> pickle.dump(j, open('billion-laughs.pkl1', 'wb'), protocol=0)
>>> pickle.dump(j, open('billion-laughs.pkl2', 'wb'), protocol=2)

Protocol downgrades

In protocols 0 and 1 most variable length values are pickled as a new-line terminated, ASCII string. This includes (long) integers. pickletools documentation notes that

LONG takes time quadratic in the number of digits when unpickling (this is simply due to the nature of decimal->binary conversion). Proto 2 added linear-time (in C; still quadratic-time in Python) LONG1 and LONG4 opcodes

The comment (commit bf2674, 28 Jan 2003) about quadratic runtime in LONG1 and LONG2 appears to be out of date. A subsequent comment (commit fdc034, 2 Feb 2003) notes

def decode_long(data):
    ...
    n = long(ashex, 16) # quadratic time before Python 2.3; linear now

However nothing in the Unpickler implementation enforces the use of LONG1 and LONG2. Hence an attacker can simply avoid using them in order to magnify the impact of any DOS.

Benchmarks of float, integer, byte string and text string for payloads of 1 Byte to 1 MByte show protocol 0 op-codes are significantly slower to deserialize. An integer can take 10000 times longer to decode if the LONG op-code is used.

Other considerations

Bit rot

In addition to the known security issues the Pickle protocol is not formally documented, or standardised. Pickling of custom classes is tightly coupled to their implementation. This makes Pickle a poor choice choice for any long term storage and retrieval. To quote Don’t use pickle — use Camel

Its automatic, magical behavior shackles you to the internals of your classes in non-obvious ways. You can’t even easily tell which classes are baked forever into your pickles. Once a pickle breaks, figuring out why and where and how to fix it is an utter nightmare.

Don’t use pickle.

Stack shenanigans?

The STOP opcode terminates unpickling, and returns the topmost stack item. After this occurs the stack should be empty, that condition is not checked or enforced.

Extension Registry

Since protocol 2 Pickle has included 'extension' opcodes. Chosen types may have their constructor added to the extension registry. When pickled these constructors will be identified by an integer, instead of the GLOBAL (c) op-code. This mechanism is inherantly opt-in, since the extension registry is empty by default. An example

>>> import collections, copy_reg, pickle, pickletools
>>> pickletools.optimize(pickle.dumps(collections.OrderedDict(), protocol=2))
'\x80\x02ccollections\nOrderedDict\n]\x85R.'
>>> copy_reg.add_extension('collections', 'OrderedDict', 240)
>>> pickletools.optimize(pickle.dumps(collections.OrderedDict(), protocol=2))
'\x80\x02\x82\xf0]\x85R.'

Note that this still requires the REDUCE op-code.

Integer opcodes

Python 2.x has two integer types: int, or long. Python 3.x unified these types into one: int. An object that was pickled from an int may be unpickled as a long, and vice versa.

Protocol 0

Python 2.x pickles int objects with an INT opcode, and long objects with a LONG opcode.

Python 3.0 to 3.6 pickles int objects with a LONG opcode. This behaviour was identified as a regression in bpo-32037. Python 3.7 and onward pickles int objects smaller than 32-bits with an INT opcode, and all others with a LONG opcode.

Protocol 2 onward

At protocol 2 and above all Python releases use BININT, BININT1 or BININT2 opcodes for 32-bit integers. On 64-bit builds of Python 2.x integers between 2³² and 2⁶³-1 are pickled using an INT opcode

>>> pickle.dumps(2**62, protocol=2)
'\x80\x02I4611686018427387904\n.'

On 64 bit builds of Python 3.x such integers are pickled with a LONG1 opcode

>>> pickle.dumps(2**62, protocol=2)
b'\x80\x02\x8a\x08\x00\x00\x00\x00\x00\x00\x00@.'

String opcodes

Python 2.x has two string types: str, and unicode. Common practice played loose and fast with whether str is a byte string, or a textual string. Python 3.x cleaned the distinction between its types: bytes and str.

Protocol 0-2

Protocol <= 2 pickles produced by Python 2.x don't specify an encoding

>>> pickle.dumps('abc', protocol=0)
"S'abc'\np0\n."

Pickles produced by Python 3 use the GLOBAL opcode to call _codecs.encode for non-empty byte strings or __builtins__.bytes for empty byte strings. This serves as a backward-compatibile shim to explicitly declare an encoding

>>> pickle.dumps(b'abc', protocol=0)
b'c_codecs\nencode\np0\n(Vabc\np1\nVlatin1\np2\ntp3\nRp4\n.'

Protocol 3 onward

Protocol 3 added specific opcodes for byte strings

DUP opcode

DUP duplicates the top item on the stack, and places it back on the stack. It is supported by pickle.Unpickler, but it's not used by pickle.Pickler. As a result, any pickle containing a DUP opcode cannot have been produced by the Python stdlib.

Weird machines

What happens if a SETITEM has repeated keys? Is this implementation defined?

The order of keys in a pickled dict is not specified. Prior to CPython 3.6 dict objects dont have a defined iteration order. Pickling the same dict object twice may produce distinct pickles. They should unpickle the same, but maybe not?

set and frozenset objects still don't have defined iteration order. So pickling such an object twice may produce differing pickles.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
benchmarks		benchmarks
corpus		corpus
img		img
pickles		pickles
results		results
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
Rehabilitating Pickle.pdf		Rehabilitating Pickle.pdf
generate.py.lprof		generate.py.lprof
hypothickle.expect		hypothickle.expect
hypothickle.py		hypothickle.py
hypothickle3.py		hypothickle3.py
integers.py		integers.py
pickle-ref.ods		pickle-ref.ods
pickle2714.py		pickle2714.py
picklelite2.py		picklelite2.py
picklelite3.py		picklelite3.py
pickletester2714.py		pickletester2714.py
pickletools2714.py		pickletools2714.py
plotly.ipynb		plotly.ipynb
requirements.txt		requirements.txt
tox.ini		tox.ini
uncpickle_stdin.c		uncpickle_stdin.c
uncpickle_stdin.py		uncpickle_stdin.py
unpickle_stdin.py		unpickle_stdin.py

moreati/pickle-fuzz

Folders and files

Latest commit

History

Repository files navigation