Is nested comprehensive - non optimized in Python? - python

The question is: "Have I really to do handmade optimization or exists better explaining of this uncomprehensive comprehensive?"
Thanks! And please - don't minus my question... Even FORTRAN can optimize nested loops since 1990... or earlier.
Look the example.
dict_groups = [{'name': 'Новые Альбомы', 'gid': 4100014},
{'name': 'Synthpop [Futurepop, Retrowave, Electropop]', 'gid': 8564},
{'name': 'E:\\music\\leftfield', 'gid': 101522128},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480},
{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Depeche Mode', 'gid': 125927592}]
x = [{'gid': 35486626},{'gid': 134709480},{'gid': 27683540}]
Have to receive
rez = [{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480}]
One of the solutions is:
x_val = tuple(d["gid"] for d in x)
rez = [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
with timing
%timeit x_val = tuple(d["gid"] for d in x)
1.55 µs ± 81.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
2.19 µs ± 93.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
one-row nested comprehensive solution gives:
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in tuple(d["gid"] for d in x)]
11.9 µs ± 756 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much slower! Its looks like expression tuple(d["gid"] for d in x) calculates each time!
7*1,55 + 2,19 = 13,04µs It's near the 11.9µs....

Related

Regex in python dataframe: count occurences of pattern

I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?
column_A
column_B
column_C
Test • test abc
winter • sun
snow rain blank
blabla • summer abc
break • Data
test letter • stop.
So far I created a solution which is slow:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
The str.count should be able to apply to the whole dataframe without hard coding this way. Try
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:
%%timeit 
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit 
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#

Concatenate lists of strings by day in Pandas Dataframe

I have the following:
import pandas as pd
import numpy as np
documents = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user'],
['System', 'and', 'human'],
['Relation', 'of', 'user'],
['The', 'generation'],
['The', 'intersection'],
['Graph', 'minors'],
['Graph', 'minors', 'a']]
df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
There are only 5 unique days. I would like to group by day to end up with the following:
documents2 = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user', 'System', 'and', 'human'],
['Relation', 'of', 'user', 'The', 'generation'],
['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]
df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})
IIUC, you can aggregate by sum
df.groupby('date').text.sum() # or .agg(sum)
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Or flatten your list using list comprehension, which yields same time complexity as chain.from_iterable but has no dependency on one more external library
df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
sum has already been shown in another answer, so let me propose a solution that is a much faster (and more efficient) using chain.from_iterable:
from itertools import chain
df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
The problem with sum is that, for every two lists that are summed, a new intermediate result is created. So the operation is O(N^2). You can cut this down to linear time using chain.
The performance difference is apparent even with a relatively small DataFrame.
df = pd.concat([df] * 1000)
%timeit df.groupby('date').text.sum()
%timeit df.groupby('date').text.agg('sum')
%timeit df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
%timeit df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
71.8 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
68.9 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.25 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The problem will be more pronounced when the groups are larger. Particularly because sum is not vectorised for objects.

What is the difference between bson.son.SON and collections.OrderedDict?

The bson.son.SON is used mainly in pymongo, to get a ordered mapping(dict).
But python already have the ordered dict in collections.OrderedDict()
I have read the docs of bson.son.SON. It did say SON is similar to OrderedDict but did not mention the difference.
So What is the difference? When should we use SON and when should we use OrderedDict?
Currently, the slight difference in both is that bson.son.SON remains backward compatible with Python 2.7 and older versions.
Also the argument that SON serializes faster than OrderedDict is no longer correct in 2018.
The son module was added in Jan 8, 2009.
collections.OrderedDict(PEP-372) was added in python in Mar 2, 2009.
While the differences in dates doesn't tell which was released first, it is interesting to see that the Mongodb already implemented an ordered map for their use case. I guess that they may have decided to keep maintaining it for backward compatibility instead of switching all SON references in their codebase to collections.OrderedDict
In small experiments with both, you easily see that collections.OrderedDict performs better than bson.son.SON.
In [1]: from bson.son import SON
from collections import OrderedDict
import copy
print(set(dir(SON)) - set(dir(OrderedDict)))
{'weakref', 'iteritems', 'iterkeys', 'itervalues', 'module', 'has_key', 'deepcopy', 'to_dict'}
In [2]: %timeit s = SON([('a',2)]); z = copy.deepcopy(s)
14.3 µs ± 758 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit s = OrderedDict([('a',2)]); z = copy.deepcopy(s)
7.54 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit s = SON(data=[('a',2)]); z = json.dumps(s)
11.5 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit s = OrderedDict([('a',2)]); z = json.dumps(s)
5.35 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In answer to your question about when to use SON,
use SON if running your software in versions of Python older than 2.7.
If you can help it, use OrderedDict from the collections module.
You can also use dict, they are ordered now in Python 3.7

Dataframe hierarchical indexing speedup

I have dataframe like this
+----+------------+------------+------------+
| | | type | payment |
+----+------------+------------+------------+
| id | res_number | | |
+----+------------+------------+------------+
| a | 1 | toys | 20000 |
| | 2 | clothing | 30000 |
| | 3 | food | 40000 |
| b | 4 | food | 40000 |
| | 5 | laptop | 30000 |
+----+------------+------------+------------+
as you can see id, and res_number are hierachical row value, and type, payment are normal columns value. What i want to get is below.
array([['toys', 20000],
['clothing', 30000],
['food', 40000]])
It indexed by 'id(=a)' no matter what 'res_number' came, and i know that
df.loc[['a']].values
perfectly works for it. But the speed of indexing is too slow... i have to index 150000 values.
so i indexed dataframe by
df.iloc[1].values
but it only brought
array(['toys', 20000])
is there any indexing method more faster in indexing hierarchical structure?
Option 1
pd.DataFrame.xs
df.xs('a').values
Option 2
pd.DataFrame.loc
df.loc['a'].values
Option 3
pd.DataFrame.query
df.query('ilevel_0 == \'a\'').values
Option 4
A bit more roundabout, use pd.MultiIndex.get_level_values to create a mask:
df[df.index.get_level_values(0) == 'a'].values
array([['toys', 20000],
['clothing', 30000],
['food', 40000]], dtype=object)
Option 5
Use .loc with axis parameter
df.loc(axis=0)['a',:].values
Output:
array([['toys', 20000],
['clothing', 30000],
['food', 40000]], dtype=object)
Another option. Keep an extra dictionary of the beginning and ending indices of each group. (
Assume the index is sorted. )
Option 1 Use the first and the last index in a group to query with iloc.
d = {k: slice(v[0], v[-1]+1) for k, v in df.groupby("id").indices.items()}
df.iloc[d["b"]]
array([['food', 40000],
['laptop', 30000]], dtype=object)
Option 2 Use the first and the last index to query with numpy's index slicing on df.values.
df.values[d["a"]]
Timing
df_testing = pd.DataFrame({"id": [str(v) for v in np.random.randint(0, 100, 150000)],
"res_number": np.arange(150000),
"payment": [v for v in np.random.randint(0, 100000, 150000)]}
).set_index(["id","res_number"]).sort_index()
d = {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
# by COLDSPEED
%timeit df_testing.xs('5').values
303 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# by OP
%timeit df_testing.loc['5'].values
358 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Tai 1
%timeit df_testing.iloc[d["5"]].values
130 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Tai 2
%timeit df_testing.values[d["5"]]
7.26 µs ± 845 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
However, getting d is not costless.
%timeit {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
16.3 ms ± 6.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Whether creating an extra lookup table d worth it?
The cost of creating the index will be spread on the gain from doing queries. In my toy dataset, it will be 16.3 ms / (300 us - 7 us) ≈ 56 queries to recover the cost of creating the index.
Again, the index needs to be sorted.

Increased performance for pickle.dumps through monkey patching DEFAULT_PROTOCOL?

I've noticed it can make quite a substantial difference in terms of speed,
if you specify the protocol used in pickle.dumps via argument or if you
monkey patch pickle.DEFAULT_PROTOCOL for the desired protocol version.
On Python 3.6, pickle.DEFAULT_PROTOCOL is 3 and
pickle.HIGHEST_PROTOCOL is 4.
For objects up to a certain length it seems to be faster setting
DEFAULT_PROTOCOL to 4 instead of passing protocol=4 as argument.
In my tests for example, with setting pickle.DEFAULT_PROTOCOL to 4 and pickling
a list with length 1 by calling pickle.dumps(packet_list_1) takes 481 ns, while calling with pickle.dumps(packet_list_1, protocol=4) takes 733 ns, a staggering ~52% speed-penalty for passing protocol explicitly instead of falling back to default (which was set to 4 before).
"""
(stackoverflow insists this to be formatted as code:)
pickle.DEFAULT_PROTOCOL = 4
pickle.dumps(packet) vs pickle.dumps(packet, protocol=4):
(stackoverflow insists this to be formatted as code:)
For a list with length 1 it's 481ns vs 733ns (~52% penalty).
For a list with length 10 it's 763ns vs 999ns (~30% penalty).
For a list with length 100 it's 2.99 µs vs 3.21 µs (~7% penalty).
For a list with length 1000 it's 25.8 µs vs 26.2 µs (~1.5% penalty).
For a list with length 1_000_000 it's 32 ms vs 32.4 ms (~1.13% penalty).
"""
I've found this behaviour for instances, lists, dicts and arrays, which is
all I tested so far. The effect diminishes with object size.
For dicts I noticed the effect turning at some point into the opposite, so that
for a length 10**6 dict (with unique integer values) it's faster to explicitly
pass protocol=4 as argument (269ms) than relying on default set to 4 (286ms).
"""
pickle.DEFAULT_PROTOCOL = 4
pickle.dumps(packet) vs pickle.dumps(packet, protocol=4):
For a dict with length 1 it's 589 ns vs 811 ns (~38% penalty).
For a dict with length 10 it's 1.59 µs vs 1.81 µs (~14% penalty).
For a dict with length 100 it's 13.2 µs vs 12.9 µs (~2,3% penalty).
For a dict with length 1000 it's 128 µs vs 129 µs (~0.8% penalty).
For a dict with length 1_000_000 it's 306 ms vs 283 ms (~7.5% improvement).
"""
Glimpsing over the pickle source, nothing strikes my eye what might cause
such variations.
How is this unexpected behaviour explainable?
Are there any caveats for setting pickle.DEFAULT_PROTOCOL instead of passing
protocol as argument to take advantage of the improved speed?
(Timed with IPython's timeit magic on Python 3.6.3, IPython 6.2.1, Windows 7)
Some example code dump:
# instances -------------------------------------------------------------
class Dummy: pass
dummy = Dummy()
pickle.DEFAULT_PROTOCOL = 3
"""
>>> %timeit pickle.dumps(dummy)
5.8 µs ± 33.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit pickle.dumps(dummy, protocol=4)
6.18 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
%timeit pickle.dumps(dummy)
5.74 µs ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit pickle.dumps(dummy, protocol=4)
6.24 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
# lists -------------------------------------------------------------
packet_list_1 = [*range(1)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1)
476 ns ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_1, protocol=4)
730 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1)
481 ns ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_1, protocol=4)
733 ns ± 2.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
# --------------------------
packet_list_10 = [*range(10)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_10)
714 ns ± 3.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_10, protocol=4)
978 ns ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_10)
763 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_10, protocol=4)
999 ns ± 8.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
# --------------------------
packet_list_100 = [*range(100)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_100)
2.96 µs ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>>%timeit pickle.dumps(packet_list_100, protocol=4)
3.22 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_100)
2.99 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>>%timeit pickle.dumps(packet_list_100, protocol=4)
3.21 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
# --------------------------
packet_list_1000 = [*range(1000)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1000)
26 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>>%timeit pickle.dumps(packet_list_1000, protocol=4)
26.4 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1000)
25.8 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>>%timeit pickle.dumps(packet_list_1000, protocol=4)
26.2 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
"""
# --------------------------
packet_list_1m = [*range(10**6)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1m)
32 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>%timeit pickle.dumps(packet_list_1m, protocol=4)
32.3 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1m)
32 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>%timeit pickle.dumps(packet_list_1m, protocol=4)
32.4 ms ± 466 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
Let's reorganize your %timeit results by return value:
| DEFAULT_PROTOCOL | call | %timeit | returns |
|------------------+-----------------------------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------|
| 3 | pickle.dumps(dummy) | 5.8 µs ± 33.5 ns | b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.' |
| 4 | pickle.dumps(dummy) | 5.74 µs ± 18.8 ns | b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.' |
| 3 | pickle.dumps(dummy, protocol=4) | 6.18 µs ± 10.4 ns | b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.' |
| 4 | pickle.dumps(dummy, protocol=4) | 6.24 µs ± 26.7 ns | b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.' |
| 3 | pickle.dumps(packet_list_1) | 476 ns ± 1.01 ns | b'\x80\x03]q\x00cbuiltins\nrange\nq\x01K\x00K\x01K\x01\x87q\x02Rq\x03a.' |
| 4 | pickle.dumps(packet_list_1) | 481 ns ± 2.12 ns | b'\x80\x03]q\x00cbuiltins\nrange\nq\x01K\x00K\x01K\x01\x87q\x02Rq\x03a.' |
| 3 | pickle.dumps(packet_list_1, protocol=4) | 730 ns ± 2.22 ns | b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00]\x94\x8c\x08builtins\x94\x8c\x05range\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94a.' |
| 4 | pickle.dumps(packet_list_1, protocol=4) | 733 ns ± 2.94 ns | b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00]\x94\x8c\x08builtins\x94\x8c\x05range\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94a.' |
Notice how the %timeit results correspond well when we pair calls that give the same return value.
As you can see, the value of pickle.DEFAULT_PROTOCOL has no effect on the value returned by pickle.dumps.
If the protocol parameter is not specified, the default protocol is 3 no matter what the value of pickle.DEFAULT_PROTOCOL.
The reason is here:
# Use the faster _pickle if possible
try:
from _pickle import (
PickleError,
PicklingError,
UnpicklingError,
Pickler,
Unpickler,
dump,
dumps,
load,
loads
)
except ImportError:
Pickler, Unpickler = _Pickler, _Unpickler
dump, dumps, load, loads = _dump, _dumps, _load, _loads
The pickle module sets pickle.dumps to _pickle.dumps if it succeeds in importing _pickle, the compiled version of the pickle module.
The _pickle module uses protocol=3 by default. Only if Python fails to import _pickle is dumps set to the Python version:
def _dumps(obj, protocol=None, *, fix_imports=True):
f = io.BytesIO()
_Pickler(f, protocol, fix_imports=fix_imports).dump(obj)
res = f.getvalue()
assert isinstance(res, bytes_types)
return res
Only the Python version, _dumps, is affected by the value of pickle.DEFAULT_PROTOCOL:
In [68]: pickle.DEFAULT_PROTOCOL = 3
In [70]: pickle._dumps(dummy)
Out[70]: b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.'
In [71]: pickle.DEFAULT_PROTOCOL = 4
In [72]: pickle._dumps(dummy)
Out[72]: b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.'

Categories

Resources