I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?
column_A
column_B
column_C
Test • test abc
winter • sun
snow rain blank
blabla • summer abc
break • Data
test letter • stop.
So far I created a solution which is slow:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
The str.count should be able to apply to the whole dataframe without hard coding this way. Try
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:
%%timeit
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#
Related
The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Im using 12 hours sensor data at 25Hrz that I query from mongo db into a dataframe
I'm trying to extract a list or a dict of 1 minute dataframes from the 12 hours.
I use a window of 1 minute and a stride/ step of 10 seconds.
The goal is to build a dataset by creating al list or dict of 1 minute dataframes/samples from 12 hours of data, that will be converted to tensor and fed to a deep learning model.
The index of the dataframe is datetime and 4 columns of sensor values.
here is how part of the data looks like:
A B C D
2020-06-17 22:00:00.000 1.052 -0.147 0.836 0.623
2020-06-17 22:00:00.040 1.011 -0.147 0.820 0.574
2020-06-17 22:00:00.080 1.067 -0.131 0.868 0.607
2020-06-17 22:00:00.120 1.033 -0.163 0.820 0.607
2020-06-17 22:00:00.160 1.030 -0.147 0.820 0.607
below is a sample code that is similar to how I extract windows of 1 minutes data. For 12 hours it takes 5 minutes-which is a long time..
Any ideas on how to reduce the running time in this case?
step= 10*25
w=60*25
df # 12 hours df data
sensor_dfs=[]
df_range = range(0, df.shape[0]-step, step)
for a in df_range:
sample = df.iloc[a:a+w]
sensor_dfs.append(sample)
I created random data and made the following experiments looking at runtime:
# create random normal samples
w= 60*25 # 1 minute window
step=w # no overlap
num_samples=50000
data= np.random.normal(size=(num_samples,3))
date_rng=pd.date_range(start="2020-07-09 00:00:00.000",
freq="40ms",periods=num_samples)
data=pd.DataFrame(data, columns=["x","y","z"], index=date_rng)
data.head()
x y z
2020-07-09 00:00:00.000 -1.062264 -0.008656 0.399642
2020-07-09 00:00:00.040 0.182398 -1.014290 -1.108719
2020-07-09 00:00:00.080 -0.489814 -0.020697 0.651120
2020-07-09 00:00:00.120 -0.776405 -0.596601 0.611516
2020-07-09 00:00:00.160 0.663900 0.149909 -0.552779
numbers are of type float64
data.dtypes
x float64
y float64
z float64
dtype: object
using for loops
minute_samples=[]
for i in range(0,len(data)-w,step):
minute_samples.append(data.iloc[i:i+w])
result:6.45 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using list comprehension
minute_samples=[data.iloc[i:i+w] for i in range(0,len(data)-w,step)]
result: 6.13 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using Grouper with list comprehension
minute_samples=[df for i, df in data.groupby(pd.Grouper(freq="1T"))]
result:7.89 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using grouper with dict
minute_samples=dict(tuple(data.groupby(pd.Grouper(freq="1T"))))
result: 7.41 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
resample is also an option here but since behind the scenes it uses grouper then I don't think it will be different in terms of runtime
It seems like list comprehension is slightly better than the rest
I want to use Tensorflow serving to load multiple Models. If I mount a directory containing the model, loading everything is done in an instant, while loading them from a gs:// path takes around 10 seconds per model.
While researching the issue I discovered this is probably a Tensorflow issue and not a Tensorflow Serving issue as loading them in Tensorflow is a huge difference as well:
[ins] In [22]: %timeit tf.saved_model.load('test/1')
3.88 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [23]: %timeit tf.saved_model.load('gs://path/to/test/1')
30.6 s ± 2.66 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Then it could be that downloading the model (which is very small) is slow, but I tested this as well:
def test_load():
bucket_name = 'path'
folder='test'
delimiter='/'
file = 'to/test/1'
bucket=storage.Client().get_bucket(bucket_name)
blobs=bucket.list_blobs(prefix=file, delimiter=delimiter) # Excluding folder inside bucket
for blob in blobs:
print(blob.name)
destination_uri = '{}/{}'.format(folder, blob.name)
blob.download_to_filename(destination_uri)
[ins] In [31]: %timeit test_load()
541 ms ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any idea what is happening here?
The bson.son.SON is used mainly in pymongo, to get a ordered mapping(dict).
But python already have the ordered dict in collections.OrderedDict()
I have read the docs of bson.son.SON. It did say SON is similar to OrderedDict but did not mention the difference.
So What is the difference? When should we use SON and when should we use OrderedDict?
Currently, the slight difference in both is that bson.son.SON remains backward compatible with Python 2.7 and older versions.
Also the argument that SON serializes faster than OrderedDict is no longer correct in 2018.
The son module was added in Jan 8, 2009.
collections.OrderedDict(PEP-372) was added in python in Mar 2, 2009.
While the differences in dates doesn't tell which was released first, it is interesting to see that the Mongodb already implemented an ordered map for their use case. I guess that they may have decided to keep maintaining it for backward compatibility instead of switching all SON references in their codebase to collections.OrderedDict
In small experiments with both, you easily see that collections.OrderedDict performs better than bson.son.SON.
In [1]: from bson.son import SON
from collections import OrderedDict
import copy
print(set(dir(SON)) - set(dir(OrderedDict)))
{'weakref', 'iteritems', 'iterkeys', 'itervalues', 'module', 'has_key', 'deepcopy', 'to_dict'}
In [2]: %timeit s = SON([('a',2)]); z = copy.deepcopy(s)
14.3 µs ± 758 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit s = OrderedDict([('a',2)]); z = copy.deepcopy(s)
7.54 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit s = SON(data=[('a',2)]); z = json.dumps(s)
11.5 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [5]: %timeit s = OrderedDict([('a',2)]); z = json.dumps(s)
5.35 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In answer to your question about when to use SON,
use SON if running your software in versions of Python older than 2.7.
If you can help it, use OrderedDict from the collections module.
You can also use dict, they are ordered now in Python 3.7
I've noticed it can make quite a substantial difference in terms of speed,
if you specify the protocol used in pickle.dumps via argument or if you
monkey patch pickle.DEFAULT_PROTOCOL for the desired protocol version.
On Python 3.6, pickle.DEFAULT_PROTOCOL is 3 and
pickle.HIGHEST_PROTOCOL is 4.
For objects up to a certain length it seems to be faster setting
DEFAULT_PROTOCOL to 4 instead of passing protocol=4 as argument.
In my tests for example, with setting pickle.DEFAULT_PROTOCOL to 4 and pickling
a list with length 1 by calling pickle.dumps(packet_list_1) takes 481 ns, while calling with pickle.dumps(packet_list_1, protocol=4) takes 733 ns, a staggering ~52% speed-penalty for passing protocol explicitly instead of falling back to default (which was set to 4 before).
"""
(stackoverflow insists this to be formatted as code:)
pickle.DEFAULT_PROTOCOL = 4
pickle.dumps(packet) vs pickle.dumps(packet, protocol=4):
(stackoverflow insists this to be formatted as code:)
For a list with length 1 it's 481ns vs 733ns (~52% penalty).
For a list with length 10 it's 763ns vs 999ns (~30% penalty).
For a list with length 100 it's 2.99 µs vs 3.21 µs (~7% penalty).
For a list with length 1000 it's 25.8 µs vs 26.2 µs (~1.5% penalty).
For a list with length 1_000_000 it's 32 ms vs 32.4 ms (~1.13% penalty).
"""
I've found this behaviour for instances, lists, dicts and arrays, which is
all I tested so far. The effect diminishes with object size.
For dicts I noticed the effect turning at some point into the opposite, so that
for a length 10**6 dict (with unique integer values) it's faster to explicitly
pass protocol=4 as argument (269ms) than relying on default set to 4 (286ms).
"""
pickle.DEFAULT_PROTOCOL = 4
pickle.dumps(packet) vs pickle.dumps(packet, protocol=4):
For a dict with length 1 it's 589 ns vs 811 ns (~38% penalty).
For a dict with length 10 it's 1.59 µs vs 1.81 µs (~14% penalty).
For a dict with length 100 it's 13.2 µs vs 12.9 µs (~2,3% penalty).
For a dict with length 1000 it's 128 µs vs 129 µs (~0.8% penalty).
For a dict with length 1_000_000 it's 306 ms vs 283 ms (~7.5% improvement).
"""
Glimpsing over the pickle source, nothing strikes my eye what might cause
such variations.
How is this unexpected behaviour explainable?
Are there any caveats for setting pickle.DEFAULT_PROTOCOL instead of passing
protocol as argument to take advantage of the improved speed?
(Timed with IPython's timeit magic on Python 3.6.3, IPython 6.2.1, Windows 7)
Some example code dump:
# instances -------------------------------------------------------------
class Dummy: pass
dummy = Dummy()
pickle.DEFAULT_PROTOCOL = 3
"""
>>> %timeit pickle.dumps(dummy)
5.8 µs ± 33.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit pickle.dumps(dummy, protocol=4)
6.18 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
%timeit pickle.dumps(dummy)
5.74 µs ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit pickle.dumps(dummy, protocol=4)
6.24 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
# lists -------------------------------------------------------------
packet_list_1 = [*range(1)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1)
476 ns ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_1, protocol=4)
730 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1)
481 ns ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_1, protocol=4)
733 ns ± 2.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
# --------------------------
packet_list_10 = [*range(10)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_10)
714 ns ± 3.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_10, protocol=4)
978 ns ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_10)
763 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit pickle.dumps(packet_list_10, protocol=4)
999 ns ± 8.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
"""
# --------------------------
packet_list_100 = [*range(100)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_100)
2.96 µs ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>>%timeit pickle.dumps(packet_list_100, protocol=4)
3.22 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_100)
2.99 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>>%timeit pickle.dumps(packet_list_100, protocol=4)
3.21 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
"""
# --------------------------
packet_list_1000 = [*range(1000)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1000)
26 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>>%timeit pickle.dumps(packet_list_1000, protocol=4)
26.4 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1000)
25.8 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>>%timeit pickle.dumps(packet_list_1000, protocol=4)
26.2 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
"""
# --------------------------
packet_list_1m = [*range(10**6)]
pickle.DEFAULT_PROTOCOL = 3
"""
>>>%timeit pickle.dumps(packet_list_1m)
32 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>%timeit pickle.dumps(packet_list_1m, protocol=4)
32.3 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
pickle.DEFAULT_PROTOCOL = 4
"""
>>>%timeit pickle.dumps(packet_list_1m)
32 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>%timeit pickle.dumps(packet_list_1m, protocol=4)
32.4 ms ± 466 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
Let's reorganize your %timeit results by return value:
| DEFAULT_PROTOCOL | call | %timeit | returns |
|------------------+-----------------------------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------|
| 3 | pickle.dumps(dummy) | 5.8 µs ± 33.5 ns | b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.' |
| 4 | pickle.dumps(dummy) | 5.74 µs ± 18.8 ns | b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.' |
| 3 | pickle.dumps(dummy, protocol=4) | 6.18 µs ± 10.4 ns | b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.' |
| 4 | pickle.dumps(dummy, protocol=4) | 6.24 µs ± 26.7 ns | b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.' |
| 3 | pickle.dumps(packet_list_1) | 476 ns ± 1.01 ns | b'\x80\x03]q\x00cbuiltins\nrange\nq\x01K\x00K\x01K\x01\x87q\x02Rq\x03a.' |
| 4 | pickle.dumps(packet_list_1) | 481 ns ± 2.12 ns | b'\x80\x03]q\x00cbuiltins\nrange\nq\x01K\x00K\x01K\x01\x87q\x02Rq\x03a.' |
| 3 | pickle.dumps(packet_list_1, protocol=4) | 730 ns ± 2.22 ns | b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00]\x94\x8c\x08builtins\x94\x8c\x05range\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94a.' |
| 4 | pickle.dumps(packet_list_1, protocol=4) | 733 ns ± 2.94 ns | b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00]\x94\x8c\x08builtins\x94\x8c\x05range\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94a.' |
Notice how the %timeit results correspond well when we pair calls that give the same return value.
As you can see, the value of pickle.DEFAULT_PROTOCOL has no effect on the value returned by pickle.dumps.
If the protocol parameter is not specified, the default protocol is 3 no matter what the value of pickle.DEFAULT_PROTOCOL.
The reason is here:
# Use the faster _pickle if possible
try:
from _pickle import (
PickleError,
PicklingError,
UnpicklingError,
Pickler,
Unpickler,
dump,
dumps,
load,
loads
)
except ImportError:
Pickler, Unpickler = _Pickler, _Unpickler
dump, dumps, load, loads = _dump, _dumps, _load, _loads
The pickle module sets pickle.dumps to _pickle.dumps if it succeeds in importing _pickle, the compiled version of the pickle module.
The _pickle module uses protocol=3 by default. Only if Python fails to import _pickle is dumps set to the Python version:
def _dumps(obj, protocol=None, *, fix_imports=True):
f = io.BytesIO()
_Pickler(f, protocol, fix_imports=fix_imports).dump(obj)
res = f.getvalue()
assert isinstance(res, bytes_types)
return res
Only the Python version, _dumps, is affected by the value of pickle.DEFAULT_PROTOCOL:
In [68]: pickle.DEFAULT_PROTOCOL = 3
In [70]: pickle._dumps(dummy)
Out[70]: b'\x80\x03c__main__\nDummy\nq\x00)\x81q\x01.'
In [71]: pickle.DEFAULT_PROTOCOL = 4
In [72]: pickle._dumps(dummy)
Out[72]: b'\x80\x04\x95\x1b\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x05Dummy\x94\x93\x94)}\x94\x92\x94.'