Normalising data for plotting - python

I am trying to plot the data shown below in a normalised way, in order to have the maximum value on the y-axis equal to 1.
Dataset:
%_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0
I tried as follows:
cols_to_norm = ["%_F", "%_M", "%_C", "%_D"]
df[cols_to_norm] = df[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
but I am not completely sure about the output.
In fact, if a plot as follows
df.pivot_table(index='Label').plot.bar()
I get a different result. I think it is because I am not considering in the first code the index on Label.

there are multiple techniques normalize
this shows technique which uses native pandas
import io
df = pd.read_csv(io.StringIO(""" %_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0"""), sep="\s+")
fig, ax = plt.subplots(2, figsize=[10,6])
df2 = (df-df.min())/(df.max()-df.min())
df.plot(ax=ax[0], kind="line")
df2.plot(ax=ax[1], kind="line")

Related

How to add a total column that sums up values from dynamic columns

I have the following DataFrame
party 2022 - 45 2022 - 46 2022 - 48 2022 - 49 2022 - 50 2022 - 51 2022 - 52 2023 - 01 2023 - 02 2023 - 06 2023 - 10 scheduled_total ledger_balance
0 V00011 0.00 0.0 1917.50 6894.00 5743.50 3826.00 3826.00 0.00 0.00 0.00 0.00 NaN -37145.00
1 V00020 7327.22 0.0 0.00 5652.00 5652.00 5652.00 0.00 0.00 0.00 0.00 0.00 NaN -45863.01
2 V00117 0.00 0.0 2265.50 3776.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -8589.50
3 V00144 23986.55 0.0 0.00 11629.63 11629.63 11629.63 11629.63 11629.63 0.00 0.00 0.00 NaN -276629.91
4 V00153 0.00 0.0 1794.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -8769.00
5 V00198 0.00 2655.0 2655.00 2655.00 2655.00 2655.00 0.00 0.00 0.00 0.00 0.00 NaN -10620.00
6 V00229 11868.53 0.0 7327.75 14837.50 14837.50 14837.50 0.00 0.00 0.00 0.00 0.00 NaN -103789.08
7 V00235 0.00 0.0 9600.00 9600.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -43200.00
8 V00241 0.00 0.0 5575.50 5575.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -11151.00
9 V00261 0.00 0.0 6208.50 17201.75 11502.25 5630.50 763.50 0.00 0.00 0.00 0.00 NaN -34131.22
10 V00319 0.00 0.0 0.00 13865.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -27731.00
11 V00325 0.00 0.0 0.00 0.00 850.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN -3568.00
12 V00345 0.00 0.0 0.00 5000.25 0.00 0.00 0.00 0.00 5000.25 5000.25 5000.25 NaN -20001.00
total NaN 43182.30 2655.0 37343.75 96686.63 52869.88 44230.63 16219.13 11629.63 5000.25 5000.25 5000.25 NaN -631187.72
The columns after party, till scheduled_total are dynamic columns created using pivot. It represents "year - week". I want to get total of each row in "scheduled_total". Any help?
Here is the code used to generate the above output:
pr = frappe.db.get_list("Payment Request",filters={"docstatus":1,"status":"Initiated","party_type":"Supplier"},fields=["name", "transaction_date", "payment_request_type", "party", "supplier_name", "reference_name", "grand_total", "status"])
df = pd.DataFrame.from_records(pr)
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df["week"] = df['transaction_date'].dt.year.astype(str) + " - " + df['transaction_date'].dt.week.map("{:02}".format).astype(str)
df = df.groupby(["party", "week", "supplier_name"],as_index=False)["grand_total"].sum()
df = df.sort_values(by=['week'],ascending=True)
df = df.pivot(index=["party","supplier_name"], columns='week', values='grand_total').reset_index().rename_axis(None, axis=1).fillna(0)
df["scheduled_total"] = df.agg("sum", axis=0)
df["ledger_balance"] = df.apply(lambda x: get_balance_on(party_type="Supplier",party=x['party']), axis=1)
df.loc['total'] = df.sum(numeric_only=True)
print(df.to_string())
df.to_excel("/home/frappe/frappe-bench/apps/zarnik/zarnik/ap_automation/report/payment_request_summary/payment_request_summary.xlsx")
column_names = df.columns.values
print(column_names)
I have tried the following,
df["scheduled_total"] = df.agg("sum", axis=0)
This returns NaN.

Fill all the columns of the dataframe with condition

I am trying to fill the dataframe with certain condition but I can not find the appropriate solution. I have a bit larger dataframe bet let's say that my pandas dataframe looks like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
1.30
2.42
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.90
1.50
1.40
0.00
And I want to update the values, so that if 0.00 appears once in a row (row 2 and 4) that until the end all the values are 0.00. Something like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
0.00
0.00
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.00
0.00
0.00
0.00
I have tried with
for t in range (1,T-1):
data= np.where(df[t-1]==0,0,df[t])
and several others ways but I couldn't get what I want.
Thanks!
Try as follows:
Select from df with df.eq(0). This will get us all zeros and the rest as NaN values.
Now, add df.ffill along axis=1. This will continue all the zeros through to the end of each row.
Finally, change the dtype to bool by chaining df.astype, thus turning all zeros into False, and all NaN values into True.
We feed the result to df.where. For all True values, we'll pick from the df itself, for all False values, we'll insert 0.
df = df.where(df[df.eq(0)].ffill(axis=1).astype(bool), 0)
print(df)
0 1 2 3 4 5
0 0.32 0.40 0.6 1.20 3.40 0.0
1 0.17 0.12 0.0 0.00 0.00 0.0
2 0.31 0.90 0.8 1.24 4.35 0.0
3 0.39 0.00 0.0 0.00 0.00 0.0

Sorting by data in another Dataframe

I've been stuck with an engineering problem thats Python/Pandas related. I'd appreciate any help given.
I've simplified the numbers so I can better explain myself.
I have something similar to the following:
positioning(x-axis)
Calculated difference
1
0.25
0.05
2
0.75
0.06
3
1.25
0.02
4
0.25
0.05
5
0.75
0.05
6
1.25
0.02
7
0.25
0.09
8
0.75
0.01
9
1.25
0.02
10
0.25
0.05
What I need to do is re-organise the calculated difference based on the x-axis positioning.
So it looks something like this:
(0.25)
(0.75)
(1.25)
0.05
0
0
0
0.06
0
0
0
0.02
0.5
0
0
0
0.5
0
0
0
0.02
0.09
0
0
0
0.01
0
0
0
0.02
0.05
0
0
As you can see, I need to organize everything based on the x-positioning.
What is the best approach to this problem? Keep in mind I have 2000+ rows and the x positioning is dynamic but I'm currently working till up to 50(so a lot of columns).
I hope I've clarified the question.
Use pd.get_dummies:
In [10]: pd.get_dummies(df['positioning(x-axis)']).mul(df['Calculated difference'],axis=0)
Out[10]:
0.25 0.75 1.25
1 0.05 0.00 0.00
2 0.00 0.06 0.00
3 0.00 0.00 0.02
4 0.05 0.00 0.00
5 0.00 0.05 0.00
6 0.00 0.00 0.02
7 0.09 0.00 0.00
8 0.00 0.01 0.00
9 0.00 0.00 0.02
10 0.05 0.00 0.00
Just do pivot
df.pivot(columns='positioning(x-axis)',values='Calculated difference').fillna(0)
Out[363]:
Calculated 0.25 0.75 1.25
0 0.05 0.00 0.00
1 0.00 0.06 0.00
2 0.00 0.00 0.02
3 0.05 0.00 0.00
4 0.00 0.05 0.00
5 0.00 0.00 0.02
6 0.09 0.00 0.00
7 0.00 0.01 0.00
8 0.00 0.00 0.02
9 0.05 0.00 0.00
factorize
i, p = pd.factorize(df['positioning(x-axis)'])
d = df['Calculated difference'].to_numpy()
a = np.zeros_like(d, shape=(len(df), len(p)))
a[np.arange(len(df)), i] = d
pd.DataFrame(a, df.index, p)
0.25 0.75 1.25
0 0.05 0.00 0.00
1 0.00 0.06 0.00
2 0.00 0.00 0.02
3 0.05 0.00 0.00
4 0.00 0.05 0.00
5 0.00 0.00 0.02
6 0.09 0.00 0.00
7 0.00 0.01 0.00
8 0.00 0.00 0.02
9 0.05 0.00 0.00
One way to do this would be to use pandas' pivot and then to reset the index.
Given a data frame like this:
positioning(x-axis) Calculated difference
0 0.0 0.61
1 0.0 0.96
2 0.0 0.56
3 0.0 0.91
4 0.0 0.57
5 0.0 0.67
6 0.1 0.71
7 0.1 0.71
8 0.1 0.95
9 0.1 0.89
10 0.1 0.61
df.pivot(columns='positioning(x-axis)', values='Calculated difference').reset_index().drop(columns=['index']).fillna(0)
positioning(x-axis) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.00
3 0.00 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.00 0.00 0.00 0.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.85
8 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 0.00 0.91 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Python: Transpose a dataframe and the result in incomplete

I have a problem when I try to transpose my dataframe, the process act quite well but I don't understand why doesn't include in the trasposition the last column, this is my original dataframe:
ASH700936D_M-SCIS East 2.07 -0.30 -0.27 0.00 0.00 0.00 0.00 0.19
ASH700936D_M-SCIS North 1.93 0.00 0.00 -0.15 0.09 0.04 -0.27 0.12
ASH700936D_M-SCIS Up 31.59 -40.09 -1.48 15.31 1.03 0.00 0.00 0.65
ASH701945E_M-SCIS East 2.66 0.00 0.00 0.00 0.00 0.00 0.00 0.17
ASH701945E_M-SCIS North -0.91 0.00 0.00 -0.21 0.08 0.13 -0.44 0.12
ASH701945E_M-SCIS Up 5.45 3.31 0.11 0.00 0.00 -0.18 -0.18 0.41
LEIAR20-LEIM East -1.34 0.04 0.06 0.00 0.00 0.03 -0.05 0.05
LEIAR20-LEIM North -0.39 0.04 0.07 0.00 0.00 0.01 -0.06 0.03
LEIAR20-LEIM Up 0.58 0.00 0.00 0.10 0.04 0.20 0.02 0.13
LEIAR25.R3-LEIT East -0.39 0.00 0.00 0.02 0.28 -0.00 -0.31 0.08
LEIAR25.R3-LEIT North -0.65 0.00 0.00 0.09 -0.11 -0.10 0.22 0.05
LEIAR25.R3-LEIT Up 2.02 -8.52 -1.15 2.62 0.63 0.00 0.00 0.27
LEIAR25.R4-LEIT East 0.79 0.00 0.00 0.02 0.22 -0.05 -0.16 0.12
LEIAR25.R4-LEIT North -0.36 0.00 0.00 0.00 0.00 0.04 0.03 0.05
LEIAR25.R4-LEIT Up 15.11 -20.38 0.03 7.53 0.07 0.00 0.00 0.32
My transposed dataframe is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ASH700936D_M-SCIS ASH700936D_M-SCIS ASH700936D_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS LEIAR20-LEIM LEIAR20-LEIM LEIAR20-LEIM LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R4-LEIT LEIAR25.R4-LEIT
East North Up East North Up East North Up East North Up East North
2.07 1.93 31.59 2.66 -0.91 5.45 -1.34 -0.39 0.58 -0.39 -0.65 2.02 0.79 -0.36
-0.30 0.0 -40.09 0.0 0.0 3.31 0.04 0.04 0.0 0.0 0.0 -8.52 0.0 0.0
-0.27 0.0 -1.48 0.0 0.0 0.11 0.06 0.07 0.0 0.0 0.0 -1.15 0.0 0.0
0.00 -0.15 15.31 0.0 -0.21 0.0 0.0 0.0 0.1 0.02 0.09 2.62 0.02 0.0
0.00.1 0.09 1.03 0.0 0.08 0.0 0.0 0.0 0.04 0.28 -0.11 0.63 0.22 0.0
0.00.2 0.04 0.0 0.0 0.13 -0.18 0.03 0.01 0.2 -0.0 -0.1 0.0 -0.05 0.04
0.00.3 -0.27 0.0 0.0 -0.44 -0.18 -0.05 -0.06 0.02 -0.31 0.22 0.0 -0.16 0.03
0.19 0.12 0.65 0.17 0.12 0.41 0.05 0.03 0.13 0.08 0.05 0.27 0.12 0.05
As you can see, at the end of the transposed file one of "LEIAR25.R4-LEIT" is missing.
This is my script:
df = pd.read_csv('original_dataframe.txt', sep='\s*', index_col=None, engine='python')
df_transposed = df.transpose()
df_transposed.to_csv('transposta_bozza.txt', sep=' ')
with open ('transposta_bozza.txt','r') as f:
with open ('transposta.txt', 'w') as r:
for line in f:
data=line.split()
df='{0[0]:<20}{0[1]:<20}{0[2]:<20}{0[3]:<20}{0[4]:<20}{0[5]:<20}{0[6]:<20}{0[7]:<20}{0[8]:<20}{0[9]:<20}{0[10]:<20}{0[11]:<20}{0[12]:<20}{0[13]:<20}'.format(data)
r.write("%s\n" % df)
r.close()
I have to say that no, if I try to put the {0[14]:<20}.format(data) says that the "IndexError: Index out of range"
I think that there is some problem with the index but I dont know how to do!!
you don't have to write a temporary file, try out this code:
with open ('transposta.txt', 'w') as fp:
for i, row in df_transposed.iterrows():
fp.write(''.join('{:<20}'.format(x) for x in row))
fp.write('\n')
I wasn't able to spot the problem with your code, but hope this will help

Why is there a performance difference between the order of a nested loop?

I have a process that loops through two lists, one being relatively large while the other being significantly smaller.
Example:
larger_list = list(range(15000))
smaller_list = list(range(2500))
for ll in larger_list:
for sl in smaller_list:
pass
I scaled the sized down of the lists to test performance, and I noticed there is a decent difference between which list is looped through first.
import timeit
larger_list = list(range(150))
smaller_list = list(range(25))
def large_then_small():
for ll in larger_list:
for sl in smaller_list:
pass
def small_then_large():
for sl in smaller_list:
for ll in larger_list:
pass
print('Larger -> Smaller: {}'.format(timeit.timeit(large_then_small)))
print('Smaller -> Larger: {}'.format(timeit.timeit(small_then_large)))
>>> Larger -> Smaller: 114.884992572
>>> Smaller -> Larger: 98.7751009799
At first glance, they look identical - however there is 16 second difference between the two functions.
Why is that?
When you disassemble one of your functions you get:
>>> dis.dis(small_then_large)
2 0 SETUP_LOOP 31 (to 34)
3 LOAD_GLOBAL 0 (smaller_list)
6 GET_ITER
>> 7 FOR_ITER 23 (to 33)
10 STORE_FAST 0 (sl)
3 13 SETUP_LOOP 14 (to 30)
16 LOAD_GLOBAL 1 (larger_list)
19 GET_ITER
>> 20 FOR_ITER 6 (to 29)
23 STORE_FAST 1 (ll)
4 26 JUMP_ABSOLUTE 20
>> 29 POP_BLOCK
>> 30 JUMP_ABSOLUTE 7
>> 33 POP_BLOCK
>> 34 LOAD_CONST 0 (None)
37 RETURN_VALUE
>>>
Looking at address 29 & 30, it looks like these will execute every time the inner loop ends. The two loops look basically the same, but these two instructions are executed each time the inner loop exits. Having the smaller number on the inside would cause these to be executed more often, hence increasing the time (vs the larger number on the inner loop).
This same phenomenon was under discussion in this duplicate and got me interested in what goes on in the C land of CPython. Built python with:
% ./configure --enable-profiling
% make coverage
Tests
% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for sl in smaller_list:
for ll in larger_list:
pass"
% mv gmon.out soflgmon.out
% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for ll in larger_list:
for sl in smaller_list:
pass"
% mv gmon.out lofsgmon.out
Results
Short list of long lists (total time for a single run 1.60):
% gprof python soflgmon.out|head -n40
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
46.25 0.74 0.74 3346 0.00 0.00 PyEval_EvalFrameEx
25.62 1.15 0.41 37518735 0.00 0.00 insertdict
14.38 1.38 0.23 37555121 0.00 0.00 lookdict_unicode_nodummy
7.81 1.50 0.12 37506675 0.00 0.00 listiter_next
4.06 1.57 0.07 37516233 0.00 0.00 PyDict_SetItem
0.62 1.58 0.01 2095 0.00 0.00 _PyEval_EvalCodeWithName
0.62 1.59 0.01 3 0.00 0.00 untrack_dicts
0.31 1.59 0.01 _PyDict_SetItem_KnownHash
0.31 1.60 0.01 listiter_len
0.00 1.60 0.00 87268 0.00 0.00 visit_decref
0.00 1.60 0.00 73592 0.00 0.00 visit_reachable
0.00 1.60 0.00 71261 0.00 0.00 _PyThreadState_UncheckedGet
0.00 1.60 0.00 49742 0.00 0.00 _PyObject_Alloc
0.00 1.60 0.00 48922 0.00 0.00 PyObject_Malloc
0.00 1.60 0.00 48922 0.00 0.00 _PyObject_Malloc
0.00 1.60 0.00 47487 0.00 0.00 PyDict_GetItem
0.00 1.60 0.00 44246 0.00 0.00 _PyObject_Free
0.00 1.60 0.00 43637 0.00 0.00 PyObject_Free
0.00 1.60 0.00 30034 0.00 0.00 slotptr
0.00 1.60 0.00 24892 0.00 0.00 type_is_gc
0.00 1.60 0.00 24170 0.00 0.00 r_byte
0.00 1.60 0.00 23774 0.00 0.00 PyErr_Occurred
0.00 1.60 0.00 20371 0.00 0.00 _PyType_Lookup
0.00 1.60 0.00 19930 0.00 0.00 PyLong_FromLong
0.00 1.60 0.00 19758 0.00 0.00 r_string
0.00 1.60 0.00 19080 0.00 0.00 _PyLong_New
0.00 1.60 0.00 18887 0.00 0.00 lookdict_unicode
0.00 1.60 0.00 18878 0.00 0.00 long_dealloc
0.00 1.60 0.00 17639 0.00 0.00 PyUnicode_InternInPlace
0.00 1.60 0.00 17502 0.00 0.00 rangeiter_next
0.00 1.60 0.00 14776 0.00 0.00 PyObject_GC_UnTrack
0.00 1.60 0.00 14578 0.00 0.00 descr_traverse
0.00 1.60 0.00 13520 0.00 0.00 r_long
0.00 1.60 0.00 13058 0.00 0.00 PyUnicode_New
0.00 1.60 0.00 12298 0.00 0.00 _Py_CheckFunctionResult
...
Long list of short lists (total time for a single run 1.64):
gprof python lofsgmon.out|head -n40
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
48.78 0.80 0.80 3346 0.00 0.00 PyEval_EvalFrameEx
17.99 1.09 0.29 37531168 0.00 0.00 insertdict
11.59 1.28 0.19 37531675 0.00 0.00 listiter_next
11.28 1.47 0.18 37580156 0.00 0.00 lookdict_unicode_nodummy
6.71 1.58 0.11 37528666 0.00 0.00 PyDict_SetItem
1.22 1.60 0.02 _PyDict_SetItem_KnownHash
0.61 1.61 0.01 5525 0.00 0.00 update_one_slot
0.61 1.62 0.01 120 0.00 0.00 PyDict_Merge
0.30 1.62 0.01 18178 0.00 0.00 lookdict_unicode
0.30 1.63 0.01 11988 0.00 0.00 insertdict_clean
0.30 1.64 0.01 listiter_len
0.30 1.64 0.01 listiter_traverse
0.00 1.64 0.00 96089 0.00 0.00 _PyThreadState_UncheckedGet
0.00 1.64 0.00 87245 0.00 0.00 visit_decref
0.00 1.64 0.00 74743 0.00 0.00 visit_reachable
0.00 1.64 0.00 62232 0.00 0.00 _PyObject_Alloc
0.00 1.64 0.00 61412 0.00 0.00 PyObject_Malloc
0.00 1.64 0.00 61412 0.00 0.00 _PyObject_Malloc
0.00 1.64 0.00 59815 0.00 0.00 PyDict_GetItem
0.00 1.64 0.00 55231 0.00 0.00 _PyObject_Free
0.00 1.64 0.00 54622 0.00 0.00 PyObject_Free
0.00 1.64 0.00 36274 0.00 0.00 PyErr_Occurred
0.00 1.64 0.00 30034 0.00 0.00 slotptr
0.00 1.64 0.00 24929 0.00 0.00 type_is_gc
0.00 1.64 0.00 24617 0.00 0.00 _PyObject_GC_Alloc
0.00 1.64 0.00 24617 0.00 0.00 _PyObject_GC_Malloc
0.00 1.64 0.00 24170 0.00 0.00 r_byte
0.00 1.64 0.00 20958 0.00 0.00 PyObject_GC_Del
0.00 1.64 0.00 20371 0.00 0.00 _PyType_Lookup
0.00 1.64 0.00 19918 0.00 0.00 PyLong_FromLong
0.00 1.64 0.00 19758 0.00 0.00 r_string
0.00 1.64 0.00 19068 0.00 0.00 _PyLong_New
0.00 1.64 0.00 18845 0.00 0.00 long_dealloc
0.00 1.64 0.00 18507 0.00 0.00 _PyObject_GC_New
0.00 1.64 0.00 17639 0.00 0.00 PyUnicode_InternInPlace
...
The difference is marginal (2.4%), and profiling adds to run time, so it is difficult to say how much it actually would've been. The total time also includes the creation of the test lists, so that hides the true difference further.
The reason for the 16s difference in the original test is that timeit.timeit runs the given statement or function number=1000000 times by default, so that would add up to a whopping 40,000s in this case. Don't quote that value though, as it is an artifact of profiling. With your original test code and non profiling python3 on this machine I get:
Larger -> Smaller: 40.29234626500056
Smaller -> Larger: 33.09413992699956
which would mean a difference of
In [1]: (40.29234626500056-33.09413992699956)/1000000
Out[1]: 7.198206338001e-06
per single run (7.2µs), 18% in total.
So as stated in the former answer, POP_BLOCK gets executed more, but it's not just that, but the whole inner loop setup:
0.00 1.64 0.00 16521 0.00 0.00 PyFrame_BlockSetup
0.00 1.64 0.00 16154 0.00 0.00 PyFrame_BlockPop
Compared to the short list of long lists:
0.00 1.60 0.00 4021 0.00 0.00 PyFrame_BlockSetup
0.00 1.60 0.00 3748 0.00 0.00 set_next
0.00 1.60 0.00 3654 0.00 0.00 PyFrame_BlockPop
That has negligible impact though.

Categories

Resources