Why is Python's built-in sum much slower than manual summation? - python

Here is my simple code example:
import time
t0 = time.time()
s = 0
for i in range(1000000):
s += i
t1 = time.time()
print(s, t1 - t0)
t0 = time.time()
s = sum(i for i in range(1000000))
t1 = time.time()
print(s, t1 - t0)
On my computer (with Python 3.8) it prints:
499999500000 0.22901296615600586
499999500000 1.6930372714996338
So, doing += a million times is 7 times faster than calling sum? That is really unexpected. What is it doing?
Edit: I foolishly allowed a debugger to attach to the process and interfere with my measurements, which was in the end the cause of the slowness. With the debugger out, the measurements are no longer so unpredictable. As some of the answers are clearly showing, what I observed shuold not happen.

Let's use timeit for proper benchmarking and to make it easy to also compare different Python versions, let's run this in Docker containers:
so62514160.py
N = 1000000
def m1():
s = 0
for i in range(N):
s += i
def m2():
s = sum(i for i in range(N))
def m3():
s = sum(range(N))
so62514160bench.sh
for image in python:2.7 python:3.6 python:3.7 python:3.8; do
for fun in m1 m2 m3; do
echo -n "$image" "$fun "
docker run --rm -it -v $(pwd):/app -w /app -e PYTHONDONTWRITEBYTECODE=1 "$image" python -m timeit -s 'import so62514160 as s' "s.$fun()"
done
done
results on my machine:
python:2.7 m1 10 loops, best of 3: 43.5 msec per loop
python:2.7 m2 10 loops, best of 3: 39.6 msec per loop
python:2.7 m3 100 loops, best of 3: 17.1 msec per loop
python:3.6 m1 10 loops, best of 3: 41.9 msec per loop
python:3.6 m2 10 loops, best of 3: 46 msec per loop
python:3.6 m3 100 loops, best of 3: 17.7 msec per loop
python:3.7 m1 5 loops, best of 5: 45 msec per loop
python:3.7 m2 5 loops, best of 5: 40.7 msec per loop
python:3.7 m3 20 loops, best of 5: 17.3 msec per loop
python:3.8 m1 5 loops, best of 5: 48.2 msec per loop
python:3.8 m2 5 loops, best of 5: 44.6 msec per loop
python:3.8 m3 10 loops, best of 5: 19.2 msec per loop
plot

First of all, probably your observation does not generalize well to other systems, as your way of measuring is quite unreliable because it is susceptible to fluctuation in performances that are bound to be dominated by how your OS reacts to the fluctuating system load at the time of measurement.
You should use timeit or something similar.
For example, this are the timings I get on Python 3.6 in a virtual environment (Google Colab) (which seems to be quite reproducible across the other answers):
import numba as nb
def sum_loop(n):
result = 0
for x in range(n):
result += x
return result
sum_loop_nb = nb.jit(sum_loop)
sum_loop_nb.__name__ = 'sum_loop_nb'
def sum_analytical(n):
return n * (n - 1) // 2
def sum_list(n):
return sum([x for x in range(n)])
def sum_gen(n):
return sum(x for x in range(n))
def sum_range(n):
return sum(range(n))
sum_loop_nb(10) # to trigger compilation
funcs = sum_analytical, sum_loop, sum_loop_nb, sum_gen, sum_list, sum_range
n = 1000000
for func in funcs:
print(func.__name__, func(n))
%timeit func(n)
# sum_analytical 499999500000
# 10000000 loops, best of 3: 222 ns per loop
# sum_loop 499999500000
# 10 loops, best of 3: 55.6 ms per loop
# sum_loop_nb 499999500000
# 10000000 loops, best of 3: 196 ns per loop
# sum_gen 499999500000
# 10 loops, best of 3: 51.7 ms per loop
# sum_list 499999500000
# 10 loops, best of 3: 68.4 ms per loop
# sum_range 499999500000
# 100 loops, best of 3: 17.8 ms per loop
It is unlikely that you will observe much different timings across different Python versions.
The sum_analytical() and sum_loop_nb() versions have been included just for fun and are not analyzed further.
The sum_list() is also behaving quite differently from the rest, as it is creating a large, largely unnecessary, object for the computation, and it is also not analyzed further.
The reason for these different timings is in the bytecode produced by the considered versions of the functions, of course. In particular, from sum_loop() through sum_range() one gets progressively simpler code:
import dis
funcs = sum_loop, sum_gen, sum_range
for func in funcs:
print(func.__name__)
print(dis.dis(func))
print()
# sum_loop
# 2 0 LOAD_CONST 1 (0)
# 2 STORE_FAST 1 (result)
# 3 4 SETUP_LOOP 24 (to 30)
# 6 LOAD_GLOBAL 0 (range)
# 8 LOAD_FAST 0 (n)
# 10 CALL_FUNCTION 1
# 12 GET_ITER
# >> 14 FOR_ITER 12 (to 28)
# 16 STORE_FAST 2 (x)
# 4 18 LOAD_FAST 1 (result)
# 20 LOAD_FAST 2 (x)
# 22 INPLACE_ADD
# 24 STORE_FAST 1 (result)
# 26 JUMP_ABSOLUTE 14
# >> 28 POP_BLOCK
# 5 >> 30 LOAD_FAST 1 (result)
# 32 RETURN_VALUE
# None
# sum_gen
# 9 0 LOAD_GLOBAL 0 (sum)
# 2 LOAD_CONST 1 (<code object <genexpr> at 0x7f86d67c49c0, file "<ipython-input-4-9519b0039c88>", line 9>)
# 4 LOAD_CONST 2 ('sum_gen.<locals>.<genexpr>')
# 6 MAKE_FUNCTION 0
# 8 LOAD_GLOBAL 1 (range)
# 10 LOAD_FAST 0 (n)
# 12 CALL_FUNCTION 1
# 14 GET_ITER
# 16 CALL_FUNCTION 1
# 18 CALL_FUNCTION 1
# 20 RETURN_VALUE
# None
# sum_range
# 13 0 LOAD_GLOBAL 0 (sum)
# 2 LOAD_GLOBAL 1 (range)
# 4 LOAD_FAST 0 (n)
# 6 CALL_FUNCTION 1
# 8 CALL_FUNCTION 1
# 10 RETURN_VALUE
# None

Ah, I found an answer myself, but it bring up another question.
So, this is much faster:
t0 = time.time()
s = sum(range(1000000))
t1 = time.time()
print(s, t1 - t0)
The result is:
499999500000 0.05099987983703613
So, sum is faster than +=, as expected, but the generator expression (i for i in range(n)) is much slower than anything else.
I must say that this is also quite surprising.

I get different numbers
python3 main.py
499999500000 0.0832064151763916
499999500000 0.03934478759765625
And from this code
import time
to0 = []
to1 = []
for i in range(1000):
t0 = time.time()
s = 0
for i in range(1000000):
s += i
t1 = time.time()
to0.append(t1 - t0)
t0 = time.time()
s = sum(i for i in range(1000000))
t1 = time.time()
to1.append(t1 - t0)
print(sum(to0)/len(to0))
print(sum(to1)/len(to1))
I get
0.07862246823310852
0.0318267240524292
Try updating your python all this is run on Python 3.7.3

Related

Compute column from multiple previous rows in dataframes with conditionals

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Why are addition and multiplication faster than comparisons?

I always thought comparisons were the fastest operation a computer could execute. I remember hearing it on a presentation from D. Knuth where he'd write loops in descending order "because comparison against 0 is fast". I also read that multiplications should be slower than additions here.
I'm surprised to see that, in both Python 2 and 3, testing under both Linux and Mac, comparisons seem to be much slower than arithmetic operations.
Could anyone explain why?
%timeit 2 > 0
10000000 loops, best of 3: 41.5 ns per loop
%timeit 2 * 2
10000000 loops, best of 3: 27 ns per loop
%timeit 2 * 0
10000000 loops, best of 3: 27.7 ns per loop
%timeit True != False
10000000 loops, best of 3: 75 ns per loop
%timeit True and False
10000000 loops, best of 3: 58.8 ns per loop
And under python 3:
$ ipython3
Python 3.5.2 | packaged by conda-forge | (default, Sep 8 2016, 14:36:38)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: %timeit 2 + 2
10000000 loops, best of 3: 22.9 ns per loop
In [2]: %timeit 2 * 2
10000000 loops, best of 3: 23.7 ns per loop
In [3]: %timeit 2 > 2
10000000 loops, best of 3: 45.5 ns per loop
In [4]: %timeit True and False
10000000 loops, best of 3: 62.8 ns per loop
In [5]: %timeit True != False
10000000 loops, best of 3: 92.9 ns per loop
This is happening due to Constant Folding in the Peep Hole optimizer within Python compiler.
Using the dis module, if we break each of the statements to look inside how they are being translated at machine level, you will observe that all operators like inequality, equality etc are first loaded into memory and then evaluated. However, all expressions like multiplication, addition etc are calculated and loaded as a constant into memory.
Overall, this leads to a lesser number of execution steps, making the steps faster:
>>> import dis
>>> def m1(): True != False
>>> dis.dis(m1)
1 0 LOAD_GLOBAL 0 (True)
3 LOAD_GLOBAL 1 (False)
6 COMPARE_OP 3 (!=)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
>>> def m2(): 2 *2
>>> dis.dis(m2)
1 0 LOAD_CONST 2 (4)
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
>>> def m3(): 2*5
>>> dis.dis(m3)
1 0 LOAD_CONST 3 (10)
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
>>> def m4(): 2 > 0
>>> dis.dis(m4)
1 0 LOAD_CONST 1 (2)
3 LOAD_CONST 2 (0)
6 COMPARE_OP 4 (>)
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
>>> def m5(): True and False
>>> dis.dis(m5)
1 0 LOAD_GLOBAL 0 (True)
3 JUMP_IF_FALSE_OR_POP 9
6 LOAD_GLOBAL 1 (False)
>> 9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
As others have explained, this is because Python's peephole optimiser optimises arithmetic operations but not comparisons.
Having written my own peephole optimiser for a Basic compiler, I can assure you that optimising constant comparisons is just as easy as optimising constant arithmetic operations. So there is no technical reason why Python should do the latter but not the former.
However, each such optimisation has to be separately programmed, and comes with two costs: the time to program it, and the extra optimising code taking up space in the Python executable. So you find yourself having to do some triage: which of these optimisations is common enough to make it worth the costs?
It seems that the Python implementers, reasonably enough, decided to optimise the arithmetic operations first. Perhaps they will get round to comparisons in a future release.
A quick disassambling reveals that the comparison involves more operations. According to this answer, there is some precalculation done by the "peephole optimiser" (wiki) for multiplication, addition, etc., but not for the comparison operators:
>>> import dis
>>> def a():
... return 2*3
...
>>> dis.dis(a)
2 0 LOAD_CONST 3 (6)
3 RETURN_VALUE
>>> def b():
... return 2 < 3
...
>>> dis.dis(b)
2 0 LOAD_CONST 1 (2)
3 LOAD_CONST 2 (3)
6 COMPARE_OP 0 (<)
9 RETURN_VALUE
Like others have commented - it is due to the peep hole optimizer which pre-computes the results of 2*3 (6). As the dis shows
0 LOAD_CONST 3 (6)
But try this - it will disable the optimizer from pre-computing the results
>>> def a(a, b):
... return a*b
...
>>> dis.dis(a)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 BINARY_MULTIPLY
7 RETURN_VALUE
>>> def c(a,b):
... return a<b
...
>>> dis.dis(c)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 COMPARE_OP 0 (<)
9 RETURN_VALUE
>>>
If you time these functions the compare will be faster.
For python case the above answers are correct. For machine code things a bit more complicated. I assume we are talking about integer operations here, with floats and complex objects none of the below will apply. Also, we assume that the values you are comparing are already loaded into registers. If they are not fetching them from wherever they are could take 100 of times longer than the actual operations
Modern CPUs have several ways to compare two numbers. Very popular ones are doing XOR a,b if you just want to see if two values are equal or CMP a,b if you want to know the relationship between the values ( less, greater, equal, etc ). CMP operation is just a subtraction with the result thrown away because we are only interested in post-op flags.
Both of these operations are of depth 1, so they could be executed in a single CPU cycle. This is as fast as you can go. Multiplication is a form of repeated additions so the depth of the operation is usually equal to the size of your register. There are some optimizations that could be made to reduce the depth, but generally multiplication is one of the slower operations that CPU can perform.
However, multiplying by 0,1 or any power of 2 could be reduced to shift operation. Which is also depth one operation. So it takes the same time as comparing two numbers. Think about decimal system, you can multiply any number by 10, 100, 1000 by appending zeros at the end of the number. Any optimizing compiler will recognize this type of multiplication and use the most efficient operation for it. Modern CPUs are also pretty advanced, so they can perform same optimization in the hardware by counting how many bits are set in any of the operands. And if it's just one bit the operation will be reduced to the shift.
So in your case multiplying by 2 is as fast as comparing two numbers. As people above pointed out any optimizing compiler will see that you are multiplying two constants, so it will replace just replace the function with returning a constant.
Wow, The answer by #mu 無 blew my mind! However, it is important not to generalize when deriving your conclusions... You are checking the times for CONSTANTS not variables. For variables, multiplication seems to be slower than comparison.
Here is the more interesting case, in which the numbers to be compared are stored in actual variables...
import timeit
def go():
number=1000000000
print
print 'a>b, internal:',timeit.timeit(setup="a=1;b=1", stmt="a>b", number=number)
print 'a*b, internal:',timeit.timeit(setup="a=1;b=1", stmt="a*b", number=number)
print 'a>b, shell :',
%%timeit -n 1000000000 "a=1;b=1" "a>b"
print 'a*b, shell :',
%%timeit -n 1000000000 "a=1;b=1" "a*b"
go()
The result gives:
a>b, internal: 51.9467676445
a*b, internal: 63.870462403
a>b, shell :1000000000 loops, best of 3: 19.8 ns per loop
a>b, shell :1000000000 loops, best of 3: 19.9 ns per loop
And order is restored in the universe ;)
For completeness, lets see some more cases... What about if we have one variable and one constant?
import timeit
def go():
print 'a>2, shell :',
%%timeit -n 10000000 "a=42" "a>2"
print 'a*2, shell :',
%%timeit -n 10000000 "a=42" "a*2"
go()
a>2, shell :10000000 loops, best of 3: 18.3 ns per loop
a*2, shell :10000000 loops, best of 3: 19.3 ns per loop
what happens with bools?
import timeit
def go():
print
number=1000000000
print 'a==b : ', timeit.timeit(setup="a=True;b=False",stmt="a==b",number=number)
print 'a and b : ', timeit.timeit(setup="a=True;b=False",stmt="a and b",number=number)
print 'boolean ==, shell :',
%%timeit -n 1000000000 "a=True;b=False" "a == b"
print 'boolean and, shell :',
%%timeit -n 1000000000 "a=False;b=False" "a and b"
go()
a==b : 70.8013108982
a and b : 38.0614485665
boolean ==, shell :1000000000 loops, best of 3: 17.7 ns per loop
boolean and, shell :1000000000 loops, best of 3: 16.4 ns per loop
:D Now this is interesting, it seems boolean and is faster than ==. However all this would be ok as the Donald Knuth would not loose his sleep, the best way to compare would be to use and...
In practice, we should check numpy, which may be even more significant...
import timeit
def go():
number=1000000 # change if you are in a hurry/ want to be more certain....
print '==== int ===='
print 'a>b : ', timeit.timeit(setup="a=1;b=2",stmt="a>b",number=number*100)
print 'a*b : ', timeit.timeit(setup="a=1;b=2",stmt="a*b",number=number*100)
setup = "import numpy as np;a=np.arange(0,100);b=np.arange(100,0,-1);"
print 'np: a>b : ', timeit.timeit(setup=setup,stmt="a>b",number=number)
print 'np: a*b : ', timeit.timeit(setup=setup,stmt="a*b",number=number)
print '==== float ===='
print 'float a>b : ', timeit.timeit(setup="a=1.1;b=2.3",stmt="a>b",number=number*100)
print 'float a*b : ', timeit.timeit(setup="a=1.1;b=2.3",stmt="a*b",number=number*100)
setup = "import numpy as np;a=np.arange(0,100,dtype=float);b=np.arange(100,0,-1,dtype=float);"
print 'np float a>b : ', timeit.timeit(setup=setup,stmt="a>b",number=number)
print 'np float a*b : ', timeit.timeit(setup=setup,stmt="a*b",number=number)
print '==== bool ===='
print 'a==b : ', timeit.timeit(setup="a=True;b=False",stmt="a==b",number=number*1000)
print 'a and b : ', timeit.timeit(setup="a=True;b=False",stmt="a and b",number=number*1000)
setup = "import numpy as np;a=np.arange(0,100)>50;b=np.arange(100,0,-1)>50;"
print 'np a == b : ', timeit.timeit(setup=setup,stmt="a == b",number=number)
print 'np a and b : ', timeit.timeit(setup=setup,stmt="np.logical_and(a,b)",number=number)
print 'np a == True : ', timeit.timeit(setup=setup,stmt="a == True",number=number)
print 'np a and True : ', timeit.timeit(setup=setup,stmt="np.logical_and(a,True)",number=number)
go()
==== int ====
a>b : 4.5121130192
a*b : 5.62955748632
np: a>b : 0.763992986986
np: a*b : 0.723006032235
==== float ====
float a>b : 6.39567713272
float a*b : 5.62149055215
np float a>b : 0.697037433398
np float a*b : 0.847941712765
==== bool ====
a==b : 6.91458288689
a and b : 3.6289697892
np a == b : 0.789666454087
np a and b : 0.724517620007
np a == True : 1.55066706189
np a and True : 1.44293071804
Again, same behavior...
So I guess, one can benefit by using instead for == in general,
at least in Python 2 (Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]), where I tried all these...

difference between if len(x) and if x

I have following list
x = [1,2,3]
if x:
dosomething()
if len(x)>0:
dosomething()
In above example which if statement will work faster ?
There is (from the result) no difference if x is list. But the first one is a bit faster:
%%timeit
if x:
pass
10000000 loops, best of 3: 95 ns per loop
than the second one:
%%timeit
if len(x) > 0:
pass
1000000 loops, best of 3: 276 ns per loop
You should use the first one with if x in almost all cases. Only if you want to distinguish between None, False and an empty list (or something similar) you might need something else.
The first statement would work faster as it doesn't need to execute a function whereas in the second statement something needs to be executed before it will run which in this case len(x).
Internally,
if x:
will be getting the size of the list object and will check if it is a non-zero value.
In this case,
if len(x) > 0:
you are doing it explicitly.
Moreover, PEP-0008 suggests the first form,
For sequences, (strings, lists, tuples), use the fact that empty sequences are false.
Yes: if not seq:
if seq:
No: if len(seq)
if not len(seq)
Expanding on the answer by #thefourtheye, here's a demo/proof that __len__ is called when you check the truth-value of a list:
>>> class mylist(list):
... def __len__(self):
... print('__len__ called')
... return super(mylist, self).__len__()
...
>>> a = mylist([1, 2, 3])
>>> if a:
... print('doing something')
...
__len__ called
doing something
>>>
>>> if len(a) > 0:
... print('doing something')
...
__len__ called
doing something
>>>
>>> bool(a)
__len__ called
True
And here's a quick timing:
In [3]: a = [1,2,3]
In [4]: timeit if a: pass
10000000 loops, best of 3: 28.2 ns per loop
In [5]: timeit if len(a) > 0: pass
10000000 loops, best of 3: 62.2 ns per loop
So the implicit check is slightly faster (probably because there's no overhead from the global len function) and as already mentioned suggested by PEP-0008.
If you look at the dis.dis() for each method you'll see that the second has to perform almost twice as many steps as the first one.
In [1]: import dis
In [2]: def f(x):
....: if x: pass
....:
In [3]: def g(x):
....: if len(x) > 0: pass
....:
In [4]: dis.dis(f)
2 0 LOAD_FAST 0 (x)
3 POP_JUMP_IF_FALSE 9
6 JUMP_FORWARD 0 (to 9)
>> 9 LOAD_CONST 0 (None)
12 RETURN_VALUE
In [5]: dis.dis(g)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (x)
6 CALL_FUNCTION 1
9 LOAD_CONST 1 (0)
12 COMPARE_OP 4 (>)
15 POP_JUMP_IF_FALSE 21
18 JUMP_FORWARD 0 (to 21)
>> 21 LOAD_CONST 0 (None)
24 RETURN_VALUE
Both of them need to do LOAD_FAST, POP_JUMP_IF_FALSE, JUMP_FORWARD, LOAD_CONST, and RETURN_VALUE. But the second method needs to additionally do
LOAD_GLOBAL, CALL_FUNCTION, LOAD_CONST, and COMPARE_OP. Therefore, the first method will be faster.
In reality, however, the difference in time between the two methods will be so minuscule that unless these if statements run millions of times in your code it will not noticeably impact the performance of your program. This sounds like an example of premature optimization to me.

Python: multiple assignment vs. individual assignment speed

I've been looking to squeeze a little more performance out of my code; recently, while browsing this Python wiki page, I found this claim:
Multiple assignment is slower than individual assignment. For example "x,y=a,b" is slower than "x=a; y=b".
Curious, I tested it (on Python 2.7):
$ python -m timeit "x, y = 1.2, -1.4"
10000000 loops, best of 3: 0.0365 usec per loop
$ python -m timeit "x = 1.2" "y = -1.4"
10000000 loops, best of 3: 0.0542 usec per loop
I repeated several times, in different orders, etc., but the multiple assignment snippet consistently performed at least 30% better than the individual assignment. Obviously the parts of my code involving variable assignment aren't going to be the source of any significant bottlenecks, but my curiousity is piqued nonetheless. Why is multiple assignment apparently faster than individual assignment, when the documentation suggests otherwise?
EDIT:
I tested assignment to more than two variables and got the following results:
The trend seems more or less consistent; can anyone reproduce it?
(CPU: Intel Core i7 # 2.20GHz)
Interestingly, it may depend on the CPU to some extent. These are both 64 bit linux machines (same Python build).
Result for Intel(R) Core(TM)2 Duo CPU T7300 # 2.00GHz
$ python -V
Python 2.7.5+
$ python -m timeit "x, y = 1.2, -1.4"
10000000 loops, best of 3: 0.0554 usec per loop
$ python -m timeit "x = 1.2" "y = -1.4"
10000000 loops, best of 3: 0.0349 usec per loop
Result for Intel(R) Pentium(R) CPU G850 # 2.90GHz
$ python -V
Python 2.7.5+
$ python -m timeit "x, y = 1.2, -1.4"
10000000 loops, best of 3: 0.0245 usec per loop
$ python -m timeit "x = 1.2" "y = -1.4"
10000000 loops, best of 3: 0.0394 usec per loop
Better look at dis module of python. Which disassembles bytecode. The test shows on two variable assignment:
import dis
def single_assignment():
x = 1
y = 2
def multiple_assignment():
x, y = 1, 2
print dis.dis(single_assignment)
print dis.dis(multiple_assignment)
Bytecode:
4 0 LOAD_CONST 1 (1)
3 STORE_FAST 0 (x)
5 6 LOAD_CONST 2 (2)
9 STORE_FAST 1 (y)
12 LOAD_CONST 0 (None)
15 RETURN_VALUE
None
8 0 LOAD_CONST 3 ((1, 2))
3 UNPACK_SEQUENCE 2
6 STORE_FAST 0 (x)
9 STORE_FAST 1 (y)
12 LOAD_CONST 0 (None)
15 RETURN_VALUE
None
It looks number of bytecode required is same in case of 2 variables. If there's 3 or more variable assignment, number of bytecode is smaller.

Python: why are * and ** faster than / and sqrt()?

While optimising my code I realised the following:
>>> from timeit import Timer as T
>>> T(lambda : 1234567890 / 4.0).repeat()
[0.22256922721862793, 0.20560789108276367, 0.20530295372009277]
>>> from __future__ import division
>>> T(lambda : 1234567890 / 4).repeat()
[0.14969301223754883, 0.14155197143554688, 0.14141488075256348]
>>> T(lambda : 1234567890 * 0.25).repeat()
[0.13619112968444824, 0.1281130313873291, 0.12830305099487305]
and also:
>>> from math import sqrt
>>> T(lambda : sqrt(1234567890)).repeat()
[0.2597470283508301, 0.2498021125793457, 0.24994492530822754]
>>> T(lambda : 1234567890 ** 0.5).repeat()
[0.15409398078918457, 0.14059877395629883, 0.14049601554870605]
I assume it has to do with the way python is implemented in C, but I wonder if anybody would care to explain why is so?
The (somewhat unexpected) reason for your results is that Python seems to fold constant expressions involving floating-point multiplication and exponentiation, but not division. math.sqrt() is a different beast altogether since there's no bytecode for it and it involves a function call.
On Python 2.6.5, the following code:
x1 = 1234567890.0 / 4.0
x2 = 1234567890.0 * 0.25
x3 = 1234567890.0 ** 0.5
x4 = math.sqrt(1234567890.0)
compiles to the following bytecodes:
# x1 = 1234567890.0 / 4.0
4 0 LOAD_CONST 1 (1234567890.0)
3 LOAD_CONST 2 (4.0)
6 BINARY_DIVIDE
7 STORE_FAST 0 (x1)
# x2 = 1234567890.0 * 0.25
5 10 LOAD_CONST 5 (308641972.5)
13 STORE_FAST 1 (x2)
# x3 = 1234567890.0 ** 0.5
6 16 LOAD_CONST 6 (35136.418286444619)
19 STORE_FAST 2 (x3)
# x4 = math.sqrt(1234567890.0)
7 22 LOAD_GLOBAL 0 (math)
25 LOAD_ATTR 1 (sqrt)
28 LOAD_CONST 1 (1234567890.0)
31 CALL_FUNCTION 1
34 STORE_FAST 3 (x4)
As you can see, multiplication and exponentiation take no time at all since they're done when the code is compiled. Division takes longer since it happens at runtime. Square root is not only the most computationally expensive operation of the four, it also incurs various overheads that the others do not (attribute lookup, function call etc).
If you eliminate the effect of constant folding, there's little to separate multiplication and division:
In [16]: x = 1234567890.0
In [17]: %timeit x / 4.0
10000000 loops, best of 3: 87.8 ns per loop
In [18]: %timeit x * 0.25
10000000 loops, best of 3: 91.6 ns per loop
math.sqrt(x) is actually a little bit faster than x ** 0.5, presumably because it's a special case of the latter and can therefore be done more efficiently, in spite of the overheads:
In [19]: %timeit x ** 0.5
1000000 loops, best of 3: 211 ns per loop
In [20]: %timeit math.sqrt(x)
10000000 loops, best of 3: 181 ns per loop
edit 2011-11-16: Constant expression folding is done by Python's peephole optimizer. The source code (peephole.c) contains the following comment that explains why constant division isn't folded:
case BINARY_DIVIDE:
/* Cannot fold this operation statically since
the result can depend on the run-time presence
of the -Qnew flag */
return 0;
The -Qnew flag enables "true division" defined in PEP 238.

Categories

Resources