I have this simple code that helped me to measure how classes with __slots__ perform (taken from here):
import timeit
def test_slots():
class Obj(object):
__slots__ = ('i', 'l')
def __init__(self, i):
self.i = i
self.l = []
for i in xrange(1000):
Obj(i)
print timeit.Timer('test_slots()', 'from __main__ import test_slots').timeit(10000)
If I run it via python2.7 - I would get something around 6 seconds - ok, it's really faster (and also more memory-efficient) than without slots.
But, if I run the code under PyPy (using 2.2.1 - 64bit for Mac OS/X), it starts to use 100% CPU and "never" returns (waited for minutes - no result).
What is going on? Should I use __slots__ under PyPy?
Here's what happens if I pass different number to timeit():
timeit(10) - 0.067s
timeit(100) - 0.5s
timeit(1000) - 19.5s
timeit(10000) - ? (probably more than a Game of Thrones episode)
Thanks in advance.
Note that the same behavior is observed if I use namedtuples:
import collections
import timeit
def test_namedtuples():
Obj = collections.namedtuple('Obj', 'i l')
for i in xrange(1000):
Obj(i, [])
print timeit.Timer('test_namedtuples()', 'from __main__ import test_namedtuples').timeit(10000)
In each of the 10,000 or so iterations of the timeit code, the class is recreated from scratch. Creating classes is probably not a well-optimized operation in PyPy; even worse, doing so will probably discard all of the optimizations that the JIT learned about the previous incarnation of the class. PyPy tends to be slow until the JIT has warmed up, so doing things that require it to warm up repeatedly will kill your performance.
The solution here is, of course, to simply move the class definition outside of the code being benchmarked.
To directly answer the question in the title: __slots__ is pointless for (but doesn't hurt) performance in PyPy.
Related
I'm slicing a quite big pandas series (~5M) using .loc and I stumble upon some weird behavior when checking times in an attempt to optimize my code.
It's weird that the first slicing attempt like series_object.loc[some_indexes] is taking 100x longer than the following ones.
When I try timeit it does not reflect this behaviour, but when checking the partial laps using `time``, we can see that the first lap is taking much longer than the following ones.
Is .loc using some sort of cacheing? if that's so, how does garbage collection is not influencing this?
Is timeit doing the cacheing even with garbage collector disabled and not behaving as it's suppose?
Which time should I trust that my app in production will take when running in a live environment?
I tried this on windows and linux machines using different versions of python (3.6, 3.7 and 2.7) and the behavior is always the same.
Thanks in advance for you help. This thing is banging my head for a week already and I miss not doubting %timeit :)
to reproduce:
Save the following code to a python file eg.:test_loc_times.py
import pandas as pd
import numpy as np
import timeit
import time, gc
def get_data():
ids = np.arange(size_bigseries)
big_series = pd.Series(index=ids, data=np.random.rand(len(ids)), name='{} elements series'.format(len(ids)))
small_slice = np.arange(size_slice)
return big_series, small_slice
# Method to test: a simple pandas slicing with .loc
def basic_loc_indexing(pd_series, slice_ids):
return pd_series.loc[slice_ids].dropna()
# method to time it
def timing_it(func, n, *args):
gcold = gc.isenabled()
gc.disable()
times = []
for i in range(n):
s = time.time()
func(*args)
times.append((time.time()-s)*1000)
if gcold:
gc.enable()
return times
if __name__ == '__main__':
import sys
n_tries = int(sys.argv[1]) if len(sys.argv)>1 and sys.argv[1] is not None else 1000
size_bigseries = int(sys.argv[2]) if len(sys.argv)>2 and sys.argv[2] is not None else 5000000 #5M
size_slice = int(sys.argv[3]) if len(sys.argv)>3 and sys.argv[3] is not None else 100 #5M
#1: timeit()
big_series, small_slice = get_data()
time_with_timeit = timeit.timeit('basic_loc_indexing(big_series, small_slice)',"gc.disable(); from __main__ import basic_loc_indexing, big_series, small_slice",number=n_tries)
print("using timeit: {:.6f}ms".format(time_with_timeit/n_tries*1000))
del big_series, small_slice
#2: time()
big_series, small_slice = get_data()
time_with_time = timing_it(basic_loc_indexing, n_tries, big_series, small_slice)
print("using time: {:.6f}ms".format(np.mean(time_with_time)))
print('head detail: {}\n'.format(time_with_time[:5]))
try out:
Run
python test_loc_times.py 1000 5000000 100
This will run timeit and time 1000 laps on slicing 100 elements from a 5M pandas.Series.
you can try it yourself with other values and the first run it always taking longer.
stdout:
>>> using timeit: 0.789754ms
>>> using time: 0.829869ms
>>> head detail: [145.02716064453125, 0.7691383361816406, 0.7028579711914062, 0.5738735198974609, 0.6380081176757812]
Weird right?
edit:
I found this answer which might be related. What do you think?
This code is likely not idempotent (has side effects that impact its execution).
timeit will run the code once first to measure the time and deduce the number of loops and runs it should use. If your code is not idempotent (has side effects, like cashing) then that first run (not recorded) will be longer and the subsequent (faster runs) will be measured and reported.
You can take a look at the arguments you can pass to timeit (see the doc) to specify the number of loops and forgo that initial run.
Also note that (taken from the doc linked above):
The times reported by %timeit will be slightly higher than those reported by the timeit.py script when variables are accessed. This is due to the fact that %timeit executes the statement in the namespace of the shell, compared with timeit.py, which uses a single setup statement to import function or create variables. Generally, the bias does not matter as long as results from timeit.py are not mixed with those from %timeit.
Edit: Missed the fact that you were passing the number of runs to timeit. In that case, only the latter part of my answer applies, but the numbers you are seeing seem to point to another issue...
I see that from x import * is discouraged all over the place. Corrupts naming space, etc.
So I'm inclined to use from . import x, and when I need to use the functions, I'll call x.func() instead of just using func().
The speed difference is probably very little, but I still want to know how much it might impact the performance? So that I can keep the good habit without needing to worry about other things.
It has practically no impact:
>>> import timeit
>>> timeit.timeit('math.pow(1, 1)', 'import math')
0.20310196322982677
>>> timeit.timeit('pow(1, 1)', 'from math import pow')
0.19039931574786806
Note I picked a function that would have very little run time so that any difference would be magnified.
I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'
Therefore I wonder:
Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
Or is the python interpreter smart enough?
(I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)
Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
(You can provide specific Python2 and/or Python3 tricks and tips)
I have no idea of the CPU usage, for this purpose, but isn't it why we use high level languages in some way?
Another solution would be using regular expressions, using compiled pattern should allow background optimisations:
import re
version = '1.2.3.4.5-RC4'
pat = re.compile('^(\d+\.\d+\.\d+)')
res = re.match(version)
if res:
print res.group(1)
Edit: As suggested #jonrsharpe, I did also run the timeit benchmark. Here are my results:
def extract_vers(str):
res = pat.match(str)
if res:
return res.group(1)
else:
return False
>>> timeit.timeit("api1(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.9013631343841553
>>> timeit.timeit("api2(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.3482811450958252
>>> timeit.timeit("extract_vers(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.174590826034546
Edit: But anyway, some lib exist in Python, such as distutils.version to do the job.
You should have a look on that answer.
To answer your first question: no, this will not be optimised out by the interpreter. Python will create a list from the string, then create a second list for the slice, then put the list items back together into a new string.
To cover the second, you can optimise this slightly by limiting the split with the optional maxsplit argument:
>>> v = '1.2.3.4.5-RC4'
>>> v.split(".", 3)
['1', '2', '3', '4.5-RC4']
Once the third '.' is found, Python stops searching through the string. You can also neaten slightly by removing the default 0 argument to the slice:
api = '.'.join(version.split('.', 3)[:3])
Note, however, that any difference in performance is negligible:
>>> import timeit
>>> def test1(version):
return '.'.join(version.split('.')[0:3])
>>> def test2(version):
return '.'.join(version.split('.', 3)[:3])
>>> timeit.timeit("test1(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0458565345561743
>>> timeit.timeit("test2(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0842980287537776
The benefit of maxsplit becomes clearer with longer strings containing more irrelevant '.'s:
>>> timeit.timeit("s.split('.')", setup="s='1.'*100")
3.460900054011617
>>> timeit.timeit("s.split('.', 3)", setup="s='1.'*100")
0.5287887450379003
I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
A feel of CPU wasting is absolutely normal for C/C++ programmers facing python code. Your code:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join(version.split('.')[0:3]) # extract '1.2.3'
Is absolutely fine in python, there is no simplification possible. Only if you have to do it 1000s of times, consider using a library function or write your own.
In the process of hunting down performance bugs I finally identified that the source of the problem is the contextlib wrapper. The overhead is quite staggering and I did not expect that to be the source of the slowdown. The slowdown is in the range of 50X, I cannot afford to have that in a loop. I sure would have appreciated a warning in the docs if it has the potential of slowing things down so significantly.
It seems this has been known since 2010 https://gist.github.com/bdarnell/736778
It has a set of benchmarks you can try. Please change fn to fn() in simple_catch() before running. Thanks, DSM for pointing this out.
I am surprised that the situation has not improved since those times. What can I do about it? I can drop down to try/except, but I hope there are other ways to deal with it.
Here are some new timings:
import contextlib
import timeit
def work_pass():
pass
def work_fail():
1/0
def simple_catch(fn):
try:
fn()
except Exception:
pass
#contextlib.contextmanager
def catch_context():
try:
yield
except Exception:
pass
def with_catch(fn):
with catch_context():
fn()
class ManualCatchContext(object):
def __enter__(self):
pass
def __exit__(self, exc_type, exc_val, exc_tb):
return True
def manual_with_catch(fn):
with ManualCatchContext():
fn()
preinstantiated_manual_catch_context = ManualCatchContext()
def manual_with_catch_cache(fn):
with preinstantiated_manual_catch_context:
fn()
setup = 'from __main__ import simple_catch, work_pass, work_fail, with_catch, manual_with_catch, manual_with_catch_cache'
commands = [
'simple_catch(work_pass)',
'simple_catch(work_fail)',
'with_catch(work_pass)',
'with_catch(work_fail)',
'manual_with_catch(work_pass)',
'manual_with_catch(work_fail)',
'manual_with_catch_cache(work_pass)',
'manual_with_catch_cache(work_fail)',
]
for c in commands:
print c, ': ', timeit.timeit(c, setup)
I've made simple_catch actually call the function and I've added two new benchmarks.
Here's what I got:
>>> python2 bench.py
simple_catch(work_pass) : 0.413918972015
simple_catch(work_fail) : 3.16218209267
with_catch(work_pass) : 6.88726496696
with_catch(work_fail) : 11.8109841347
manual_with_catch(work_pass) : 1.60508012772
manual_with_catch(work_fail) : 4.03651213646
manual_with_catch_cache(work_pass) : 1.32663416862
manual_with_catch_cache(work_fail) : 3.82525682449
python2 p.py.py 33.06s user 0.00s system 99% cpu 33.099 total
And for PyPy:
>>> pypy bench.py
simple_catch(work_pass) : 0.0104489326477
simple_catch(work_fail) : 0.0212869644165
with_catch(work_pass) : 0.362847089767
with_catch(work_fail) : 0.400238037109
manual_with_catch(work_pass) : 0.0223228931427
manual_with_catch(work_fail) : 0.0208241939545
manual_with_catch_cache(work_pass) : 0.0138869285583
manual_with_catch_cache(work_fail) : 0.0213649272919
The overhead is much smaller than you claimed. Further, the only overhead PyPy doesn't seem to be able to remove relative to the try...catch for the manual variant is object creation, which is trivially removed in this case.
Unfortunately with is way too involved for good optimization by CPython, especially with regards to contextlib which even PyPy finds hard to optimize. This is normally OK because although object creation + a function call + creating a generator is expensive, it's cheap compared to what is normally done.
If you are sure that with is causing most of your overhead, convert the context managers into cached instances like I have. If that's still too much overhead, you've likely got a bigger problem with how your system is designed. Consider making the scope of the withs bigger (not normally a good idea, but acceptable if need be).
Also, PyPy. Dat JIT be fast.
See Is there something wrong with this python code, why does it run so slow compared to ruby? for my previous attempt at understanding the differences between python and ruby.
As pointed out by igouy the reasoning I came up with for python being slower could be something else than due to recursive function calls (stack involved).
I made this
#!/usr/bin/python2.7
i = 0
a = 0
while i < 6553500:
i += 1
if i != 6553500:
a = i
else:
print "o"
print a
In ruby it is
#!/usr/bin/ruby
i = 0
a = 0
while i < 6553500
i += 1
if i != 6553500
a = i
else
print "o"
end
end
print a
Python 3.1.2 (r312:79147, Oct 4 2010, 12:45:09)
[GCC 4.5.1] on linux2
time python pytest.py
o
6553499
real 0m3.637s
user 0m3.586s
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
time ruby rutest.rb
o6553499
real 0m0.618s
user 0m0.610s
Letting it loop higher gives higher differences. Adding an extra 0, ruby finishes in 7s, while python runs for 40s.
This is run on Intel(R) Core(TM) i7 CPU M 620 # 2.67GHz with 4GB mem.
Why is this so?
First off, note that the Python version you show is incorrect: you're running this code in Python 2.7, not 3.1 (it's not even valid Python3 code). (FYI, Python 3 is usually slower than 2.)
That said, there's a critical problem in the Python test: you're writing it as global code. You need to write it as a function. It runs about twice as fast when written correctly, in both Python 2 and 3:
def main():
i = 0
a = 0
while i < 6553500:
i += 1
if i != 6553500:
a = i
else:
print("o")
print(a)
if __name__ == "__main__":
main()
When you write code globally, you have no locals; all of your variables are global variables. Locals are much faster than globals in Python, because globals are stored in a dict. Locals can be referenced directly by the VM by index, so no hash table lookups are needed.
Also, note that this is such a simple test, what you're really doing is benchmarking a few arbitrary bytecode operations.
Why is this so?
Python's loop (for, while) isn't fast for handle dynamic types. in such a case, it lose the advantage.
but cython becames the salvation
pure python version below borrowed from Glenn Maynard's answer (without print)
cython version is very easy based on this, it's is easy enough for a new python programmer can read :
def main():
cdef int i = 0
cdef int a = 0
while i < 6553500:
i += 1
if i != 6553500:
a = i
else:
pass # print "0"
return a
if __name__ == "__main__":
print main()
on my pc, python version need 2.5s, cython version need 5.5ms:
In [1]: import pyximport
In [2]: pyximport.install()
In [3]: import spam # pure python version
In [4]: timeit spam.main()
1 loops, best of 3: 2.41 s per loop
In [5]: import eggs # cython version
In [6]: timeit eggs.main()
100 loops, best of 3: 5.51 ms per loop
update: as Glenn Maynard point out in the comment, while i < N: i+= 1 is not pythonic.
I test xrange implementation.
spam.py is the same as Glenn Maynard's verison. foo.py's code is:
def main():
for i in xrange(6553500):
pass
a = i
return a
if __name__ == "__main__":
print main()
~/code/note$ time python2.7 spam.py # Glenn Maynard's while version
6553499
real 0m2.128s
user 0m2.080s
sys 0m0.044s
:~/code/note$ time python2.7 foo.py # xrange version, as Glenn Maynard point out in comment
6553499
real 0m0.618s
user 0m0.604s
sys 0m0.016s
On my friend's laptop (Windows7 64 bit, python 2.6, 3GB RAM), python takes only around 1 sec for 6553500 and 10 secs for 65535000 input. I wonder why your computer is taking so much time. It also shaves off some time on the larger input when I use xrange and local variables instead.
I cannot comment on Ruby since it's not installed on this computer.
It's too simple test. Maybe in such things Ruby is faster. But Python takes advantage, when you need to work with more complex datatypes and their methods. Python have much more implemented "ways to do this", and you must choose the one, which is the simplest and enough. Ruby works with more abstract entities and it requires less knowledge. But when it is possible in the same task in Python go deeper in it's manuals and find more and more combinations of types and methods/functions, it is possible after time to make much more faster program, than Ruby.
Ruby is simplier and easier to write, if you just want to get result, not perfomance. But the more complex you task is, the more advantage Python will have in perfomance after hours of optimizing.
UPD: While Ruby and Python have so much common things, the perfomace and high-levelness will have the reverse proportionality.