Timeit module - Passing objects to setup? - python

I am trying to use the timeit module to time the speed of an algorithm that analyzes data.
The problem is that I have to do run some setup code in order to run this algorithm. Specifically, I have to load some documents from a database, and turn it into matrix representation.
The timeit module does not seem to allow me to pass in the matrix object, and instead forces me to set this up all over again in the setup parameter. Unfortunately this means that the running time of my algorithm is fuzzed by the running time of pre-processing.
Is there some way to pass in objects that were created already, to timeit in the setup parameter? Otherwise, how can I deal with situations where the setup code takes a nontrivial amount of time, and I don't want that to fuzz the code block I actually am trying to test?
Am I approaching this the wrong way?

Running time of your algorithm is not fuzzed by the running time of pre-processing. This can be proved as: Suppose I declare a list in __main__ module and run timeit to find index of some item in this list. But I need to pass the list to timeit too. The list passing is sort of pre-processing. Time returned by timeit shows 0.26 sec (see below code). Now if timeit would have calculated the pre-processing time (importing list from __main__) too, then the result would have been almost 1.1 sec, because importing list from __main__ requires 0.84 sec for 1000000 iterations (see below code). What timeit does is it imports list from __main__ only once and then calculates time required by the algorithm for given number of iterations.
>>> import timeit
>>> lst = range(10)
>>> timeit.timeit('lst.index(9)', 'from __main__ import lst', number = 1000000)
0.2645089626312256
>>> timeit.timeit('from __main__ import lst', number = 1000000)
0.8406829833984375

The time it takes to run the setup code doesn't affect the timeit module's timing calculations.
You should be able to pass your matrix into the setup parameter using import, eg
"from __main__ import mymatrix"

Related

python timeit: function with decorator

I am trying to figure out whether to compile regex expressions upfront, or alternativelt for startup speed, only compile they once when they are first required.
I can use timeit to test:
re.compile(r'#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})|rgb\(\d{1,3},\d{1,3},\d{1,3}\)')
I don't know know how to figure out how long the parser takes (or how to use timeit to time it) to see the following code (not execute it, just "see" it).
from functools import cache
#cache
def get_color_pattern():
return re.compile(r'#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})|rgb\(\d{1,3},\d{1,3},\d{1,3}\)')
Python code is interpreted "top to bottom". Therefore, if you declare variables either side of your function that are initialised with a "timestamp" you can deduce that the difference between those values will be the time taken to parse the code between them. So, for example,
from functools import cache
import time
import re
start = time.perf_counter()
#cache
def get_color_pattern():
return re.compile(r'#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})|rgb\(\d{1,3},\d{1,3},\d{1,3}\)')
print(time.perf_counter()-start)
start = time.perf_counter()
re.compile(r'#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})|rgb\(\d{1,3},\d{1,3},\d{1,3}\)')
print(time.perf_counter()-start)
Output:
5.995000265102135e-06
0.00024428899996564724
Thus we see the difference in time taken to parse the function versus compiling the expression
If you want to check how long it takes with the cache generated, just run the function once before timing to generate the cache.
from functools import cache
import re
import timeit
#cache
def get_color_pattern():
return re.compile(r'#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})|rgb\(\d{1,3},\d{1,3},\d{1,3}\)')
# Run once to generate cache
get_color_pattern()
number = 1000
t = timeit.timeit(get_color_pattern, number=number)
print(f"Total {t} seconds for {number} iterations, {t / number} at average")
Out: Total 4.5900000000001495e-05 seconds for 1000 iterations, 4.5900000000001496e-08 at average
Running with cache commented out, seems to be roughly 10x slower:
Total 0.0004936999999999997 seconds for 1000 iterations, 4.936999999999997e-07 at average
As pointed out by Kelly Bundy, re.compile already includes a built in caching feature.

Why does Numba skew the timings of a JIT-compiled function?

I'm trying to benchmark a Python function that does list operations using Numba against the CPython interpreter. To compare end-to-end time I used the Linux time utility.
time python3.10 list.py
As I understand the first invocation will be expensive due to JIT compilation, but it does not explain why the maximum recorded time is longer than the total time taken to run the entire script.
# list.py
import numpy as np
from time import time, perf_counter
from numba import njit
#njit
def listOperations():
list = []
for i in range(1000):
list.append(i)
list.sort(reverse=True)
list.remove(420)
list.reverse()
if __name__ == "__main__":
repetitions = 1000
timings = np.zeros(repetitions)
for rep in range(repetitions):
start = time() # Similar results with perf_counter too.
listOperations()
timings[rep] = time() - start
# Convert to milliseconds
timings *= 10e3
print("Mean {}ms, Median {}ms, Std. Dev {}ms, Min {}ms, Max {}ms".format(
float('%.4f' % np.mean(timings)),
float('%.4f' % np.median(timings)),
float('%.4f' % np.std(timings)),
float('%.4f' % np.min(timings)),
float('%.4f' % np.max(timings)))
)
For Numba it shows maximum of ~66.3s while the time utility reports ~8s. The complete results are below.
'''
Numba --->
Mean 66.8154ms, Median 0.391ms, Std. Dev 2097.7752ms, Min 0.3219ms, Max 66371.1143ms
real 0m7.982s
user 0m8.248s
sys 0m0.100s
CPython3.10 --->
Mean 1.6395ms, Median 1.6284ms, Std. Dev 0.0708ms, Min 1.5759ms, Max 2.3198ms
real. 0m1.115s
user 0m1.468s
sys 0m0.080s
'''
The main issue is that the compilation time is included in the timings. Indeed, Numba compiles the functions lazily. To prevent this, you must specify the prototype or to execute the first function call outside (which is generally a good practice in benchmarks).
You can use #njit('()') instead of #njit. With this fix, the Numba code is about twice faster on my machine.
Note that your function does not return anything nor read anything in parameter so the JIT can optimize the function to a no-op. To avoid biases, you certainly need to add a parameter, to use it and to return the list. This is apparently not the case on my machine but different versions of Numba may do that.
Note also that Numba list are generally not where Numba shine. Lists are generally slow (both with and without Numba). It is better to use array when the size is known.
By the way, list is a built-in function. Overwriting it can cause sneaky bugs in modules using it (frequent) so this is not a good idea. I advise you to use another name.
Furthermore, note that the standard deviation was pretty big in the results, the median time was good and the maximum time was very big indicating that the timings were not stable and that this instability was due to one slow call. Such results generally indicates the benchmark is flawed or the function itself has an unstable behaviour (typically due to a bug or an initialization done once).

Pandas: Why is Series indexing using .loc taking 100x longer on the first run when timing it?

I'm slicing a quite big pandas series (~5M) using .loc and I stumble upon some weird behavior when checking times in an attempt to optimize my code.
It's weird that the first slicing attempt like series_object.loc[some_indexes] is taking 100x longer than the following ones.
When I try timeit it does not reflect this behaviour, but when checking the partial laps using `time``, we can see that the first lap is taking much longer than the following ones.
Is .loc using some sort of cacheing? if that's so, how does garbage collection is not influencing this?
Is timeit doing the cacheing even with garbage collector disabled and not behaving as it's suppose?
Which time should I trust that my app in production will take when running in a live environment?
I tried this on windows and linux machines using different versions of python (3.6, 3.7 and 2.7) and the behavior is always the same.
Thanks in advance for you help. This thing is banging my head for a week already and I miss not doubting %timeit :)
to reproduce:
Save the following code to a python file eg.:test_loc_times.py
import pandas as pd
import numpy as np
import timeit
import time, gc
def get_data():
ids = np.arange(size_bigseries)
big_series = pd.Series(index=ids, data=np.random.rand(len(ids)), name='{} elements series'.format(len(ids)))
small_slice = np.arange(size_slice)
return big_series, small_slice
# Method to test: a simple pandas slicing with .loc
def basic_loc_indexing(pd_series, slice_ids):
return pd_series.loc[slice_ids].dropna()
# method to time it
def timing_it(func, n, *args):
gcold = gc.isenabled()
gc.disable()
times = []
for i in range(n):
s = time.time()
func(*args)
times.append((time.time()-s)*1000)
if gcold:
gc.enable()
return times
if __name__ == '__main__':
import sys
n_tries = int(sys.argv[1]) if len(sys.argv)>1 and sys.argv[1] is not None else 1000
size_bigseries = int(sys.argv[2]) if len(sys.argv)>2 and sys.argv[2] is not None else 5000000 #5M
size_slice = int(sys.argv[3]) if len(sys.argv)>3 and sys.argv[3] is not None else 100 #5M
#1: timeit()
big_series, small_slice = get_data()
time_with_timeit = timeit.timeit('basic_loc_indexing(big_series, small_slice)',"gc.disable(); from __main__ import basic_loc_indexing, big_series, small_slice",number=n_tries)
print("using timeit: {:.6f}ms".format(time_with_timeit/n_tries*1000))
del big_series, small_slice
#2: time()
big_series, small_slice = get_data()
time_with_time = timing_it(basic_loc_indexing, n_tries, big_series, small_slice)
print("using time: {:.6f}ms".format(np.mean(time_with_time)))
print('head detail: {}\n'.format(time_with_time[:5]))
try out:
Run
python test_loc_times.py 1000 5000000 100
This will run timeit and time 1000 laps on slicing 100 elements from a 5M pandas.Series.
you can try it yourself with other values and the first run it always taking longer.
stdout:
>>> using timeit: 0.789754ms
>>> using time: 0.829869ms
>>> head detail: [145.02716064453125, 0.7691383361816406, 0.7028579711914062, 0.5738735198974609, 0.6380081176757812]
Weird right?
edit:
I found this answer which might be related. What do you think?
This code is likely not idempotent (has side effects that impact its execution).
timeit will run the code once first to measure the time and deduce the number of loops and runs it should use. If your code is not idempotent (has side effects, like cashing) then that first run (not recorded) will be longer and the subsequent (faster runs) will be measured and reported.
You can take a look at the arguments you can pass to timeit (see the doc) to specify the number of loops and forgo that initial run.
Also note that (taken from the doc linked above):
The times reported by %timeit will be slightly higher than those reported by the timeit.py script when variables are accessed. This is due to the fact that %timeit executes the statement in the namespace of the shell, compared with timeit.py, which uses a single setup statement to import function or create variables. Generally, the bias does not matter as long as results from timeit.py are not mixed with those from %timeit.
Edit: Missed the fact that you were passing the number of runs to timeit. In that case, only the latter part of my answer applies, but the numbers you are seeing seem to point to another issue...

Any benefits to importing sub modules directly (seems to be slower)?

I wanted to see which is faster:
import numpy as np
np.sqrt(4)
-or-
from numpy import sqrt
sqrt(4)
Here is the code I used to find the average time to run each.
def main():
import gen_funs as gf
from time import perf_counter_ns
t = 0
N = 40
for j in range(N):
tic = perf_counter_ns()
for i in range(100000):
imp2() # I ran the code with this then with imp1()
toc = perf_counter_ns()
t += (toc - tic)
t /= N
time = gf.ns2hms(t) # Converts ns to readable object
print("Ave. time to run: {:d}h {:d}m {:d}s {:d}ms" .
format(time.hours, time.minutes, time.seconds, time.milliseconds))
def imp1():
import numpy as np
np.sqrt(4)
return
def imp2():
from numpy import sqrt
sqrt(4)
return
if __name__ == "__main__":
main()
When I import numpy as np then call np.sqrt(4), I get an average time of about 229ms (time to run the loop 10**4 times).
When I run from numpy import sqrt then call sqrt(4), I get an average time of about 332ms.
Since there is such a difference in time to run, what is the benefit to running from numpy import sqrt? Is there a memory benefit or some other reason why I would do this?
I tried timing with the time bash command. I got 215ms for importing numpy and running sqrt(4) and 193ms for importing sqrt from numpy with the same command. The difference is negligible, honestly.
However, if you don't need a certain aspect of a module, importing it is not encouraged.
In this particular case, since there is no discernable performance benefit and because there are few situations in which you would import just numpy.sqrt (math.sqrt is ~4x faster. The extra features numpy.sqrt offers would only be useable if you have numpy data, which would require you to import the entire module, of course).
There might be a rare scenario in which you don't need all of numpy but still need numpy.sqrt, e.g. using pandas.DataFrame.to_numpy() and manipulating the data in some ways, but honestly I don't feel the 20ms of speed is worth anything in the real world. Especially since you saw worse performance with importing just numpy.sqrt.

Simplify statement '.'.join( string.split('.')[0:3] )

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'
Therefore I wonder:
Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
Or is the python interpreter smart enough?
(I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)
Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
(You can provide specific Python2 and/or Python3 tricks and tips)
I have no idea of the CPU usage, for this purpose, but isn't it why we use high level languages in some way?
Another solution would be using regular expressions, using compiled pattern should allow background optimisations:
import re
version = '1.2.3.4.5-RC4'
pat = re.compile('^(\d+\.\d+\.\d+)')
res = re.match(version)
if res:
print res.group(1)
Edit: As suggested #jonrsharpe, I did also run the timeit benchmark. Here are my results:
def extract_vers(str):
res = pat.match(str)
if res:
return res.group(1)
else:
return False
>>> timeit.timeit("api1(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.9013631343841553
>>> timeit.timeit("api2(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.3482811450958252
>>> timeit.timeit("extract_vers(s)", setup="from __main__ import extract_vers,api1,api2; s='1.2.3.4.5-RC4'")
1.174590826034546
Edit: But anyway, some lib exist in Python, such as distutils.version to do the job.
You should have a look on that answer.
To answer your first question: no, this will not be optimised out by the interpreter. Python will create a list from the string, then create a second list for the slice, then put the list items back together into a new string.
To cover the second, you can optimise this slightly by limiting the split with the optional maxsplit argument:
>>> v = '1.2.3.4.5-RC4'
>>> v.split(".", 3)
['1', '2', '3', '4.5-RC4']
Once the third '.' is found, Python stops searching through the string. You can also neaten slightly by removing the default 0 argument to the slice:
api = '.'.join(version.split('.', 3)[:3])
Note, however, that any difference in performance is negligible:
>>> import timeit
>>> def test1(version):
return '.'.join(version.split('.')[0:3])
>>> def test2(version):
return '.'.join(version.split('.', 3)[:3])
>>> timeit.timeit("test1(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0458565345561743
>>> timeit.timeit("test2(s)", setup="from __main__ import test1, test2; s = '1.2.3.4.5-RC4'")
1.0842980287537776
The benefit of maxsplit becomes clearer with longer strings containing more irrelevant '.'s:
>>> timeit.timeit("s.split('.')", setup="s='1.'*100")
3.460900054011617
>>> timeit.timeit("s.split('.', 3)", setup="s='1.'*100")
0.5287887450379003
I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:
A feel of CPU wasting is absolutely normal for C/C++ programmers facing python code. Your code:
version = '1.2.3.4.5-RC4' # the end can vary a lot
api = '.'.join(version.split('.')[0:3]) # extract '1.2.3'
Is absolutely fine in python, there is no simplification possible. Only if you have to do it 1000s of times, consider using a library function or write your own.

Categories

Resources