How to use timeit where each test requires random setup

How to use timeit where each test requires random setup - python

I have a function f(x) that takes as input a list x of 100 random floats between 0 and 1. Different lists will result in different running times of f.
I want to find out how long f takes to run on average, over a large number of different random lists. What's the best way to do this? Should I use timeit and if so is there a way I can do this without including the time it takes to generate each random list in each trial?
This is how I would do it without timeit (pseudocode):
for i = 1 to 10000:
x = random list
start = current time
f(x)
end = current time
results.append(end - start)
return mean(results)

You can make a timer decorator:
Here is some example code:
from time import time
class Timer(object):
def __init__(self, func):
"""
Decorator that times a function
#param func: Function being decorated
#type func: callable
"""
self.func = func
def __call__(self, *args, **kwargs):
start = time()
self.func(*args, **kwargs)
end = time()
return end - start
#Timer
def cheese():
for var in xrange(9999999):
continue
for var in xrange(100):
print cheese()
Working example, with fewer loops.

import timeit, random
def summer(myList):
result = 0
for num in myList:
result += num
return result
for i in range(10):
x = [random.randint(0, 100) for i in range(100000)]
print timeit.timeit("summer(x)", setup="from __main__ import x, summer", number = 100)
You can import the variable using from __main__ import x

I think this does the trick. It will execute the setup once per repeat and then execute stmt number=1 time. However I don't think this is that much better than the simple loop you posted.
import timeit
stmt = '[x*x*x for x in xrange(n)]' # just an example
setup = 'import random; n = random.randint(10, 100)'
r = 10000
times = timeit.repeat(stmt, setup, repeat=r, number=1)
print min(times), max(times), sum(times)/r
There is also a "cell mode" that you can use with timeit in the IPython shell, but it only returns the fasted time and there is no easy way to change that (?).
import random
%%timeit -r 10000 -n 1 n = random.randint(10,100)
var = [x*x*x for x in xrange(n)]

Related

Optimizing a parallel implementation of a list comprehension

I have a dataframe, where each row contains a list of integers. I also have a reference-list that I use to check what integers in the dataframe appear in this list.
I have made two implementations of this, one single-threaded and one multi-threaded. The single-threaded implementation is quite fast (takes roughly 0.1s on my machine), whereas the multithreaded takes roughly 5s.
My question is: Is this due to my implementation being poor, or is this merely a case where the overhead due to multithreading is so large that it doesn't make sense to use multiple threads?
The example is below:
import time
from random import randint
import pandas as pd
import multiprocessing
from functools import partial
class A:
def __init__(self, N):
self.ls = [[randint(0, 99) for i in range(20)] for j in range(N)]
self.ls = pd.DataFrame({'col': self.ls})
self.lst_nums = [randint(0, 99) for i in range(999)]
#classmethod
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
def get_idx_method1(self):
method1 = self.ls['col'].apply(lambda nums: any(x in self.lst_nums for x in nums))
return method1
def get_idx_method2(self):
pool = multiprocessing.Pool(processes=1)
method2 = pool.map(partial(A.helper, self.lst_nums), self.ls['col'])
pool.close()
return method2
if __name__ == "__main__":
a = A(50000)
start = time.time()
m1 = a.get_idx_method1()
end = time.time()
print(end-start)
start = time.time()
m2 = a.get_idx_method2()
end = time.time()
print(end - start)

First of all, multiprocessing is useful when the cost of data communication between the main process and the others is less comparable to the time cost of the function.
Another thing is that you made an error in your code:
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
VS
any(x in self.lst_nums for x in nums)
You have that list [] in the helper method, which will make the any() method to wait for the entire array to be computed, while the second any() will just stop at the first True value.
In conclusion if you remove list brackets from the helper method and maybe increase the randint range for lst_nums initializer, you will notice an increase in speed when using multiple processes.
self.lst_nums = [randint(0, 10000) for i in range(999)]
and
def helper(cls, lst_nums, col):
return any(s in lst_nums for s in col)

in ReactiveX, how do I pass other parameters to Observer.create?

Using RxPY for illustration purposes.
I want to create an observable from a function, but that function must take parameters. This particular example must return, at random intervals, one of many pre-defined tickers which I want to send to it. My solution thus far is to use a closure:
from __future__ import print_function
from rx import Observable
import random
import string
import time
def make_tickers(n = 300, s = 123):
""" generates up to n unique 3-letter strings geach makde up of uppsercase letters"""
random.seed(s)
tickers = [''.join(random.choice(string.ascii_uppercase) for _ in range(3)) for y in range(n)]
tickers = list(set(tickers)) # unique
print(len(tickers))
return(tickers)
def spawn_prices_fn(tickers):
""" returns a function that will return a random element
out of tickers every 20-100 ms, and takes an observable parameter """
def spawner(observer):
while True:
next_tick = random.choice(tickers)
observer.on_next(next_tick)
time.sleep(random.randint(20, 100)/1000.0)
return(spawner)
if __name__ == "__main__":
spawned = spawn_prices_fn(make_tickers())
xx = Observable.create(spawned)
xx.subscribe(lambda s: print(s))
Is there a simpler way? Can further parameters be sent to Observable.create's first paramater function, that don't require a closure? What is the canonical advice?

It can be done in numerous ways, here's one of the solutions that doesn't change your code too much.
Note that tickers generation could also be broken up into a function generating a single string combined with some rx magic to be more rx-like
I also slightly adjusted the code to make flake8 happy
from __future__ import print_function
import random
import string
import time
from rx import Observable
def make_tickers(n=300, s=123):
"""
Generates up to n unique 3-letter strings each made up of uppercase letters
"""
random.seed(s)
tickers = [''.join(random.choice(string.ascii_uppercase) for _ in range(3))
for y in range(n)]
tickers = list(set(tickers)) # unique
print(len(tickers))
return(tickers)
def random_picker(tickers):
ticker = random.choice(tickers)
time.sleep(random.randint(20, 100) / 1000.0)
return ticker
if __name__ == "__main__":
xx = Observable\
.repeat(make_tickers())\
.map(random_picker)\
.subscribe(lambda s: print(s))
or a solution without make_tickers:
from __future__ import print_function
import random
import string
import time
from rx import Observable
def random_picker(tickers):
ticker = random.choice(tickers)
time.sleep(random.randint(20, 100) / 1000.0)
return ticker
if __name__ == "__main__":
random.seed(123)
Observable.range(1, 300)\
.map(lambda _: ''.join(random.choice(string.ascii_uppercase)
for _ in range(3)))\
.reduce(lambda x, y: x + [y], [])\
.do_while(lambda _: True)\
.map(random_picker)\
.subscribe(lambda s: print(s))
time.sleep could be moved away from random_picker but the code would become a bit trickier

You can also use "partials", to wrap your Subscription method. It allows you to define other arguments, but call rx.create on a method that waits only for Observer and Scheduler:
def my_subscription_with_arguments(observer, scheduler, arg1):
observer.on_next(arg1)
my_subscription_wrapper = functools.partial(my_subscription_with_arguments, arg1='hello')
source = rx.create(my_subscription_wrapper)

python-measure function time

I am having a problem with measuring the time of a function.
My function is a "linear search":
def linear_search(obj, item,):
for i in range(0, len(obj)):
if obj[i] == item:
return i
return -1
And I made another function that measures the time 100 times and adds all the results to a list:
def measureTime(a):
nl=[]
import random
import time
for x in range(0,100): #calculating time
start = time.time()
a
end =time.time()
times=end-start
nl.append(times)
return nl
When I'm using measureTime(linear_search(list,random.choice(range(0,50)))), the function always returns [0.0].
What can cause this problem? Thanks.

you are actually passing the result of linear_search into function measureTime, you need to pass in the function and arguments instead for them to be execute inside measureTime function like #martijnn2008 answer
Or better wise you can consider using timeit module to to the job for you
from functools import partial
import timeit
def measureTime(n, f, *args):
# return average runtime for n number of times
# use a for loop with number=1 to get all individual n runtime
return timeit.timeit(partial(f, *args), number=n)
# running within the module
measureTime(100, linear_search, list, random.choice(range(0,50)))
# if running interactively outside the module, use below, lets say your module name mymodule
mymodule.measureTime(100, mymodule.linear_search, mymodule.list, mymodule.random.choice(range(0,50)))

Take a look at the following example, don't know exactly what you are trying to achieve so I guessed it ;)
import random
import time
def measureTime(method, n, *args):
start = time.time()
for _ in xrange(n):
method(*args)
end = time.time()
return (end - start) / n
def linear_search(lst, item):
for i, o in enumerate(lst):
if o == item:
return i
return -1
lst = [random.randint(0, 10**6) for _ in xrange(10**6)]
repetitions = 100
for _ in xrange(10):
item = random.randint(0, 10**6)
print 'average runtime =',
print measureTime(linear_search, repetitions, lst, item) * 1000, 'ms'

Python not in dict condition sentence performance

Does anybody know about what is better to use thinking about speed and resources? Link to some trusted sources would be much appreciated.
if key not in dictionary.keys():
or
if not dictionary.get(key):

Firstly, you'd do
if key not in dictionary:
since dicts are iterated over by keys.
Secondly, the two statements are not equivalent - the second condition would be true if the corresponding values is falsy (0, "", [] etc.), not only if the key doesn't exist.
Lastly, the first method is definitely faster and more pythonic. Function/method calls are expensive. If you're unsure, timeit.

In my experience, using in is faster than using get, although the speed of get can be improved by caching the get method so it doesn't have to be looked up each time. Here are some timeit tests:
''' in vs get speed test
Comparing the speed of cache retrieval / update using `get` vs using `in`
http://stackoverflow.com/a/35451912/4014959
Written by PM 2Ring 2015.12.01
Updated for Python 3 2017.08.08
'''
from __future__ import print_function
from timeit import Timer
from random import randint
import dis
cache = {}
def get_cache(x):
''' retrieve / update cache using `get` '''
res = cache.get(x)
if res is None:
res = cache[x] = x
return res
def get_cache_defarg(x, get=cache.get):
''' retrieve / update cache using defarg `get` '''
res = get(x)
if res is None:
res = cache[x] = x
return res
def in_cache(x):
''' retrieve / update cache using `in` '''
if x in cache:
return cache[x]
else:
res = cache[x] = x
return res
#slow to fast.
funcs = (
get_cache,
get_cache_defarg,
in_cache,
)
def show_bytecode():
for func in funcs:
fname = func.__name__
print('\n%s' % fname)
dis.dis(func)
def time_test(reps, loops):
''' Print timing stats for all the functions '''
for func in funcs:
fname = func.__name__
print('\n%s: %s' % (fname, func.__doc__))
setup = 'from __main__ import data, ' + fname
cmd = 'for v in data: %s(v)' % (fname,)
times = []
t = Timer(cmd, setup)
for i in range(reps):
r = 0
for j in range(loops):
r += t.timeit(1)
cache.clear()
times.append(r)
times.sort()
print(times)
datasize = 1024
maxdata = 32
data = [randint(1, maxdata) for i in range(datasize)]
#show_bytecode()
time_test(3, 500)
typical output on my 2Ghz machine running Python 2.6.6:
get_cache: retrieve / update cache using `get`
[0.65624237060546875, 0.68499755859375, 0.76354193687438965]
get_cache_defarg: retrieve / update cache using defarg `get`
[0.54204297065734863, 0.55032730102539062, 0.56702113151550293]
in_cache: retrieve / update cache using `in`
[0.48754477500915527, 0.49125504493713379, 0.50087881088256836]

TLDR: Use if key not in dictionary. This is idiomatic, robust and fast.
There are four versions of relevance to this question: the 2 posed in the question, and the optimal variant of them:
key not in dictionary.keys() # inA
key not in dictionary # inB
not dictionary.get(key) # getA
sentinel = object()
dictionary.get(key, sentinel) is not sentinel # getB
Both A variants have shortcomings that mean you should not use them. inA needlessly creates a dict view on the keys - this adds an indirection step. getA looks at the truth of the value - this leads to incorrect results for values such as '' or 0.
As for using inB over getB: both do the same thing, namely looking at whether there is a value for key. However, getB also returns that value or default and has to compare it against the sentinel. Consequently, using get is considerably slower:
$ PREPARE="
> import random
> data = {a: True for a in range(0, 512, 2)}
> sentinel=object()"
$ python3 -m perf timeit -s "$PREPARE" '27 in data'
.....................
Mean +- std dev: 33.9 ns +- 0.8 ns
$ python3 -m perf timeit -s "$PREPARE" 'data.get(27, sentinel) is not sentinel'
.....................
Mean +- std dev: 105 ns +- 5 ns
Note that pypy3 has practically the same performance for both variants once the JIT has warmed up.

Ok, I've tested it on python 3.4.3 and all three ways give the same result around 0.00001 second.
import random
a = {}
for i in range(0, 1000000):
a[str(random.random())] = random.random()
import time
t1 = time.time(); 1 in a.keys(); t2 = time.time(); print("Time=%s" % (t2 - t1))
t1 = time.time(); 1 in a; t2 = time.time(); print("Time=%s" % (t2 - t1))
t1 = time.time(); not a.get(1); t2 = time.time(); print("Time=%s" % (t2 - t1))

List Comparison Algorithm: How can it be made better?

Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()

If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).

Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use timeit where each test requires random setup - python

Related

Optimizing a parallel implementation of a list comprehension

in ReactiveX, how do I pass other parameters to Observer.create?

python-measure function time

Python not in dict condition sentence performance

List Comparison Algorithm: How can it be made better?

Categories

Resources