i am trying to compute the time for every loop iterations. However, i have noticed that the time required to process (anything) increases on each iteration in an incremental fashion. I am computing time by using following commands:
start_time = time.time()
loop:
(any process)
print (time.time() - start_time))
When you call the time.time method you are returning the amount of time in seconds based on the Unix clock system, basically the time local to the system.
You are assigning the time to start_time, you are then running your 10 processes and outputting the current time minus start_time, so you are essentially working out how long it takes you to run your 10 processes.
Now I believe that what you're trying to do is calculate how long each individual process takes, to do that you need to move around some of the lines in the sample code you supplied:
import time
for i in range(10):
start_time = time.time()
(any process)
print(time.time() - start_time))
By moving the assignment of time into the loop you will be assigning the time at which the loop starts and then outputting the individual time of each iteration rather than timing how long the entire loop takes as a whole.
This would output how long each iteration takes.
Please feel free to ask any questions!
Here's an example of how you could perform your timings with timeit.
import timeit
setup = "i = {}"
stmt = """
for x in range(i):
3 + 3
"""
[timeit.timeit(stmt=stmt, setup=setup.format(i), number=100) for i in range(10)]
Which gives you a list of the times of each loop:
[8.027901640161872e-05,
0.00011072197230532765,
0.00011189299402758479,
0.00012168602552264929,
0.00012224999954923987,
0.0001258430420421064,
0.00013012002455070615,
0.00013478699838742614,
0.000138589006382972,
0.0001438520266674459]
Related
I'm running the following program, in order to compare different times between multiprocessing and single core processing.
Here is the script :
from multiprocessing import Pool, cpu_count
from time import *
#Amount to calculate
N=5000
#Fonction that works alone
def two_loops(x):
t=0
for i in range(1,x+1):
for j in range(i):
t+=1
return t
#Function that need to be called in a loop
def single_loop(x):
tt=0
for j in range(x):
tt+=1
return tt
print 'Starting loop function'
starttime=time()
tot=0
for i in range(1,N+1):
tot+=single_loop(i)
print 'Single loop function. Result ',tot,' in ', time()-starttime,' seconds'
print 'Starting multiprocessing function'
if __name__=='__main__':
starttime=time()
pool = Pool(cpu_count())
res= pool.map(single_loop,range(1,N+1))
pool.close()
print 'MP function. Result ',res,' in ', time()-starttime,' seconds'
print 'Starting two loops function'
starttime=time()
print 'Two loops onction. Result ',two_loops(N),' in ', time()-starttime,' seconds'
So basically the functions gives me the sum of all integers between 1 and N (so N(N+1)/2).
The two_loops function is the basic one, using two for loops. The single_loop is just created to simulate one loop (the j loop).
When I'm running this script, this works well but I don't get the right result. I get :
Starting loop function Single loop function. Result 12502500 in
0.380275964737 seconds
Starting multiprocessing function MP function. Result [1, 2, 3,... a
lot of values here ...,4999, 5000] in 0.683819055557 seconds
Starting two loops function Two loops onction. Result 12502500 in
0.4114818573 seconds
It looks like the script runs, but I can't manage to get the good result. I saw on a the web that the close() function was supposed to do that, but apparently not.
Do you know how I can do ?
Thanks a lot !
I don't understand your question but here's how it can be done:
from concurrent.futures.process import ProcessPoolExecutor
from timeit import Timer
def two_loops_multiprocessing():
with ProcessPoolExecutor() as executor:
executor.map(single_loop, range(N))
if __name__ == "__main__":
iterations, elapsed_time = Timer("two_loops(N)", globals=globals()).autorange()
print(elapsed_time / iterations)
iterations, elapsed_time = Timer("two_loops_multiprocessing()", globals=globals()).autorange()
print(elapsed_time / iterations)
What's happening is that your map function is chopping up your range you provide it and runs the single loop function for all these separate numbers. Look here to see what it does: https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.map And since your single loop just adds 1 to tt for a range up to the value you get the value back. This effectively means you get your range() object back which is the answer you get.
In your other "single loop" you later add all the values you get together to get to a single value here:
for i in range(1,N+1):
tot+=single_loop(i)
But you forget to do this with you multiprocessing. What you should do is add a loop after you have called your map function to add them all together and you will get your expected answer.
Besides this your single loop function is basically a two loop function where you moved one loop to a function call. I'm not sure what you are trying to accomplish but there is not a big difference between the two.
Just sum the result list:
res = sum(pool.map(single_loop,range(1,N+1)))
You could avoid calculating the sum in the main thread by using some shared memory, but keep in mind that you will lose more time on synchronization. And again, there's no gain from multiprocessing in this case. It all depends on the specific case. If you needed to call single_loop fewer times and each call would take more time, then multiprocessing would speed up your code.
I'm following along with the videos for Interactive Python's course on data structures and algorithms. In one segement the following piece of code appears. It's to demonstrate a example of O(n**2) complexity.
It's supposed to loop through the range starting from 1000, and ending at 10000. But I have no idea why 100000 is given to the randrange function in the list comprehension on line 2.
Thanks in advance!
Note: i'm following along with this course - http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/BigONotation.html
for listSize in range(1000,10001,1000):
alist = [randrange(100000) for x in range(listSize)]
start = time.time()
print(findmin(alist))
end = time.time()
print("size: %d time: %f" % (listSize, end-start))
This is a time trial, testing how fast findmin() is. That's best done with randomised data, to avoid pathological cases. The list comprehension produces the test data. The 100000 is just an upper bound for the random values in that list, high enough to ensure that even for a list with 10k integers there is a nice spread of values.
Note that it is better to use the timeit module to execute time trials.
I would like to time a code block using the timeit magic command in a Jupyter notebook. According to the documentation, timeit takes several arguments. Two in particular control number of loops and number of repetitions. What isn't clear to me is the distinction between these two arguments. For example
import numpy
N = 1000000
v = numpy.arange(N)
%timeit -n 10 -r 500 pass; w = v + v
will run 10 loops and 500 repetitions. My question is,
Can this be interpreted as the
following? (with obvious differences the actual timing results)
import time
n = 10
r = 500
T = numpy.empty(r)
for j in range(r):
t0 = time.time()
for i in range(n):
w = v + v
T[j] = (time.time() - t0)/n
print('Best time is {:.4f} ms'.format(max(T)*1000))
An assumption I am making, and may well be incorrect, is that the time for the inner loop is averaged over the n iterations through this loop. Then the best of 500 repetitions of this loop is taken.
I have searched the documentation, and haven't found anything that specifies exactly what this is doing. For example, the documentation here is
Options: -n: execute the given statement times in a loop. If this value is not given, a fitting value is chosen.
-r: repeat the loop iteration times and take the best result. Default: 3
Nothing is really said about how the inner loop is timed. The final results is the "best" of what?
The code I want to time does not involve any randomness, so I am wondering if I should set this inner loop to n=1. Then, the r repetitions will take care of any system variability.
That number and repeat are separate arguments is because they serve different purposes. The number controls how many executions are done for each timing and it's used to get representative timings. The repeat argument controls how many timings are done and its use is to get accurate statistics. IPython uses the mean or average to calculate the run-time of the statement of all repetitions and then divides that number with number. So it measures the average of the averages. In earlier versions it used the minimum time (min()) of all repeats and divided it by number and reported it as "best of".
To understand why there are two arguments to control the number and the repeats you have to understand what you're timing and how you can measure the time.
The granularity of the clock and the number of executions
A computer has different "clocks" to measure times. These clocks have different "ticks" (depending on the OS). For example it could measure seconds, milliseconds or nanoseconds - these ticks are called the granularity of the clock.
If the duration of the execution is smaller or roughly equal to the granularity of the clock one cannot get representative timings. Suppose your operation would take 100ns (=0.0000001 seconds) but the clock only measures milliseconds (=0.001 seconds) then most measurements would measure 0 milliseconds and a few would measure 1 millisecond - which one depends on where in the clock cycle the execution started and finished. That's not really representative of the duration of what you want to time.
This is on Windows where time.time has a granularity of 1 millisecond:
import time
def fast_function():
return None
r = []
for _ in range(10000):
start = time.time()
fast_function()
r.append(time.time() - start)
import matplotlib.pyplot as plt
plt.title('measuring time of no-op-function with time.time')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This shows the histogram of the measured times from this example. Almost all measurements were 0 milliseconds and three measurements which were 1 millisecond:
There are clocks with a much lower granularity on Windows, this was just to illustrate the effect of the granularity and each clock has some granularity even if it's lower than one millisecond.
To overcome the restriction of the granularity one can increase the number of executions so the expected duration is significantly higher than the granularity of the clock. So instead of running the execution once it's run number times. Taking the numbers from above and using a number of 100 000 the expected run-time would be =0.01 seconds. So neglecting everything else the clock would now measure 10 milliseconds in almost all cases, which would accurately resemble the expected execution time.
In short specifying a number measures the sum of number executions. You need to divide the times measure this way by number again to get the "time per execution".
Other processes and the repeatitions of the execution
Your OS typically has a lot of active processes, some of them can run in parallel (different processors or using hyper-threading) but most of them run sequentially with the OS scheduling times for each process to run on the CPU. Most clocks don't care what process currently runs so the measured time will be different depending on the scheduling plan. There are also some clocks that instead of measuring system time measure the process time. However they measure the complete time of the Python process, which will sometimes includes a garbage collection or other Python threads - besides that the Python process isn't stateless and not every operation will be always exactly the same, and there are also memory allocations/re-allocations/clears happening (sometimes behind the scenes) and these memory operations times can vary depending on a lot of reasons.
Again I use a histogram measuring the time it takes to sum ten thousand ones on my computer (only using repeat and setting number to 1):
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This histogram shows a sharp cutoff at just below ~5 milliseconds, which indicates that this is the "optimal" time in which the operation can be executed. The higher timings are measurements were the conditions were not optimal or other processes/threads took some of the time:
The typical approach to avoid these fluctuations is to repeat the number of timings very often and then use statistics to get the most accurate numbers. Which statistic depends on what you want to measure. I'll go into this in more detail below.
Using both number and repeat
Essentially the %timeit is a wrapper over timeit.repeat which is roughly equivalent to:
import timeit
timer = timeit.default_timer()
results = []
for _ in range(repeat):
start = timer()
for _ in range(number):
function_or_statement_to_time
results.append(timer() - start)
But %timeit has some convenience features compared to timeit.repeat. For example it calculates the best and average times of one execution based on the timings it got by repeat and number.
These are calculated roughly like this:
import statistics
best = min(results) / number
average = statistics.mean(results) / number
You could also use TimeitResult (returned if you use the -o option) to inspect all results:
>>> r = %timeit -o ...
7.46 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
>>> r.loops # the "number" is called "loops" on the result
100000000
>>> r.repeat
7
>>> r.all_runs
[0.7445439999999905,
0.7611092000000212,
0.7249667000000102,
0.7238135999999997,
0.7385598000000186,
0.7338551999999936,
0.7277425999999991]
>>> r.best
7.238135999999997e-09
>>> r.average
7.363701571428618e-09
>>> min(r.all_runs) / r.loops # calculated best by hand
7.238135999999997e-09
>>> from statistics import mean
>>> mean(r.all_runs) / r.loops # calculated average by hand
7.363701571428619e-09
General advise regarding the values of number and repeat
If you want to modify either number or repeat then you should set number to the minimum value possible without running into the granularity of the timer. In my experience number should be set so that number executions of the function take at least 10 microseconds (0.00001 seconds) otherwise you might only "time" the minimum resolution of the "timer".
The repeat should be set as high as possible. Having more repeats will make it more likely that you really find the real best or average. However more repeats will take longer so there's a trade-off as well.
IPython adjusts number but keeps repeat constant. I often do the opposite: I adjust number so that the number executions of the statement take ~10us and then I adjust the repeat that I get a good representation of the statistics (often it's in the range 100-10000). But your mileage may vary.
Which statistic is best?
The documentation of timeit.repeat mentions this:
Note
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
For example one typically wants to find out how fast the algorithm can be, then one could use the minimum of these repetitions. If one is more interested in the average or median of the timings one can use those measurements. In most cases the number one is most interested in is the minimum, because the minimum resembles how fast the execution can be - the minimum is probably the one execution where the process was least interrupted (by other processes, by GC, or had the most optimal memory operations).
To illustrate the differences I repeated the above timing again but this time I included the minimum, mean, and median:
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import numpy as np
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto', color='black', label='measurements')
plt.tight_layout()
plt.axvline(np.min(r), c='lime', label='min')
plt.axvline(np.mean(r), c='red', label='mean')
plt.axvline(np.median(r), c='blue', label='median')
plt.legend()
Contrary to this "advise" (see quoted documentation above) IPythons %timeit reports the average instead of the min(). However they also only use a repeat of 7 by default - which I think is too less to accurately determine the minimum - so using the average in this case is actually sensible.It's a great tool to do a "quick-and-dirty" timing.
If you need something that allows to customize it based on your needs, one could use timeit.repeat directly or even a 3rd party module. For example:
pyperf
perfplot
simple_benchmark (my own library)
It looks like the latest version of %timeit is taking the average of the r n-loop averages, not the best of the averages.
Evidently, this has changed from earlier versions of Python. The best time of r averages can still be obtained via the TimeResults return argument, but it is no longer the value that is displayed.
Comment : I recently ran this code from above and found that the following syntax no longer works :
n = 1
r = 50
tr = %timeit -n $n -r $r -q -o pass; compute_mean(x,np)
That is, it is no longer possible (it seems) to use $var to pass a variable to the timeit magic command. Does this mean that this magic command should be retired and replaced with the timeit module?
I am using Python 3.7.4.
I decided to translate some of Python code from Peter Harrington's Machine Learning in Action into Julia, starting with kNN algorithm.
After normalizing a dataset he provided, I wrote a few functions: find_kNN(), mass_kNN (a function that finds kNN for multiple inputs), and a function that splits a given dataset into randomly picked train and test datasets, calls mass_kNN(), and plots the resulting accuracy multiple times.
Then I compared the runtimes between the Julia code and the equivalent Python code. (I am using Distances in Julia to find the Euclidean distance and Gadfly to plot, but turning the plotting off doesn't do much to affect the time.)
Results:
Julia:
elapsed time: 1.175523034 seconds (455531636 bytes allocated, 47.54% gc time)
Python:
time elapsed: 0.9517326354980469 sec
I am wondering if there is a way to speed up my Julia code, or if it's running as fast as possible at this point (I mean if there are any glaring mistakes I made in terms of making the code run the fastest.)
Julia notebook
Python notebook
repo with both notebooks and the dataset
Thanks!..
Edit: Removing convert() statements and passing everything around as Real slowed down time to 2.29 sec.
So first of all, I removed the last two lines of plot_probs - plotting isn't really a great thing to benchmark I think, and it's largely beyond my (or your) control - could try PyPlot if its a real factor. I also timed plot_probs a few times to see how much time is spent compiling it the first time:
**********elapsed time: 1.071184218 seconds (473218720 bytes allocated, 26.36% gc time)
**********elapsed time: 0.658809962 seconds (452017744 bytes allocated, 40.29% gc time)
**********elapsed time: 0.660609145 seconds (452017680 bytes allocated, 40.45% gc time)
So there is a 0.3s penalty paid once. Moving on to the actual algorithm, I used the in built profiler (e.g. #profile plot_probs(norm_array, 0.25, [1:3], 10, 3)), which revealed all of the time (essentially) is spent here:
~80%: [ push!(dist, euclidean(set_array[i,:][:], input_array)) for i in 1:size(set_array, 1) ]
~20%: [ d[i] = get(d, i, 0) + 1 for i in labels[sortperm(dist)][1:k] ]
Using array comprehensions like that is not idiomatic Julia (or Python for that matter). The first is also slow because all that slicing makes many copies of the data. I'm not an expert with Distances.jl, but I think you can replace it with
dist = Distances.colwise(Euclidean(), set_array', input_array)
d = Dict{Int,Int}()
for i in labels[sortperm(dist)][1:k]
d[i] = get(d, i, 0) + 1
end
which gave me
**********elapsed time: 0.731732444 seconds (234734112 bytes allocated, 20.90% gc time)
**********elapsed time: 0.30319397 seconds (214057552 bytes allocated, 37.84% gc time)
More performance could be extracted by doing the transpose once in mass_kNN, but that required touching a few too many places and this post is long enough. Trying to micro-optimize it led me to using
dist=zeros(size(set_array, 1)
#inbounds for i in 1:size(set_array, 1)
d = 0.0
for j in 1:length(input_array)
z = set_array[i,j] - input_array[j]
d += z*z
end
dist[i] = sqrt(d)
end
which gets it to
**********elapsed time: 0.646256408 seconds (158869776 bytes allocated, 15.21% gc time)
**********elapsed time: 0.245293449 seconds (138817648 bytes allocated, 35.40% gc time)
so took about half the time off - but not really worth it, and less flexible (e.g. what if I wanted L1). Other code review points (unsolicited, I know):
I find Vector{Float64} and Matrix{Float64} much easier on the eye than Array{Float64,1} and Array{Float64,2} and less likely to be confused.
Float64[] is more usual than Array(Float64, 0)
Int64 can just be written as Int, since it doesn't need to be a 64-bit integer.
Actually this is an interesting topic from programming pearls, sorting 10 digits telephone numbers in a limited memory with an efficient algorithm. You can find the whole story here
What I am interested in is just how fast the implementation could be in python. I have done a naive implementation with the module bitvector. The code is as following:
from BitVector import BitVector
import timeit
import random
import time
import sys
def sort(input_li):
return sorted(input_li)
def vec_sort(input_li):
bv = BitVector( size = len(input_li) )
for i in input_li:
bv[i] = 1
res_li = []
for i in range(len(bv)):
if bv[i]:
res_li.append(i)
return res_li
if __name__ == "__main__":
test_data = range(int(sys.argv[1]))
print 'test_data size is:', sys.argv[1]
random.shuffle(test_data)
start = time.time()
sort(test_data)
elapsed = (time.time() - start)
print "sort function takes " + str(elapsed)
start = time.time()
vec_sort(test_data)
elapsed = (time.time() - start)
print "sort function takes " + str(elapsed)
start = time.time()
vec_sort(test_data)
elapsed = (time.time() - start)
print "vec_sort function takes " + str(elapsed)
I have tested from array size 100 to 10,000,000 in my macbook(2GHz Intel Core 2 Duo 2GB SDRAM), the result is as following:
test_data size is: 1000
sort function
takes 0.000274896621704
vec_sort function takes 0.00383687019348
test_data size is: 10000
sort function takes 0.00380706787109
vec_sort function takes 0.0371489524841
test_data size is: 100000
sort function takes 0.0520560741425
vec_sort function takes 0.374383926392
test_data size is: 1000000
sort function takes 0.867373943329
vec_sort function takes 3.80475401878
test_data size is: 10000000
sort function takes 12.9204008579
vec_sort function takes 38.8053860664
What disappoints me is that even when the test_data size is 100,000,000, the sort function is still faster than vec_sort. Is there any way to accelerate the vec_sort function?
As Niki pointed out, you are comparing a very fast C routine with a Python one. Using psyco speeds it up a little bit for me, but you can really speed it up by using a bit vector module written in C. I used bitarray and then the bit sorting method surpasses the built-in sort for an array size of about 250,000 using psyco.
Here's the function that I used:
def vec_sort2(input_li):
bv = bitarray(len(input_li))
bv.setall(0)
for i in input_li:
bv[i] = 1
return [i for i in xrange(len(bv)) if bv[i]]
Notice also that I have used a list comprehension to construct the sorted list which helps a bit. Using psyco and the above function with your functions I get the following results:
test_data size is: 1000000
sort function takes 1.29699993134
vec_sort function takes 3.5150001049
vec_sort2 function takes 0.953999996185
As a side note, BitVector isn't especially optimized even for Python. Before I found bitarray, I did some various tweaks to the module and using my module that has the tweaks, the time for vec_sort is reduced over a second for this size of an array. I haven't submitted my changes to it though because bitarray is just so much faster.
My Python isn't the best but it looks like you have a bug in your code:
bv = BitVector( size = len(input_li) )
The size of your bitvector is the same as the size of your input array. You want the bitvector to be the size of your domain - 10^10. I'm not sure how Python's bitvectors deal with overflows, but if it automatically resizes the bitvector then you are getting quadratic behavior.
Additionally I imagine that Python's sort function is implemented in C and is not going to have the overhead of a sort implemented purely in Python. However that probably wouldn't cause an O(nlogn) algorithm to run substantially faster than an O(n) algorithm.
Edit: also this sort will only work on large data sets. Your algorithm runs in O(n + 10^10) time (based on your tests I assume you know this) which will be worse than O(nlogn) for small inputs.