Hi I am trying to estimate the run time for my fft code form numpy. With different input length N. The following is my code.
import cmath
import math
from random import uniform
from numpy.fft import fft
import time
for i in range(3,10):
N = 2**i
x = [uniform(-32768,32767) for i in range(N)]
t0 = time.clock()
X = fft(x)
t1 = time.clock()
print t1-t0
This is the result I got, the first line with input length N=3 should be the fastest one, but no matter how many times I run, the first one is always the largest one. I guess this is a problem with timer, however I don't know the exact reason for it. Can anyone explain this to me?
Output:
4.8e-05
3e-05
1.7e-05
6e-05
3.1e-05
5.4e-05
9.6e-05
The time interval is too small to be accurately measured by time.clock(), as there is latency jitter in the OS call. Instead, do enough work (loop each fft a few thousand or million times) until the work to be measured takes a few seconds. Also repeat each measurement several times and take an average, as there may be other system overheads (cache flushes, process switches, etc.) that can vary the performance.
Related
Problem: How long does it take to generate a Python list of prime numbers from 1 to N? Plot a graph of time taken against N.
I used SymPy to generate the list of primes.
I expected the time to increase monotonically.
But why is there a dip?
import numpy as np
import matplotlib.pyplot as plt
from time import perf_counter as timer
from sympy import sieve
T = []
tic=timer()
N= np.logspace(1,8,30)
for Nup in N:
tic = timer()
A=list(sieve.primerange(1,Nup))
toc = timer()
T.append(toc-tic)
plt.loglog(N,T,'x-')
plt.grid()
plt.show()
Time taken to generate primes up to N
The sieve itself requires an exponential amount of time to compute ever larger numbers of primes, so plotting the pure runtime of a sieve should come out to roughly a straight line for large numbers.
In your copy of the plot, it looks like it's actually getting a bit worse over time, but when I run your script it's not perfectly straight, but close to a straight line on the log scale towards the end. However, there is a bit of a bend at the start, as with your result.
This makes sense because the sieve caches previous results, but initially it gets little benefit from that and there's the small overhead of setting up the cache and increasing its size which goes down over time, and more importantly there's the overhead of the actual call to the sieve routine. Also, this type of performance measurement is very sensitive to anything else going on on your system, including whatever Python and your IDE are doing
Here's your code with some added code to loop over various initial runs, warming the cache of the sieve before every run - it shows pretty clearly what the effect is:
import numpy as np
import matplotlib.pyplot as plt
from time import perf_counter as timer, sleep
from sympy import sieve
for warmup_step in range(0, 5):
warmup = 100 ** warmup_step
sieve._reset() # this resets the internal cache of the sieve
_ = list(sieve.primerange(1, warmup)) # warming the sieve's cache
_ = timer() # avoid initial delays from other elements of the code
sleep(3)
print('Start')
times = []
tic = timer()
numbers = np.logspace(1, 8, 30)
for n in numbers:
tic = timer()
_ = list(sieve.primerange(1, n))
toc = timer()
times.append(toc - tic)
print(toc, n) # provide some visual feedback of speed
plt.loglog(numbers, times, 'x-')
plt.title(f'Warmup: {warmup}')
plt.ylim(1e-6, 1e+1) # fix the y-axis, so the charts are easily comparable
plt.grid()
plt.show()
The lesson to be learned here is that you need to consider overhead. Of your own code and the libraries you use, but also the entire system that sits around it: the Python VM, your IDE, whatever is running on your workstation, the OS that runs it, the hardware.
The test above is better, but if you want really nice results, run the whole thing a dozen times and average out the results over runs.
Results:
What I've tried
I have an embarrassingly parallel for loop in which I iterate over 90x360 values in two nested for loops and do some computation. I tried dask.delayed to parallelize the for loops as per this tutorial although it is demonstrated for a very small set of iterations.
Problem description
I'm surprised to find that the parallel code took 2h 39 mins compared to non-parallel timing of 1h 54 mins which means that I'm doing something fundamentally wrong or maybe the task graphs are too big to handle?
Set-up info
This test was done for a subset of my iterations that is, 10 x 360, but the optimized code should be able to handle 90 x 360 nested iterations. My mini-cluster has 66 cores and 256 GB of RAM and the 2 data files are 4 GB and < 1 GB each. I'm also confused between the approach of multi-processing vs multi-threading for this task. I thought running parallel loops in multiple processes similar to joblib default implementation would be the way to go as each loop works on independent grid-points. But, this suggests that multi-threading is faster and should be preferred if one doesn't have a GIL issue (which I don't). So, for the timing above, I used dask.delay default scheduling option which uses multi-threading option for a single process.
Simplified code
import numpy as np
import pandas as pd
import xarray as xr
from datetime import datetime
from dask import compute, delayed
def add_data_from_small_file(lat):
""" for each grid-point, get time steps from big-file as per mask, and
compute data from small file for those time-steps
Returns: array per latitude which is to be stacked
"""
for lon in range(0,360):
# get time steps from big file
start_time = big_file.time.values[mask1[:, la, lo]]
end_time = big_file.time.values[[mask2[:,la,lo]]
i=0
for t1, t2 in zip(start_time, end_time):
# calculate value from small file for each time pair
temp_var[i] = small_file.sel(t=slice(t1, t2)).median()
i=i+1
temp_per_lon[:, lon] = temp_var
return temp_per_lon
if __name__ == '__main__':
t1 = datetime.now()
small_file = xr.open_dataarray('small_file.nc') # size < 1 GB, 10000x91
big_file = xr.open_dataset('big_file.nc') # size = 4 GB, 10000x91x360
delayed_values = [delayed(add_data_from_small_file)(lat) for lat in range(0,10)] # 10 loops for testing, to scale to 90 loops
# have to delay stacking to avoid memory error
stack_arr = delayed(np.stack)(delayed_values, axis=1)
stack_arr = stack_arr.compute()
print('Total run time:{}'.format(datetime.now()-t1))
Every delayed task adds about 1ms of overhead. So if your function is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
If you're curious about whether or not threads or processes are better for you, the easiest way to find out is just to try both. It is easy to do.
dask.compute(*values, scheduler="processes")
dask.compute(*values, scheduler="threads")
It could be that even though you're using numpy arrays, most of your time is actually spent in Python for loops. If so, multithreading won't help you here, and the real solution is to stop using Python for loops, either by being clever with numpy/xarray, or by using a project like Numba.
I'm using SageMath to perform some mathematical calculations, and at one point I have a for loop that looks like this:
uni = {}
end = (l[idx]^(e[idx] - 1)) * (l[idx] + 1) # where end in my case is about 2013265922,
# but can also be much much larger too.
for count in range(0, end):
i = randint(1, 303325737249669131) # this executes very fast in Sage
if i in uni:
uni[i] += 1
else:
uni[i] = 1
So basically, I want to create very large number of random integers in the given range, check whether the number was already in the dictionary, if yes increment its count, if not initialize it to 1. But, the loop takes such a long time that it doesn't finish in a reasonable amount of time, and not because the operations inside the loop are complicated, but instead because there are a huge number of iterations to be performed. Therefore, I want to ask whether there is any way to avoid (or speed up) this kind of loops in Python?
I profiled your code (use cProfile for this) and the vast majority of the time spent, is spend within the randint function that is called for each iteration of the loop.
I recommend you vectorize the loop using numpy random number generation libraries, and then a single call to the Counter class to extract frequency counts.
import numpy.random
import numpy
from collections import Counter
assert 303325737249669131 < 18446744073709551615 # limit for uint64
numbers = numpy.random.randint(low=0, high=303325737249669131, size=end,
dtype=numpy.uint64)
frequency = Counter(numbers)
For a loop of 1000000 iterations (smaller than the one you suggest) I observed a reduction from 6 seconds to about 1 second. So even with this you cannot expect more than an order of magnitude reduction in terms of computation time.
You may think that keeping an array of all the values in memory is inefficient, and may lead to memory exhaustion before the computation ends. However, due to the small value of "end" compared with the range of the random integers the rate at which you will be recording collisions is low, and therefore the memory cost of a full array is not significantly larger than storing the dictionary. However, if this becomes and issue you may wish to perform the computation in batches. In that spirit you may also want to use the multiprocessing facilities to distribute computations across many CPUs or even many machines (but lookout for network costs if you chose that).
Biggest speedup you can make without low-level magic is using defaultdict, i.e.
uni = defaultdict(int)
for count in range(0, end):
i = randint(1, 303325737249669131) # this executes very fast in Sage
uni[i] += 1
If you're using python2, change range to xrange.
Except this - I'm pretty sure that its somewhere near limit for python. Loop is
generating random integer (optimized as much as possible without static typing)
calculating hash
updating dict. With defaultdict if-else branches is factored to more optimized code
from time to time, malloc calls to resize dict - this is fast (considering inablity to preallocate memory for dict)
I need to make a functionnal test that assert that one run is significantly faster than one other.
Here is the code I have written so far:
def test_run5(self):
cmd_line = ["python", self.__right_def_file_only_files]
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime1 = end - start
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime2 = end - start
self.assertTrue(runtime2 < runtime1 * 1.4)
It works but I don't like this way because the 1.4 factor has been chosen experimentaly with my specific example of execution.
How would you test that the second execution is always faster than the first?
EDIT
I didn't think that it would be necessary to explain it but in the context of my program, it is not up to me to say that a factor is significant for an unknown execution.
The whole program is a kind of Make and it is the pipeline definition file that will define what is the "significant difference of speed", not me:
If the definition file contains a lot of rules that are very fast, the difference of execution time between two consecutive execution will be very small, let's say 5% faster but still significant
Else if the definition file contains few rules but very long ones, the difference will be big, let's say 90% faster so a difference of 5% would not be significant at all.
I found out an equation named Michaelis-Menten kinetics which fit with my needs. Here is the function which should do the trick
def get_best_factor(full_exec_time, rule_count, maximum_ratio=1):
average_rule_time = full_exec_time / rule_count
return 1 + (maximum_ratio * average_rule_time / (1.5 + average_rule_time))
full_exec_time parameter is runtime1 which is the maximum execution time for a given pipeline definition file.
rule_count is the number of rules in the given pipeline definition file.
maximum_ratio means that the second execution will be, at max, 100% faster than the first (impossible, in practice)
The variable parameter of the Michaelis-Menten kinetics equation, is the average rule execution time. And I have arbitrarily chosen 1.5 seconds as the average rule execution time at which the execution time should be maximum_ratio / 2 faster. It is the actual parameter that depends on your use of this equation.
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.