python for-loop slower each iteration - python

I am trying to optimize some python code (to speed up some matrix operations), my code is something similar to this one (my real dataset is also similar to 'gps'),
import numpy as np
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
for i in xrange(len(gps)):
for j in xrange(0,i+1):
ips[i,j]= f.innerProd(gps[i],gps[j])
ips[j,i]= ips[i,j]
print "Inner product matrix: %3.0f %% done (%d of %d)"% \
(((i+1)**2.)/(len(gps)**2.)*100, i, len(gps))
def innerProd(mat1,mat2):
return float(np.sum(np.dot(np.dot(mat1,mat2),mat1)))
What I would like to understand is , why is it that the program begins running fast during the first iterations and then slows down as it iterates further? I know the question might be a bit naive but I really want to have a clearer idea of what is happening before I attempt anything else. I already implemented my function in Fortran (leaving within the Fortran realm any for loops) and used f2py to create a dynamic lib to call the function from python, this would be the new code in python..
import numpy as np
import myfortranInnProd as fip
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
ips = fip.innerProd(gps)
unfortunately I only found out (surprisingly) that my fortran-python version runs 1.5 ~ 2 times slower than the first version (it is important to mention that I used MATMUL() on the Fortran implementation). I have been googling around for a while and I believe that this "slow down" has something to do with the memory bandwidth, memory allocation or caching, given the large datasets, but I am not very sure about what is really happening behind and how could I improve the performance. I have run the code on both a small intel atom , 2GB ram and a 4 core intel xeon, with 8GB (of course with a correspondingly scaled dataset) and the "slow down" behavior is the same.
I just need to understand why is it that this 'slow down' happens? would it do any good if i implement the function in C ? or try to implement it to run on a GPU ? Any other ideas how to improve it? Thanks in advance

At the risk of stating the obvious, the number of executions of the inner loop will grow each time you complete an execution of the outer loop. When i is 0, the inner loop will only be executed once, but when i is 100, it will be executed 101 times. Could this explain your observations, or do you mean that each execution of the inner loop itself is getting slower over time?

The number of executions of the inner for loop depends on the value of i, the index of the outer for loop. Since you're displaying your debug each time the inner loop finishes, it gets displayed less and less often as i grows. (Note that the percentage increases regularly, however.)

Related

Efficient way to calculate on all row-pairs of a large matrix?

I need to do some time-consuming calculations on all row-pairs of a large matrix M, like:
for i in range(n):
for j in range(i+1,n):
time_comsuming_calculation(M[i,:],M[j:])
Since I am new to parallel computing , after studied the example in Writing parallel computation results in shared memory, I tried to do parallel computing with joblib as below:
dump(M, M_name)
M=load(M_name,mmap_mode='r')
...
Parallel(n_jobs=num_cores)(delayed(paracalc)(u,v,M)
for u,v in itertools.combinations(range(M.shape[0]),2))
However, it turned to be unbearably much slower than non-parallel version. Computing on each row-pair consumed even more seconds than num_cores=1.
I am wondering what's wrong with my parallel implementation. Is mpi4py a better choice? Any suggestions will be appreciated.
Okay, still no answers but I've managed to work it out.
The first interesting fact I found is when I commented out these two lines,
# dump(M, M_name)
# M=load(M_name,mmap_mode='r')
by which the memmap array were no longer used to take place of memory array, it went much faster. I don't know why up to now. Is there a memmap lock or something?
Then, I read this article Parallel and HPC with Python (or numpy) and decided to turn to mpi4py. After hours of struggling with debugging, I got satisfying results.

What could be wrong with this matrix multiplication benchmark (Matlab vs Numpy)?

Here's the code I wrote for comparing performance of numpy vs Matlab. It just measures the average time taken for matrix multiplication (1701x576 matrix M1 * 576x576 matrix M2).
Matlab version : (M1 is (1701x576) while M2 is (576x576) matrix)
function r = benchmark(M1,M2)
total_time=0;
for i=1:4
for j=1:1500
tic;
a=M1*M2;
tim=toc;
total_time =total_time+tim;
end
end
avg_time = total_time/4
r=avg_time
end
Python version :
def benchmark():
iters = range(1500)
for i in range(4):
for j in iters:
tic = time.time()
a=M1.dot(M2);
toc = time.time() - tic
t_time=t_time+toc;
return t_time/4
Matlab version takes almost ~18.2s , while Python takes ~19.3s . Ive repeated this test multiple times , and Matlab was always performing better than Python (even if smaller difference) in all cases . My understanding is Numpy uses efficient and compiled code for vector operations , and is supposed to be faster than Matlab.
Then , why could Matlab perform faster than Numpy ? The test was done on a 32 core machine .
Where did I go wrong ? or is this expected for Numpy to be slower than Matlab.
Are there ways to improve the performance for Python ?
Edit : Updated the matlab code to fix the loop index/return value error . The error was the result of me trying to edit the names in the snipper to make it presentable just before posting(a bad idea everytime :) ).
[edited to remove the mention of loops; that was my mistake]
Couple things--
First, the multicore nature of the machine doesn't really matter unless you're explicitly using those extra cores (or linking NumPy against a BLAS library that uses multiple cores -- thanks #ali_m). If you're not, it'll run about as fast on a 32-core machine as it will on a 4-core machine (assuming the clock speeds of the cores themselves are roughly equal).
Second, using purely off-the-shelf Matlab vs off-the-shelf NumPy, Matlab generally beats out NumPy. This is a very general statement, though; YMMV. Also, speaking of Matlab, there does indeed appear to be a bug in the looping indices.
Third, this may not be the best benchmark for performance; there may be some unseen caching issues taking place under the hood that aren't obvious. A better one would be to randomly generate the matrices on-the-fly in each iteration and multiply them, but even this could be problematic depending on the random number generator.
There are two major issues I can see with the test.
The first is that you are using global variable lookup in Python while you are using local variable lookup in MATLAB. Global variable lookup in Python is relatively slow. Making sure the variables are local like they are in MATLAB will affect the performance.
The second is that you are re-doing the same calculation over and over. MATLAB has a JIT for loops and numpy has a cache for calculations, both of which can reduce the time for repeated calculations.
So to make the comparisons more equal and reliable, you should create new, random matrices each time through the loop. This will prevent caching and JIT from messing up your results, and will make sure the variables are all local.
There is a bug in your Matlab code. It appears that you are using the same loop control variable in nested loops.
The outer loop actually only runs once.
Edit: The outer loop actually runs the correct number of times. The two loop control variables seem to be independent.

Pycuda: Best way of calling Kernel multiple times

I'm using pycuda to make a relativistic raytracer. Basically, for each "pixel" in a big 2D array we must solve a system of 6 ODEs using Runge Kutta. As each integration is independent of the rest it should be very easy. Other people has achieve it using C/C++ CUDA with excellent results (see this project).
The problem arises in the fact that I do not know how is the best way of doing this. I'm writing a Kernel that does some Runge Kutta Steps and then return the results to the CPU. This Kernel is called a lot of times in order to get the whole ray integrated. The problem is for some reason is very slow. Of course, I know that memory transfers are really a Bottleneck in CUDA, but as this is really slow I'm starting to think that I'm doing something wrong.
It would be great if you can recommend me best programming practices for this case. (Using pycuda). Some things that I'm wandering:
Do I need to create a new context on reach Kernel call?
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation). And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.
To clarify: the reason I have to pause/restart the Kernel is because of the watchdog. Kernels of more than 10 seconds got killed by the watchdog.
Thank you in advance!
Your main question seems to be too general, and it's hard to give some concrete advice without seeing the code. I'll try to answer your subquestions (not an actual answer, but it's a bit long for a comment)
Do I need to create a new context on reach Kernel call?
No.
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
Depends on what you mean by "get some information". If it means doing stuff with it on CPU, then, of course, you have to transfer it. If you want to use in another kernel invocation, then you don't need to transfer it.
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation).
It really depends on the equation, the number of threads and the video card you are using. I can imagine a situation when one RK step would take that long.
And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.
Impossible to say for sure without the code. Try to create some minimal demonstrating example, or, at the very least, post a link to a runnable (even if it's rather long) piece of code that illustrates your problem. As for PyCUDA, it's a very thin wrapper over CUDA, and all the programming practises that apply to the latter, apply to the former as well.
I might help you with the memory handling, i.e. not having to copy from CPU to GPU during your iterations. I am evolving a system through time using euler timestepping, and the way I keep all my data on my GPU is given below.
However, the problem with this is that once the first kernel has been launched, the cpu keeps executing the lines after it. I.e. the boundary kernel gets launched before the time evolution step.
What I need is a way to synchronize things. I have tried doing it using strm.synchronize() (see my code) but it does not always work. If you have any ideas on this, I would really appreciate your input! Thanks!
def curveShorten(dist,timestep,maxit):
"""
iterates the function image on a 2d grid through an euler anisotropic
diffusion operator with timestep=timestep maxit number of times
"""
image = 1*dist
forme = image.shape
if(np.size(forme)>2):
sys.exit('Only works on gray images')
aSize = forme[0]*forme[1]
xdim = np.int32(forme[0])
ydim = np.int32(forme[1])
image[0,:] = image[1,:]
image[xdim-1,:] = image[xdim-2,:]
image[:,ydim-1] = image[:,ydim-2]
image[:,0] = image[:,1]
#np arrays i need to store things on the CPU, image is the initial
#condition and final is the final state
image = image.reshape(aSize,order= 'C').astype(np.float32)
final = np.zeros(aSize).astype(np.float32)
#allocating memory to GPUs
image_gpu = drv.mem_alloc(image.nbytes)
final_gpu = drv.mem_alloc(final.nbytes)
#sending data to each memory location
drv.memcpy_htod(image_gpu,image) #host to device copying
drv.memcpy_htod(final_gpu,final)
#block size: B := dim1*dim2*dim3=1024
#gird size : dim1*dimr2*dim3 = ceiling(aSize/B)
blockX = int(1024)
multiplier = aSize/float(1024)
if(aSize/float(1024) > int(aSize/float(1024)) ):
gridX = int(multiplier + 1)
else:
gridX = int(multiplier)
strm1 = drv.Stream(1)
ev1 = drv.Event()
strm2 = drv.Stream()
for k in range(0,maxit):
Kern_diffIteration(image_gpu,final_gpu,ydim, xdim, np.float32(timestep), block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)
strm1.synchronize()
if(strm1.is_done()==1):
Kern_boundaryX0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))
Kern_boundaryX1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)
Kern_boundaryY0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm2)
Kern_boundaryY1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)
if(strm1.is_done()==1):
drv.memcpy_dtod(image_gpu, final_gpu, final.nbytes)
#Kern_copy(image_gpu,final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)
drv.memcpy_dtoh(final,final_gpu) #device to host copying
#final_gpu.free()
#image_gpu.free()
return final.reshape(forme,order='C')

How can I rewrite this Python operation so it doesn't hang my system?

Beginner here, looked for an answer, but can't find one.
I know (or rather suspect) that part of the problem with the following code is how big the list of combinations gets.
(Maybe too, the last line seems like an error, in that, if I just run 'print ...' rather than 'comb += ...' it runs quickly and quits. Would 'append' be more graceful?)
I'm not 100% sure if the system hang is due to disk I/O (swapping?), CPU use, or memory... running it under Windows seems to result in a rather large disk I/O by 'System', while under Linux, top was showing high CPU and memory use before it was killed. In both cases though, the rest of the system was unusable while this operation was going (tried it in the Python interpreter directly, as well as in PyCharm).
So two part question: 1) is there some 'safe' way to test code like this that won't affect the rest of the system negatively, and 2) for this specific example, how should I rewrite it?
Trying this code (which I do not recommend!):
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb += cwr(iterable, x)
Thanks!
EDIT: Should have specified, but it is python2.7 code here as well (guess the xrange makes it obvious it's not 3 anyways). The Windows machine that's hanging has 4 GB of RAM, but it looks like the hang is on disk I/O. The original problem I was (and still am) working on was a question at codewars.com, about how many ways to make change given a list of possible coins and an amount to make. The solution I'd come up with worked for small amounts, and not big ones. Obviously, I need to come up with a better algorithm to solve that problem... so this is non-essential code, certainly. However, I would like to know if there's something I can do to set the programming environment so that bugs in my code don't propagate and choke my system this way.
FURTHER EDIT:
I was working on the problem again tonight, and realized that I didn't need to append to a master list (as some of you hinted to me in the comments), but just work on the subset that was collected. I hadn't really given enough of the code to make that obvious, but my key problem here was the line:
comb += cwr(iterable, x)
which should have been
comb = cwr(iterable, x)
Since you are trying to compute combinations with replacement, the number of orderings that must be considered will be 4^nth power.(4 because your iterable has 4 items).
More generally speaking, the number of orderings to be computed is the number of elements that can be at any spot in the list, raised to the power of how long the list is.
You are trying to compute 4^nth power for n between 3 and 99. 4^99 power is 4.01734511064748 * 1059.
I'm afraid not even a quantum computer would be much help computing that.
This isn't a very powerful laptop (3.7 GiB,Intel® Celeron(R) CPU N2820 # 2.13GHz × 2, 64bit ubuntu) but it did it in 15s or so (but did slow noticeably, top showed 100% cpu (dual core) and 35% memory. It took about 15s to release the memory when if finished.
len(comb) was 4,421,240
I had to change your code to
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb.extend(list(cwr(iterable, x)))
ED - just re-tried as per your original and it does run OK. My mistake. It looks as though it is the memory requirement. If you really need to do this you could write it to a file.
re-ED being curious about the back-of-an-envelope complexity calculation above not squaring my experience, I tried plotting n (X axis) against the length of list returned by combinations_with_replacement() (Y axis) for iterable lengths 2,3,4,5 i. The result seems to be below n**(i-1) (Which ties in with the figure I got for 4,99 above. It's actually (i+n-1)! / n! / (i-1)! which approximates to n**(i-1)/i! for n much bigger than i)
Also, running the plot I didn't keep the full comb list in memory and this did improve computer performance quite a bit, so maybe that's a relevant point: rather than produce a giant list then work on it afterwords, do the calculations in the loop.

Python "range" resource consumption

I wrote the following script
Basically, I'm just learning Python for Machine Learning and wanted to check how really computationally intensive tasks would perform. I observe that for 10**8 iterations, Python takes up a lot of RAM (around 3.8 GB) and also a lot of CPU time (just froze my system)
I want to know if there is any way to limit the time/memory consumption either through code or some global settings
Script -
initial_start = time.clock()
for i in range(9):
start = time.clock()
for j in range(10**i):
pass
stop = time.clock()
print 'Looping exp(',i,') times takes', stop - start, 'seconds'
final_stop = time.clock()
print 'Overall program time is',final_stop - initial_start,'seconds'
In Python 2, range creates a list. Use xrange instead. For a more detailed explanation see Should you always favor xrange() over range()?
Note that a no-op for loop is a very poor benchmark that tells you pretty much nothing about Python.
Also note, as per gnibbler's comment, Python 3's range is works like Python 2's xrange.
look at this question: How to limit the heap size?
To address your script, the timeit module measures the time it takes to perform an action more accurately
>>> import timeit
>>> for i in range(9):
... print timeit.timeit(stmt='pass', number=10**i)
...
0.0
0.0
0.0
0.0
0.0
0.015625
0.0625
0.468752861023
2.98439407349
Your example is taking most of its time dealing with the gigantic lists of numbers you're putting it memory. xrange instead of range will help fix that issue but you're still using a terrible benchmark. the loop is going to execute over and over and not actually do anything, so the cpu is busy checking the condition and entering the loop.
As you can see, creating these lists is taking the majority of the time here
>>> timeit.timeit(stmt='range(10**7)', number=1)
0.71875405311584473
>>> timeit.timeit(stmt='for i in range(10**7): pass', number=1)
1.093757152557373
Python takes RAM because you're creating a very large list of 10 ** 8 length with range function. That's where iterators become useful.
Use xrange instead of range.
It will work the same way as range do but instead of creating that large list in memory, xrange will just calculate inner index (incrementing it's value by 1 each iteration).
If you're considering Python for machine learning, take a look at numpy. Its philosophy is to implement all "inner loops" (matrix operations, linear algebra) in optimized C, and to use Python to manipulate input and output and to manage high-level algorithms - sort of like Matlab that uses Python. That gives you the best of both worlds: ease and readability of Python, and speed of C.
To get back to your question, benchmarking numpy operations will give you a more realistic assessment of Python's performances for machine learning.
As regards cpu, you have a for loop running for billions of iterations without any sort of sleep or pause inbetween, so no wonder the process hogs the cpu completely ( at least on a single core computer).

Categories

Resources