Pycuda: Best way of calling Kernel multiple times

Pycuda: Best way of calling Kernel multiple times - python

I'm using pycuda to make a relativistic raytracer. Basically, for each "pixel" in a big 2D array we must solve a system of 6 ODEs using Runge Kutta. As each integration is independent of the rest it should be very easy. Other people has achieve it using C/C++ CUDA with excellent results (see this project).
The problem arises in the fact that I do not know how is the best way of doing this. I'm writing a Kernel that does some Runge Kutta Steps and then return the results to the CPU. This Kernel is called a lot of times in order to get the whole ray integrated. The problem is for some reason is very slow. Of course, I know that memory transfers are really a Bottleneck in CUDA, but as this is really slow I'm starting to think that I'm doing something wrong.
It would be great if you can recommend me best programming practices for this case. (Using pycuda). Some things that I'm wandering:
Do I need to create a new context on reach Kernel call?
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation). And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.
To clarify: the reason I have to pause/restart the Kernel is because of the watchdog. Kernels of more than 10 seconds got killed by the watchdog.
Thank you in advance!

Your main question seems to be too general, and it's hard to give some concrete advice without seeing the code. I'll try to answer your subquestions (not an actual answer, but it's a bit long for a comment)
Do I need to create a new context on reach Kernel call?
No.
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
Depends on what you mean by "get some information". If it means doing stuff with it on CPU, then, of course, you have to transfer it. If you want to use in another kernel invocation, then you don't need to transfer it.
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation).
It really depends on the equation, the number of threads and the video card you are using. I can imagine a situation when one RK step would take that long.
And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.
Impossible to say for sure without the code. Try to create some minimal demonstrating example, or, at the very least, post a link to a runnable (even if it's rather long) piece of code that illustrates your problem. As for PyCUDA, it's a very thin wrapper over CUDA, and all the programming practises that apply to the latter, apply to the former as well.

I might help you with the memory handling, i.e. not having to copy from CPU to GPU during your iterations. I am evolving a system through time using euler timestepping, and the way I keep all my data on my GPU is given below.
However, the problem with this is that once the first kernel has been launched, the cpu keeps executing the lines after it. I.e. the boundary kernel gets launched before the time evolution step.
What I need is a way to synchronize things. I have tried doing it using strm.synchronize() (see my code) but it does not always work. If you have any ideas on this, I would really appreciate your input! Thanks!
def curveShorten(dist,timestep,maxit):
"""
iterates the function image on a 2d grid through an euler anisotropic
diffusion operator with timestep=timestep maxit number of times
"""
image = 1*dist
forme = image.shape
if(np.size(forme)>2):
sys.exit('Only works on gray images')
aSize = forme[0]*forme[1]
xdim = np.int32(forme[0])
ydim = np.int32(forme[1])
image[0,:] = image[1,:]
image[xdim-1,:] = image[xdim-2,:]
image[:,ydim-1] = image[:,ydim-2]
image[:,0] = image[:,1]
#np arrays i need to store things on the CPU, image is the initial
#condition and final is the final state
image = image.reshape(aSize,order= 'C').astype(np.float32)
final = np.zeros(aSize).astype(np.float32)
#allocating memory to GPUs
image_gpu = drv.mem_alloc(image.nbytes)
final_gpu = drv.mem_alloc(final.nbytes)
#sending data to each memory location
drv.memcpy_htod(image_gpu,image) #host to device copying
drv.memcpy_htod(final_gpu,final)
#block size: B := dim1*dim2*dim3=1024
#gird size : dim1*dimr2*dim3 = ceiling(aSize/B)
blockX = int(1024)
multiplier = aSize/float(1024)
if(aSize/float(1024) > int(aSize/float(1024)) ):
gridX = int(multiplier + 1)
else:
gridX = int(multiplier)
strm1 = drv.Stream(1)
ev1 = drv.Event()
strm2 = drv.Stream()
for k in range(0,maxit):
Kern_diffIteration(image_gpu,final_gpu,ydim, xdim, np.float32(timestep), block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)
strm1.synchronize()
if(strm1.is_done()==1):
Kern_boundaryX0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))
Kern_boundaryX1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)
Kern_boundaryY0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm2)
Kern_boundaryY1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1)
if(strm1.is_done()==1):
drv.memcpy_dtod(image_gpu, final_gpu, final.nbytes)
#Kern_copy(image_gpu,final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1),stream=strm1)
drv.memcpy_dtoh(final,final_gpu) #device to host copying
#final_gpu.free()
#image_gpu.free()
return final.reshape(forme,order='C')

Related

sympy compiling functions with large matrices

I have been using sympy to work with systems of differential equations. I write the equations symbolically, use autowrap to compile them through cython, and then pass the resulting function to the scipy ODE solver. One of the major benefits of doing this is that I can solve for the jacobian symbolically using the sympy jacobian function, compile it, and it to the ODE solver as well.
This has been working great for systems of about 30 variables. Recently I tried doing it with 150 variables, and what happened was that I ran out of memory when compiling the jacobian function. This is on Windows with anaconda and the microsoft Visual C++ 14 tools for python. Basically during compilation of the jacobian, which is now a 22000-element vector, memory usage during the linking step went up to about 7GB (on my 8GB laptop) before finally crashing out.
Does someone have some suggestions before I go and try on a machine with more memory? Are other operating systems or other C compilers likely to improve the situation?
I know lots of people do this type of work, so if there's an answer, it will be beneficial to a good chunk of the community.
Edit: response to some of Jonathan's comments:
Yes, I'm fully aware that this is an N^2 problem. The jacobian is a matrix of all partial derivatives, so it will have size N^2. There is no real way around this scaling. However, a 22000-element array is not nearly at the level that would create a memory problem during runtime -- I only have the problem during compilation.
Basically there are three levels that we can address this at.
1) solve the ODE problem without the jacobian, or somehow split up the jacobian to not have a 150x150 matrix. That would address the very root, but it certainly limits what I can do, and I'm not yet convinced that it's impossible to compile the jacobian function
2) change something about the way sympy automatically generates C code, to split it up into multiple chunks, use more functions for intermediate expressions, to somehow make the .c file smaller. People with more sympy experience might have some ideas on this.
3) change something about the way the C is compiled, so that less memory is needed.
I thought that by posting a separate question more oriented around #3 (literal referencing of large array -- compiler out of memory) , I would get a different audience answering. That is in fact exactly what happened. Perhaps the answer to #3 is "you can't" but that's also useful information.

Following a lot of the examples posted at http://www.sympy.org/scipy-2017-codegen-tutorial/ I was able to get this to compile.
The key things were
1) instead of using autowrap, write the C code directly with more control over it. Among other things, this allows passing the argument list as a vector instead of expanding it. This took some effort to get working (setting up the compiler flags through distutils, etc, etc) but in the end it worked well. Having the repo from the course linked above as an example helped a lot.
2) using common subexpression elimination (sympy.cse) to dramatically reduce the size of the expressions for the jacobian elements.
(1) by itself didn't do that much to help in this case (although I was able to use it to vastly improve performance of smaller models). The code was still 200 MB instead of the original 300 MB. But combining it with (2) (cse) I was able to get it down to a meager 1.7 MB (despite 14000 temporary variables).
The cse takes about 20-30 minutes on my laptop. After that, it compiles quickly.

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

memory exhaust on big matrix operation using dask

Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).
In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.
so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).
this is what I do so far from init, until the kernel dies:
fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt
I didn't get the U,S,and Vt yet.
Is my memory simply not enough to do these sort of things, even when I'm using dask?
or actually this is not a spec problem, but my bad memory management?
or something else?
At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?

Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.
You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.
A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.

Loop through collision detection with multiple processes

I'm using python to write an ideal gas simulator, and right now the collision detection is the most intensive part of the program. At the moment though, I'm only using one of my 8 cores. (I'm using an i7 3770 # 3.4GHz)
After minimal googling I found the multiprocessing module for python (2.7.4). And I've tried it. With a bit of thought I've realised the only thing I can really run in parallel is here, where I loop through all the particles to detect collisions:
for ball in self.Objects:
if not foo == ball:
foo.CollideBall(ball, self.InternalTimestep)
Here foo is the particle that I'm testing against all the others.
So I tried doing this:
for ball in self.Objects:
if not foo == ball:
p = multiprocessing.Process(target=foo.CollideBall, args=(ball, self.InternalTimestep))
p.start()
Although the program does run a little faster, it's still only using 1.5 cores to their fullest extent, the rest are just in idle and it's not detecting any collisions either! I've read that if you create too many processes at once (more than the number of cores) then you get a backlog (this is a loop of 196 particles), so this might explain the lower speed than I was expecting, but it doesn't explain the fact I'm still not using all my cores!
Either way it's too slow!!! So is there a way I can create 8 processes, and only create a new one when there are less than 8 processes running already? Will that even solve my problem? And how do I use all of my cores/why is this code not already?
I only found out about multiprocessing in python yesterday, so I'm afraid any answers will have to be spelt out to me.
Thanks for any help though!
---EDIT---
In response to Carson, I tried adding p.join directly after p.start and this slowed the programme right down. Instead of taking o.2 seconds per cycle it's taking 24 seconds per cycle!

As far as I understand, you test one particle against all others and then perform that operation on each particle in turn. Based on this, I'd say your problem is that you try to optimize your code to work on all cores without trying to optimize your code itself.
Instead you could partitionate your particles so that you only check those that are close to each other. One possible mean to do so is a quad tree: See http://en.wikipedia.org/wiki/Quadtree.
In a second step you can parallelize everything. For quad trees you resolve the upmost level by hand and create a new process for each sub tree. By this, the processes are independent from each other and don't block. I'd expect a quadratic speed up (think of square root of your current run time) by the quad tree and the enabling of a further linear speed up (divide by number of processes) through parallelization.
Sorry, I can't spell it out in Python.

With a working quad tree, you could set up a thread pool (as a class) and define jobs (another class) that are allocated to individual threads (yet another class, if possible from a threading framework). In your case a job contains a list of quad tree nodes that have to be inspected. Initially each top level quad tree node (4 in 2D / 8 in 3D) resides in its own job.
So you can have up to 4 (respective 8) threads, each of which inspecting an independent subtree of the quadtree. If you need more threads to fully use your machines processing power you can have threads put back part of their jobs to the thread pool, if they encounter many deep subtrees.
For this, I'd use a BFS (breadth first search) with the list of quadtree nodes from the job. If the list gets longer than expected, I'd put part of it back to the thread pool. Knowledge in maths/statistics/stochastics helps finding a good parameterization for what length is to be expected.
I've also written a quad tree implementation that parameterizes itself according to expected number of objects given the "world" size and calculating the average object size.
Search for the open source project d-collide. Although it's in C++ there should be some usefull sample code. But please regard its licensing, which is not asked much as it's BSD style.
I added this as a second answer, because the first one was about optimizing your code to achieve your implied goal: better run time (although it's via better efficiency)
This second answer is about achieving your written goal: stronger parallelization.
However the quad tree enables this second step, but don't expect the second speed up to be as much as the first. Especially when it comes to many objects, nothing beats an optimized algorithm. But don't lose yourself in micro optimizations: see the runtime discussion in Cancelling a Task is throwing an exception

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.