pyCUDA reduction doesn't work

pyCUDA reduction doesn't work - python

I am using reduction code basically exactly like the examples in the docs. The code below should return 2^3 + 2^3 = 16, but it instead returns 9. What did I do wrong?
import numpy
import pycuda.reduction as reduct
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule as module
newzeros = [{1,2,3},{4,5,6}]
gpuSum = reduct.ReductionKernel(numpy.uint64, neutral="0", reduce_expr="a+b", map_expr="1 << x[i]", arguments="int* x")
mylengths = pycuda.gpuarray.to_gpu(numpy.array(map(len,newzeros),dtype = "uint64",))
sumfalse = gpuSum(mylengths).get()
print sumfalse

I just figured it out. The argument list used when defining the kernel should be unsigned long *x, not int *x. I was using 64-bit integers everywhere else and it messed it up.

Related

How to use C# UInt16[,] in python

I am using clr to import c# dll in python
one of the functions return ushort[,] ,
which is considered as System.UInt16[,] in python
How can in convert System.UInt16[,] to numpy uint16 matrix?
I can do the conversion only by looping on the matrix, reading each element and assigning its value to the respective position in another numpy matrix, but this solution is very slow.
Is there a faster conversion method which can utilize numpy vectorization ?
Here's a sample for my loop
import clr
import os
import numpy as np
dll_name = os.path.join(os.path.abspath(os.path.dirname(__file__)), ("mydll") + ".dll")
clr.AddReference(dll_name)
from mynamespace import myclass
myobject = myclass()
numpy_matrix = np.empty([80,260],dtype = np.uint16)
SystemInt16_matrix = myobject.Getdata()
for i in range(20):
for j in range(32):
numpy_matrix[i,j]=SystemInt16_matrix[i,j]

I could find the solution, instead of the loop I should use np.fromiter & reshape
import clr
import os
import numpy as np
dll_name = os.path.join(os.path.abspath(os.path.dirname(__file__)), ("mydll") + ".dll")
clr.AddReference(dll_name)
from mynamespace import myclass
myobject = myclass()
SystemInt16_matrix = myobject.Getdata()
numpy_matrix = np.fromiter(SystemInt16_matrix, np.int16).reshape((20, 32))

Generating single random number in pyCuda kernel

I have seen many ways to generate an array of random numbers. but I want to generate a single random number. Is there any function as rand() in c++. I don't want a series of random numbers. I just need to generate a random number inside the kernel. is there any builtin function to generate random numbers? I have tried the given code below, but it not working.
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
__global__ void myRand(float *values)
{
values[0] = getRand();
}
"""
mod = SourceModule(code)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
Errors are like:
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_poisson.h(548): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(69): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(78): error: this declaration may not have extern "C" linkage
/usr/local/cuda/bin/../targets/x86_64-linux/include/curand_discrete2.h(86): error: this declaration may not have extern "C" linkage
30 errors detected in the compilation of "kernel.cu".

The basic problem is that, by default, PyCUDA silently applies C linkage to all code compiled in a SourceModule. As the error is showing, cuRand requires C++ linkage, so getRand can't have C linkage.
You can fix this either by changing these two lines:
mod = SourceModule(code)
myRand = mod.get_function("myRand")
to
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("_Z6myRandPf")
This disables C linkage, but does mean you need to supply the C++ mangled name to the get_function call. You will need to look at the verbose compiler output or compile the code outside of PyCUDA to get that name (for example Godbolt).
Alternatively you can modify the code like this:
import numpy as np
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray
code = """
#include <curand_kernel.h>
__device__ float getRand()
{
curandState_t s;
curand_init(clock64(), 123456, 0, &s);
return curand_uniform(&s);
}
extern "C" {
__global__ void myRand(float *values)
{
values[0] = getRand();
}
}
"""
mod = SourceModule(code, no_extern_c=True)
myRand = mod.get_function("myRand")
gdata = gpuarray.zeros(2, dtype=np.float32)
myRand(gdata, block=(1,1,1), grid=(1,1,1))
print(gdata)
This leaves the kernel with C linkage, but doesn't touch the device function which is using cuRand.

you can import random in python . and use random.randint(). to generate random number in specified range by defining range in function. exrandom.randint(0,50)

Numba error :No implementation of function Function(<function all at #>) found for signature

I'm trying to write a background subtraction with python and NumPy and I'm using numba to make it run faster but I get this error
No implementation of function Function(<function all at 0x000002782257B940>) found for signature:
all(array(bool, 3d, C), axis=Literal[int](-1))
here's the code
import numpy as np
from PIL import Image
from numba import njit
import time
start_time = time.time()
im1=Image.open('u.jpg')
w,h=im1.size
for x in range (100):
for y in range(100):
im1.putpixel((x,y),(255,255,255))
im2=Image.open('u.jpg')
#njit
def Check():
data1=np.asarray(im1)
data2=np.asarray(im2)
out_im=np.zeros((h,w,3),dtype='uint8')
value=np.all(data1!=data2,axis=-1)
rs,cs=value.nonzero()
out_im[rs,cs,:]=[255,0,0]
return out_im
im_out=Image.fromarray((Check()))
im_out.save('save.png')
how can in fix this?

Unrolling a trivially parallelizable for loop in python with CUDA

I have a for loop in python that I want to unroll onto a GPU. I imagine there has to be a simple solution but I haven't found one yet.
Our function loops over elements in a numpy array and does some math storing the result in another numpy array. Each iteration adds some to this result array. A possible large simplification of our code might look something like this:
import numpy as np
a = np.arange(100)
out = np.array([0, 0])
for x in xrange(a.shape[0]):
out[0] += a[x]
out[1] += a[x]/2.0
How can I unroll a loop like this in Python to run on a GPU?

The place to start is http://documen.tician.de/pycuda/ the example there is
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b
You place the part of the code you want to parallelize in C code segment and call it from python.
For you example the size of your data will need to be much bigger than 100 to make it worth while. You'll need some way to divide your data into block. If you wanted to add 1,000,000 numbers you could divide it into 1000 blocks. Add each block in the parallezed code. Then add the results in python.
Adding things is not really a natural task for this type of parallelisation. GPUs tend to do the same task for each pixel. You have a task which need to operate on multiple pixels.
It might be better to work with cuda first. A related thread is.
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

How to use well scipy.optimize.fmin

Hello I'm trying tu use scipy.optimize.fmin to minimize a function. But things aren't going well since my computation seems diverging instead of converging and i got an error. I tried to fixed a tolerance but it is not working.
Here is my code (Main program):
import sys,os
import numpy as np
from math import exp
import scipy
from scipy.optimize import fmin
from carlo import *
A=real()
x_r=0.11245
x_i=0.14587
#C=A.minim
part_real=0.532
part_imag=1.2
R_0 = fmin(A.minim,[part_real,part_imag],xtol=0.0001)
And the class:
import sys,os
import numpy as np
import random, math
import matplotlib.pyplot as plt
import cmath
#import pdb
#pdb.set_trace()
class real:
def __init__(self):
self.nmodes = 4
self.L_ch = 1
self.w = 2
def minim(self,p):
x_r=p[0]
x_i=p[1]
x=complex(x_r,x_i)
self.a=complex(3,4)*(3*np.exp(1j*self.L_ch))
self.T=np.array([[0.0,2.0*self.a],[(0.00645+(x)**2), 4.3*x**2]])
self.Id=np.array([[1,0],[0,1]])
self.disp=np.linalg.det(self.T-self.Id)
print self.disp
return self.disp
The error is:
(-2.16124712985-8.13819476595j)
/usr/local/lib/python2.7/site-packages/scipy/optimize/optimize.py:438: ComplexWarning: Casting complex values to real discards the imaginary part
fsim[0] = func(x0)
(-1.85751684826-8.95377303768j)
/usr/local/lib/python2.7/site-packages/scipy/optimize/optimize.py:450: ComplexWarning: Casting complex values to real discards the imaginary part
fsim[k + 1] = f
(-2.79592712985-8.13819476595j)
(-3.08484130014-7.36240080015j)
(-3.68788935914-6.62639114029j)
/usr/local/lib/python2.7/site-packages/scipy/optimize/optimize.py:475: ComplexWarning: Casting complex values to real discards the imaginary part
fsim[-1] = fxe
(-2.62046851255e+87-1.45013007728e+88j)
(-4.037931857e+87-2.2345341712e+88j)
(-7.45017628087e+87-4.12282179854e+88j)
(-1.14801242605e+88-6.35293780534e+88j)
(-2.11813751435e+88-1.17214723347e+89j)
Warning: Maximum number of function evaluations has been exceeded.
Actually I don't undersatnd why the computation is diverging, maybe I have to use something else instead of using fmin for minimizing?
Someone got an idea?
Thank you very much.

Try to optimize the absolute value instead of the complex value. That gave decent result for me.
f = lambda x: abs(A.minim(x))
R_0 = fmin(f,[part_real,part_imag],xtol=0.0001)
I guess fmin don't work well with complex values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyCUDA reduction doesn't work - python

I just figured it out. The argument list used when defining the kernel should be unsigned long x, not int x. I was using 64-bit integers everywhere else and it messed it up.

Related

How to use C# UInt16[,] in python

Generating single random number in pyCuda kernel

Numba error :No implementation of function Function(<function all at #>) found for signature

Unrolling a trivially parallelizable for loop in python with CUDA

How to use well scipy.optimize.fmin

Categories

Resources