I am using QL in Python and have translated parts of the example file
http://quantlib.org/reference/_fitted_bond_curve_8cpp-example.html#_a25;
of how to fit a yield curve with bonds in order to fit a Nelson-Siegel
yield curve to a set of given calibration bonds.
As usual when performing such a non-linear fit, the results depend strongly
on the initial conditions and many (economically meaningless) minima of the
objective function exist. This is why putting constraints on the parameters
is essential for success. To give an example, at times I get negative
tau/lambda parameters and my yield curve diverges.
I did not find how these parameter constraints can be specified in
the NelsonSiegelFitting or the FittedBondDiscountCurve classes. I could
imagine that anyone performing NS fitting in QL will encounter the same
issue.
Thanks to Andres Hernandez for the answer:
Currently it is not possible. However, it is very easy to extend QL to allow it, but I think it needs to be done on the c++. So even though you are using QL in python, can you modify the c++ code and export a new binding? If yes, then you can use the following code, if not then I could just check it into the code, but it will take some time for the pull request to be accepted. In case you can touch the code, you can add something like this:
in nonlinearfittingmethods.hpp:
class NelsonSiegelConstrainedFitting
: public FittedBondDiscountCurve::FittingMethod {
public:
NelsonSiegelConstrainedFitting(const Array& lower, const Array& upper,
const Array& weights = Array(),
boost::shared_ptr<OptimizationMethod> optimizationMethod
= boost::shared_ptr<OptimizationMethod>());
std::auto_ptr<FittedBondDiscountCurve::FittingMethod> clone() const;
private:
Size size() const;
DiscountFactor discountFunction(const Array& x, Time t) const;
Array lower_, upper_;
};
in nonlinearfittingmethods.cpp:
NelsonSiegelConstrainedFitting::NelsonSiegelConstrainedFitting(
const Array& lower, const Array& upper, const Array& weights,
boost::shared_ptr<OptimizationMethod> optimizationMethod)
: FittedBondDiscountCurve::FittingMethod(true, weights, optimizationMethod),
lower_(lower), upper_(upper){
QL_REQUIRE(lower_.size() == 4, "Lower constraint must have 4 elements");
QL_REQUIRE(upper_.size() == 4, "Lower constraint must have 4 elements");
}
std::auto_ptr<FittedBondDiscountCurve::FittingMethod>
NelsonSiegelConstrainedFitting::clone() const {
return std::auto_ptr<FittedBondDiscountCurve::FittingMethod>(
new NelsonSiegelFitting(*this));
}
Size NelsonSiegelConstrainedFitting::size() const {
return 4;
}
DiscountFactor NelsonSiegelConstrainedFitting::discountFunction(const Array& x,
Time t) const {
///extreme values of kappa result in colinear behaviour of x[1] and x[2], so it should be constrained not only
///to be positive, but also not very extreme
Real kappa = lower_[3] + upper_[3]/(1.0+exp(-x[3]));
Real x0 = lower_[0] + upper_[0]/(1.0+exp(-x[0])),
x1 = lower_[1] + upper_[1]/(1.0+exp(-x[1])),
x2 = lower_[2] + upper_[2]/(1.0+exp(-x[2])),;
Real zeroRate = x0 + (x1 + x2)*
(1.0 - std::exp(-kappa*t))/
((kappa+QL_EPSILON)*(t+QL_EPSILON)) -
x2*std::exp(-kappa*t);
DiscountFactor d = std::exp(-zeroRate * t) ;
return d;
}
You then need to add it to the swig interface, but it should be trivial to do so.
Related
I'm detecting faces with haarcascade and tracking them with a webcam using OpenCV. I need to save each face that is tracked. But the problem is when people are moving. In which case the face becomes blurry.
I've tried to mitigate this problem with opencv's dnn face detector and Laplacian with the following code:
blob = cv2.dnn.blobFromImage(cropped_face, 1.0, (300, 300), (104.0, 177.0, 123.0))
net.setInput(blob)
detections = net.forward()
confidence = detections[0, 0, 0, 2]
blur = cv2.Laplacian(cropped_face, cv2.CV_64F).var()
if confidence >= confidence_threshold and blur >= blur_threshold:
cv2.imwrite('less_blurry_image', cropped_face)
Here I tried to limit saving a face if it is not blurry due to motion by setting blur_threshold to 500 and confidence_threshold to 0.98 (i.e. 98%).
But the problem is if I change the camera I have to change the thresholds again manually. And in most of the cases setting a threshold omits most of the faces.
Plus, it is difficult to detect since the background is always clear compared to the blurred face.
So my question is how can I detect this motion blur on a face. I know I can train an ML model for motion blur detection of a face. But that would require heavy processing resources for a small task.
Moreover, I will be needing a huge amount of annotated data for training if I go that route. Which is not easy for a student like me.
Hence, I am trying to detect this with OpenCV which will be a lot less resource intensive compared to using an ML model for detection.
Is there any less resource intensive solution for this?
You can probably use a Fourier Transform (FFT) or a Discrete Cosine Transform (DCT) to figure out how blurred your faces are. Blur in images leads to high frequencies disappearing, and only low frequencies remaining.
So you'd take an image of your face, zero-pad it to a size that'll work well for FFT or DCT, and look how much spectral power you have at higher frequencies.
You probably don't need FFT - DCT will be enough. The advantage of DCT is that it produces a real-valued result (no imaginary part). Performance-wise, FFT and DCT are really fast for sizes that are powers of 2, as well as for sizes that have only factors 2, 3 and 5 in them (although if you also have 3's and 5's it'll be a bit slower).
As mentioned by #PlinyTheElder, DCT information can give you motion blur. I am attaching the code snippet from repo below:
The code is in C and i am not sure if there is python binding for libjpeg. Else you need to create one.
/* Fast blur detection using JPEG DCT coefficients
*
* Based on "Blur Determination in the Compressed Domain Using DCT
* Information" by Xavier Marichal, Wei-Ying Ma, and Hong-Jiang Zhang.
*
* Tweak MIN_DCT_VALUE and MAX_HISTOGRAM_VALUE to adjust
* effectiveness. I reduced these values from those given in the
* paper because I find the original to be less effective on large
* JPEGs.
*
* Copyright 2010 Julian Squires <julian#cipht.net>
*/
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <jpeglib.h>
static int min_dct_value = 1; /* -d= */
static float max_histogram_value = 0.005; /* -h= */
static float weights[] = { /* diagonal weighting */
8,7,6,5,4,3,2,1,
1,8,7,6,5,4,3,2,
2,1,8,7,6,5,4,3,
3,2,1,8,7,6,5,4,
4,3,2,1,8,7,6,5,
5,4,3,2,1,8,7,6,
6,5,4,3,2,1,8,7,
7,6,5,4,3,2,1,8
};
static float total_weight = 344;
static inline void update_histogram(JCOEF *block, int *histogram)
{
for(int k = 0; k < DCTSIZE2; k++, block++)
if(abs(*block) > min_dct_value) histogram[k]++;
}
static float compute_blur(int *histogram)
{
float blur = 0.0;
for(int k = 0; k < DCTSIZE2; k++)
if(histogram[k] < max_histogram_value*histogram[0])
blur += weights[k];
blur /= total_weight;
return blur;
}
static int operate_on_image(char *path)
{
struct jpeg_error_mgr jerr;
struct jpeg_decompress_struct cinfo;
jvirt_barray_ptr *coeffp;
JBLOCKARRAY cs;
FILE *in;
int histogram[DCTSIZE2] = {0};
cinfo.err = jpeg_std_error(&jerr);
jpeg_create_decompress(&cinfo);
if((in = fopen(path, "rb")) == NULL) {
fprintf(stderr, "%s: Couldn't open.\n", path);
jpeg_destroy_decompress(&cinfo);
return 0;
}
jpeg_stdio_src(&cinfo, in);
jpeg_read_header(&cinfo, TRUE);
// XXX might be a little faster if we ask for grayscale
coeffp = jpeg_read_coefficients(&cinfo);
/* Note: only looking at the luma; assuming it's the first component. */
for(int i = 0; i < cinfo.comp_info[0].height_in_blocks; i++) {
cs = cinfo.mem->access_virt_barray((j_common_ptr)&cinfo, coeffp[0], i, 1, FALSE);
for(int j = 0; j < cinfo.comp_info[0].width_in_blocks; j++)
update_histogram(cs[0][j], histogram);
}
printf("%f\n", compute_blur(histogram));
// output metadata XXX should be in IPTC etc
// XXX also need to destroy coeffp?
jpeg_destroy_decompress(&cinfo);
return 0;
}
int main(int argc, char **argv)
{
int status, i;
for(status = 0, i = 1; i < argc; i++) {
if(argv[i][0] == '-') {
if(argv[i][1] == 'd')
sscanf(argv[i], "-d=%d", &min_dct_value);
else if(argv[i][1] == 'h')
sscanf(argv[i], "-h=%f", &max_histogram_value);
continue;
}
status |= operate_on_image(argv[i]);
}
return status;
}
Compile the code:
gcc -std=c99 blur_detection.c -l jpeg -o blur-detection
Run the code:
./blur-detection <image path>
I'm stuck on this exercise and am not good enough to resolve it. Basically I am writing a Monte-Carlo Maximum Likelihood algorithm for the Bernoulli distribution. The problem is that I have to pass the data as the parameter to the GSL minimization (one-dim) algorithm, and need to also pass the size of the data (since the outer loop are the different sample sizes of the "observed" data). So I'm attempting to pass these parameters as a struct. However, I'm running into seg faults and I'm SURE it is coming from the portion of the code that concerns the struct and treating it as a pointer.
[EDIT: I have corrected for allocation of the struct and its components]
%%cython
#!python
#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True
from libc.stdlib cimport rand, RAND_MAX, calloc, malloc, realloc, free, abort
from libc.math cimport log
#Use the CythonGSL package to get the low-level routines
from cython_gsl cimport *
######################### Define the Data Structure ############################
cdef struct Parameters:
#Pointer for Y data array
double* Y
#size of the array
int* Size
################ Support Functions for Monte-Carlo Function ##################
#Create a function that allocates the memory and verifies integrity
cdef void alloc_struct(Parameters* data, int N, unsigned int flag) nogil:
#allocate the data array initially
if flag==1:
data.Y = <double*> malloc(N * sizeof(double))
#reallocate the data array
else:
data.Y = <double*> realloc(data.Y, N * sizeof(double))
#If the elements of the struct are not properly allocated, destory it and return null
if N!=0 and data.Y==NULL:
destroy_struct(data)
data = NULL
#Create the destructor of the struct to return memory to system
cdef void destroy_struct(Parameters* data) nogil:
free(data.Y)
free(data)
#This function fills in the Y observed variable with discreet 0/1
cdef void Y_fill(Parameters* data, double p_true, int* N) nogil:
cdef:
Py_ssize_t i
double y
for i in range(N[0]):
y = rand()/<double>RAND_MAX
if y <= p_true:
data.Y[i] = 1
else:
data.Y[i] = 0
#Definition of the function to be maximized: LLF of Bernoulli
cdef double LLF(double p, void* data) nogil:
cdef:
#the sample structure (considered the parameter here)
Parameters* sample
#the total of the LLF
double Sum = 0
#the loop iterator
Py_ssize_t i, n
sample = <Parameters*> data
n = sample.Size[0]
for i in range(n):
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
return (-(Sum/n))
########################## Monte-Carlo Function ##############################
def Monte_Carlo(int[::1] Samples, double[:,::1] p_hat,
Py_ssize_t Sims, double p_true):
#Define variables and pointers
cdef:
#Data Structure
Parameters* Data
#iterators
Py_ssize_t i, j
int status, GSL_CONTINUE, Iter = 0, max_Iter = 100
#Variables
int N = Samples.shape[0]
double start_val, a, b, tol = 1e-6
#GSL objects and pointer
const gsl_min_fminimizer_type* T
gsl_min_fminimizer* s
gsl_function F
#Set the GSL function
F.function = &LLF
#Allocate the minimization routine
T = gsl_min_fminimizer_brent
s = gsl_min_fminimizer_alloc(T)
#allocate the struct
Data = <Parameters*> malloc(sizeof(Parameters))
#verify memory integrity
if Data==NULL: abort()
#set the starting value
start_val = rand()/<double>RAND_MAX
try:
for i in range(N):
if i==0:
#allocate memory to the data array
alloc_struct(Data, Samples[i], 1)
else:
#reallocate the data array in the struct if
#we are past the first run of outer loop
alloc_struct(Data, Samples[i], 2)
#verify memory integrity
if Data==NULL: abort()
#pass the data size into the struct
Data.Size = &Samples[i]
for j in range(Sims):
#fill in the struct
Y_fill(Data, p_true, Data.Size)
#set the parameters for the GSL function (the samples)
F.params = <void*> Data
a = tol
b = 1
#set the minimizer
gsl_min_fminimizer_set(s, &F, start_val, a, b)
#initialize conditions
GSL_CONTINUE = -2
status = -2
while (status == GSL_CONTINUE and Iter < max_Iter):
Iter += 1
status = gsl_min_fminimizer_iterate(s)
start_val = gsl_min_fminimizer_x_minimum(s)
a = gsl_min_fminimizer_x_lower(s)
b = gsl_min_fminimizer_x_upper(s)
status = gsl_min_test_interval(a, b, tol, 0.0)
if (status == GSL_SUCCESS):
print ("Converged:\n")
p_hat[i,j] = start_val
finally:
destroy_struct(Data)
gsl_min_fminimizer_free(s)
with the following python code to run the above function:
import numpy as np
#Sample Sizes
N = np.array([5,50,500,5000], dtype='i')
#Parameters for MC
T = 1000
p_true = 0.2
#Array of the outputs from the MC
p_hat = np.empty((N.size,T), dtype='d')
p_hat.fill(np.nan)
Monte_Carlo(N, p_hat, T, p_true)
I have separately tested the struct allocation and it works, doing what it should do. However, while funning the Monte Carlo the kernel is killed with an abort call (per the output on my Mac) and the Jupyter output on my console is the following:
gsl: fsolver.c:39: ERROR: computed function value is infinite or NaN
Default GSL error handler invoked.
It seems now that the solver is not working. I'm not familiar with the GSL package, having used it only once to generate random numbers from the gumbel distribution (bypassing the scipy commands).
I would appreciate any help on this! Thanks
[EDIT: Change lower bound of a]
Redoing the exercise with the exponential distribution, whose log likelihood function contains just one log I've honed down the problem having been with gsl_min_fminimizer_set initially evaluating at the lower bound of a at 0 yielding the -INF result (since it evaluates the problem prior to solving to generate f(lower), f(upper) where f is my function to optimise). When I set the lower bound to something other than 0 but really small (say the tol variable of my defined tolerance) the solution algorithm works and yields the correct results.
Many thanks #DavidW for the hints to get me to where I needed to go.
This is a somewhat speculative answer since I don't have GSL installed so struggle to test it (so apologies if it's wrong!)
I think the issue is the line
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
It looks like Y[i] can be either 0 or 1. When p is at either end of the range 0-1 it gives 0*-inf = nan. In the case where only all the 'Y's are the same this point is the minimum (so the solver will reliably end up at the invalid point). Fortunately you should be able to rewrite the line to avoid getting a nan:
if sample.Y[i]:
Sum += log(p)
else:
Sum += log(1-p)
(the case which will generate the nan is the one not executed).
There's a second minor issue I've spotted: in alloc_struct you do data = NULL in case of an error. This only affects the local pointer, so your test for NULL in Monte_Carlo is meaningless. You'd be better returning a true or false flag from alloc_struct and checking that. I doubt if you're hitting this error though.
Edit: Another better option would be to find the minimum analytically: the derivative of A log(p) + (1-A) log (1-p) is A/p - (1-A)/(1-p). Average all the sample.Ys to find A. Finding the place where the derivative is 0 gives p=A. (You'll want to double-check my working!). With this you can avoid having to use the GSL minimization routines.
I've been going around this issue for days, but haven't been able to find an explanation to what I am doing wrong. I hope you can lend me a hand.
I have a set of UTM coordinates (epsg:23030) that I want to convert to LongLat Coordinates (epsg:4326) by using the proj4 library for C++ (libproj-dev). My code is as follows:
#include "proj_api.h
#include <geos/geom/Coordinate.h>
geos::geom::Coordinate utm2longlat(double x, double y){
// Initialize LONGLAT projection with epsg:4326
if ( !( pj_longlat = pj_init_plus("+init=epsg:4326" ) ) ){
qDebug() << "pj_init_plus error: longlat";
}
// Initialize UTM projection with epsg:23030
if ( ! (pj_utm = pj_init_plus("+init=epsg:23030" ) ) ){
qDebug() << "pj_init_plus error: utm";
}
// Transform UTM projection into LONGLAT projection
int p = pj_transform( pj_utm, pj_longlat, 1, 1, &x, &y, NULL );
// Check for errors
qDebug() << "Error message" << pj_strerrno( p ) ;
// Return values as coordinate
return geos::geom::Coordinate(x, y)
}
My call to the function utm2longlat:
...
// UTM coordinates
double x = 585363.1;
double y = 4796767.1;
geos::geom::Coordinate coord = utm2longlat( x, y );
qDebug() << coord.x << coord.y;
/* Result is -0.0340087 0.756025 <-- WRONG */
In my example:
I know that UTM coordinates (585363.1 4796767.1) refer to LongLat coordinates (-1.94725 43.3189).
However, when called, the function returns a set of wrong coordinates: (-0.0340087 0.756025 ).
I was wondering if I had any misconfiguration when initializing the projections, so I decided to test the Proj4 Python bindings (pyproj), just to test whether I got the same wrong coordinates... and curiously, I got the good ones.
from pyproj import Proj, transform
// Initialize UTM projection
proj_utm = Proj(init='epsg:23030')
// Initialize LongLat projection
proj_lonlat = Proj(init='epsg:4326')
x_utm, y_utm = 585363.1, 4796767.1
x_longlat, y_longlat = transform(proj_utm, proj_lonlat, x_utm, y_utm)
// Print results
print "original", x_utm, y_utm
print "utm2lonlat", x_longlat, y_longlat
/* Result is -1.94725 43.3189 <-- CORRECT */
From what I understand pyproj is a set of Cython bindings over the Proj4 library, so I am using the same core in both programming languages.
Do you have any clue as to what could be wrong? Am I missing some type of conversion in the C++ function?
Thanks in advance.
The result seems to be correct to me, but it's returned in radians instead of degrees. Convert the result to degrees and check again.
is there any easy way how to pass float4 or any other vector argument to OpenCL kernel?
For scalar argument (int, float) you can pass it directly while calling kernel. For array argument you have to first copy it to GPU using cl.Buffer() and than pass pointer. Sure it is probably possible to pass float4 the same way as array. But I ask if there is any easier and more clear way. ( especially using Python, numpy, pyOpenCL)
I tried pass numpy array of size 4*float32 as float4 but it does not work. Is it possible to do it somehow else?
For example :
kernnel:
__kernel void myKernel( __global float * myArray, float myFloat, float4 myFloat4 )
Python:
myFloat4 = numpy.array ( [1.0 ,2.0 ,3.0], dtype=np.float32 )
myArray = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=myArray_host)
kernelargs = ( myArray , numpy.float32(myFloat) , myFloat4)
prg.myKernel(queue, cl_myArray.shape() , None, *(kernelargs) )
I got error :
pyopencl.LogicError: when processing argument #2 (1-based): clSetKernelArg failed: invalid arg size
the other possibiliy is passing it as set of scalar int or float - like:
__kernel void myKernel( __global float * myArray, float myFloat, float myFloat4_x, float myFloat4_y, float myFloat4_z )
kernelargs = ( myArray , numpy.float32(myFloat) ,numpy.float32(myFloat4_x),numpy.float32(myFloat4_y),numpy.float32(myFloat4_z))
but this is also not very convenient - you can be easily lost in many variable names if you want for example pass 4x float4 and 5x int3 to the kernell.
I think passing vectors (2,3,4) of int and float must be quite common in OpenCL - for example the size of 3D data grids. So I wonder if it is really necessary to pass it using cl.Buffer() as pointers.
I guess that constant argument float4 is also faster than *float (because it can be shared as a constant by all workitems)
I find this a nice way to create a float4 in python:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
data= np.zeros(N, dtype=cl_array.vec.float4)
Edit: To also give a MWE:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
deviceID = 0
platformID = 0
workGroup=(1,1)
N = 10
testData = np.zeros(N, dtype=cl_array.vec.float4)
dev = cl.get_platforms()[platformID].get_devices()[deviceID]
ctx = cl.Context([dev])
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
Data_In = cl.Buffer(ctx, mf.READ_WRITE, testData.nbytes)
prg = cl.Program(ctx, """
__kernel void Pack_Cmplx( __global float4* Data_In, int N)
{
int gid = get_global_id(0);
Data_In[gid] = 1;
}
""").build()
prg.Pack_Cmplx(queue, (N,1), workGroup, Data_In, np.int32(N))
cl.enqueue_copy(queue, testData, Data_In)
print testData
Problem is here:
myFloat4 = numpy.array ( [1.0 ,2.0 ,3.0], dtype=numpy.float32 )
but myFloat4.size is equal to 3
Just type this :
myFloat4 = numpy.array ( [1.0 ,2.0 ,3.0, 4.0], dtype=numpy.float32 )
The rest of code is be fine
I noticed three things:
Looking at the error message, there seems to be an issue with the 2nd kernel argument, i.e. myFloat. What happens if you declare it a const argument in the kernel signature? What happens if you do
myFloat = myFloat.astype(np.float32)
kernelArgs = (..., myFloat, ...)
prg.myKernel(...)
You want to define a four-element vector myFloat4 but you give three values [1.0, 2.0, 3.0] only. Also try setting const float4 myFloat4 in the kernel signature.
You don't need additional parentheses for the kernelargs tuple in the actual kernel call:
prg.myKernel(queue, cl_myArray.shape() , None, *kernelargs)
For me, creating a numpy array of shape (SIZE,4) and dtype float32 worked fine when I ran opencl kernel. Be sure second dimension matches what kind of floatN you want, it won't throw any errors if they don't match but in my case it crashed graphics card driver.
The way I inited my arrays: np.zeros((SIZE,4), dtype=np.float32)
Hope this helps anybody who is wondering the same.
I don't know about OpenCl in Python, but I do pass double, int, double8, or whatever OpenCl type to kernels.
Suppose that N is an integer, alpha a double, and vect a double8.
What I do is
clSetKernelArg(kernel, 0, sizeof(int), &N);
clSetKernelArg(kernel, 18, sizeof(double), &alpha);
clSetKernelArg(kernel, 11, sizeof(cl_double8), &vect);
Hope it helps.
Éric.
I need help to know the size of my blocks and grids.
I'm building a python app to perform metric calculations based on scipy as: Euclidean distance, Manhattan, Pearson, Cosine, joined other.
The project is PycudaDistances.
It seems to work very well with small arrays. When I perform a more exhaustive test, unfortunately it did not work. I downloaded movielens set (http://www.grouplens.org/node/73).
Using Movielens 100k, I declared an array with shape (943, 1682). That is, users are 943 and 1682 films evaluated. The films not by a classifier user I configured the value to 0.
With a much larger array algorithm no longer works. I face the following error:
pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid value.
Researching this error, I found an explanation of telling Andrew that supports 512 threads to join and to work with larger blocks it is necessary to work with blocks and grids.
I wanted a help to adapt the algorithm Euclidean distance arrays to work from small to giant arrays.
def euclidean_distances(X, Y=None, inverse=True):
X, Y = check_pairwise_arrays(X,Y)
rows = X.shape[0]
cols = Y.shape[0]
solution = numpy.zeros((rows, cols))
solution = solution.astype(numpy.float32)
kernel_code_template = """
#include <math.h>
__global__ void euclidean(float *x, float *y, float *solution) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int idy = threadIdx.y + blockDim.y * blockIdx.y;
float result = 0.0;
for(int iter = 0; iter < %(NDIM)s; iter++) {
float x_e = x[%(NDIM)s * idy + iter];
float y_e = y[%(NDIM)s * idx + iter];
result += pow((x_e - y_e), 2);
}
int pos = idx + %(NCOLS)s * idy;
solution[pos] = sqrt(result);
}
"""
kernel_code = kernel_code_template % {
'NCOLS': cols,
'NDIM': X.shape[1]
}
mod = SourceModule(kernel_code)
func = mod.get_function("euclidean")
func(drv.In(X), drv.In(Y), drv.Out(solution), block=(cols, rows, 1))
return numpy.divide(1.0, (1.0 + solution)) if inverse else solution
For more details see: https://github.com/vinigracindo/pycudaDistances/blob/master/distances.py
To size the execution paramters for your kernel you need to do two things (in this order):
1. Determine the block size
Your block size will mostly be determined by hardware limitations and performance. I recommend reading this answer for more detailed information, but the very short summary is that your GPU has a limit on the total number of threads per block it can run, and it has a finite register file, shared and local memory size. The block dimensions you select must fall inside these limits, otherwise the kernel will not run. The block size can also effect the performance of kernel, and you will find a block size which gives optimal performance. Block size should always be a round multiple of the warp size, which is 32 on all CUDA compatible hardware released to date.
2. Determine the grid size
For the sort of kernel you have shown, the number of blocks you need is directly related to the amount of input data and the dimensions of each block.
If, for example, your input array size was 943x1682, and you had a 16x16 block size, you would need a 59 x 106 grid, which would yield 944x1696 threads in the kernel launch. In this case the input data size is not a round multiple of the block size, you will need to modify your kernel to ensure that it doesn't read out-of-bounds. One approach could be something like:
__global__ void euclidean(float *x, float *y, float *solution) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int idy = threadIdx.y + blockDim.y * blockIdx.y;
if ( ( idx < %(NCOLS)s ) && ( idy < %(NDIM)s ) ) {
.....
}
}
The python code to launch the kernel could look like something similar to:
bdim = (16, 16, 1)
dx, mx = divmod(cols, bdim[0])
dy, my = divmod(rows, bdim[1])
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) )
func(drv.In(X), drv.In(Y), drv.Out(solution), block=bdim, grid=gdim)
This question and answer may also help understand how this process works.
Please note that all of the above code was written in the browser and has never been tested. Use it at your own risk.
Also note it was based on a very brief reading of your code and might not be correct because you have not really described anything about how the code is called in your question.
The accepted answer is correct in principle, however the code that talonmies has listed is not quite correct. The line:
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) )
Should Be:
gdim = ( (dx + (mx>0)), (dy + (my>0)) )
Besides an obvious extra parenthesis, gdim would produce way too many threads than what you want. talonmies had explained it right in his text that threads is the blocksize * gridSize. However the gdim he has listed would give you the total threads and not the correct grid size which is what is desired.