I'm stuck on this exercise and am not good enough to resolve it. Basically I am writing a Monte-Carlo Maximum Likelihood algorithm for the Bernoulli distribution. The problem is that I have to pass the data as the parameter to the GSL minimization (one-dim) algorithm, and need to also pass the size of the data (since the outer loop are the different sample sizes of the "observed" data). So I'm attempting to pass these parameters as a struct. However, I'm running into seg faults and I'm SURE it is coming from the portion of the code that concerns the struct and treating it as a pointer.
[EDIT: I have corrected for allocation of the struct and its components]
%%cython
#!python
#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True
from libc.stdlib cimport rand, RAND_MAX, calloc, malloc, realloc, free, abort
from libc.math cimport log
#Use the CythonGSL package to get the low-level routines
from cython_gsl cimport *
######################### Define the Data Structure ############################
cdef struct Parameters:
#Pointer for Y data array
double* Y
#size of the array
int* Size
################ Support Functions for Monte-Carlo Function ##################
#Create a function that allocates the memory and verifies integrity
cdef void alloc_struct(Parameters* data, int N, unsigned int flag) nogil:
#allocate the data array initially
if flag==1:
data.Y = <double*> malloc(N * sizeof(double))
#reallocate the data array
else:
data.Y = <double*> realloc(data.Y, N * sizeof(double))
#If the elements of the struct are not properly allocated, destory it and return null
if N!=0 and data.Y==NULL:
destroy_struct(data)
data = NULL
#Create the destructor of the struct to return memory to system
cdef void destroy_struct(Parameters* data) nogil:
free(data.Y)
free(data)
#This function fills in the Y observed variable with discreet 0/1
cdef void Y_fill(Parameters* data, double p_true, int* N) nogil:
cdef:
Py_ssize_t i
double y
for i in range(N[0]):
y = rand()/<double>RAND_MAX
if y <= p_true:
data.Y[i] = 1
else:
data.Y[i] = 0
#Definition of the function to be maximized: LLF of Bernoulli
cdef double LLF(double p, void* data) nogil:
cdef:
#the sample structure (considered the parameter here)
Parameters* sample
#the total of the LLF
double Sum = 0
#the loop iterator
Py_ssize_t i, n
sample = <Parameters*> data
n = sample.Size[0]
for i in range(n):
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
return (-(Sum/n))
########################## Monte-Carlo Function ##############################
def Monte_Carlo(int[::1] Samples, double[:,::1] p_hat,
Py_ssize_t Sims, double p_true):
#Define variables and pointers
cdef:
#Data Structure
Parameters* Data
#iterators
Py_ssize_t i, j
int status, GSL_CONTINUE, Iter = 0, max_Iter = 100
#Variables
int N = Samples.shape[0]
double start_val, a, b, tol = 1e-6
#GSL objects and pointer
const gsl_min_fminimizer_type* T
gsl_min_fminimizer* s
gsl_function F
#Set the GSL function
F.function = &LLF
#Allocate the minimization routine
T = gsl_min_fminimizer_brent
s = gsl_min_fminimizer_alloc(T)
#allocate the struct
Data = <Parameters*> malloc(sizeof(Parameters))
#verify memory integrity
if Data==NULL: abort()
#set the starting value
start_val = rand()/<double>RAND_MAX
try:
for i in range(N):
if i==0:
#allocate memory to the data array
alloc_struct(Data, Samples[i], 1)
else:
#reallocate the data array in the struct if
#we are past the first run of outer loop
alloc_struct(Data, Samples[i], 2)
#verify memory integrity
if Data==NULL: abort()
#pass the data size into the struct
Data.Size = &Samples[i]
for j in range(Sims):
#fill in the struct
Y_fill(Data, p_true, Data.Size)
#set the parameters for the GSL function (the samples)
F.params = <void*> Data
a = tol
b = 1
#set the minimizer
gsl_min_fminimizer_set(s, &F, start_val, a, b)
#initialize conditions
GSL_CONTINUE = -2
status = -2
while (status == GSL_CONTINUE and Iter < max_Iter):
Iter += 1
status = gsl_min_fminimizer_iterate(s)
start_val = gsl_min_fminimizer_x_minimum(s)
a = gsl_min_fminimizer_x_lower(s)
b = gsl_min_fminimizer_x_upper(s)
status = gsl_min_test_interval(a, b, tol, 0.0)
if (status == GSL_SUCCESS):
print ("Converged:\n")
p_hat[i,j] = start_val
finally:
destroy_struct(Data)
gsl_min_fminimizer_free(s)
with the following python code to run the above function:
import numpy as np
#Sample Sizes
N = np.array([5,50,500,5000], dtype='i')
#Parameters for MC
T = 1000
p_true = 0.2
#Array of the outputs from the MC
p_hat = np.empty((N.size,T), dtype='d')
p_hat.fill(np.nan)
Monte_Carlo(N, p_hat, T, p_true)
I have separately tested the struct allocation and it works, doing what it should do. However, while funning the Monte Carlo the kernel is killed with an abort call (per the output on my Mac) and the Jupyter output on my console is the following:
gsl: fsolver.c:39: ERROR: computed function value is infinite or NaN
Default GSL error handler invoked.
It seems now that the solver is not working. I'm not familiar with the GSL package, having used it only once to generate random numbers from the gumbel distribution (bypassing the scipy commands).
I would appreciate any help on this! Thanks
[EDIT: Change lower bound of a]
Redoing the exercise with the exponential distribution, whose log likelihood function contains just one log I've honed down the problem having been with gsl_min_fminimizer_set initially evaluating at the lower bound of a at 0 yielding the -INF result (since it evaluates the problem prior to solving to generate f(lower), f(upper) where f is my function to optimise). When I set the lower bound to something other than 0 but really small (say the tol variable of my defined tolerance) the solution algorithm works and yields the correct results.
Many thanks #DavidW for the hints to get me to where I needed to go.
This is a somewhat speculative answer since I don't have GSL installed so struggle to test it (so apologies if it's wrong!)
I think the issue is the line
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
It looks like Y[i] can be either 0 or 1. When p is at either end of the range 0-1 it gives 0*-inf = nan. In the case where only all the 'Y's are the same this point is the minimum (so the solver will reliably end up at the invalid point). Fortunately you should be able to rewrite the line to avoid getting a nan:
if sample.Y[i]:
Sum += log(p)
else:
Sum += log(1-p)
(the case which will generate the nan is the one not executed).
There's a second minor issue I've spotted: in alloc_struct you do data = NULL in case of an error. This only affects the local pointer, so your test for NULL in Monte_Carlo is meaningless. You'd be better returning a true or false flag from alloc_struct and checking that. I doubt if you're hitting this error though.
Edit: Another better option would be to find the minimum analytically: the derivative of A log(p) + (1-A) log (1-p) is A/p - (1-A)/(1-p). Average all the sample.Ys to find A. Finding the place where the derivative is 0 gives p=A. (You'll want to double-check my working!). With this you can avoid having to use the GSL minimization routines.
Related
I am trying to write a fast algorithm in Cython. The algorithm is fairly straightforward and is described here (https://arxiv.org/pdf/1411.4357.pdf - the paragraph before Theorem 8 on page 17). The idea is relatively straightforward: assuming that the data is sparse it can be given as input in the form (i,j,A_ij) for a matrix A which has n rows and d columns. Now to take advantage of the sparsity we need two functions h which maps the n rows into s buckets uniformly at random (which is a parameter to the algorithm) and a sign function s which is just ±1 with equal probability. Then for every triple (i,j,A_ij) the algorithm must output (h(i),j, s(h(i))*A_ij) and return these as arrays to be given as input to another sparse.
The problem is that I can't get a speedup. This should run extremely fast and outperform matrix multiplication of realising S and multiplying by A as illustrated in the above reference.
Python approach (roughly 11ms):
import numpy as np
import numpy.random as npr
from scipy.sparse import coo_matrix
def countSketch_faster(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
observed_rows = set([])
sign_hashes = []
hash_table = {}
sketched_data = []
hashed_rows = []
for row_id, col_id, Aij in zip(input_matrix.row, input_matrix.col, input_matrix.data):
if row_id in observed_rows:
sketched_data.append(hash_table[row_id][1]*Aij)
hashed_rows.append(hash_row_val)
else:
hash_row_val, sign_val = np.random.choice(sketch_size), np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data.append(sign_val*Aij)
hashed_rows.append(hash_row_val)
observed_rows.add(row_id)
hashed_rows = np.asarray(hashed_rows)
sketched_data = np.asarray(sketched_data)
row_hashes = np.asarray(row_hashes)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col))).tocsr()
return S
Converting to Cython:
Analysing the annotated output highlights that the most python-heavy lines are those which call numpy. I have tried to cdef all variables and specify dtype for any call to numpy. Also, I have removed the zip command and tried to keep the loop simpler and more C-like. All of this has actually had an adverse effect though and it runs very slowly, and I'm not really sure why. It seems like quite a simple algorithm to implement so if someone could help me get the runtime down to something very small I would be extremely grateful.
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from scipy.sparse import coo_matrix
def countSketch_cython(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
hash_table = {}
sketched_data = np.zeros_like(input_matrix.data,dtype=np.float64)
hashed_rows = np.zeros_like(sketched_data,dtype=np.float64)
observed_rows = set([])
for idx in range(input_matrix.row.shape[0]):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = np.random.randint(low=0,high=sketch_size+1)
sign_val = np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)))
return S`
UPDATE: I have managed to speed the code up by removing some of the slower lines. These were calls to np.random for which the C random number generator was faster, giving the length in the input so this did not need to be calculated, and not converting to a sparse matrix before returning (for the purpose of this experiment I am interested in how quickly the transform can be done as opposed to details surrounding the conversion for downstream use).
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from libc.stdlib cimport rand
##cython.boundscheck(False) # these don't contribute much in this
example
##cython.wraparound(False)
def countSketch(input_matrix, input_matrix_length, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
input_matrix_length - number of rows in input matrix
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
cdef int arr_lengths = input_matrix_length
# These two lines are still annotated most boldly
cdef double[:,:] sketched_data =
np.zeros((arr_lengths,1),dtype=np.float64)
cdef double[:,:] hashed_rows = np.zeros((arr_lengths,1))
hash_table = {}
observed_rows = set([])
for idx in range(arr_lengths):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = rand()%(sketch_size)
#np.random.randint(low=0,high=sketch_size+1)
sign_val = 2*rand()%(2) - 1
#np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
#S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)), dtype=np.float64)
return hashed_rows, sketched_data
On a random sparse matrix A = scipy.sparse.random(1000, 50, density=0.1) this now achieves 508 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) using %timeit countSketch(A, A.shape[0],100). I should imagine there are still gains to be made as I don't know if I have set the arrays in the best way.
I am running a simulation in Python 3.4 - that involves a lot of dot products between a sparse array (in csr format) and a dense vector. I am using Scipy for the sparse matrix, numpy for everything else.
Using Cython gave me a massive boost (~x6 speed increase), after making sure that I cdef everything properly and after minimizing Python interaction (bt going through the html file that Cython gives me and modifying my code).
Now, I profile the code and 50% of the simulation time is spent on the line with the dot product. I am wondering if it is possible to somehow accelerate this line, say by complining this one dot function in Cython?
I know I could write my own implementation for (csr sprase 2d matrix) dot (dense vector), but I am trying to avoid that.
Edit: I have included a minimal example of the code. I am sorry, I can't see how to make it smaller. It is a textbook exercise in statistical mechanics. Place marbles in pots until one of the pots exceeds capacity. Then, start a cascade which propagates according to a (here sparse) matrix. I am using batch sampling.
Please focus on the line towards the end.
from __future__ import division
import numpy as np
import cython
cimport numpy as np
cimport cpython.array
import scipy.sparse as sps
#cython.cdivision(True)
#cython.nonecheck(False)
#cython.boundscheck(False)
#cython.wraparound(False)
def simulate(long[:] capacity_vec,
int random_array_size,
long n,
int seed,
int[:] A_col,
int[:] A_row,
long[:] A_data):
#### Initialise ####################################################################################################
# Initialise states
cdef int[:] damage = np.random.randint(0, int(np.min(capacity_vec)/2), n).astype(np.int32)
cdef int[:] dr_list = np.random.choice(n, 1000).astype(np.int32)
cdef int[:] states = np.zeros(n).astype(np.int32)
cdef int[:] states_ = np.zeros(n).astype(np.int32)
cdef int[:] change = np.zeros(n).astype(np.int32)
# Initialise counters
cdef int k, violations, violations_, counter= 0, dr_id=0, increment_index = 0
# Build Sparse Adjecency Matrix
cA_sps = sps.csr_matrix( (A_data, (A_row, A_col) ), shape=(n,n) ).astype(np.int32)
while counter < 1000:
#### Place damage until a cascade starts #######################################################################
while damage[increment_index] <= capacity_vec[increment_index]:# Check for violations
increment_index = dr_list[dr_id] # Where do we place the marble?
damage[increment_index] = damage[increment_index] + 1 # place the marble
dr_id = dr_id + 1 # another random number used
if dr_id == random_array_size - 1: # Check if we run out of random numbers
dr_list = np.random.choice(n, random_array_size).astype(np.int32) # if so, pick new increment_index
dr_id = 0 # and reset the counter
#### Initialise cascade ########################################################################################
violations, violations_ = 1, 0
states[increment_index] = 1
#### Propagate cascade #########################################################################################
while violations > violations_: # check for fixed point, propagate cascade
for k in range(n): change[k] = states[k] - states_[k]
### THIS LINE IS THE PROBLEM. It takes up half of all simulation time.
damage = damage + cA_sps.dot(change).astype(np.int32) # spread violations
states_ = states.copy() # store previous states
# Determine previous and current violations
violations, violations_ = 0 , violations
for k in range(n):
states_[k] = 0
if damage[k] > capacity_vec[k]:
violations = violations + 1
states[k] = 1 # deactivate any node that has a violation
for k in range(n): damage[k] = 0
counter = counter + 1 # progress cascade id after storing
I'd discourage from writing your own matrix multiplication. SciPy is done by smart people who know what they are doing, and unless your confident in numerical computing, just don't. Most of SciPy code is already compiled.
However, what you might look at is code for sparse.csr_matrix.dot. Getting into definition directly here and then here, you'll see that there are few checks done in Scipy. If you know what exact form you want, you could write your own method (modify your SciPy copy) and compute your product directly. Not sure how much that would help, though.
If you want to build Scipy yourself it is as easy as checking out whole project from GitHug and then running
python setup.py build
python setup.py install
For more direct instructions check build documentation.
I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
Sum all the elements of the result array
Here is the algorithm in Python I want to speed up
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
break
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
}
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
break;
}
x -= dx;
}
result[k] = x;
}
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
}
std::cout << somme << std::endl;
return 0;
}
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
Thanks.
You can do step 1. with numpy efficiently:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on #ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]
Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.
I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
initialize an array ar with values `[1,2,3,....n]
divide ar with n. This gets us the 1/n, 2/n ... members
add to that an array of same dimensions that contain just the number 1.0
This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.
find square roots, sum it
Average of 10 sequential execution times is 0.018786 seconds.
Obviously I'm 6 years late to this party, but this question is a common stumbling block for people in effectively using numpy for real scientific work. The basic idea is covered in #ev-br's answer. The OP points out that the solution offered there (even modified to stop iterating when a convergence criterion is met rather than after a fixed number of iterations) takes the same number of passes for each element of u. I want to show how you can avoid that objection using pure numpy code, making explicit the mask suggestion in #ev-br's comment.
However, I also want to point out that in many real world situations, the number of passes for Newton-like iteration to converge varies so little that this general technique I illustrate here will actually slow numpy code down significantly. If the average number of iterations will be within a factor of two or three of the maximum number of iterations, you should stick with something closer to #ev-br's answer (including his first comment).
The numpy performance numbers you need to understand are these: Loops over array indices will run 200 to 500 times slower in pure numpy code than in compiled code. On the other hand, if you manage to use numpy's array syntax to avoid all index loops, you can get within about a factor of 5 of compiled speed. (The factor of 5 is partly because of memory management as #ev-br mentions, but also because optimized compiled code overlaps many different arithmetical operations inside each index loop, while numpy just performs a single arithmetic operation, storing everything back to memory after each operation.) The point is that factor of 100 difference means that it often pays to do substantial amounts of "extra" work in numpy code: Even if you do 10 times the number of floating point operations in vectorized numpy code, it will still run 10 times faster than the index-loop code that avoids the "extra" work. (Incidentally, the python map function is implemented as an interpreted index loop - it has nothing to do with numpy array operations.)
from numpy import asfarray, broadcast_arrays, arange
# Begin by defining the function to be inverted by Newton's method.
def f_dfdx(x):
x = asfarray(x) # always avoid repeated type conversions
return x**2, 2.*x
# First, the simplest algorithm to find x such that f(x)=y.
# We must supply a starting guess x0 for x.
def f_inverse0(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
x = x.copy() # without this may clobber input x0
for npass in range(20):
f, dfdx = f_dfdx(x)
dx = (f - y) / dfdx
if (abs(dx) <= tol).all():
break # iterate all x until all have converged
x -= dx
else:
raise RuntimeError("failed to converge")
return x
# A frequently slower algorithm that avoids extra iterations.
def f_inverse1(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
shape = x.shape
y, x = y.ravel(), x.flatten() # avoid clobbering x0
unconverged = arange(y.size)
for npass in range(20):
f, dfdx = f_dfdx(x[unconverged])
dx = (f - y[unconverged]) / dfdx
unc = abs(dx) > tol
unconverged = unconverged[unc]
if not unconverged.size:
break # iterate all x until all have converged
x[unconverged] -= dx[unc]
else:
raise RuntimeError("failed to converge")
return x.reshape(shape)
On my machine, the OP's C++ program runs in 2.03 s (1.64+0.38 user+sys). For n=100 million as for the C++ program, f_inverse0 runs in 20.4 s (4.7+15.6 user+sys). As expected, f_inverse1 is slower, 51.3 s (11.5+39.8 user+sys). Again, don't automatically try to minimize total operation count when you are writing numpy code. The high system overhead is probably due to heavy memory management - every vector temporary is 0.8 GB and the memory manager is struggling.
Cutting the array size to n = 1 million elements (8 MB), then multiplying the runtime by 100 brings the system time down by a large factor, f_inverse0 now takes 16.1 s (12.5+3.6), while f_inverse1 takes 22.3 s (16.2+5.1). This factor of 8 to 10 slower than compiled code is not unreasonable to expect for numpy performance.
I cythonized a function that I call a bunch of times in my code. The cython version and the original python code give me the same answers (within 1e-7 which I understand has something to do with cython vs. python types...not the question here but might be important).
I attempt to find the root of the function using scipy.optimize.fsolve(). The python version works fine, but the cython version diverges.
The code is pretty involved and has a big external file to prepare some of the arguments, so I can't post everything. I post the cython code. Full code is here.
def euler_outside(float b_prime, int index_b,
np.ndarray[np.double_t, ndim=1] b_grid, int index_y,
np.ndarray[np.double_t, ndim=1] y_grid,
np.ndarray[np.double_t, ndim=1] y_vec,
np.ndarray[np.double_t, ndim=2] pol_mat_b, float q,
np.ndarray[np.double_t, ndim=2] pol_mat_q,
np.ndarray[np.double_t, ndim=2] P, float beta,
int n_ygrid, int check=0):
'''
b_prime - the variable of interest. want to find b_prime that solves this
function
'''
cdef double b, y, c, uc, e_ucp, eul_val
cdef int i
cdef np.ndarray[np.float64_t, ndim=1] uct, c_prime = np.zeros((n_ygrid,))
b = b_grid[index_b]
y = y_grid[index_y]
# Get value of consumption today
c = b + y - b_prime/q
# Get possible values of consumption tomorrow
if check:
c_prime = b_prime + y_vec - b_grid[0]/q
else:
for i in range(n_ygrid):
c_prime[i] = (b_prime + y_vec[i] -
(np.interp(b_prime, b_grid, pol_mat_b[:,i]) /
np.interp(b_prime, b_grid, pol_mat_q[:,i])))
if c<0:
return 1e10
uc = utility_prime(c)
uct = utility_prime(c_prime)
e_ucp = np.inner( uct, P[index_y,:] )
eul_val = uc - beta*q * e_ucp
return eul_val
The python code is the same but w/out the cdef statements and type info on the arguments. I've checked to make sure the output is the same for the same input values, and it is. My question is why scipy's fsolve goes off the deep-end for one and not the other. I assume it's a problem with my cython?
Running python 2.7 from Anaconda. Compiling the extension module via pyximport.
As mentioned in the comments above, the reason for the discrepancy between the results from the Python and Cython versions is that in the Cython function, several of the inputs are declared as float, whereas the actual Python variables are double precision.
The resulting increase in round-off error for the Cython function seems to be the reason why fsolve fails to converge - when these inputs are declared as double instead, the Python and Cython versions yield the exact same result, and fsolve converges correctly for both.
As an aside, cases where round-off error in the objective function prevents convergence are indicative of ill-conditioned problems. You might want to think about whether it's possible to re-formulate your model in order to improve its numerical stability.
I need help to know the size of my blocks and grids.
I'm building a python app to perform metric calculations based on scipy as: Euclidean distance, Manhattan, Pearson, Cosine, joined other.
The project is PycudaDistances.
It seems to work very well with small arrays. When I perform a more exhaustive test, unfortunately it did not work. I downloaded movielens set (http://www.grouplens.org/node/73).
Using Movielens 100k, I declared an array with shape (943, 1682). That is, users are 943 and 1682 films evaluated. The films not by a classifier user I configured the value to 0.
With a much larger array algorithm no longer works. I face the following error:
pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid value.
Researching this error, I found an explanation of telling Andrew that supports 512 threads to join and to work with larger blocks it is necessary to work with blocks and grids.
I wanted a help to adapt the algorithm Euclidean distance arrays to work from small to giant arrays.
def euclidean_distances(X, Y=None, inverse=True):
X, Y = check_pairwise_arrays(X,Y)
rows = X.shape[0]
cols = Y.shape[0]
solution = numpy.zeros((rows, cols))
solution = solution.astype(numpy.float32)
kernel_code_template = """
#include <math.h>
__global__ void euclidean(float *x, float *y, float *solution) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int idy = threadIdx.y + blockDim.y * blockIdx.y;
float result = 0.0;
for(int iter = 0; iter < %(NDIM)s; iter++) {
float x_e = x[%(NDIM)s * idy + iter];
float y_e = y[%(NDIM)s * idx + iter];
result += pow((x_e - y_e), 2);
}
int pos = idx + %(NCOLS)s * idy;
solution[pos] = sqrt(result);
}
"""
kernel_code = kernel_code_template % {
'NCOLS': cols,
'NDIM': X.shape[1]
}
mod = SourceModule(kernel_code)
func = mod.get_function("euclidean")
func(drv.In(X), drv.In(Y), drv.Out(solution), block=(cols, rows, 1))
return numpy.divide(1.0, (1.0 + solution)) if inverse else solution
For more details see: https://github.com/vinigracindo/pycudaDistances/blob/master/distances.py
To size the execution paramters for your kernel you need to do two things (in this order):
1. Determine the block size
Your block size will mostly be determined by hardware limitations and performance. I recommend reading this answer for more detailed information, but the very short summary is that your GPU has a limit on the total number of threads per block it can run, and it has a finite register file, shared and local memory size. The block dimensions you select must fall inside these limits, otherwise the kernel will not run. The block size can also effect the performance of kernel, and you will find a block size which gives optimal performance. Block size should always be a round multiple of the warp size, which is 32 on all CUDA compatible hardware released to date.
2. Determine the grid size
For the sort of kernel you have shown, the number of blocks you need is directly related to the amount of input data and the dimensions of each block.
If, for example, your input array size was 943x1682, and you had a 16x16 block size, you would need a 59 x 106 grid, which would yield 944x1696 threads in the kernel launch. In this case the input data size is not a round multiple of the block size, you will need to modify your kernel to ensure that it doesn't read out-of-bounds. One approach could be something like:
__global__ void euclidean(float *x, float *y, float *solution) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int idy = threadIdx.y + blockDim.y * blockIdx.y;
if ( ( idx < %(NCOLS)s ) && ( idy < %(NDIM)s ) ) {
.....
}
}
The python code to launch the kernel could look like something similar to:
bdim = (16, 16, 1)
dx, mx = divmod(cols, bdim[0])
dy, my = divmod(rows, bdim[1])
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) )
func(drv.In(X), drv.In(Y), drv.Out(solution), block=bdim, grid=gdim)
This question and answer may also help understand how this process works.
Please note that all of the above code was written in the browser and has never been tested. Use it at your own risk.
Also note it was based on a very brief reading of your code and might not be correct because you have not really described anything about how the code is called in your question.
The accepted answer is correct in principle, however the code that talonmies has listed is not quite correct. The line:
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) )
Should Be:
gdim = ( (dx + (mx>0)), (dy + (my>0)) )
Besides an obvious extra parenthesis, gdim would produce way too many threads than what you want. talonmies had explained it right in his text that threads is the blocksize * gridSize. However the gdim he has listed would give you the total threads and not the correct grid size which is what is desired.