Unrolling a trivially parallelizable for loop in python with CUDA

Unrolling a trivially parallelizable for loop in python with CUDA - python

I have a for loop in python that I want to unroll onto a GPU. I imagine there has to be a simple solution but I haven't found one yet.
Our function loops over elements in a numpy array and does some math storing the result in another numpy array. Each iteration adds some to this result array. A possible large simplification of our code might look something like this:
import numpy as np
a = np.arange(100)
out = np.array([0, 0])
for x in xrange(a.shape[0]):
out[0] += a[x]
out[1] += a[x]/2.0
How can I unroll a loop like this in Python to run on a GPU?

The place to start is http://documen.tician.de/pycuda/ the example there is
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b
You place the part of the code you want to parallelize in C code segment and call it from python.
For you example the size of your data will need to be much bigger than 100 to make it worth while. You'll need some way to divide your data into block. If you wanted to add 1,000,000 numbers you could divide it into 1000 blocks. Add each block in the parallezed code. Then add the results in python.
Adding things is not really a natural task for this type of parallelisation. GPUs tend to do the same task for each pixel. You have a task which need to operate on multiple pixels.
It might be better to work with cuda first. A related thread is.
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Parallelizing generation of random vectors with arbitrary length using numba

I would like to generate random vectors of the form
[ i if random.uniform(0,1) <= probs[i] for i prange(K) ]
for an K length array of probabilities probs. Each resulting vector has somewhere between 0 and K elements. Conceptually, this is like flipping K specific coins (with particular probabilities of being heads) and recording which of the coins displayed heads.
The arbitrary return length makes it difficult to use any of the automatic parallelization options in numba. E.g.,
from numba import prange, njit, int64, float64
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in range(n)] # **
r = 10; n =100
freqs = np.random.uniform(0,1,r)
rand_coins(freqs, n)
works great serially but produces a double free or corruption error if the range in the starred line is replaced with prange.
Is is possible parallelize functions returning arrays with random lengths in numba?

prange is a numba function. The typing error is just a generic error from numba, saying that it ran into an issue while compiling the function. The real issue is that you are trying the make a call to a function that isn't declared. You need to utilize the prange function like so:
from numba import njit, int64, float64
import numba
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in numba.prange(n)] # **

Compile Scipy function with Cython

I am running a simulation in Python 3.4 - that involves a lot of dot products between a sparse array (in csr format) and a dense vector. I am using Scipy for the sparse matrix, numpy for everything else.
Using Cython gave me a massive boost (~x6 speed increase), after making sure that I cdef everything properly and after minimizing Python interaction (bt going through the html file that Cython gives me and modifying my code).
Now, I profile the code and 50% of the simulation time is spent on the line with the dot product. I am wondering if it is possible to somehow accelerate this line, say by complining this one dot function in Cython?
I know I could write my own implementation for (csr sprase 2d matrix) dot (dense vector), but I am trying to avoid that.
Edit: I have included a minimal example of the code. I am sorry, I can't see how to make it smaller. It is a textbook exercise in statistical mechanics. Place marbles in pots until one of the pots exceeds capacity. Then, start a cascade which propagates according to a (here sparse) matrix. I am using batch sampling.
Please focus on the line towards the end.
from __future__ import division
import numpy as np
import cython
cimport numpy as np
cimport cpython.array
import scipy.sparse as sps
#cython.cdivision(True)
#cython.nonecheck(False)
#cython.boundscheck(False)
#cython.wraparound(False)
def simulate(long[:] capacity_vec,
int random_array_size,
long n,
int seed,
int[:] A_col,
int[:] A_row,
long[:] A_data):
#### Initialise ####################################################################################################
# Initialise states
cdef int[:] damage = np.random.randint(0, int(np.min(capacity_vec)/2), n).astype(np.int32)
cdef int[:] dr_list = np.random.choice(n, 1000).astype(np.int32)
cdef int[:] states = np.zeros(n).astype(np.int32)
cdef int[:] states_ = np.zeros(n).astype(np.int32)
cdef int[:] change = np.zeros(n).astype(np.int32)
# Initialise counters
cdef int k, violations, violations_, counter= 0, dr_id=0, increment_index = 0
# Build Sparse Adjecency Matrix
cA_sps = sps.csr_matrix( (A_data, (A_row, A_col) ), shape=(n,n) ).astype(np.int32)
while counter < 1000:
#### Place damage until a cascade starts #######################################################################
while damage[increment_index] <= capacity_vec[increment_index]:# Check for violations
increment_index = dr_list[dr_id] # Where do we place the marble?
damage[increment_index] = damage[increment_index] + 1 # place the marble
dr_id = dr_id + 1 # another random number used
if dr_id == random_array_size - 1: # Check if we run out of random numbers
dr_list = np.random.choice(n, random_array_size).astype(np.int32) # if so, pick new increment_index
dr_id = 0 # and reset the counter
#### Initialise cascade ########################################################################################
violations, violations_ = 1, 0
states[increment_index] = 1
#### Propagate cascade #########################################################################################
while violations > violations_: # check for fixed point, propagate cascade
for k in range(n): change[k] = states[k] - states_[k]
### THIS LINE IS THE PROBLEM. It takes up half of all simulation time.
damage = damage + cA_sps.dot(change).astype(np.int32) # spread violations
states_ = states.copy() # store previous states
# Determine previous and current violations
violations, violations_ = 0 , violations
for k in range(n):
states_[k] = 0
if damage[k] > capacity_vec[k]:
violations = violations + 1
states[k] = 1 # deactivate any node that has a violation
for k in range(n): damage[k] = 0
counter = counter + 1 # progress cascade id after storing

I'd discourage from writing your own matrix multiplication. SciPy is done by smart people who know what they are doing, and unless your confident in numerical computing, just don't. Most of SciPy code is already compiled.
However, what you might look at is code for sparse.csr_matrix.dot. Getting into definition directly here and then here, you'll see that there are few checks done in Scipy. If you know what exact form you want, you could write your own method (modify your SciPy copy) and compute your product directly. Not sure how much that would help, though.
If you want to build Scipy yourself it is as easy as checking out whole project from GitHug and then running
python setup.py build
python setup.py install
For more direct instructions check build documentation.

Fast Numpy Loops

How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):
slow_lib.py:
import numpy as np
def foo():
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.
This is quite slow:
>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839
Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:
c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo():
cdef int size = 200
cdef int i,j
np.random.seed(1000031212)
cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
cdef np.ndarray[np.double_t, ndim=1] val
for i in xrange(0,size):
for j in xrange(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579
... ehr... what? Numba to the rescue!
numba_slow_lib.py:
import numpy as np
from numba import jit
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
#jit
def foo():
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402
So is there really no way to speed this up? The point is:
if I convert the inner loop into a vectorized version (building a larger matrix representing the inner loop and then calling np.outer on the larger matrix) I get much faster code.
if I implement something similar in Matlab (R2016a) this performs quite well due to JIT.

Here's the code for outer:
def outer(a, b, out=None):
a = asarray(a)
b = asarray(b)
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
So each call to outer involves a number of python calls. Those eventually call compiled code to perform the multiplication. But each incurs an overhead that has nothing to do with the size of your arrays.
So 200 (200**2?) calls to outer will have all that overhead, whereas one call to outer with all 200 rows has one overhead set, followed by one fast compiled operation.
cython and numba don't compile or otherwise bypass the Python code in outer. All they can do is streamline the iteration code that you wrote - and that isn't consuming much time.
Without getting into details, the MATLAB jit must be able to replace the 'outer' with faster code - it rewrites the iteration. But my experience with MATLAB dates from a time before its jit.
For real speed improvements with cython and numba you need to use primitive numpy/python code all the way down. Or better yet focus your effort on slow inner pieces.
Replacing your outer with a streamlined version cuts run time about in half:
def foo1(N):
size = N
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += val[:,None]*val
return moo
With the full N=200 your function took 17s per loop. If I replace the inner two lines with pass (no calculation), time drops to 3ms per loop. In other words, the outer loop mechanism is not a big time consumer, at least not compared to many calls to outer().

Memory permitting, you can use np.einsum to perform those heavy calculations in a vectorized manner, like so -
moo = size*np.einsum('ij,ik->jk',bar,bar)
One can also use np.tensordot -
moo = size*np.tensordot(bar,bar,axes=(0,0))
Or simply np.dot -
moo = size*bar.T.dot(bar)

Many tutorials and demonstrations of Cython, Numba, etc. make it seem as if these tools can speed up your code automagically, but in practice, this is often not the case: You'll need to modify your code a little to extract the best performance. If you had already implemented some degree of vectorization, it usually means writing out ALL the loops. Reasons Numpy array operations are non-optimal include:
Lots of temporary arrays are created and looped over;
Significant per-call overhead if the arrays are small;
Short-circuiting logic can't be implemented, because arrays are processed as a whole;
Sometimes the optimal algorithm can't be expressed using array expressions and you settle for an algorithm with a worse time complexity.
Using Numba or Cython wont optimize these problems away! Instead, these tools allow you to write loopy code that is much faster than plain Python.
Also, for Numba specifically, you should be aware of the difference between "object mode" and "nopython mode". The tight loops from your example have to run in nopython mode to provide any significant speedup. However, numpy.outer is not yet supported by Numba, resulting in the function to be compiled in object mode. Decorate with jit(nopython=True) to let such cases throw an exception.
Example to demonstrate a speedup is indeed possible:
import numpy as np
from numba import jit
#jit
def foo_nb(bar):
size = bar.shape[0]
moo = np.zeros((size, size))
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
return moo
#jit
def foo_nb2(bar):
size = bar.shape[0]
moo = np.zeros((size, size))
for i in range(size):
for j in range(size):
for k in range(0,size):
for l in range(0,size):
moo[k,l] += bar[j,k] * bar[j,l]
return moo
size = 100
bar = np.random.rand(size, size)
np.allclose(foo_nb(bar), foo_nb2(bar))
# True
%timeit foo_nb(bar)
# 1 loop, best of 3: 816 ms per loop
%timeit foo_nb2(bar)
# 10 loops, best of 3: 176 ms per loop

The example you show us is kind of inefficient algorithm, since you calculate the same outer product multiple times. The resulting time complexity is O(n^4). It can be reduced to n^3.
for i in range(0,size):
val = bar[i]
moo += size * np.outer(val, val)

Vectorize a Newton method in Python/Numpy

I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
Sum all the elements of the result array
Here is the algorithm in Python I want to speed up
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
break
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
}
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
break;
}
x -= dx;
}
result[k] = x;
}
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
}
std::cout << somme << std::endl;
return 0;
}
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
Thanks.

You can do step 1. with numpy efficiently:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on #ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]

Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.

I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
initialize an array ar with values `[1,2,3,....n]
divide ar with n. This gets us the 1/n, 2/n ... members
add to that an array of same dimensions that contain just the number 1.0
This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.
find square roots, sum it
Average of 10 sequential execution times is 0.018786 seconds.

Obviously I'm 6 years late to this party, but this question is a common stumbling block for people in effectively using numpy for real scientific work. The basic idea is covered in #ev-br's answer. The OP points out that the solution offered there (even modified to stop iterating when a convergence criterion is met rather than after a fixed number of iterations) takes the same number of passes for each element of u. I want to show how you can avoid that objection using pure numpy code, making explicit the mask suggestion in #ev-br's comment.
However, I also want to point out that in many real world situations, the number of passes for Newton-like iteration to converge varies so little that this general technique I illustrate here will actually slow numpy code down significantly. If the average number of iterations will be within a factor of two or three of the maximum number of iterations, you should stick with something closer to #ev-br's answer (including his first comment).
The numpy performance numbers you need to understand are these: Loops over array indices will run 200 to 500 times slower in pure numpy code than in compiled code. On the other hand, if you manage to use numpy's array syntax to avoid all index loops, you can get within about a factor of 5 of compiled speed. (The factor of 5 is partly because of memory management as #ev-br mentions, but also because optimized compiled code overlaps many different arithmetical operations inside each index loop, while numpy just performs a single arithmetic operation, storing everything back to memory after each operation.) The point is that factor of 100 difference means that it often pays to do substantial amounts of "extra" work in numpy code: Even if you do 10 times the number of floating point operations in vectorized numpy code, it will still run 10 times faster than the index-loop code that avoids the "extra" work. (Incidentally, the python map function is implemented as an interpreted index loop - it has nothing to do with numpy array operations.)
from numpy import asfarray, broadcast_arrays, arange
# Begin by defining the function to be inverted by Newton's method.
def f_dfdx(x):
x = asfarray(x) # always avoid repeated type conversions
return x**2, 2.*x
# First, the simplest algorithm to find x such that f(x)=y.
# We must supply a starting guess x0 for x.
def f_inverse0(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
x = x.copy() # without this may clobber input x0
for npass in range(20):
f, dfdx = f_dfdx(x)
dx = (f - y) / dfdx
if (abs(dx) <= tol).all():
break # iterate all x until all have converged
x -= dx
else:
raise RuntimeError("failed to converge")
return x
# A frequently slower algorithm that avoids extra iterations.
def f_inverse1(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
shape = x.shape
y, x = y.ravel(), x.flatten() # avoid clobbering x0
unconverged = arange(y.size)
for npass in range(20):
f, dfdx = f_dfdx(x[unconverged])
dx = (f - y[unconverged]) / dfdx
unc = abs(dx) > tol
unconverged = unconverged[unc]
if not unconverged.size:
break # iterate all x until all have converged
x[unconverged] -= dx[unc]
else:
raise RuntimeError("failed to converge")
return x.reshape(shape)
On my machine, the OP's C++ program runs in 2.03 s (1.64+0.38 user+sys). For n=100 million as for the C++ program, f_inverse0 runs in 20.4 s (4.7+15.6 user+sys). As expected, f_inverse1 is slower, 51.3 s (11.5+39.8 user+sys). Again, don't automatically try to minimize total operation count when you are writing numpy code. The high system overhead is probably due to heavy memory management - every vector temporary is 0.8 GB and the memory manager is struggling.
Cutting the array size to n = 1 million elements (8 MB), then multiplying the runtime by 100 brings the system time down by a large factor, f_inverse0 now takes 16.1 s (12.5+3.6), while f_inverse1 takes 22.3 s (16.2+5.1). This factor of 8 to 10 slower than compiled code is not unreasonable to expect for numpy performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unrolling a trivially parallelizable for loop in python with CUDA - python

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

Parallelizing generation of random vectors with arbitrary length using numba

Compile Scipy function with Cython

Fast Numpy Loops

Vectorize a Newton method in Python/Numpy

Categories

Resources