Reading hdf5 file quickly with cython and h5py

Reading hdf5 file quickly with cython and h5py - python

I'm trying to speed up a python3 function that takes some data, which is an array of indexes and saves them if they meet a certain criterion. I have tried to speed it up by using "cython -a script.py", but the bottle neck seems to be the h5py I/O slicing datasets.
I'm relatively new to cython, so I was wondering whether there is anyway to speed this up or am I just limited by the h5py I/O here?
Here is the function I'm trying to improve:
import numpy as np
import h5py
cimport numpy as np
cimport cython
from libc.math cimport sqrt
DTYPE64 = np.int64
ctypedef np.int64_t DTYPE64_t
DTYPE32 = np.int32
ctypedef np.int32_t DTYPE32_t
#cython.boundscheck(False)
#cython.wraparound(False)
def tag_subhalo_branch(np.ndarray[DTYPE64_t] halos_z0_treeindxs,
np.ndarray[DTYPE64_t] tree_pindx,
np.ndarray[DTYPE32_t] tree_psnapnum,
np.ndarray[DTYPE64_t] tree_psnapid,
np.ndarray[DTYPE64_t] tree_hsnapid, hf,
int size):
cdef int i
cdef double radial, progen_x, progen_y, progen_z
cdef double host_x, host_y, host_z, host_rvir
cdef DTYPE64_t progen_indx, progen_haloid, host_id
cdef DTYPE32_t progen_snap
cdef int j = 0
cdef int size_array = size
cdef np.ndarray[DTYPE64_t] backsplash_ids = np.zeros(size_array,
dtype=DTYPE64)
for i in range(0, size_array):
progen_indx = tree_pindx[halos_z0_treeindxs[i]]
if progen_indx != -1:
progen_snap = tree_psnapnum[progen_indx]
progen_haloid = tree_psnapid[progen_indx]
while progen_indx != -1 and progen_snap != -1:
# ** This is slow **
grp = hf['Snapshots/snap_' + str('%03d' % progen_snap) + '/']
host_id = grp['HaloCatalog'][(progen_haloid - 1), 2]
# **
if host_id != -1:
# ** This is slow **
progen_x = grp['HaloCatalog'][(progen_haloid - 1), 6]
host_x = grp['HaloCatalog'][(host_id - 1), 6]
progen_y = grp['HaloCatalog'][(progen_haloid - 1), 7]
host_y = grp['HaloCatalog'][(host_id - 1), 7]
progen_z = grp['HaloCatalog'][(progen_haloid - 1), 8]
host_z = grp['HaloCatalog'][(host_id - 1), 8]
# **
radial = 0
radial += (progen_x - host_x)**2
radial += (progen_y - host_y)**2
radial += (progen_z - host_z)**2
radial = sqrt(radial)
host_rvir = grp['HaloCatalog'][(host_id - 1), 24]
if radial <= host_rvir:
backsplash_ids[j] = tree_hsnapid[
halos_z0_treeindxs[i]]
j += 1
break
# Find next progenitor information
progen_indx = tree_pindx[progen_indx]
progen_snap = tree_psnapnum[progen_indx]
progen_haloid = tree_psnapid[progen_indx]
return backsplash_ids

As described here: http://api.h5py.org/, h5py uses cython code to interface with the HDF5 c code. So your own cython code might be able to access that directly. But I suspect that will require a lot more study.
Your code is using the Python interface to h5py, and cythonizing isn't going to touch that.
cython code is best used for low level actions, especially iterative things that can't be expressed as array operations. Study and experiment with the numpy examples first. You are diving into cython at the deep end of the pool.
Have you tried to improve that code just with Python and numpy? Just at glance I'm seeing a lot of redundant h5py calls.
====================
Your radial calculation accesses the h5py indexing 6 times when it could get by with 2. Maybe you wrote it that way in hopes that cython would preform the following calculation faster than numpy?
data = grp['HaloCatalog']
progen = data[progen_haloid-1, 6:9]
host = data[host_id-1, 6:9]
radial = np.sqrt((progren-host)**2).sum(axis=1))
Why not load all data[progen_haloid-1,:] and data[host_id-1,:]? Even all of data? I'd have to review when h5py switches from working directly with the arrays on the file and when they become numpy arrays. In any case, math on arrays in memory will be a lot faster than file reads.

Related

Fastest way to do loop over 2D arrays in Cython

I am trying to loop over 2 2d arrays in Cython. The arrays have the following shape:
ranges_1 is a 6000x3 array of int64, while ranges_2 is a 2000x2 of int64. This iteration needs to be perfomed around 10000 times. This would mean that the total number of calculations inside the nested for loop would be around 2000x6000x10000 = 120 billions times.
This is the code I am using to generate the "dummy" data:
import numpy as np
ranges_1 = np.stack([np.random.randint(0, 10_000, 6_000), np.random.randint(0, 10_000, 6_000), np.arange(0, 6_000)], axis=1)
ranges_2 = np.stack([np.random.randint(0, 10_000, 2_000), np.random.randint(0, 10_000, 2_000)], axis=1)
Which gives 2 arrays like these:
array([[6131, 1478, 0],
[9317, 7263, 1],
[7938, 6249, 2],
...,
[5153, 426, 5997],
[9164, 9211, 5998],
[1695, 1792, 5999]])
and:
array([[ 433, 558],
[3420, 2494],
[6367, 7916],
...,
[8693, 1692],
[1256, 9013],
[4096, 1860]])
The first implementation I tried is a "naive" version and it's the following (the function inside is just a test function which uses all the data in the array):
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_func(np.ndarray[DTYPE_t, ndim = 2] ranges_1, np.ndarray[DTYPE_t, ndim=2] ranges_2, int n ):
k = 0
for i in range(n):
for j in range(len(ranges_1)):
r1 = ranges_1[j]
a = r1[0]
b = r1[1]
c = r1[2]
for f in range(len(ranges_2)):
r2 = ranges_2[f]
d = r2[0]
e = r2[1]
k = (a + b + c + d + e)/(d+e)
return k
This takes about 5 seconds per each of the 10_000 outside loops.
So I then tried flatteing out the arrays and, since I know the dimension on the other axis, accessing the items like this:
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_func_flattened(np.ndarray[DTYPE_t, ndim = 1] ranges_1_, np.ndarray[DTYPE_t, ndim=1] ranges_2_, int n ):
k = 0
for i in range(n):
for j in range(0, len(ranges_1_), 3):
a = ranges_1_[j]
b = ranges_1_[j+1]
c = ranges_1_[j+2]
for f in range(0, len(ranges_2_), 2):
d = ranges_2_[f]
e = ranges_2_[f+1]
k = (a + b + c + d + e)/(d+e)
return k
But that provided no speed up at all. The time to perform one single iteration of the 10_000 seems too high, considered that for one single iteration it's just 12_000_000 operations inside the loop. I also tried implementing a much simpler example both in Cython and in python which was then compiled with numba:
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.int_t DTYPE_t
def test_1(int n ):
cdef k = 0
cdef a = 0
for i in range(n):
a = i +1
return a
This took 15s to run with n = 1_000_000_000.
While with numba:
def test_1_python(n ):
k = 0
a = 0
for i in range(n):
if i % 2 == 0:
a = a + 1
else:
a = a - 1
return a
test_1_numba= numba.jit(test_1_python)
%%time
test_1_numba(120_000_000_000)
The full run with n = 120bln took about 6s, (albeit the function inside is simpler) this means that it would be 500 times faster than Cython, could this be possible?
I am new to Cython so I probably am missing something obvious, but since the numba version (without the array accessing) is tha much faster I think that the difference in speed might come from the overhead associated with accessing the items in the array.
Is this a wrong assumption?
If not, what would be the best of going about looping over a 2D list of integers in Cython?

What you measure in your benchmarks is mainly compilation artefacts and overheads.
First of all, Cython use the default compiler installed on your machine preferred by the Python stack. On Linux, it should be GCC. On Windows, it is certainly MSVC if installed, otherwise MinGW (if any). Meanwhile, Numba is based on LLVM-Lite which is based on the LLVM stack like Clang. Thus, in your case, it is very likely that different compilers are used resulting in different binary with different performance. If you want to make a fair benchmark, you need to use Clang to build you Cython program.
Additionally, the default optimization for Cython is -O2 while it is -O3 for Numba. The former should not enable auto-vectorization while the later does (this is dependent of the target compilers -- newer version of GCC changes this behaviour). Furthermore, Cython does not enable machine-specific non-portable optimizations by default (since binaries may be packaged for other machines like with pip). This means Cython can only use the old SSE2 SIMD instruction set by default on x86-64 processors. Meanwhile, LLVM-JIT can use of the much faster AVX2/AVX-512 SIMD instruction set. You need to enable such optimization manually with Cython so for the benchmark to be fair (ie. -march=native on GCC/Clang).
In fact, on my x86-64 mainstream Intel machine, Numba does use the AVX2 instruction set and Cython does not on your last benchmark. Here is for example the main loop generated by the Numba JIT:
.LBB0_7:
vptestnmq %ymm5, %ymm4, %k1
vpblendmq %ymm5, %ymm6, %ymm18 {%k1}
vpaddq %ymm0, %ymm18, %ymm0
vpaddq %ymm1, %ymm18, %ymm1
vpaddq %ymm2, %ymm18, %ymm2
vpaddq %ymm3, %ymm18, %ymm3
vpaddq %ymm16, %ymm4, %ymm18
vptestnmq %ymm5, %ymm18, %k1
vpblendmq %ymm5, %ymm6, %ymm18 {%k1}
vpaddq %ymm0, %ymm18, %ymm0
vpaddq %ymm1, %ymm18, %ymm1
vpaddq %ymm2, %ymm18, %ymm2
vpaddq %ymm3, %ymm18, %ymm3
vpaddq %ymm17, %ymm4, %ymm4
addq $-2, %rdx
jne .LBB0_7
For the benchmark doing a = i + 1 in a loop, it is flawed as a good compiler can just optimize out the whole loop (ie. remove it) and replace it with only one assignment since only the last iteration matter. In fact, the same thing applies with k = (a + b + c + d + e)/(d+e): only the last iteration matters. The i variable of for i in range(n) is not even used. Clang and GCC often do such kind of optimizations.
Finally, the speed of your initial first code will be memory-bound if it is modified to compute something meaningful in a real-world use-case and multiple threads are used.
Note that division are very expensive and you can precompute the reciprocal so to perform multiplications instead in your main loop.

Issue converting a python program to cython for speedup

I am trying to write a fast algorithm in Cython. The algorithm is fairly straightforward and is described here (https://arxiv.org/pdf/1411.4357.pdf - the paragraph before Theorem 8 on page 17). The idea is relatively straightforward: assuming that the data is sparse it can be given as input in the form (i,j,A_ij) for a matrix A which has n rows and d columns. Now to take advantage of the sparsity we need two functions h which maps the n rows into s buckets uniformly at random (which is a parameter to the algorithm) and a sign function s which is just ±1 with equal probability. Then for every triple (i,j,A_ij) the algorithm must output (h(i),j, s(h(i))*A_ij) and return these as arrays to be given as input to another sparse.
The problem is that I can't get a speedup. This should run extremely fast and outperform matrix multiplication of realising S and multiplying by A as illustrated in the above reference.
Python approach (roughly 11ms):
import numpy as np
import numpy.random as npr
from scipy.sparse import coo_matrix
def countSketch_faster(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
observed_rows = set([])
sign_hashes = []
hash_table = {}
sketched_data = []
hashed_rows = []
for row_id, col_id, Aij in zip(input_matrix.row, input_matrix.col, input_matrix.data):
if row_id in observed_rows:
sketched_data.append(hash_table[row_id][1]*Aij)
hashed_rows.append(hash_row_val)
else:
hash_row_val, sign_val = np.random.choice(sketch_size), np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data.append(sign_val*Aij)
hashed_rows.append(hash_row_val)
observed_rows.add(row_id)
hashed_rows = np.asarray(hashed_rows)
sketched_data = np.asarray(sketched_data)
row_hashes = np.asarray(row_hashes)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col))).tocsr()
return S
Converting to Cython:
Analysing the annotated output highlights that the most python-heavy lines are those which call numpy. I have tried to cdef all variables and specify dtype for any call to numpy. Also, I have removed the zip command and tried to keep the loop simpler and more C-like. All of this has actually had an adverse effect though and it runs very slowly, and I'm not really sure why. It seems like quite a simple algorithm to implement so if someone could help me get the runtime down to something very small I would be extremely grateful.
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from scipy.sparse import coo_matrix
def countSketch_cython(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
hash_table = {}
sketched_data = np.zeros_like(input_matrix.data,dtype=np.float64)
hashed_rows = np.zeros_like(sketched_data,dtype=np.float64)
observed_rows = set([])
for idx in range(input_matrix.row.shape[0]):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = np.random.randint(low=0,high=sketch_size+1)
sign_val = np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)))
return S`
UPDATE: I have managed to speed the code up by removing some of the slower lines. These were calls to np.random for which the C random number generator was faster, giving the length in the input so this did not need to be calculated, and not converting to a sparse matrix before returning (for the purpose of this experiment I am interested in how quickly the transform can be done as opposed to details surrounding the conversion for downstream use).
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from libc.stdlib cimport rand
##cython.boundscheck(False) # these don't contribute much in this
example
##cython.wraparound(False)
def countSketch(input_matrix, input_matrix_length, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
input_matrix_length - number of rows in input matrix
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
cdef int arr_lengths = input_matrix_length
# These two lines are still annotated most boldly
cdef double[:,:] sketched_data =
np.zeros((arr_lengths,1),dtype=np.float64)
cdef double[:,:] hashed_rows = np.zeros((arr_lengths,1))
hash_table = {}
observed_rows = set([])
for idx in range(arr_lengths):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = rand()%(sketch_size)
#np.random.randint(low=0,high=sketch_size+1)
sign_val = 2*rand()%(2) - 1
#np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
#S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)), dtype=np.float64)
return hashed_rows, sketched_data
On a random sparse matrix A = scipy.sparse.random(1000, 50, density=0.1) this now achieves 508 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) using %timeit countSketch(A, A.shape[0],100). I should imagine there are still gains to be made as I don't know if I have set the arrays in the best way.

Compile Scipy function with Cython

I am running a simulation in Python 3.4 - that involves a lot of dot products between a sparse array (in csr format) and a dense vector. I am using Scipy for the sparse matrix, numpy for everything else.
Using Cython gave me a massive boost (~x6 speed increase), after making sure that I cdef everything properly and after minimizing Python interaction (bt going through the html file that Cython gives me and modifying my code).
Now, I profile the code and 50% of the simulation time is spent on the line with the dot product. I am wondering if it is possible to somehow accelerate this line, say by complining this one dot function in Cython?
I know I could write my own implementation for (csr sprase 2d matrix) dot (dense vector), but I am trying to avoid that.
Edit: I have included a minimal example of the code. I am sorry, I can't see how to make it smaller. It is a textbook exercise in statistical mechanics. Place marbles in pots until one of the pots exceeds capacity. Then, start a cascade which propagates according to a (here sparse) matrix. I am using batch sampling.
Please focus on the line towards the end.
from __future__ import division
import numpy as np
import cython
cimport numpy as np
cimport cpython.array
import scipy.sparse as sps
#cython.cdivision(True)
#cython.nonecheck(False)
#cython.boundscheck(False)
#cython.wraparound(False)
def simulate(long[:] capacity_vec,
int random_array_size,
long n,
int seed,
int[:] A_col,
int[:] A_row,
long[:] A_data):
#### Initialise ####################################################################################################
# Initialise states
cdef int[:] damage = np.random.randint(0, int(np.min(capacity_vec)/2), n).astype(np.int32)
cdef int[:] dr_list = np.random.choice(n, 1000).astype(np.int32)
cdef int[:] states = np.zeros(n).astype(np.int32)
cdef int[:] states_ = np.zeros(n).astype(np.int32)
cdef int[:] change = np.zeros(n).astype(np.int32)
# Initialise counters
cdef int k, violations, violations_, counter= 0, dr_id=0, increment_index = 0
# Build Sparse Adjecency Matrix
cA_sps = sps.csr_matrix( (A_data, (A_row, A_col) ), shape=(n,n) ).astype(np.int32)
while counter < 1000:
#### Place damage until a cascade starts #######################################################################
while damage[increment_index] <= capacity_vec[increment_index]:# Check for violations
increment_index = dr_list[dr_id] # Where do we place the marble?
damage[increment_index] = damage[increment_index] + 1 # place the marble
dr_id = dr_id + 1 # another random number used
if dr_id == random_array_size - 1: # Check if we run out of random numbers
dr_list = np.random.choice(n, random_array_size).astype(np.int32) # if so, pick new increment_index
dr_id = 0 # and reset the counter
#### Initialise cascade ########################################################################################
violations, violations_ = 1, 0
states[increment_index] = 1
#### Propagate cascade #########################################################################################
while violations > violations_: # check for fixed point, propagate cascade
for k in range(n): change[k] = states[k] - states_[k]
### THIS LINE IS THE PROBLEM. It takes up half of all simulation time.
damage = damage + cA_sps.dot(change).astype(np.int32) # spread violations
states_ = states.copy() # store previous states
# Determine previous and current violations
violations, violations_ = 0 , violations
for k in range(n):
states_[k] = 0
if damage[k] > capacity_vec[k]:
violations = violations + 1
states[k] = 1 # deactivate any node that has a violation
for k in range(n): damage[k] = 0
counter = counter + 1 # progress cascade id after storing

I'd discourage from writing your own matrix multiplication. SciPy is done by smart people who know what they are doing, and unless your confident in numerical computing, just don't. Most of SciPy code is already compiled.
However, what you might look at is code for sparse.csr_matrix.dot. Getting into definition directly here and then here, you'll see that there are few checks done in Scipy. If you know what exact form you want, you could write your own method (modify your SciPy copy) and compute your product directly. Not sure how much that would help, though.
If you want to build Scipy yourself it is as easy as checking out whole project from GitHug and then running
python setup.py build
python setup.py install
For more direct instructions check build documentation.

Cython/numpy vs pure numpy for least square fitting

A T.A at school showed me this code as an example of a least square fitting algorithm.
import numpy as np
#return the coefficients (a0,..aN) of the fit y=a0+a1*x+..an*x^n
#with associated sigma dy
#x,y,dy are all np.arrays with dtype= np.float64
def fit_poly(x,y,dy,n):
V = np.asmatrix(np.diag(dy**2))
M = []
for k in range(n+1):
M.append(x**k)
M = np.asmatrix(M).T
theta = (M.T*V.I*M).I*M.T*V.I*np.asmatrix(y).T
cov_t = (M.T*V.I*M).I
return np.asarray(theta.T)[0], np.asarray(cov_t)
Im trying to optimize his codes using cython. i got this code
cimport numpy as np
import numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef poly_c(np.ndarray[np.float64_t, ndim=1] x ,
np.ndarray[np.float64_t, ndim=1] y np.ndarray[np.float64_t,ndim=1]dy , np.int n):
cdef np.ndarray[np.float64_t, ndim=2] V, M
V=np.asmatrix(np.diag(dy**2),dtype=np.float64)
M=np.asmatrix([x**k for k in range(n+1)],dtype=np.float64).T
return ((M.T*V.I*M).I*M.T*V.I*(np.asmatrix(y).T))[0],(M.T*V.I*M).I
But the runtime seems to be the same for both programs,i did used an 'assert' to make sure the outputs where the same. What am i missing/doing wrong?
Thank you for your time and hopefully you can help me.
ps: this is the code im profiling with(not sure if i can call this profiling but w/e)
import numpy as np
from polyC import poly_c
from time import time
from pancho_fit import fit_poly
#pancho's the T.A,sup pancho
x=np.arange(1,1000)
x=np.asarray(x,dtype=np.float64)
y=3*x+np.random.random(999)
y=np.asarray(y,dtype=np.float64)
dy=np.array([y.std() for i in range(1,1000)],dtype=np.float64)
t0=time()
a,b=poly_c(x,y,dy,4)
#a,b=fit_poly(x,y,dy,4)
print("time={}s".format(time()-t0))

Except for [x**k for k in range(n+1)] I don't see any iterations for cython to improve. Most of the action is in matrix products. Those are already done with compiled code (with np.dot for ndarrays).
And n is only 4, not many iterations.
But why iterate this?
In [24]: x=np.arange(1,1000.)
In [25]: M1=x[:,None]**np.arange(5)
# np.matrix(M1)
does the same thing.
So no, this does not look like a good cython candidate - not unless you are prepared to write out all those matrix products in compilable detail.
I'd skip also the asmatrix stuff and use regular dot, # and einsum, but that's more a matter of style than speed.

Vectorize a Newton method in Python/Numpy

I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
Sum all the elements of the result array
Here is the algorithm in Python I want to speed up
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
break
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
}
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
break;
}
x -= dx;
}
result[k] = x;
}
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
}
std::cout << somme << std::endl;
return 0;
}
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
Thanks.

You can do step 1. with numpy efficiently:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on #ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]

Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.

I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
initialize an array ar with values `[1,2,3,....n]
divide ar with n. This gets us the 1/n, 2/n ... members
add to that an array of same dimensions that contain just the number 1.0
This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.
find square roots, sum it
Average of 10 sequential execution times is 0.018786 seconds.

Obviously I'm 6 years late to this party, but this question is a common stumbling block for people in effectively using numpy for real scientific work. The basic idea is covered in #ev-br's answer. The OP points out that the solution offered there (even modified to stop iterating when a convergence criterion is met rather than after a fixed number of iterations) takes the same number of passes for each element of u. I want to show how you can avoid that objection using pure numpy code, making explicit the mask suggestion in #ev-br's comment.
However, I also want to point out that in many real world situations, the number of passes for Newton-like iteration to converge varies so little that this general technique I illustrate here will actually slow numpy code down significantly. If the average number of iterations will be within a factor of two or three of the maximum number of iterations, you should stick with something closer to #ev-br's answer (including his first comment).
The numpy performance numbers you need to understand are these: Loops over array indices will run 200 to 500 times slower in pure numpy code than in compiled code. On the other hand, if you manage to use numpy's array syntax to avoid all index loops, you can get within about a factor of 5 of compiled speed. (The factor of 5 is partly because of memory management as #ev-br mentions, but also because optimized compiled code overlaps many different arithmetical operations inside each index loop, while numpy just performs a single arithmetic operation, storing everything back to memory after each operation.) The point is that factor of 100 difference means that it often pays to do substantial amounts of "extra" work in numpy code: Even if you do 10 times the number of floating point operations in vectorized numpy code, it will still run 10 times faster than the index-loop code that avoids the "extra" work. (Incidentally, the python map function is implemented as an interpreted index loop - it has nothing to do with numpy array operations.)
from numpy import asfarray, broadcast_arrays, arange
# Begin by defining the function to be inverted by Newton's method.
def f_dfdx(x):
x = asfarray(x) # always avoid repeated type conversions
return x**2, 2.*x
# First, the simplest algorithm to find x such that f(x)=y.
# We must supply a starting guess x0 for x.
def f_inverse0(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
x = x.copy() # without this may clobber input x0
for npass in range(20):
f, dfdx = f_dfdx(x)
dx = (f - y) / dfdx
if (abs(dx) <= tol).all():
break # iterate all x until all have converged
x -= dx
else:
raise RuntimeError("failed to converge")
return x
# A frequently slower algorithm that avoids extra iterations.
def f_inverse1(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
shape = x.shape
y, x = y.ravel(), x.flatten() # avoid clobbering x0
unconverged = arange(y.size)
for npass in range(20):
f, dfdx = f_dfdx(x[unconverged])
dx = (f - y[unconverged]) / dfdx
unc = abs(dx) > tol
unconverged = unconverged[unc]
if not unconverged.size:
break # iterate all x until all have converged
x[unconverged] -= dx[unc]
else:
raise RuntimeError("failed to converge")
return x.reshape(shape)
On my machine, the OP's C++ program runs in 2.03 s (1.64+0.38 user+sys). For n=100 million as for the C++ program, f_inverse0 runs in 20.4 s (4.7+15.6 user+sys). As expected, f_inverse1 is slower, 51.3 s (11.5+39.8 user+sys). Again, don't automatically try to minimize total operation count when you are writing numpy code. The high system overhead is probably due to heavy memory management - every vector temporary is 0.8 GB and the memory manager is struggling.
Cutting the array size to n = 1 million elements (8 MB), then multiplying the runtime by 100 brings the system time down by a large factor, f_inverse0 now takes 16.1 s (12.5+3.6), while f_inverse1 takes 22.3 s (16.2+5.1). This factor of 8 to 10 slower than compiled code is not unreasonable to expect for numpy performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.