I am trying to convert part of a native python function to cython to improve the compute time. I would like to write a cython function just for the loop component that is taking up the time (as ipython lprun kindly told me). However this function takes in variably sized matrices .. and I can't see how to bring that across easily to statically typed cython.
for index1 in range(0,num_products):
for index2 in range(0,num_products):
cond_prob = (data[index1] * data[index2]).sum() / max(col_sums[index1], col_sums[index2])
prox[index1][index2] = cond_prob
This issue is that num_products changes year to year, so the matrix (data) size is variable.
What is the best strategy here?
Should I write two C functions. One to create a matrix of a certain dimension using memalloc, and then One to do the loops over the created matrix?
Is there some fancy cython/numpy wizardry to help in this scenario? Can I write a C function that takes in a variably sized Numpy Array in memory and pass the size?
Cython code is (strategically) statically typed, but that doesn't mean that arrays must have a fixed size. In straight C passing a multidimensional array to a function can be a little awkward maybe, but in Cython you should be able to do something like the following:
Note I took the function and variable names from your follow-up question.
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.cdivision(True)
def cooccurance_probability_cy(double[:,:] X):
cdef int P, i, j, k
P = X.shape[0]
cdef double item
cdef double [:] CS = np.sum(X, axis=1)
cdef double [:,:] D = np.empty((P, P), dtype=np.float)
for i in range(P):
for j in range(P):
item = 0
for k in range(P):
item += X[i,k] * X[j,k]
D[i,j] = item / max(CS[i], CS[j])
return D
On the other hand, using just Numpy should also be quite fast for this problem, if you use the right functions and some broadcasting. In fact, as the calculation complexity is dominated by the matrix multiplication, I found the following is much faster than the Cython code above (np.inner uses a highly optimized BLAS routine):
def new(X):
CS = np.sum(X, axis=1, keepdims=True)
D = np.inner(X,X) / np.maximum(CS, CS.T)
return D
Have you tried getting rid of the for loops in numpy?
for the first part of your equation you could for example try:
(data[ np.newaxis,:] * data[:,np.newaxis]).sum(2)
if memory is an issue you can also use the np.einsum() function.
For the second part one could probably also cook up a numpy expression (bit more difficult) if you've not already tried that.
Related
I want to understand how is einsum function in python implemented. I found the source code in numpy/core/src/multiarray/einsum.c.src file but couldn't completely understand it. In particular I want to understand how does it creates the required loops automatically?
For example:
import numpy as np
a = np.random.rand(2,3,4,5)
b = np.random.rand(5,3,2,4)
ll = np.einsum('ijkl, ljik ->', a,b) # This should loop over all the
# four indicies i,j,k,l. How does it create loops for these indices automatically ?
# The assume that under the hood it does the following
sum1 = 0
for i in range(2):
for j in range(3):
for k in range(4):
for l in range(5):
sum1 = sum1 + a[i,j,k,l]*b[l,j,i,k]
Thank you in advance
ps: This question is not about how to use numpy.einsum
I want to understand how does it creates the required loops automatically?
Well, it does not create the loops the way you think it does. In this case, it creates an iterator operating over multiple arrays and then use it in a generic main loop. In the more general case, there are two main loops: one to iterate over the output array items and one to perform a reduction.
The main function is PyArray_EinsteinSum. In your case, it takes an unoptimized path and end up creating a basic iteration function based on the iterator created previously (ie. iter). This function is get_sum_of_products_function. It basically analyze the einsum operation so to find the best (sum of product) function to call based on a lookup table (like _outstride0_specialized_table). In your specific case, double_sum_of_products_outstride0_two is called. Numpy use a template system so to generate this function automatically at build time (*.c.src files are template files converted to *.c files based on predefined basic comments). In this case, the function is generated from #name#_sum_of_products_outstride0_#noplabel# and once computed by the C preprocessor it gives something like the following function:
static void double_sum_of_products_outstride0_two(int nop,
char **dataptr,
npy_intp const *strides,
npy_intp count)
{
npy_double accum = 0;
char *data0 = dataptr[0];
npy_intp stride0 = strides[0];
char *data1 = dataptr[1];
npy_intp stride1 = strides[1];
while (count--)
{
accum += (*(npy_double *)data0) * (*(npy_double *)data1);
data0 += stride0;
data1 += stride1;
}
*((npy_double *)dataptr[2]) = (accum + (*((npy_double *)dataptr[2])));
}
As you can see, there is only one main loop iterating over the previously generated iterator. In your case, stride0 and stride1 are both equal to 8, data0 and data1 are the raw input arrays, dataptr is the raw output array and count is set to 120 initially. Note that the fact both strides are equal to 8 is surprising at first glance since the einsum does not iterate on the two array contiguously. This is because the second array is copied and reorder because Numpy cannot create a uniform view based on the einsum parameters.
Note that the fallback case use for the example code is not particularly optimized and it only produce one value. For example, the much more optimized double_sum_of_products_contig_contig_outstride0_two function can be called from unbuffered_loop_nop2_ndim2 for the following code:
import numpy as np
a = np.random.rand(3, 10)
b = np.random.rand(3, 10)
for i in range(1):
ll = np.einsum('ij, ij -> i', a, b)
In this case, the double_sum_of_products_contig_contig_outstride0_two performs the reductions for a given output item and unbuffered_loop_nop2_ndim2 iterate over the output array.
If the expression ij, ij -> j is instead used in the above code, then the function double_sum_of_products_contig_two is called which operates the same way than double_sum_of_products_contig_contig_outstride0_two except it reads/writes on the whole output line during the reduction.
I've been experimenting with Numba lately, and here's something that I still cannot understand:
In a normal Python function with NumPy arrays you can do something like this:
# Subtracts two NumPy arrays and returns an array as the result
def sub(a, b):
res = a - b
return res
But, when you use Numba's #guvectorize decorator like so:
# Subtracts two NumPy arrays and returns an array as the result
#guvectorize(['void(float32[:], float32[:], float32[:])'],'(n),(n)->(n)')
def subT(a, b, res):
res = a - b
The result is not even correct. Worse still, there are instances where it complains about "Invalid usage of [math operator] with [parameters]"
I am baffled. Even if I try this:
# Subtracts two NumPy arrays and returns an array as the result
#guvectorize(['void(float32[:], float32[:], float32[:])'],'(n),(n)->(n)')
def subTt(a, b, res):
res = np.subtract(a,b)
The result is still incorrect. Considering that this is supposed to be a supported Math operation, I don't see why it doesn't work.
I know the standard way is like this:
# Subtracts two NumPy arrays and returns an array as the result
#guvectorize(['void(float32[:], float32[:], float32[:])'],'(n),(n)->(n)')
def subTtt(a, b, res):
for i in range(a.shape[0]):
res[i] = a[i] - b[i]
and this does work as per expected.
But what is wrong with my way?
P/S This is just a trivial example to explain my problem, I don't actually plan to use #guvectorize just to subtract arrays :P
P/P/S I suspect it has something to do with how the arrays are copied to gpu memory, but I am not sure...
P/P/P/S This looked relevant but the function here operates only on a single thread right...
The correct way to write this is:
#guvectorize(['void(float32[:], float32[:], float32[:])'],'(n),(n)->(n)')
def subT(a, b, res):
res[:] = a - b
The reason what you tried didn't work is a limitation of python syntax not particular to numba.
name = expr rebinds the value of name to expr, it can never mutate the original value of name, as you could with, e.g. c++ references.
name[] = expr calls (in essence), name.__setitem__ which can be used to modify name, as numpy arrays do, the empty slice [:] refers to the whole array.
I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?
#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.
import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix
full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
u_dict[q] = i
shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings
H = load_sparse_csr('matrix.npz')
correlation_train = []
for idx, row in train_full.iterrows():
if idx%1000 == 0: print idx
id_1 = u_dict[row['w1']]
id_2 = u_dict[row['w2']]
a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
b_vec = H[id_2].toarray()
correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])
While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.
And important parts of sparse code are already written with cython.
Without touching the cython issue I see a couple of problems.
H is defined twice:
H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')
That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.
The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.
You do have an iteration here - but over a Pandas dataframe:
for idx, row in train_full.iterrows():
id_1 = u_dict[row['w1']]
a_vec = H[id_1].toarray()
Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,
a_vec = Ha[id_1,:]
will be a lot faster.
Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.
How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?
I have code that is working in python and want to use cython to speed up the calculation. The function that I've copied is in a .pyx file and gets called from my python code. V, C, train, I_k are 2-d numpy arrays and lambda_u, user, hidden are ints.
I don't have any experience in using C or cython. What is an efficient
way to make this code faster.
Using cython -a for compiling shows me that the code is flawed but how can I improve it. Using for i in prange (user_size, nogil=True):
results in Constructing Python slice object not allowed without gil.
How has the code to be modified to harvest the power of cython?
#cython.boundscheck(False)
#cython.wraparound(False)
def u_update(V, C, train, I_k, lambda_u, user, hidden):
cdef int user_size = user
cdef int hidden_dim = hidden
cdef np.ndarray U = np.empty((hidden_dim,user_size), float)
cdef int m = C.shape[1]
for i in range(user_size):
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
return U
You are trying to use cython by diving into the deep end of pool. You should start with something small, such as some of the numpy examples. Or even try to improve on np.diag.
i = 0
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
v.
C_i = diag(C[i,:])
Can you improve the speed of this simple expression? diag is not compiled, but it does perform an efficient indexed assignment.
res[:n-k].flat[i::n+1] = v
But the real problem for cython is this expression:
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
np.dot is compiled. cython won't turn that in to c code, nor will it consolidate all 5 dots into one expression. It also won't touch the inv. So at best cython will speed up the iteration wrapper, but it will still call this Python expression m times.
My guess is that this expression can be cleaned up. Replacing the inner dots with einsum can probably eliminate the need for C_i. The inv might make 'vectorizing' the whole thing difficult. But I'd have to study it more.
But if you want to stick with the cython route, you need to transform that U expression into simple iterative code, without calls to numpy functions like dot and inv.
===================
I believe the following are equivalent:
np.dot(C_i,V.T)
C[i,:,None]*V.T
In:
np.dot(C_i,train[i,:].T)
if train is 2d, then train[i,:] is 1d, and the .T does nothing.
In [289]: np.dot(np.diag([1,2,3]),np.arange(3))
Out[289]: array([0, 2, 6])
In [290]: np.array([1,2,3])*np.arange(3)
Out[290]: array([0, 2, 6])
If I got that right, you don't need C_i.
======================
Furthermore, these calculations can be moved outside the loop, with expressions like (not tested)
CV1 = C[:,:,None]*V.T # a 3d array
CV2 = C * train.T
for i in range(user_size):
U[:,i] = np.dot(np.linalg.inv(np.dot(V, CV1[i,...]) + lambda_u*I_k), np.dot(V, CV2[i,...]))
A further step is to move both np.dot(V,CV...) out of the loop. That may require np.matmul (#) or np.einsum. Then we will have
for i...
I = np.linalg.inv(VCV1[i,...])
U[:,i] = np.dot(I+ lambda_u), VCV2[i,])
or even
for i...
I[...i] = np.linalg.inv(...) # if inv can't be vectorized
U = np.einsum(..., I+lambda_u, VCV2)
This is a rough sketch, and details will need to be worked out.
The first thing that comes to mind is you haven't typed the function arguments and specified the data type and number of dimensions like so :
def u_update(np.ndarray[np.float64, ndim=2]V, np.ndarray[np.float64, ndim=2]\
C, np.ndarray[np.float64, ndim=2] train, np.ndarray[np.float64, ndim=2] \
I_k, int lambda_u, int user, int hidden) :
This will greatly speed up indexing with 2 indices like you do in the inner loop.
It's best to do this to the array U as well, although you are using slicing:
cdef np.ndarray[np.float64, ndim=2] U = np.empty((hidden_dim,user_size), np.float64)
Next, you are redefining C_i, a large 2-D array every iteration of the outer loop. Also, you have not supplied any type information for it, which is a must if Cython is to offer any speedup. To fix this :
cdef np.ndarray[np.float64, ndim=2] C_i = np.zeros((m, m), dtype=np.float64)
for i in range(user_size):
C_i.fill(0)
Here, we have defined it once (with type information), and reused the memory by filling with zeros instead of calling np.zeros() to make a new array every time.
Also, you might want to turn off bounds checking only after you have finished debugging.
If you need speedups in the U[:,i]=... step, you could consider writing another function with Cython to perform those operations using loops.
Do read this tutorial which should give you an idea of what to do when working with Numpy arrays in Cython and what not to do as well, and also to appreciate how much of a speedup you can get with these simple changes.
I have a scipy sparse matrix A and a (long) list of coordinates
myrows=[i1,i2,...] mycols=[j1,j2,...]. I need a list of their values [A[i1,j2],A[i2,j2],...]. How can I do this quickly. A loop is too slow.
I've thought about cython.inline() (which I use in other places in my code) or weave, but I don't see how to use the sparse type efficiently in cython or C++. Am I missing something simple?
Currently I'm using a hack that seems inefficient and possibly wrong sometimes -- which I flag with an error message. Here is my badly written code. Note that it relies on the ordering of elements to be preserved under addition and assumes that the elements in myrows,mycols are in A.
import scipy.sparse as sps
def getmatvals(A,myrows,mycols) #A is a coo_matrix
B = sps.coo_matrix((range(1,1+A.nnz),(A.row,A.col)),shape=A.shape)
T = sps.coo_matrix(([A.nnz+1]*len(myrows),(myrows,mycols)),shape=A.shape)
G = B-T #signify myelements in G by negatives and others by 0's
H = np.minimum([0]*A.nnz,G.data) #remove extra elements
H = H[np.nonzero(H)]
H = H + A.nnz
return A.data[H]