How is numpy.einsum implemented?

How is numpy.einsum implemented? - python

I want to understand how is einsum function in python implemented. I found the source code in numpy/core/src/multiarray/einsum.c.src file but couldn't completely understand it. In particular I want to understand how does it creates the required loops automatically?
For example:
import numpy as np
a = np.random.rand(2,3,4,5)
b = np.random.rand(5,3,2,4)
ll = np.einsum('ijkl, ljik ->', a,b) # This should loop over all the
# four indicies i,j,k,l. How does it create loops for these indices automatically ?
# The assume that under the hood it does the following
sum1 = 0
for i in range(2):
for j in range(3):
for k in range(4):
for l in range(5):
sum1 = sum1 + a[i,j,k,l]*b[l,j,i,k]
Thank you in advance
ps: This question is not about how to use numpy.einsum

I want to understand how does it creates the required loops automatically?
Well, it does not create the loops the way you think it does. In this case, it creates an iterator operating over multiple arrays and then use it in a generic main loop. In the more general case, there are two main loops: one to iterate over the output array items and one to perform a reduction.
The main function is PyArray_EinsteinSum. In your case, it takes an unoptimized path and end up creating a basic iteration function based on the iterator created previously (ie. iter). This function is get_sum_of_products_function. It basically analyze the einsum operation so to find the best (sum of product) function to call based on a lookup table (like _outstride0_specialized_table). In your specific case, double_sum_of_products_outstride0_two is called. Numpy use a template system so to generate this function automatically at build time (*.c.src files are template files converted to *.c files based on predefined basic comments). In this case, the function is generated from #name#_sum_of_products_outstride0_#noplabel# and once computed by the C preprocessor it gives something like the following function:
static void double_sum_of_products_outstride0_two(int nop,
char **dataptr,
npy_intp const *strides,
npy_intp count)
{
npy_double accum = 0;
char *data0 = dataptr[0];
npy_intp stride0 = strides[0];
char *data1 = dataptr[1];
npy_intp stride1 = strides[1];
while (count--)
{
accum += (*(npy_double *)data0) * (*(npy_double *)data1);
data0 += stride0;
data1 += stride1;
}
*((npy_double *)dataptr[2]) = (accum + (*((npy_double *)dataptr[2])));
}
As you can see, there is only one main loop iterating over the previously generated iterator. In your case, stride0 and stride1 are both equal to 8, data0 and data1 are the raw input arrays, dataptr is the raw output array and count is set to 120 initially. Note that the fact both strides are equal to 8 is surprising at first glance since the einsum does not iterate on the two array contiguously. This is because the second array is copied and reorder because Numpy cannot create a uniform view based on the einsum parameters.
Note that the fallback case use for the example code is not particularly optimized and it only produce one value. For example, the much more optimized double_sum_of_products_contig_contig_outstride0_two function can be called from unbuffered_loop_nop2_ndim2 for the following code:
import numpy as np
a = np.random.rand(3, 10)
b = np.random.rand(3, 10)
for i in range(1):
ll = np.einsum('ij, ij -> i', a, b)
In this case, the double_sum_of_products_contig_contig_outstride0_two performs the reductions for a given output item and unbuffered_loop_nop2_ndim2 iterate over the output array.
If the expression ij, ij -> j is instead used in the above code, then the function double_sum_of_products_contig_two is called which operates the same way than double_sum_of_products_contig_contig_outstride0_two except it reads/writes on the whole output line during the reduction.

Related

Array operations using multiple indices of same array

I am very new to Python, and I am trying to get used to performing Python's array operations rather than looping through arrays. Below is an example of the kind of looping operation I am doing, but am unable to work out a suitable pure array operation that does not rely on loops:
import numpy as np
def f(arg1, arg2):
# an arbitrary function
def myFunction(a1DNumpyArray):
A = a1DNumpyArray
# Create a square array with each dimension the size of the argument array.
B = np.zeros((A.size, A.size))
# Function f is a function of two elements of the 1D array. For each
# element, i, I want to perform the function on it and every element
# before it, and store the result in the square array, multiplied by
# the difference between the ith and (i-1)th element.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
# Sum through j and return full sums as 1D array.
return np.sum(B, axis=0)
In short, I am integrating a function which takes two elements of the same array as arguments, returning an array of results of the integral.
Is there a more compact way to do this, without using loops?

The use of an arbitrary f function, and this [i, :i] business complicates by passing a loop.
Most of the fast compiled numpy operations work on the whole array, or whole rows and/or columns, and effectively do so in parallel. Loops that are inherently sequential (value from one loop depends on the previous) don't fit well. And different size lists or arrays in each loop are also a good indicator that 'vectorizing' will be difficult.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
With a sample A and known f (as simple as arg1*arg2), I'd generate a B array, and look for patterns that treat B as a whole. At first glance it looks like your B is a lower triangle. There are functions to help index those. But that final sum might change the picture.
Sometimes I tackle these problems with a bottom up approach, trying to remove inner loops first. But in this case, I think some sort of big-picture approach is needed.

Is there an equivalent of Python's list and append feature in Matlab?

This is more of a Matlab programming question than it is a math question.
I'd like to run gradient descent multiple on different learning rates. I have a set of learning rates
alpha = [0.3, 0.1, 0.03, 0.01, 0.003, 0.001];
and each time I run gradient descent, I get a vector J_vals as output. However, I don't know Matlab well enough to know how to implement this besides doing something like:
[theta, J_vals] = gradientDescent(...., alpha(1),...);
J1 = J_vals;
[theta, J_vals] = gradientDescent(...., alpha(2),...);
J2 = J_vals;
and so on.
I thought about using a for loop, but then I don't know how I would deal with the J_vals's (not sure how to apply the for loop to J1, J2, and so on). Perhaps it would look something like this:
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(i) = J_vals;
end
Then I would have a vector of vectors.
In Python, I would just run a for loop and append each new result to the end of a list. How do I implement something like this in Matlab? Or is there a more efficient way?

If you know how many loops you are going have and the size of the J_vals (or at least a reasonable upper bound) I would suggest pre-allocating the size of the container array
J = zeros(n,1);
then on each loop insert the new values
J(start:start+n) = J_vals
That way you don't reallocate memory. If you don't know, you can append the values to the array. For example,
J = []; % initialize
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J = [J; J_vals]; % Append column row
end
but this is re-allocating the size of the array every loop. If it's not too many loops then it should be ok.

Matlab's "cell arrays" are kind of like lists in Python. They are similar in that you can put variable datatypes into them. Nobody seems to be too sure, but most likely the cell array is implemented as an array of object pointers. That means that it is still somewhat expensive to append to it (cell_array{length(cell_array) + 1} = new_data), but at least you are only appending a pointer instead of the entire column. You would still have to convert the cell array to a normal matrix afterward using cell2mat.
The most idiomatic Matlab solution is to pre-allocate (as #dpmcmlxxvi suggested).
I think what you are describing is a really common use case, and it's unfortunate that Matlab requires such a verbose idiom for this. Also it's frustrating that the documentation is opaque on how cell arrays are implemented and whether it is expensive to append to a cell array.

Your solution works just fine as long as you add a : for the row subscript (assuming J_vals is a column vector):
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(:, i) = J_vals;
%// ^... all rows, column 'i'
end
You could even put that as the return value:
for i = len(alpha)
[theta, J(:, i)] = gradientDescent(..., alpha(i),...);
%// ^... add returned value directly to our list
end
Both of these methods allow you to preallocate your matrix for a potential speed gain.
If you want to build your list as you go, you can use the method in #dpmcmlxxvi's answer, or you can use the special subscript end. Neither of these methods are compatible with preallocation, though.
for i = len(alpha)
[theta, J(:, end+1)] = gradientDescent(..., alpha(i),...);
%// ^... add new vector after the current end of list
end
I would also like to suggest you not use i as a variable name in Matlab. I know it's natural for other languages, but in Matlab it overwrites the built-in imaginary constant i.
See: https://stackoverflow.com/a/14790765/1377097

Cython function with variable sized matrix input

I am trying to convert part of a native python function to cython to improve the compute time. I would like to write a cython function just for the loop component that is taking up the time (as ipython lprun kindly told me). However this function takes in variably sized matrices .. and I can't see how to bring that across easily to statically typed cython.
for index1 in range(0,num_products):
for index2 in range(0,num_products):
cond_prob = (data[index1] * data[index2]).sum() / max(col_sums[index1], col_sums[index2])
prox[index1][index2] = cond_prob
This issue is that num_products changes year to year, so the matrix (data) size is variable.
What is the best strategy here?
Should I write two C functions. One to create a matrix of a certain dimension using memalloc, and then One to do the loops over the created matrix?
Is there some fancy cython/numpy wizardry to help in this scenario? Can I write a C function that takes in a variably sized Numpy Array in memory and pass the size?

Cython code is (strategically) statically typed, but that doesn't mean that arrays must have a fixed size. In straight C passing a multidimensional array to a function can be a little awkward maybe, but in Cython you should be able to do something like the following:
Note I took the function and variable names from your follow-up question.
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.cdivision(True)
def cooccurance_probability_cy(double[:,:] X):
cdef int P, i, j, k
P = X.shape[0]
cdef double item
cdef double [:] CS = np.sum(X, axis=1)
cdef double [:,:] D = np.empty((P, P), dtype=np.float)
for i in range(P):
for j in range(P):
item = 0
for k in range(P):
item += X[i,k] * X[j,k]
D[i,j] = item / max(CS[i], CS[j])
return D
On the other hand, using just Numpy should also be quite fast for this problem, if you use the right functions and some broadcasting. In fact, as the calculation complexity is dominated by the matrix multiplication, I found the following is much faster than the Cython code above (np.inner uses a highly optimized BLAS routine):
def new(X):
CS = np.sum(X, axis=1, keepdims=True)
D = np.inner(X,X) / np.maximum(CS, CS.T)
return D

Have you tried getting rid of the for loops in numpy?
for the first part of your equation you could for example try:
(data[ np.newaxis,:] * data[:,np.newaxis]).sum(2)
if memory is an issue you can also use the np.einsum() function.
For the second part one could probably also cook up a numpy expression (bit more difficult) if you've not already tried that.

Why accessing a sparse matrix is costly?

I have a 1034_by_1034 sparse matrix (scipy.sparse.csr.csr_matrix), which basically represents the adjacency matrix of a graph. I want to check if some elements are ones or not. But I found this to be a very slow operation. Before the if statement the code runs in 11 seconds, but when I enable the if check, it takes 40 seconds!
Here's my code snippet:
target = list()
for edge_id in edges_ids:
v1_label, v2_label = from_edgeID_to_vertix_labels(edge_id) #fast
v1_index = g.get_v_index(v1_label) #fast
v2_index = g.get_v_index(v2_label) #fast
#if the following chunk is enabled, it becomes slow!
if A[v1_index, v2_index] == 1:
target.append(1)
else:
target.append(0)
g.target = target

The reason is quite likely to be the fact that fetching a single value from a sparse matrix in CSR (or CSC form), given indices (i, j), is very expensive. Algorithms for these sparse matrix representations aren't usually designed to do that: they're designed to use the indices they find as they go through the arrays sequentially.
In CSR, when you look up a row, you effectively get an array of column indices and the corresponding values. If you're fetching a single value, you have to do a linear search through the little array of column indices (unsorted in general) to see if it's there (otherwise the value is zero); if found, you then pick the value out of the value array and return it. It might look a bit like this ad-hoc C (this is intended to be illustrative):
/* Obviously silly CSR matrix typedef */
typedef struct sparse_s {
int row[nnz+1];
int col[nnz];
double value[nnz];
} sparse_s;
double spGetValue(sparse_s const* s, int i, int j)
{
int k;
for(k=s->row[i]; k<s->row[i+1]; k++) {
if( j == s->col[k] ) {
return s->value[k];
}
}
return 0.0;
}
So, if you were to average 10 elements on every row, you have to search through a ten element array for every access. This is much less of a problem for algorithms like SpMV that use the column indices as they find them. If you implemented SpMV like dense MM, fetching every value, it would be horribly horribly slow even if you had some oracular magic way of skipping the zeros. If you think that's bad, inserting an element into a CSR/CSC matrix is so viciously expensive that it's (almost) never done.
In short, you might get better results by either reorganizing your code so that you're iterating over the three vectors of the CSR matrix directly or using a different sparse matrix representation for this particular problem.
It might well be something more “Pythoney”, but I wouldn't expect your code to perform well even in a best-case scenario in C if the matrix representation and access method were retained.

In this case, you may be better off using a nested defaultdict:
from collections import defaultdict
A = defaultdict(lambda : defaultdict(int))
# Example of how to set an element in the adjacency matrix:
A[1][2] = 1
However, that does not support any of the matrix manipulations offered by numpy or scipy, but it should be fast for that particular use case.

Initialize Multiple Numpy Arrays (Multiple Assignment) - Like MATLAB deal()

I was unable to find anything describing how to do this, which leads to be believe I'm not doing this in the proper idiomatic Python way. Advice on the 'proper' Python way to do this would also be appreciated.
I have a bunch of variables for a datalogger I'm writing (arbitrary logging length, with a known maximum length). In MATLAB, I would initialize them all as 1-D arrays of zeros of length n, n bigger than the number of entries I would ever see, assign each individual element variable(measurement_no) = data_point in the logging loop, and trim off the extraneous zeros when the measurement was over. The initialization would look like this:
[dData gData cTotalEnergy cResFinal etc] = deal(zeros(n,1));
Is there a way to do this in Python/NumPy so I don't either have to put each variable on its own line:
dData = np.zeros(n)
gData = np.zeros(n)
etc.
I would also prefer not just make one big matrix, because keeping track of which column is which variable is unpleasant. Perhaps the solution is to make the (length x numvars) matrix, and assign the column slices out to individual variables?
EDIT: Assume I'm going to have a lot of vectors of the same length by the time this is over; e.g., my post-processing takes each log file, calculates a bunch of separate metrics (>50), stores them, and repeats until the logs are all processed. Then I generate histograms, means/maxes/sigmas/etc. for all the various metrics I computed. Since initializing 50+ vectors is clearly not easy in Python, what's the best (cleanest code and decent performance) way of doing this?

If you're really motivated to do this in a one-liner you could create an (n_vars, ...) array of zeros, then unpack it along the first dimension:
a, b, c = np.zeros((3, 5))
print(a is b)
# False
Another option is to use a list comprehension or a generator expression:
a, b, c = [np.zeros(5) for _ in range(3)] # list comprehension
d, e, f = (np.zeros(5) for _ in range(3)) # generator expression
print(a is b, d is e)
# False False
Be careful, though! You might think that using the * operator on a list or tuple containing your call to np.zeros() would achieve the same thing, but it doesn't:
h, i, j = (np.zeros(5),) * 3
print(h is i)
# True
This is because the expression inside the tuple gets evaluated first. np.zeros(5) therefore only gets called once, and each element in the repeated tuple ends up being a reference to the same array. This is the same reason why you can't just use a = b = c = np.zeros(5).
Unless you really need to assign a large number of empty array variables and you really care deeply about making your code compact (!), I would recommend initialising them on separate lines for readability.

Nothing wrong or un-Pythonic with
dData = np.zeros(n)
gData = np.zeros(n)
etc.
You could put them on one line, but there's no particular reason to do so.
dData, gData = np.zeros(n), np.zeros(n)
Don't try dData = gData = np.zeros(n), because a change to dData changes gData (they point to the same object). For the same reason you usually don't want to use x = y = [].
The deal in MATLAB is a convenience, but isn't magical. Here's how Octave implements it
function [varargout] = deal (varargin)
if (nargin == 0)
print_usage ();
elseif (nargin == 1 || nargin == nargout)
varargout(1:nargout) = varargin;
else
error ("deal: nargin > 1 and nargin != nargout");
endif
endfunction
In contrast to Python, in Octave (and presumably MATLAB)
one=two=three=zeros(1,3)
assigns different objects to the 3 variables.
Notice also how MATLAB talks about deal as a way of assigning contents of cells and structure arrays. http://www.mathworks.com/company/newsletters/articles/whats-the-big-deal.html

If you put your data in a collections.defaultdict you won't need to do any explicit initialization. Everything will be initialized the first time it is used.
import numpy as np
import collections
n = 100
data = collections.defaultdict(lambda: np.zeros(n))
for i in range(1, n):
data['g'][i] = data['d'][i - 1]
# ...

How about using map:
import numpy as np
n = 10 # Number of data points per array
m = 3 # Number of arrays being initialised
gData, pData, qData = map(np.zeros, [n] * m)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How is numpy.einsum implemented? - python

Related

Array operations using multiple indices of same array

Is there an equivalent of Python's list and append feature in Matlab?

Cython function with variable sized matrix input

Why accessing a sparse matrix is costly?

Initialize Multiple Numpy Arrays (Multiple Assignment) - Like MATLAB deal()

Categories

Resources