I have several thousands of 3x3 matrices, which are stored as a numpy array Nx9, where N is a number of matrices, stored in rows. I want to calculate the eigenvalues & eigenvectors for each of them (i.e. e, v = np.linalg.eig(matrices[row_idx, :].reshape((3, 3)), applied to each row_idx in range(matrices.shape[0])).
So, the function would look as:
def compute_eigen(matrices):
e_list, v_list = [], []
for i in range(matrices.shape[0]):
e, v = np.linalg.eig(matrices[row_idx, :].reshape((3, 3))
v = -v
e_list.append(e)
v_list.append(v)
return e_list, v_list
Now, how to make this run really fast? np.apply_along_axis is just a syntactic sugar for the for-loops, apparently, so it did not help much.
I tried to look into Numba, and run the same function with #jit(nopython=True) decorator, which, indeed, dropped the execution time of this function from ~20s to ~2s for ~6000 matrices. However, is it possible to drop it under 1s? Is there some smart way to do it (assuming no usage of some very powerful hardware, GPUs etc)? Thanks.
Related
I need to solve a large number of 3x3 symmetric, postive-definite systems with Python. So far, I did
res = numpy.zeros(n)
for k, obj in enumerate(data_array):
# construct A, rhs, idx from obj
res[idx] += numpy.linalg.solve(A, rhs)
This produces the correct result, however is also quite slow if n is large. (Well... Yeah.) Perhaps 3x3 isn't a problem size where calling solve() makes much sense.
Any hints?
In NumPy 1.8 and later, numpy.linalg.solve actually broadcasts. For numpy.linalg.solve(a, b), if b.ndim == a.ndim - 1, it will perform a broadcasted matrix-vector solve; otherwise, it'll do a broadcasted matrix-matrix solve. (This decision criterion isn't documented; I had to look at the source.)
If you can efficiently construct a stack of As and rhss, you can call solve once and avoid a Python loop.
I have a big 2D NumPy array, let's say 5M rows and 10 columns. I want to build a few more columns according to some stateful logic implemented using Numba #jitclass. Let's say there are 50 such new columns to create. The idea is to iterate over all the rows of 10 columns in a Numba #jit function, and for each row, apply each of my 50 "filters" to generate one new cell each. So:
Source1..Source10 Derived1..Derived50
[array of 10 inputs] [array of 50 outputs]
... 5 million rows like this ...
The problem is, I can't pass a list or tuple of my "filters" to an #jit(nopython=True) function, because they are not homogenous:
#numba.jit(nopython=True)
def calc_derived(source, derived, filters):
for srcidx, src in enumerate(source):
for filtidx, filt in enumerate(filters): # doesn't work
derived[srcidx,filtidx] = filt.transform(src)
The above doesn't work because filters are a bunch of different classes. As far as I can tell, even making them derive from a common base class is not good enough.
I am left with the possibility of swapping the order of the loops, and having the loop over the 50 filters outside of the #jit function, but this would mean the entire source dataset would be loaded 50 times instead of once, which is very wasteful.
Do you have a technique to work around the "homogenous lists only" requirement of Numba?
You originally asked about doing this with a single function that loops over rows, and applies a list of filters to each row. A challenge with this approach is that numba needs to know or be able to infer the input/output types of each function. I'm not aware of a way to satisfy numba's requirement in this situation (which is not to say that none exists). If there were a way to do this, it could be a better solution (and I'd like to know what it is).
An alternative is to move the code that loops over rows into the filters themselves. Because the filters are numba functions, this should maintain speed. The function that applies the filters would longer use numba; it would simply loop over the list of filters. But, because the number of filters is small relative to the size of the data matrix, hopefully this won't impact speed too severely. Because this function no longer uses numba, the 'heterogeneous list' issue would no longer be a problem.
This approach worked when I tested it (nopython mode is fine). In test cases, filters implemented as numba functions were 10-18x faster than filters implemented as class methods (even though classes were implemented as numba jitclasses; not sure what's going on there). To gain a bit of modularity, filters can be constructed as closures, so that similar filters can be defined using different parameters.
For example, here are filters that compute sums of powers. Given a matrix x, the filter operates over the columns of x, giving an output for each row. It returns a vector v, where v[i] = sum(x[i, :] ** power)
# filter constructor
def sumpow(power):
#numba.jit(nopython=True)
def run_filter(x):
(nrows, ncols) = x.shape
result = np.zeros(nrows)
for i in range(nrows):
for j in range(ncols):
result[i] += x[i,j] ** power
return result
return run_filter
# define filters
sum1 = sumpow(1) # sum of elements
sum2 = sumpow(2) # sum of elements squared
# apply a single filter
v = sum2(x)
The function to apply multiple filters looks like this. The output of each filter is stacked into a column of the output.
def apply_filters(x, filters):
result = np.empty((x.shape[0], len(filters)))
for (i, f) in enumerate(filters):
result[:, i] = f(x)
return result
y = apply_filters(x, [sum1, sum2])
Timing results
Data matrix: random entries drawn from standard normal distribution, float64, 5 million rows x 10 columns. All methods tested using the same matrix.
Filters: sum2 filter above, repeated 20x in a list: [sum2, sum2, ...]
Timed using IPython's %timeit function, best of 3 runs
Numerical outputs of all methods agree
Numba function filters (as shown above): 2.25s
Numba jitclass filters: 28.3s
Pure NumPy (using vectorized ops, no loops): 8.64s
I imagine Numba might gain relative to NumPy for more complex filters.
To get a homogeneous list, you could construct a list of the transform functions of all filters. In this case, all list elements would would have type method.
# filters = list of filters
transforms = [x.transform for x in filters]
Then pass transforms to calc_derived() instead of filters.
Edit:
On my system, looks like numba will accept this, but only if nopython=False
I'm lost when iterating over a ndarray with nditer.
Background
I am trying to compute the eigenvalues of 3x3 symmetric matrices for each point in a 3D array.
My data is a 4D array of shape [6,x,y,z] with the 6 values being the values of matrix at point x,y,z, over a ~500x500x500 cube of float32.
I first used numpy's eigvalsh, but it's optimized for large matrices, while I can use analytical simplification for 3x3 symmetric matrices.
I then implemented wikipedia's simplification , both as a function that takes a single matrix and computes eigenvalues (then iterating naively with nested for loops), and then vectorized using numpy.
The problem is that now inside my vectorization, each operation creates an internal array of my data's size, culminating in too much RAM used and PC freeze.
I tried using numexpr etc, it's still around 10G usage.
What I'm trying to do
I want to iterate (using numpy's nditer) through my array so that for each matrix, I compute my eigenvalues. This would remove the need to allocate huge intermediary arrays because we only calculate ~ 10 float numbers at a time.
Basically trying to substitute nested for loops into one iterator.
I'm looking for something like this :
for a,b,c,d,e,f in np.nditer([symMatrix,eigenOut]): # for each matrix in x,y,z
# computing my output for this matrix
eigenOut[...] = myLovelyEigenvalue(a,b,c,d,e,f)
The best I have so far is this :
for i in np.nditer([derived],[],[['readonly']],op_axes=[[1,2,3]]):
But this means that i takes all values of the 4D array instead of being a tuple of 6 length.
I can't seem to get the hang of the nditer documentation.
What am I doing wrong ? Do you have any tips and tricks as to iterating over "all but one" axis ?
The point is to have an nditer that would outperform regular nested loops on iteration (once this works i'll change function calls, buffer iteration ... but so far I just want it to work ^^)
You don't really need np.nditer for this. A simpler way of iterating over all but the first axis is just to reshape into a [6, 500 ** 3] array, transpose it to [500 ** 3, 6], then iterate over the rows:
for (a, b, c, d, e, f) in (symMatrix.reshape(6, -1).T):
# do something involving a, b, c, d, e, f...
If you really want to use np.nditer then you would do something like this:
for (a, b, c, d, e, f) in np.nditer(x, flags=['external_loop'], order='F'):
# do something involving a, b, c, d, e, f...
A potentially important thing to consider is that if symMatrix is C-order (row-major) rather than Fortran-order (column-major) then iterating over the first dimension may be significantly faster than iterating over the last 3 dimensions, since then you will be accessing adjacent blocks of memory address. You might therefore want to consider switching to Fortran-order.
I wouldn't expect a massive performance gain from either of these, since at the end of the day you're still doing all of your looping in Python and operating only on scalars rather than taking advantage of vectorization.
I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).
I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).
I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.
I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.
Whats the best option to go for?
Thanks in advance
UPDATE: sparsity of the boolean matrices is <0.6%
If your matrices are relatively empty it might be worthwhile encoding them as a data structure of the non-False values. Say a list of tuples describing the location of the non-False values. Or a dictionary with the tuples as the keys.
If you use e.g. a list of tuples you could use a list comprehension to find the items in the second list that can be multiplied with an element from the first list.
a = [(0,0), (3,7), (5,2)] # et cetera
b = ... # idem
for r, c in a:
res = [(r, k) for j, k in b if k == j]
-- EDITED TO SATISFY BELOW COMMENT / DOWNVOTER --
You're asking how to multiply matrices fast and easy.
SOLUTION 1: This is a solved problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, are rather blazingly fast.
http://www.numpy.org/
http://www.scipy.org
also see:
Very large matrices using Python and NumPy
http://docs.scipy.org/doc/scipy/reference/sparse.html
SciPy and Numpy have sparse matrices and matrix multiplication. It doesn't use much memory since (at least if I wrote it in C) it probably uses linked lists, and thus will only use the memory required for the sum of the datapoints, plus some overhead. And, it will almost certainly be blazingly fast compared to pure python solution.
SOLUTION 2
Another answer here suggests storing values as tuples of (x, y), presuming value is False unless it exists, then it's true. Alternate to this is a numeric matrix with (x, y, value) tuples.
REGARDLESS: Multiplying these would be Nasty time-wise: find element one, decide which other array element to multiply by, then search the entire dataset for that specific tuple, and if it exists, multiply and insert the result into the result matrix.
SOLUTION 3 ( PREFERRED vs. Solution 2, IMHO )
I would prefer this because it's simpler / faster.
Represent your sparse matrix with a set of dictionaries. Matrix one is a dict with the element at (x, y) and value v being (with x1,y1, x2,y2, etc.):
matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... }
matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... }
Since a Python dict lookup is O(1) (okay, not really, probably closer to log(n)), it's fast. This does not require searching the entire second matrix's data for element presence before multiplication. So, it's fast. It's easy to write the multiply and easy to understand the representations.
SOLUTION 4 (if you are a glutton for punishment)
Code this solution by using a memory-mapped file of the required size. Initialize a file with null values of the required size. Compute the offsets yourself and write to the appropriate locations in the file as you do the multiplication. Linux has a VMM which will page in and out for you with little overhead or work on your part. This is a solution for very, very large matrices that are NOT SPARSE and thus won't fit in memory.
Note this solves the complaint of the below complainer that it won't fit in memory. However, the OP did say sparse, which implies very few actual datapoints spread out in giant arrays, and Numpy / SciPy handle this natively and thus nicely (lots of people at Fermilab use Numpy / SciPy regularly, I'm confident the sparse matrix code is well tested).
I had a pretty compact way of computing the partition function of an Ising-like model using itertools, lambda functions, and large NumPy arrays. Given a network consisting of N nodes and Q "states"/node, I have two arrays, h-fields and J-couplings, of sizes (N,Q) and (N,N,Q,Q) respectively. J is upper-triangular, however. Using these arrays, I have been computing the partition function Z using the following method:
# Set up lambda functions and iteration tuples of the form (A_1, A_2, ..., A_n)
iters = itertools.product(range(Q),repeat=N)
hf = lambda s: h[range(N),s]
jf = lambda s: np.array([J[fi,fj,s[fi],s[fj]] \
for fi,fj in itertools.combinations(range(N),2)]).flatten()
# Initialize and populate partition function array
pf = np.zeros(tuple([Q for i in range(N)]))
for it in iters:
hterms = np.exp(hf(it)).prod()
jterms = np.exp(-jf(it)).prod()
pf[it] = jterms * hterms
# Calculates partition function
Z = pf.sum()
This method works quickly for small N and Q, say (N,Q) = (5,2). However, for larger systems (N,Q) = (18,3), this method cannot even create the pf array due to memory issues because it has Q^N nontrivial elements. Any ideas on how to either overcome this memory issue or how to alter the code to work on subarrays?
Edit: Made a small mistake in the definition of jf. It has been corrected.
You can avoid the large array just by initializing Z to 0, and incrementing it by jterms * iterms in each iteration. This still won't get you out of calculating and summing Q^N numbers, however. To do that, you probably need to figure out a way to simplify the partition function algebraically.
Not sure what you are trying to compute but I tested your code with ChrisB suggestion and jf will not work for Q=3.
Perhaps you shouldn't use a dense numpy array to encode your function? You could try sparse arrays or just straight Python with Numba compilation. This blogpost shows using Numba on the simple Ising model with good performance.