Efficient tensor contraction with Python - python

I have a piece of code with a bottleneck calculation involving tensor contractions. Lets say I want to calculate a tensor A_{i,j,k,l}( X ) whose non-zero entries for a single x\in X are N ~ 10^5, and X represents a grid with M total points, with M~1000 approximately. For a single element of the tensor A, the rhs of the equation looks something like:
A_{ijkl}(M) = Sum_{m,n,p,q} S_{i,j, m,n }(M) B_{m,n,p,q}(M) T_{ p,q, k,l }(M)
In addition, the middle tensor B_{m,n,p,q}(M) is obtained by numerical convolution of arrays so that:
B_{m,n,p,q}(M) = ( L_{m,n} * F_{p,q} )(M)
where "*" is the convolution operator, and all tensors have appoximately the same number of elements as A. My problem has to do with efficiency of the sums; to compute a single rhs of A, it takes very long times given the complexity of the problem. I have a "keys" system, where each tensor element is accessed by its unique key combination ( ( p,q,k,l ) for T for example ) taken from a dictionary. Then the dictionary for that specific key gives the Numpy array associated to that key to perform an operation, and all operations (convolutions, multiplications...) are done using Numpy. I have seen that the most time consuming part is actually due to the nested loop (I loop over all keys (i,j,k,l) of the A tensor, and for each key, a rhs like the one above needs to be computed). Is there any efficient way to do this? Consider that:
1) Using simple numpy arrays of 4 +1 D results in high memory usage, since all tensors are of type complex
2 ) I have tried several approaches: Numba is quite limited when working with dictionaries, and some important Numpy features that I need are not currently supported. For instance, the numpy.convolve() only takes the first 2 arguments, but does not take the "mode" argument which reduces considerably the needed convolution interval in this case, I dont need the "full" output of the convolution
3) My most recent approach is trying to implement everything using Cython for this part... But this is quite time consuming as well as more error prone given the logic of the code.
Any ideas on how to deal with such complexity using Python?
Thanks!

You have to make your question a bit more precise, which also includes a working code example which you have already tried. It is for example unclear, why you use dictionarys in this tensor contractions. Dictionary lookups looks to be a weard thing for this calculation, but maybe I didn't get the point what you really want to do.
Tensor contraction actually is very easy to implement in Python (Numpy), there are methods to find the best way to contract the tensors and they are really easy to use (np.einsum).
Creating some data (this should be part of the question)
import numpy as np
import time
i=20
j=20
k=20
l=20
m=20
n=20
p=20
q=20
#I don't know what complex 2 means, I assume it is complex128 (real and imaginary part are in float64)
#size of all arrays is 1.6e5
Sum_=np.random.rand(m,n,p,q).astype(np.complex128)
S_=np.random.rand(i,j,m,n).astype(np.complex128)
B_=np.random.rand(m,n,p,q).astype(np.complex128)
T_=np.random.rand(p,q,k,l).astype(np.complex128)
The naive way
This code is basically the same as writing it in loops using Cython or Numba without calling BLAS routines (ZGEMM) or optimizing the contraction order -> 8 nested loops to do the job.
t1=time.time()
A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_)
print(time.time()-t1)
This results in a very slow runtime of about 330 seconds.
How to increase the speed by a factor of 7700
%timeit A=np.einsum("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
#42.9 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is this so much faster?
Lets have a look at the contraction path and the internals.
path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal")
print(path[1])
# Complete contraction: mnpq,ijmn,mnpq,pqkl->ijkl
# Naive scaling: 8
# Optimized scaling: 6
# Naive FLOP count: 1.024e+11
# Optimized FLOP count: 2.562e+08
# Theoretical speedup: 399.750
# Largest intermediate: 1.600e+05 elements
#--------------------------------------------------------------------------
#scaling current remaining
#--------------------------------------------------------------------------
# 4 mnpq,mnpq->mnpq ijmn,pqkl,mnpq->ijkl
# 6 mnpq,ijmn->ijpq pqkl,ijpq->ijkl
# 6 ijpq,pqkl->ijkl ijkl->ijkl
and
path=np.einsum_path("mnpq,ijmn,mnpq,pqkl",Sum_,S_,B_,T_,optimize="optimal",einsum_call=True)
print(path[1])
#[((2, 0), set(), 'mnpq,mnpq->mnpq', ['ijmn', 'pqkl', 'mnpq'], False), ((2, 0), {'n', 'm'}, 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0), {'p', 'q'}, 'ijpq,pqkl->ijkl', ['ijkl'], True)]
Doing the contraction in multiple well choosen steps reduces the required flops by a factor of 400. But thats not the only thing what einsum does here. Just have a look at 'mnpq,ijmn->ijpq', ['pqkl', 'ijpq'], True), ((1, 0) the True stands for a BLAS contraction -> tensordot call -> (matrix matix multiplication).
Internally this looks basically as follows:
#consider X as a 4th order tensor {mnpq}
#consider Y as a 4th order tensor {ijmn}
X_=X.reshape(m*n,p*q) #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
Y_=Y.reshape(i*j,m*n) #-> just another view on the data (2D), costs almost nothing (no copy, just a view)
res=np.dot(Y_,X_) #-> dot is just a wrapper for highly optimized BLAS functions, in case of complex128 ZGEMM
output=res.reshape(i,j,p,q) #-> just another view on the data (4D), costs almost nothing (no copy, just a view)

Related

Optimzation of nested for loop in python

x_tsvd is matrix of length 4.6 million(row).
svd_tfidf is matrix of length 1862(row).
Both matrix has same number of column(260).
And i wand to calcuate cosine similarity for each 4.6 M rows of x_tsvd for each 1862 svd_tfidf.
Is there any way i can optimize it so that it take less time.
from numpy.linalg import norm
best_match=[]
keys=np.array(df_5M['file'])
values=np.array(df['file'])
for i in range(len(x_tsvd)):
array_=[]
for j in range (len(svd_tfidf)):
cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
array_.append(cosine_similarity_)
index=np.array(array_).argsort()
best_match.append({keys[i]:values[index][::-1][0:5]})
Update:
from numpy.linalg import norm
best_match=[]
#b=copy.copy(svd_tfidf)
keys=np.array(df_5M['file'])
values=np.array(df['file'])
#b=copy.copy(svd_tfidf)
for i in range(len(x_tsvd)):
a=x_tsvd[i]
b=svd_tfidf
a_dot_b=np.sum(np.multiply(a,b),axis=1)
norm_a=norm(a)
norm_b=norm(b,axis=1)
cosine_similarity_=a_dot_b/(norm_a*norm_b)
index=np.argsort(cosine_similarity_)
best_match.append({keys[i]:values[index][::-1][0:6]})
```
There are several issues in your code. First of all, norm(x_tsvd[i]) is recomputed len(svd_tfidf)=1862 times while the expression can be move in the parent loop. Furthermore, norm(svd_tfidf[j]) is recomputed len(x_tsvd)=4.6e6 times while the expression can be precomputed for all j values only once. Moreover, calling np.dot(x_tsvd[i],svd_tfidf[j]) in two nested loops is not efficient. You can use a big matrix multiplication: x_tsvd # svd_tfidf.T. However, since this matrix is huge (~64 GiB), it is reasonable to split x_tsvd in chunks of size 512~4096. Additionally, you can precompute the inverse of the norm because the multiplication by the inverse value is generally significantly faster than divisions. np.argsort(tmp_matrix[i])[::-1][0:5]] is not efficient and argpartition can be used instead to only compute the 5 best items (as I pointed out in a comment of the previous answer which advised you to use argsort). Note that a partition does not behave the same way than a sort if you care about equal items (ie. stable sort). There are no stable partitioning implementation available yet in Numpy.
In the end the optimized implementation should look like:
inv_norm_j = 1.0 / norm_by_line(svd_tfidf) # Horizontal vector
for chunk_start in range(0, len(x_tsvd), chunk_size):
chunk_end = min(chunk_start + chunk_size, len(x_tsvd))
inv_norm_i = 1.0 / norm_by_line(x_tsvd_block)[:,None] # Vertical vector
x_tsvd_block = x_tsvd[chunk_start:chunk_end]
tmp_matrix = (x_tsvd_block # svd_tfidf.T) * inv_norm_i * inv_norm_j
best_match_values = values[np.sort(np.argpartition(tmp_matrix, len(svd_tfidf)-5)[:,-5:])[:,::-1]]
# Pure-Python part that can hardly be optimized
for i in range(chunk_start, chunk_end):
best_match.append({keys[i]: best_match_values[i]})
Where norm_by_line can be computed in a vectorized way (certainly with Scipy for example). Note that this is a untested draft and not a code that you should trust completely and copy-part blindly ;) .
Regarding the recent update (which is a code computing a different result), most optimizations are identical but there is a big improvement you can do on np.sum(np.multiply(a,b),axis=1). Indeed, you can use np.einsum('ij,ij->i', a, b) instead so not to compute the large expensive temporary matrix. It is 3 times faster on my machine.

How to Efficiently Find the Indices of Max Values in a Multidimensional Array of Matrices using Pytorch and/or Numpy

Background
It is common in machine learning to deal with data of a high dimensionality. For example, in a Convolutional Neural Network (CNN) the dimensions of each input image may be 256x256, and each image may have 3 color channels (Red, Green, and Blue). If we assume that the model takes in a batch of 16 images at a time, the dimensionality of the input going into our CNN is [16,3,256,256]. Each individual convolutional layer expects data in the form [batch_size, in_channels, in_y, in_x], and all of these quantities often change layer-to-layer (except batch_size). The term we use for the matrix made up of the [in_y, in_x] values is feature map, and this question is concerned with finding the maximum value, and its index, in every feature map at a given layer.
Why do I want to do this? I want to apply a mask to every feature map, and I want to apply that mask centered at the max value in each feature map, and to do that I need to know where each max value is located. This mask application is done during both training and testing of the model, so efficiency is vitally important to keep computational times down. There are many Pytorch and Numpy solutions for finding singleton max values and indices, and for finding the maximum values or indices along a single dimension, but no (that I could find) dedicated and efficient built-in functions for finding the indices of maximum values along 2 or more dimensions at a time. Yes, we can nest functions that operate on a single dimension, but these are some of the least efficient approaches.
What I've Tried
I've looked at this Stackoverflow question, but the author is dealing with a special-case 4D array which is trivially squeezed to a 3D array. The accepted answer is specialized for this case, and the answer pointing to TopK is misguided because it not only operates on a single dimension, but would necessitate that k=1 given the question asked, thus devlolving to a regular torch.max call.
I've looked at this Stackoverflow question, but this question, and its answer, focus on looking through a single dimension.
I have looked at this Stackoverflow question, but I already know of the answer's approach as I independently formulated it in my own answer here (where I amended that the approach is very inefficient).
I have looked at this Stackoverflow question, but it does not satisfy the key part of this question, which is concerned with efficiency.
I have read many other Stackoverflow questions and answers, as well as the Numpy documentation, Pytorch documentation, and posts on the Pytorch forums.
I've tried implementing a LOT of varying approaches to this problem, enough that I have created this question so that I can answer it and give back to the community, and anyone who goes looking for a solution to this problem in the future.
Standard of Performance
If I am asking a question about efficiency I need to detail expectations clearly. I am trying to find a time-efficient solution (space is secondary) for the problem above without writing C code/extensions, and which is reasonably flexible (hyper specialized approaches aren't what I'm after). The approach must accept an [a,b,c,d] Torch tensor of datatype float32 or float64 as input, and output an array or tensor of the form [a,b,2] of datatype int32 or int64 (because we are using the output as indices).
Solutions should be benchmarked against the following typical solution:
max_indices = torch.stack([torch.stack([(x[k][j]==torch.max(x[k][j])).nonzero()[0] for j in range(x.size()[1])]) for k in range(x.size()[0])])
The Approach
We are going to take advantage of the Numpy community and libraries, as well as the fact that Pytorch tensors and Numpy arrays can be converted to/from one another without copying or moving the underlying arrays in memory (so conversions are low cost). From the Pytorch documentation:
Converting a torch Tensor to a Numpy array and vice versa is a breeze. The torch Tensor and Numpy array will share their underlying memory locations, and changing one will change the other.
Solution One
We are first going to use the Numba library to write a function that will be just-in-time (JIT) compiled upon its first usage, meaning we can get C speeds without having to write C code ourselves. Of course, there are caveats to what can get JIT-ed, and one of those caveats is that we work with Numpy functions. But this isn't too bad because, remember, converting from our torch tensor to Numpy is low cost. The function we create is:
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
This function if from another Stackoverflow answer located here (This was the answer which introduced me to Numba). The function takes an N-Dimensional Numpy array and looks for the first occurrence of a given item. It immediately returns the index of the found item on a successful match. The #njit decorator is short for #jit(nopython=True), and tells the compiler that we want it to compile the function using no Python objects, and to throw an error if it is not able to do so (Numba is the fastest when no Python objects are used, and speed is what we are after).
With this speedy function backing us, we can get the indices of the max values in a tensor as follows:
import numpy as np
x = x.numpy()
maxVals = np.amax(x, axis=(2,3))
max_indices = np.zeros((n,p,2),dtype=np.int64)
for index in np.ndindex(x.shape[0],x.shape[1]):
max_indices[index] = np.asarray(indexFunc(x[index], maxVals[index]),dtype=np.int64)
max_indices = torch.from_numpy(max_indices)
We use np.amax because it can accept a tuple for its axis argument, allowing it to return the max values of each 2D feature map in the 4D input. We initialize max_indices with np.zeros ahead of time because appending to numpy arrays is expensive, so we allocate the space we need ahead of time. This approach is much faster than the Typical Solution in the question (by an order of magnitude), but it also uses a for loop outside the JIT-ed function, so we can improve...
Solution Two
We will use the following solution:
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
raise RuntimeError
#njit(cache=True, parallel=True)
def indexFunc2(x,maxVals):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
max_indices[i,j] = np.asarray(indexFunc(x[i,j], maxVals[i,j]),dtype=np.int64)
return max_indices
x = x.numpy()
maxVals = np.amax(x, axis=(2,3))
max_indices = torch.from_numpy(indexFunc2(x,maxVals))
Instead of iterating through our feature maps one-at-a-time with a for loop, we can take advantage of parallelization using Numba's prange function (which behaves exactly like range but tells the compiler we want the loop to be parallelized) and the parallel=True decorator argument. Numba also parallelizes the np.zeros function. Because our function is compiled Just-In-Time and uses no Python objects, Numba can take advantage of all the threads available in our system! It is worth noting that there is now a raise RuntimeError in the indexFunc. We need to include this, otherwise the Numba compiler will try to infer the return type of the function and infer that it will either be an array or None. This doesn't jive with our usage in indexFunc2, so the compiler would throw an error. Of course, from our setup we know that indexFunc will always return an array, so we can simply raise and error in the other logical branch.
This approach is functionally identical to Solution One, but changes the iteration using nd.index into two for loops using prange. This approach is about 4x faster than Solution One.
Solution Three
Solution Two is fast, but it is still finding the max values using regular Python. Can we speed this up using a more comprehensive JIT-ed function?
#njit(cache=True)
def indexFunc(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
raise RuntimeError
#njit(cache=True, parallel=True)
def indexFunc3(x):
maxVals = np.zeros((x.shape[0],x.shape[1]),dtype=np.float32)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxVals[i][j] = np.max(x[i][j])
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
x[i][j] == np.max(x[i][j])
max_indices[i,j] = np.asarray(indexFunc(x[i,j], maxVals[i,j]),dtype=np.int64)
return max_indices
max_indices = torch.from_numpy(indexFunc3(x))
It might look like there is a lot more going on in this solution, but the only change is that instead of calculating the maximum values of each feature map using np.amax, we have now parallelized the operation. This approach is marginally faster than Solution Two.
Solution Four
This solution is the best I've been able to come up with:
#njit(cache=True, parallel=True)
def indexFunc4(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices = torch.from_numpy(indexFunc4(x))
This approach is more condensed and also the fastest at 33% faster than Solution Three and 50x faster than the Typical Solution. We use np.argmax to get the index of the max value of each feature map, but np.argmax only returns the index as if each feature map were flattened. That is, we get a single integer telling us which number the element is in our feature map, not the indices we need to be able to access that element. The math [maxTemp // x.shape[2], maxTemp % x.shape[2]] is to turn that singular int into the [row,column] that we need.
Benchmarking
All approaches were benchmarked together against a random input of shape [32,d,64,64], where d was incremented from 5 to 245. For each d, 15 samples were gathered and the times were averaged. An equality test ensured that all solutions provided identical values. An example of the benchmark output is:
A plot of the benchmarking times as d increased is (leaving out the Typical Solution so the graph isn't squashed):
Woah! What is going on at the start with those spikes?
Solution Five
Numba allows us to produce Just-In-Time compiled functions, but it doesn't compile them until the first time we use them; It then caches the result for when we call the function again. This means the very first time we call our JIT-ed functions we get a spike in compute time as the function is compiled. Luckily, there is a way around this- if we specify ahead of time what our function's return type and argument types will be, the function will be eagerly compiled instead of compiled just-in-time. Applying this knowledge to Solution Four we get:
#njit('i8[:,:,:](f4[:,:,:,:])',cache=True, parallel=True)
def indexFunc4(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices6 = torch.from_numpy(indexFunc4(x))
And if we restart our kernel and rerun our benchmark, we can look at the first result where d==5 and the second result where d==10 and note that all of the JIT-ed solutions were slower when d==5 because they had to be compiled, except for Solution Four, because we explicitly provided the function signature ahead of time:
There we go! That's the best solution I have so far for this problem.
EDIT #1
Solution Six
An improved solution has been developed which is 33% faster than the previously posted best solution. This solution only works if the input array is C-contiguous, but this isn't a big restriction since numpy arrays or torch tensors will be contiguous unless they are reshaped, and both have functions to make the array/tensor contiguous if needed.
This solution is the same as the previous best, but the function decorator which specifies the input and return types are changed from
#njit('i8[:,:,:](f4[:,:,:,:])',cache=True, parallel=True)
to
#njit('i8[:,:,::1](f4[:,:,:,::1])',cache=True, parallel=True)
The only difference is that the last : in each array typing becomes ::1, which signals to the numba njit compiler that the input arrays are C-contiguous, allowing it to better optimize.
The full solution six is then:
#njit('i8[:,:,::1](f4[:,:,:,::1])',cache=True, parallel=True)
def indexFunc5(x):
max_indices = np.zeros((x.shape[0],x.shape[1],2),dtype=np.int64)
for i in prange(x.shape[0]):
for j in prange(x.shape[1]):
maxTemp = np.argmax(x[i][j])
max_indices[i][j] = [maxTemp // x.shape[2], maxTemp % x.shape[2]]
return max_indices
max_indices7 = torch.from_numpy(indexFunc5(x))
The benchmark including this new solution confirms the speedup:

Instantiate large sparse matrices for assignment operation

If I want to instantiate a large boolean sparse matrix to assign values at certain indices later, what's the best way to initialize it?
For example, if I want to initialize a 20000000 X 7000 logical sparse matrix on MATLAB with a 10000 filled elements (without mentioning the location of non-zero elements), I would use the following syntax:
Matrix=logical(sparse([],[],[],20000000,7000,10000))
I have no speed constraints on assigning the non-zero values later.
On Python, if I initialize it as a CSR matrix, the creation of the matrix is very fast.
Matrix=csr_matrix((20000000, 7000), dtype=bool)
CPU times: user 860 µs, sys: 2.43 ms, total: 3.29 ms
Wall time: 9.72 ms
However, when I cannot efficiently assign values to a CSR_Matrix, the operation is very slow and you see the inbuilt warning.
If I try initializing it as a LIL matrix:
Matrix=lil_matrix((20000000, 7000), dtype=bool)
CPU times: user 12.4 s, sys: 624 ms, total: 13 s
Wall time: 13 s
or converting the csr_matrix to a lil_matrix:
Matrix=csr_matrix((20000000, 7000), dtype=bool)
Matrix=Matrix.tolil()
CPU times: user 26.8 s, sys: 734 ms, total: 27.5 s
Wall time: 27.5 s
The initialization takes significantly more time.
Is there any way to speed up the initialization of the LIL matrix? If not, what sparse matrix format can I use to speed up the assignment of non-zero elements to such matrices?
If you need general incremental indexed-access, dok_matrix is probably your best bet.
It's common to use this one for construction (where it can shine in some cases) before converting to something else like csc, csr (which are usually needed for algebraic operations).
Edit: Most of the below stuff focuses on the accumulated time needed for init + filling + whatever is done after that.
In terms of your case: dok_matrix init should be quite instant.
...
Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.
That being said, it also depends on your workflow and code which was omitted. Given some structure (python-)loopless task-dependent workflows surely can beat generic (python-)looping adding one element at a time. Often, this involves coo_matrix.
In your case for some workflows: you don't have any init-time at all as you don't create a matrix a-priori, but only collect whatever is needed before creating the matrix in one batch. Not sure how that fits into your computational model (which is a bit strange: init time-restricted; further usage is free)
I used MATLAB sparse quite a bit years ago. Back then you created a sparse matrix with
S = sparse(i,j,v,m,n)
where i,j,v where matrices identifying all of the nonzero values. That extra nz parameter that preallocates 'space' for more nonzeros did not exist.
In scipy, the equivalent is
S = sparse.csc_matrix((v, (i,j)), m, n)
Again, v,i,j are fully defined arrays. There isn't any nz preallocation option. In fact given how attributes are stored, I don't see how preallocation would work or be beneficial.
As you found out, trying to define nonzero values with the csc/csr format is slow and produces the warning. lil/dok are designed to make iterative addition faster.
csr creation time depends on the number of initial nonzero values, and only marginally on the shape (the indptr array size depends on the number of rows). Normally we don't worry about initialization time for a lil, but with 20000000 rows, I can see why it would take time. It has to make two object dtype arrays, with empty list elements.
Anyway, try avoid incremental definition. Create the i,j,v arrays from your source, and then build the matrix.

Applying multiple functions on each row using Numba

I have a big 2D NumPy array, let's say 5M rows and 10 columns. I want to build a few more columns according to some stateful logic implemented using Numba #jitclass. Let's say there are 50 such new columns to create. The idea is to iterate over all the rows of 10 columns in a Numba #jit function, and for each row, apply each of my 50 "filters" to generate one new cell each. So:
Source1..Source10 Derived1..Derived50
[array of 10 inputs] [array of 50 outputs]
... 5 million rows like this ...
The problem is, I can't pass a list or tuple of my "filters" to an #jit(nopython=True) function, because they are not homogenous:
#numba.jit(nopython=True)
def calc_derived(source, derived, filters):
for srcidx, src in enumerate(source):
for filtidx, filt in enumerate(filters): # doesn't work
derived[srcidx,filtidx] = filt.transform(src)
The above doesn't work because filters are a bunch of different classes. As far as I can tell, even making them derive from a common base class is not good enough.
I am left with the possibility of swapping the order of the loops, and having the loop over the 50 filters outside of the #jit function, but this would mean the entire source dataset would be loaded 50 times instead of once, which is very wasteful.
Do you have a technique to work around the "homogenous lists only" requirement of Numba?
You originally asked about doing this with a single function that loops over rows, and applies a list of filters to each row. A challenge with this approach is that numba needs to know or be able to infer the input/output types of each function. I'm not aware of a way to satisfy numba's requirement in this situation (which is not to say that none exists). If there were a way to do this, it could be a better solution (and I'd like to know what it is).
An alternative is to move the code that loops over rows into the filters themselves. Because the filters are numba functions, this should maintain speed. The function that applies the filters would longer use numba; it would simply loop over the list of filters. But, because the number of filters is small relative to the size of the data matrix, hopefully this won't impact speed too severely. Because this function no longer uses numba, the 'heterogeneous list' issue would no longer be a problem.
This approach worked when I tested it (nopython mode is fine). In test cases, filters implemented as numba functions were 10-18x faster than filters implemented as class methods (even though classes were implemented as numba jitclasses; not sure what's going on there). To gain a bit of modularity, filters can be constructed as closures, so that similar filters can be defined using different parameters.
For example, here are filters that compute sums of powers. Given a matrix x, the filter operates over the columns of x, giving an output for each row. It returns a vector v, where v[i] = sum(x[i, :] ** power)
# filter constructor
def sumpow(power):
#numba.jit(nopython=True)
def run_filter(x):
(nrows, ncols) = x.shape
result = np.zeros(nrows)
for i in range(nrows):
for j in range(ncols):
result[i] += x[i,j] ** power
return result
return run_filter
# define filters
sum1 = sumpow(1) # sum of elements
sum2 = sumpow(2) # sum of elements squared
# apply a single filter
v = sum2(x)
The function to apply multiple filters looks like this. The output of each filter is stacked into a column of the output.
def apply_filters(x, filters):
result = np.empty((x.shape[0], len(filters)))
for (i, f) in enumerate(filters):
result[:, i] = f(x)
return result
y = apply_filters(x, [sum1, sum2])
Timing results
Data matrix: random entries drawn from standard normal distribution, float64, 5 million rows x 10 columns. All methods tested using the same matrix.
Filters: sum2 filter above, repeated 20x in a list: [sum2, sum2, ...]
Timed using IPython's %timeit function, best of 3 runs
Numerical outputs of all methods agree
Numba function filters (as shown above): 2.25s
Numba jitclass filters: 28.3s
Pure NumPy (using vectorized ops, no loops): 8.64s
I imagine Numba might gain relative to NumPy for more complex filters.
To get a homogeneous list, you could construct a list of the transform functions of all filters. In this case, all list elements would would have type method.
# filters = list of filters
transforms = [x.transform for x in filters]
Then pass transforms to calc_derived() instead of filters.
Edit:
On my system, looks like numba will accept this, but only if nopython=False

Memory efficient sort of massive numpy array in Python

I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.
I build this VERY large array step by step from 22 smaller (N, 2) subarrays.
Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr.
The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]). I call these outputs b_arr and rate_arr, respectively. The function returns a 3-tuple of (N, 1) arrays:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.
Here is the sorting routine (there are two consecutive sorts)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same.
Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?
bigger, slower file:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
smaller, more uniform regular file:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!
At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr in place using its .sort() method. Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8') - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort() uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.
EDIT: In case anyone new to programming and numpy comes across this post, I want to point out the importance of considering the np.dtype that you are using. In my case, I was actually able to get away with using half-precision floating point, i.e. np.float16, which reduced a 20GB object in memory to 5GB and made sorting much more manageable. The default used by numpy is np.float64, which is a lot of precision that you may not need. Check out the doc here, which describes the capacity of the different data types. Thanks to #ali_m for pointing this out in the comments.
I did a bad job explaining this question but I have discovered some helpful workarounds that I think would be useful to share for anyone who needs to sort a truly massive numpy array.
I am building a very large numpy array from 22 "sub-arrays" of human genome data containing the elements [position, value]. Ultimately, the final array must be numerically sorted "in place" based on the values in a particular column and without shuffling the values within rows.
The sub-array dimensions follow the form:
arr1.shape = (N1, 2)
...
arr22.shape = (N22, 2)
sum([N1..N2]) = 868940742 i.e. there are close to 1BN positions to sort.
First I process the 22 sub-arrays with the function process_sub_arrs, which returns a 3-tuple of 1D arrays the same length as the input. I stack the 1D arrays into a new (N, 3) array and insert them into an np.zeros array initialized for the full dataset:
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
EDIT: Since I realized my dataset could be represented with half-precision floats, I now initialize full_arr as follows: full_arr = np.zeros([868940742, 3], dtype=np.float16), which is only 1/4 the size and much easier to sort.
Result is a massive 20GB array:
full_arr.nbytes = 20854577808
As #ali_m pointed out in his detailed post, my earlier routine was inefficient:
sort_idx = np.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
the array sort_idx, which is 33% the size of full_arr, hangs around and wastes memory after sorting full_arr. This sort supposedly generates a copy of full_arr due to "fancy" indexing, potentially pushing memory use to 233% of what is already used to hold the massive array! This is the slow step, lasting about ten minutes and relying heavily on virtual memory.
I'm not sure the "fancy" sort makes a persistent copy however. Watching the memory usage on my machine, it seems that full_arr = full_arr[sort_idx] deletes the reference to the unsorted original, because after about 1 second all that is left is the memory used by the sorted array and the index, even if there is a transient copy.
A more compact usage of argsort() to save memory is this one:
full_arr = full_arr[full_arr[:,idx].argsort()]
This still causes a spike at the time of the assignment, where both a transient index array and a transient copy are made, but the memory is almost instantly freed again.
#ali_m pointed out a nice trick (credited to Joe Kington) for generating a de facto structured array with a view on full_arr. The benefit is that these may be sorted "in place", maintaining stable row order:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
Views work great for performing mathematical array operations, but for sorting it is far too inefficient for even a single sub-array from my dataset. In general, structured arrays just don't seem to scale very well even though they have really useful properties. If anyone has any idea why this is I would be interested to know.
One good option to minimize memory consumption and improve performance with very large arrays is to build a pipeline of small, simple functions. Functions clear local variables once they have completed so if intermediate data structures are building up and sapping memory this can be a good solution.
This a sketch of the pipeline I've used to speed up the massive array sort:
def process_sub_arrs(arr):
"""process a sub-array and return a 3-tuple of 1D values arrays"""
return values1, values2, values3
def build_arr():
"""build the initial array by joining processed sub-arrays"""
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
def sort_arr():
"""return full_arr and sort_idx"""
full_arr = build_arr()
sort_idx = np.argsort(full_arr[:, index])
return full_arr[sort_idx]
def get_sorted_arr():
"""call through nested functions to return the sorted array"""
sorted_arr = sort_arr()
<process sorted_arr>
return statistics
call stack: get_sorted_arr --> sort_arr --> build_arr --> process_sub_arrs
Once each inner function is completed get_sorted_arr() finally just holds the sorted array and then returns a small array of statistics.
EDIT: It is also worth pointing out here that even if you are able to use a more compact dtype to represent your huge array, you will want to use higher precision for summary calculations. For example, since full_arr.dtype = np.float16, the command np.mean(full_arr[:,idx]) tries to calculate the mean in half-precision floating point, but this quickly overflows when summing over a massive array. Using np.mean(full_arr[:,idx], dtype=np.float64) will prevent the overflow.
I posted this question initially because I was puzzled by the fact that a dataset of identical size suddenly began choking up my system memory, although there was a big difference in the proportion of unique values in the new "slow" set. #ali_m pointed out that, indeed, more uniform data with fewer unique values is easier to sort:
The qsort variant of Quicksort works by recursively selecting a
'pivot' element in the array, then reordering the array such that all
the elements less than the pivot value are placed before it, and all
of the elements greater than the pivot value are placed after it.
Values that are equal to the pivot are already sorted, so intuitively,
the fewer unique values there are in the array, the smaller the number
of swaps there are that need to be made.
On that note, the final change I ended up making to attempt to resolve this issue was to round the newer dataset in advance, since there was an unnecessarily high level of decimal precision leftover from an interpolation step. This ultimately had an even bigger effect than the other memory saving steps, showing that the sort algorithm itself was the limiting factor in this case.
Look forward to other comments or suggestions anyone might have on this topic, and I almost certainly misspoke about some technical issues so I would be glad to hear back :-)

Categories

Resources