I couldn't find the solution to a performance enhancement problem.
I have a 1D array and I would like to compute sums over sliding windows of indices, here is an example code:
import numpy as np
input = np.linspace(1, 100, 100)
list_of_indices = [[0, 10], [5, 15], [45, 50]] #just an example
output = np.array([input[idx[0]: idx[1]].sum() for idx in list_of_indices])
The computation of the output array is extremely slow compared to numpy vectorised built-in functions.
In real life my list_of_indices contains tens of thousands [lower bound, upper bound] pairs, and this loop is definitely the bottle-neck of a high performance python script.
How to deal with this, using numpy internal functions: like masks, clever np.einsum, or other stuff like these ?
Since I work in HPC field, I am also concerned by memory consumption.
Does anyone have an answer for this problem while respecting the performance requirements?
If:
input is about the same length as output or shorter
The output values have similar magnitude
...you could create a cumsum of your input values. Then the summations turn into subtractions.
cs = np.cumsum(input, dtype=float32) # or float64 if you need it
loi = np.array(list_of_indices, dtype=np.uint16)
output = cs[loi[:,1]] - cs[loi[:,0]]
The numerical hazard here is loss of precision if input has runs of large and tiny values. Then cumsum may not be accurate enough for you.
Here's a simple approach to try: Keep the same solution structure as you already have, which presumably works. Just make the storage creation and indexing more efficient. If you are summing many elements from input for most indexes, the summation ought to take more time than the for looping. For example:
# Put all the indices in a nice efficient structure:
idxx = np.hstack((np.array(list_of_indices, dtype=np.uint16),
np.arange(len(list_of_indices), dtype=np.uint16)[:,None]))
# Allocate appropriate data type to the precision and range you need,
# Do it in one go to be time-efficient
output = np.zeros(len(list_of_indices), dtype=np.float32)
for idx0, idx1, idxo in idxx:
output[idxo] = input[idx0:idx1].sum()
If len(list_if_indices) > 2**16, use uint32 rather than uint16.
Related
So I got a matrix looking something like
0.8, 0.7,nan,...
0.1,nan,0.1,...
0.9,nan,0.3,...
with dimensions N X M, where M >> N
Now I want to go through all columns and change the values of the matrix as follows:
If the column has exactly K values >= 0, do nothing
If the column has L < K values >= 0, set K-L values of this column to X
If the column has S > K values >= 0, set S-K values of this column to nan
I am looking for a fast way to do this with python for a large optimization and my for loop is taking too long still.
The outcome in the above example, with K = 2, should look like:
0.8, 0.7,nan,...
0.1,X,0.1,...
nan, nan,0.3,...
What I am considering at the moment is first sorting each column of the entire matrix (I have to restore the previous order later though) and to somehow apply np.where(num_cols != K,applyChanges(myMat),myMat), but I am not sure how to make np.where return me the columns of the matrix and not the invidivual scalars.
Thanks for any suggestion :)
edit:
So I was able to speed up my code by a factor of slightly more than 5 by first using argsort on the columns. This was I could directly set the appropriate values without requiring anyfurther "argwhere" calls which made my previous code slow.
All I can do now seems to be to move from numpy arrays to python lists as I now do a for loop and access the elements directly which is faster done using lists.
edit2:
Using numba's #jit decorator I was able to increase the speed again by a factor of 5. Wow, I am impressed, pretty cool library (is that even the proper term for it?). Now my code is fast enough for my current requirements.
edit3:
As requested, here is my for loop, it is straight forward
idx_sorted = np.argsort(pred_new, axis=0)
num_bands_selected = np.sum(pred_new >= 0, axis=0)
for idx_col in range(0, num_cols):
if num_bands_selected[idx_col] < N:
pred_new[idx_sorted[num_bands_selected[idx_col]-N:, idx_col], idx_col] = dummy_value
elif num_bands_selected[idx_col] > N:
pred_new[idx_sorted[-num_bands_selected[idx_col]:-N, idx_col], idx_col] = np.nan
pred_new is something a neural network returns, but the data has to be in a specific format for some postprocessing algorithm to work.
This could be sped up to the best of my knowledge in two ways (aside from the #jit which was not included):
Using python lists instead of numpy arrays (they allow faster read access)
First selecting columns that require modification. Here I go through everything, but often only some subset (~50 %) might require modification
I find myself consistently facing this problem in a couple of different scenarios. So I thought about sharing it here and see if there is an optimal way to solve it.
Suppose that I have a big array of whatever X and an another array of the same size of X called y that has on it the label to whose x belongs. So like the following.
X = np.array(['obect1', 'object2', 'object3', 'object4', 'object5'])
y = np.array([0, 1, 1, 0, 2])
What I desire is a to build a dictionary / hash that uses the set of labels as keys and the indexes of all the objects with those labels in X as items. So in this case the desired output will be:
{0: (array([0, 3]),), 1: (array([1, 2]),), 2: (array([4]),)}
Note that actually what is on X does not matter but I included it for the sake of completeness.
Now, my naive solution for the problem is iterate throughout all the labels and use np.where==label to build the dictionary. In more detail, I use this function:
def get_key_to_indexes_dic(labels):
"""
Builds a dictionary whose keys are the labels and whose
items are all the indexes that have that particular key
"""
# Get the unique labels and initialize the dictionary
label_set = set(labels)
key_to_indexes = {}
for label in label_set:
key_to_indexes[label] = np.where(labels==label)
return key_to_indexes
So now the core of my question:
Is there a way to do better? is there a natural way to solve this using numpy functions? is my approach misguided somehow?
As a lateral matter of less importance: what is the complexity of the solution in the definition above? I believe that the complexity of the solution is the following:
Or in words the number of labels times the complexity of using np.where in a set of the size of y plus the complexity of making a set out of an array. Is this correct?
P.D. I could not find related post with this specific question, if you have suggestions to change the title or anything I would be grateful.
You only need to traverse once if you use a dictionary to store the indexes as you go through:
from collections import defaultdict
def get_key_to_indexes_ddict(labels):
indexes = defaultdict(list)
for index, label in enumerate(labels):
indexes[label].append(index)
The scaling seems much like you have analysed for your option, for the function above it's O(N) where N is the size of y since checking if a value is in a dictionary is O(1).
So the interesting thing is that since np.where is going so much faster in its traversal, as long as there are only a small number of labels, your function is faster. Mine seems faster when there are many distinct labels.
Here is how the functions scale:
The blue lines are your function, the red lines are mine. The line styles indicate the number of distinct labels. {10: ':', 100: '--', 1000: '-.', 10000: '-'}. You can see that my function is relatively independent of number of labels, while yours quickly becomes slow when there are many labels. If you have few labels, you're better off with yours.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a fully vectorized manner, and having O(nlogn) worst case time-complexity:
import numpy_indexed as npi
indices = np.arange(len(labels))
unique_labels, indices_per_label = npi.group_by(labels, indices)
Note that for many common applications of such functionality, such as computing a sum or mean over group labels, it is more efficient not to compute the split list of indices, but to make use of the functions for that in npi; ie, npi.group_by(labels).mean(some_corresponding_array), rather than looping through indices_per_label and taking the mean over those indices.
Assuming that the labels are consecutive integers [0, m] and taking n = len(labels), the complexity for set(labels) is O(n) and the complexity for np.where in the loop is O(m*n). However, the overall complexity is written as O(m*n) and not O(m*n + n), see "Big O notation" on wikipedia.
There are two things you can do to improve the performance: 1) use a more efficient algorithm (lower complexity) and 2) replace Python loops with fast array operations.
The other answers currently posted do exactly this, and with very sensible code. However an optimal solution would be both fully vectorized and have O(n) complexity. This can be accomplished using a certain lower level function from Scipy:
def sparse_hack(labels):
from scipy.sparse._sparsetools import coo_tocsr
labels = labels.ravel()
n = len(labels)
nlabels = np.max(labels) + 1
indices = np.arange(n)
sorted_indices = np.empty(n, int)
offsets = np.zeros(nlabels+1, int)
dummy = np.zeros(n, int)
coo_tocsr(nlabels, 1, n, labels, dummy, indices,
offsets, dummy, sorted_indices)
return sorted_indices, offsets
The source for coo_tocsr can be found here. The way I used it, it essentially performs an indirect counting sort. To be honest, this is a rather obscure method and I advise you to use one of the approaches in the other answers.
I've also struggled to find a "numpythonic" way to solve this type of problem. This is the best approach I've come up with, although requiring a bit more memory:
def get_key_to_indexes_dict(labels):
indices = numpy.argsort(labels)
bins = numpy.bincount(labels)
indices = numpy.split(indices, numpy.cumsum(bins[bins > 0][:-1]))
return dict(zip(numpy.unique(labels), indices))
I know that this question party has been answered, but I am looking specifically at numpy and scipy. Say I have a grid
lGrid = linspace(0.1, 8, 50)
and I want to find the index that corresponds best to 2, I do
index = abs(lGrid-2).argmin()
lGrid[index]
2.034
However, what if I have a whole matrix of values instead of 2 here. I guess iteration is pretty slow. abs(lGrid-[2,4]) however will fail due to shape issues. I will need a solution that is easily extendable to N-dim matrices. What is the best course of action in this environment?
You can use broadcasting:
from numpy import arange,linspace,argmin
vals = arange(30).reshape(2,5,3) #your N-dimensional input, like array([2,4])
lGrid = linspace(0.1, 8, 50)
result = argmin(abs(lGrid-vals[...,newaxis]),axis=-1)
for example, with input vals = array([2,4]), you obtain result = array([12, 24]) and lGrid[result]=array([ 2.03469388, 3.96938776])
You "guess that Iteration is pretty slow", but I guess it isn't. So I would just just iterate over the "whole Matrix of values instead of 2". Perhaps:
for val in BigArray.flatten():
index = abs(lGrid-val).argmin()
yield lGrid[index]
If lGrid is failry large, then the overhead of iterating in a Python for loop is probably not big in comparison to the vecotirsed operation Happening inside it.
There might be a way you can use broadcasting and reshaping to do the whole thing in one giant operation, but would be complicated, and you might accidentally allocate such a huge array that your machine slows down to a crawl.
Can anyone recommend a way to do a reverse cumulative sum on a numpy array?
Where 'reverse cumulative sum' is defined as below (I welcome any corrections on the name for this procedure):
if
x = np.array([0,1,2,3,4])
then
np.cumsum(x)
gives
array([0,1,3,6,10])
However, I would like to get
array([10,10,9,7,4]
Can anyone suggest a way to do this?
This does it:
np.cumsum(x[::-1])[::-1]
You can use .flipud() for this as well, which is equivalent to [::-1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.flipud.html
In [0]: x = np.array([0,1,2,3,4])
In [1]: np.flipud(np.flipud(x).cumsum())
Out[1]: array([10, 10, 9, 7, 4]
.flip() is new as of NumPy 1.12, and combines the .flipud() and .fliplr() into one API.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html
This is equivalent, and has fewer function calls:
np.flip(np.flip(x, 0).cumsum(), 0)
The answers given so far seem to be all inefficient if you want the result stored in the original array. As well, if you want a copy, keep in mind this will return a view not a contiguous array and np.ascontiguousarray() is still needed.
How about
view=np.flip(x, 0)
np.cumsum(view, 0, out=view)
#x contains the reverse cumsum result and remains contiguous and unflipped
This modifies the flipped view of x which writes the data properly in reverse order back into the original x variable. It requires no non-contiguous views at the end of execution and is about as speed efficient as possible. I am guessing numpy will never add a reversecumsum method namely because the technique I describe is so trivially and efficiently possible. Albeit, it might be ever so slightly more efficient to have the explicit method.
Otherwise if a copy is desired, then the extra flip is required AND conversion back to a contiguous array, mainly if it will be used in many vector operations thereafter. A tricky part of numpy, but views and contiguity are something to be careful with if you are seriously interested in performance.
I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.
You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)
While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)
numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])