Getting rows corresponding to label, for many labels - python

I have a 2D array, where each row has a label that is stored in a separate array (not necessarily unique). For each label, I want to extract the rows from my 2D array that have this label. A basic working example of what I want would be this:
import numpy as np
data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
label=np.array([1,1,1,0,1])
#very simple approach
label_values=np.unique(label)
res=[]
for la in label_values:
data_of_this_label_val=data[label==la]
res+=[data_of_this_label_val]
print(res)
The result (res) can have any format, as long as it is easily accessible. In the above example, it would be
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Note that I can easily associate each element in my list to one of the unique labels in label_values (that is, by index).
While this works, using a for loop can take quite a lot of time, especially if my label vector is large. Can this be sped up or coded more elegantly?

You can argsort the labels (which is what unique does under the hood I believe).
If your labels are small nonnegatvie integers as in the example you can get it a bit cheaper, see https://stackoverflow.com/a/53002966/7207392.
>>> import numpy as np
>>>
>>> data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
>>> label=np.array([1,1,1,0,1])
>>>
>>> idx = label.argsort()
# use kind='mergesort' if you require a stable sort, i.e. one that
# preserves the order of equal labels
>>> ls = label[idx]
>>> split = 1 + np.where(ls[1:] != ls[:-1])[0]
>>> np.split(data[idx], split)
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]

Unfortunately, there isn't a built-in groupby function in numpy, though you could write alternatives. However, your problem could be solved more succinctly using pandas, if that's available to you:
import pandas as pd
res = pd.DataFrame(data).groupby(label).apply(lambda x: x.values).tolist()
# or, if performance is important, the following will be faster on large arrays,
# but less readable IMO:
res = [data[i] for i in pd.DataFrame(data).groupby(label).groups.values()]
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]

Related

Stable conversion of a multi-column (2D) numpy array to an indicator vector

I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.

Appending a new row to a numpy array

I am trying to append a new row to an existing numpy array in a loop. I have tried the methods involving append, concatenate and also vstack none of them end up giving me the result I want.
I have tried the following:
for _ in col_change:
if (item + 2 < len(col_change)):
arr=[col_change[item], col_change[item + 1], col_change[item + 2]]
array=np.concatenate((array,arr),axis=0)
item+=1
I have also tried it in the most basic format and it still gives me an empty array.
array=np.array([])
newrow = [1, 2, 3]
newrow1 = [4, 5, 6]
np.concatenate((array,newrow), axis=0)
np.concatenate((array,newrow1), axis=0)
print(array)
I want the output to be [[1,2,3][4,5,6]...]
The correct way to build an array incrementally is to not start with an array:
alist = []
alist.append([1, 2, 3])
alist.append([4, 5, 6])
arr = np.array(alist)
This is essentially the same as
arr = np.array([ [1,2,3], [4,5,6] ])
the most common way of making a small (or large) sample array.
Even if you have good reason to use some version of concatenate (hstack, vstack, etc), it is better to collect the components in a list, and perform the concatante once.
If you want [[1,2,3],[4,5,6]] I could present you an alternative without append: np.arange and then reshape it:
>>> import numpy as np
>>> np.arange(1,7).reshape(2, 3)
array([[1, 2, 3],
[4, 5, 6]])
Or create a big array and fill it manually (or in a loop):
>>> array = np.empty((2, 3), int)
>>> array[0] = [1,2,3]
>>> array[1] = [4,5,6]
>>> array
array([[1, 2, 3],
[4, 5, 6]])
A note on your examples:
In the second one you forgot to save the result, make it array = np.concatenate((array,newrow1), axis=0) and it works (not exactly like you want it but the array is not empty anymore). The first example seems badly indented and without know the variables and/or the problem there it's hard to debug.

Fastest way to extract dictionary of sums in numpy in 1 I/O pass

Let's say I have an array like:
arr = np.array([[1,20,5],
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])
and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.
Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.
Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.
Is there a fast way to implement Approach 2 without resorting to a pure Python loop?
numpy approach:
u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)
dict(np.stack([u, s]).T)
{1: 13, 2: 6, 3: 9}
pandas approach:
import pandas as pd
import numpy as np
pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()
{1: 13, 2: 6, 3: 9}
Here's a NumPy based approach using np.add.reduceat -
sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)
If you would like to get the keys and values in a 2-column array -
out = np.column_stack((keys,vals)) # If you
Sample run -
In [351]: arr
Out[351]:
array([[ 1, 20, 5],
[ 1, 20, 8],
[ 3, 10, 4],
[ 2, 30, 6],
[ 3, 10, 5]])
In [352]: out
Out[352]:
array([[ 1, 13],
[ 2, 6],
[ 3, 9]])
This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)
import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])
Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.
The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:
cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
if cum_sum != 0}
where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.
Alternatively you can do things properly and call np.unique first and do:
uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}

does numpy asarray() refer to original list

I have a very long list of list and I am converting it to a numpy array using numpy.asarray(), is it safe to delete the original list after getting this matrix or does the newly created numpy array will also be affected by this action?
I am pretty sure that data is not shared and that you can safely remove the lists. Your original matrix is a nested structure of Python objects, with the numbers itself also Python objects, which can be located everywhere in memory. A Numpy array is also an object, but it is more or less a header that contains the dimensions and type of the data, with a pointer to a contiguous block of data where all the numbers are packed as close as possible as 'raw numbers'. There is no way how these two different ways could share data, so presumably the data is copied when you create the Numpy array. Example:
In [1]: m = [[1,2,3],[4,5,6],[7,8,9]]
In [2]: import numpy as np
In [3]: M = np.array(m)
In [4]: M[1,1] = 55
In [5]: M
Out[5]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 9]])
In [6]: m
Out[6]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # original is not modified!
Note that Numpy arrays can share data between each other, e.g. when you make a slice into an array. This is called a 'view', so if you modify data in the subset, it will also change in the original array:
In [18]: P = M[1:, 1:]
In [19]: P[1,1] = 666
In [20]: P
Out[20]:
array([[ 55, 6],
[ 8, 666]])
In [21]: M
Out[21]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 666]]) # original is also modified!
The data are copied over because the numpy array stores its own copy of the data as described by Bas Swinckels. You can test this for your self too. Although the trivially small list might make the point too, the ginormous data set below might bring the point home a little better ;)
import numpy as np
list_data = range(1000000000) # note, this will probably take a long time
# This will also take a long time
# because it is copying the data in memory
array_data = np.asarray(list_data)
# even this will probably take a while
del list_data
# But you still have the data even after deleting the list
print(array_data[1000])
Yes, it is safe to delete it if your input data consists of a list. From the documentation No copy is performed (ONLY) if the input is already an ndarray.

How to select inverse of indexes of a numpy array?

I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.
import numpy as np
import scipy.spatial.distance as sd
data = np.array(
[[ 0.93825827, 0.26701143],
[ 0.99121108, 0.35582816],
[ 0.90154837, 0.86254049],
[ 0.83149103, 0.42222948],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]]
)
sample_indexes = [1,2,3]
# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))
sample_data = data[sample_indexes]
other_data = data[other_indexes]
# compare them
dists = sd.cdist(sample_data, other_data)
Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I'd rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..
other_data = data[ indexes not in sample_indexes]
I read that numpy masks can do this but I tried...
other_data = data[~sample_indexes]
And this gives me an error. Do I have to create a mask?
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]
not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.
If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.
On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.
You may want to try in1d
In [5]:
select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:
print data[select]
[[ 0.99121108 0.35582816]
[ 0.90154837 0.86254049]
[ 0.83149103 0.42222948]]
In [7]:
print data[~select]
[[ 0.93825827 0.26701143]
[ 0.27309625 0.38925281]
[ 0.06510739 0.58445673]
[ 0.61469637 0.05420098]
[ 0.92685408 0.62715114]
[ 0.22587817 0.56819403]
[ 0.28400409 0.21112043]]
You may also use setdiff1d:
In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]:
array([[ 0.93825827, 0.26701143],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]])
I'm not familiar with the specifics on numpy, but here's a general solution. Suppose you have the following list: a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
You create another list of indices you don't want: inds = [1, 3, 6].
Now simply do this: good_data = [x for x in a if x not in inds], resulting in good_data = [0, 2, 4, 5, 7, 8, 9].

Categories

Resources