I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.
import numpy as np
import scipy.spatial.distance as sd
data = np.array(
[[ 0.93825827, 0.26701143],
[ 0.99121108, 0.35582816],
[ 0.90154837, 0.86254049],
[ 0.83149103, 0.42222948],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]]
)
sample_indexes = [1,2,3]
# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))
sample_data = data[sample_indexes]
other_data = data[other_indexes]
# compare them
dists = sd.cdist(sample_data, other_data)
Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I'd rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..
other_data = data[ indexes not in sample_indexes]
I read that numpy masks can do this but I tried...
other_data = data[~sample_indexes]
And this gives me an error. Do I have to create a mask?
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]
not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.
If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.
On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.
You may want to try in1d
In [5]:
select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:
print data[select]
[[ 0.99121108 0.35582816]
[ 0.90154837 0.86254049]
[ 0.83149103 0.42222948]]
In [7]:
print data[~select]
[[ 0.93825827 0.26701143]
[ 0.27309625 0.38925281]
[ 0.06510739 0.58445673]
[ 0.61469637 0.05420098]
[ 0.92685408 0.62715114]
[ 0.22587817 0.56819403]
[ 0.28400409 0.21112043]]
You may also use setdiff1d:
In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]:
array([[ 0.93825827, 0.26701143],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]])
I'm not familiar with the specifics on numpy, but here's a general solution. Suppose you have the following list: a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
You create another list of indices you don't want: inds = [1, 3, 6].
Now simply do this: good_data = [x for x in a if x not in inds], resulting in good_data = [0, 2, 4, 5, 7, 8, 9].
Related
I am trying to find the best option to maintain an index list of points using a point cache(ps0). To do this I want to efficiently find all existing points from one 2d array(ps1) in another 2d cache array(ps0). The cache array(ps0) doesn't have to be an numpy ndarray, but it's convenient if it is. Ultimately I would also maintain an index list (ps_indexes) for the points. The goal is to have functionality of ps_get_existing() you see in the code example. Thank you for your help. Yes, i'd like to use numpy's methods as much as possible since it's there. The function ps_get_existing() returns indexes for all the points in one array(ps0) that that exist in the other(ps1), or None values for indexes that do not exist.
import numpy as np
>>> ps0 = np.ndarray(shape=(1000,2), dtype=np.float64) # the point cache
>>> ps0_length = 10
>>> ps0[:ps0_length] = [ # point cache
[-0.48505498, 0.12574257],
[ 0.00761904, -0.5 ],
[-0.09404923, -0.79025632],
[-0.40395326, 0.48217822],
[-0.32285154, 0.83861386],
[-0.08257821, 1.33960396],
[ 0.00761904, 1.41683168],
[ 0.00761904, 1.41683168],
[ 0.00761904, 1.41683168],
[ 0.09781628, 1.33960396]]
>>> ps1 = np.array([ # new point array
[-0.26255921, -0.82617311],
[ 0.00761904, -0.5 ],
[-0.40395326, 0.48217822],
[-0.37366057, -0.56056108],
[-0.17125953, 1.14356436],
[-0.48476193, -0.29494906],
[-0.09404923, -0.79025632],
[-0.25994085, 0.94752475],
[-0.32285154, 0.83861386],
[-0.48505498, 0.12574257]], dtype=np.float64)
>>> indexes = ps_get_existing(ps0[:ps0_length], ps1) ### this is the functionality I am looking for!
>>> indexes
[None, 1, 3, None, None, None, 2, None, 4, 0]
>>> for i in range(len(ps1)):
... if indexes[i] is None:
... indexes[i] = ps0_length
... ps0[ps0_length] = ps1[i]
... ps0_length += 1
...
>>> indexes
[10, 1, 3, 11, 12, 13, 2, 14, 4, 0]
I have a 2D array, where each row has a label that is stored in a separate array (not necessarily unique). For each label, I want to extract the rows from my 2D array that have this label. A basic working example of what I want would be this:
import numpy as np
data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
label=np.array([1,1,1,0,1])
#very simple approach
label_values=np.unique(label)
res=[]
for la in label_values:
data_of_this_label_val=data[label==la]
res+=[data_of_this_label_val]
print(res)
The result (res) can have any format, as long as it is easily accessible. In the above example, it would be
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Note that I can easily associate each element in my list to one of the unique labels in label_values (that is, by index).
While this works, using a for loop can take quite a lot of time, especially if my label vector is large. Can this be sped up or coded more elegantly?
You can argsort the labels (which is what unique does under the hood I believe).
If your labels are small nonnegatvie integers as in the example you can get it a bit cheaper, see https://stackoverflow.com/a/53002966/7207392.
>>> import numpy as np
>>>
>>> data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
>>> label=np.array([1,1,1,0,1])
>>>
>>> idx = label.argsort()
# use kind='mergesort' if you require a stable sort, i.e. one that
# preserves the order of equal labels
>>> ls = label[idx]
>>> split = 1 + np.where(ls[1:] != ls[:-1])[0]
>>> np.split(data[idx], split)
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Unfortunately, there isn't a built-in groupby function in numpy, though you could write alternatives. However, your problem could be solved more succinctly using pandas, if that's available to you:
import pandas as pd
res = pd.DataFrame(data).groupby(label).apply(lambda x: x.values).tolist()
# or, if performance is important, the following will be faster on large arrays,
# but less readable IMO:
res = [data[i] for i in pd.DataFrame(data).groupby(label).groups.values()]
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Can someone please help me to understand why sometimes the advanced selection doesn't work and what I can do to get it to work (2nd case)?
>>> import numpy as np
>>> b = np.random.rand(5, 14, 3, 2)
# advanced selection works as expected
>>> b[[0,1],[0,1]]
array([[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.62122898, 0.95066355],
[ 0.62947885, 0.00297711],
[ 0.70292323, 0.2109297 ]]])
# doesn't work - why?
>>> b[[0,1],[0,1,2]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: shape mismatch: objects cannot be broadcast to a single shape
# but this seems to work
>>> b[:,[0,1,2]]
array([[[[ 7.57555496e-01, 1.89890676e-01],
[ 6.81678915e-02, 9.57603975e-01],
[ 8.83581071e-01, 1.95581063e-01]],
[[ 2.24896112e-01, 4.77818599e-01],
[ 4.29313861e-02, 8.61578045e-02],
[ 4.80092364e-01, 3.66821618e-01]],
...
Update
Breaking up the selection seems to resolve the problem, but I am unsure why this is necessary (or if there's a better way to achieve this).
>>> b.shape
(5, 14, 3, 2)
>>> b[[0,1]].shape
(2, 14, 3, 2)
# trying to separate indexing by dimension.
>>> b[[0,1]][:,[0,1,2]]
array([[[[ 0.7575555 , 0.18989068],
[ 0.06816789, 0.95760398],
[ 0.88358107, 0.19558106]],
[[ 0.22489611, 0.4778186 ],
[ 0.04293139, 0.0861578 ],
You want
b[np.ix_([0, 1], [0, 1, 2])]
You also need to do the same thing for b[[0, 1], [0, 1]], because that's not actually doing what you think it is:
b[np.ix_([0, 1], [0, 1])]
The problem here is that advanced indexing does something completely different from what you think it does. You've made the mistake of thinking that b[[0, 1], [0, 1, 2]] means "take all parts b[i, j] of b where i is 0 or 1 and j is 0, 1, or 2". This is a reasonable mistake to make, considering that it seems to work that way when you have one list in the indexing expression, like
b[:, [1, 3, 5], 2]
In fact, for an array A and one-dimensional integer arrays I and J, A[I, J] is an array where
A[I, J][n] == A[I[n], J[n]]
This generalizes in the natural way to more index arrays, so for example
A[I, J, K][n] == A[I[n], J[n], K[n]]
and to higher-dimensional index arrays, so if I and J are two-dimensional, then
A[I, J][m, n] == A[I[m, n], J[m, n]]
It also applies the broadcasting rules to the index arrays, and converts lists in the indexes to arrays. This is much more powerful than what you expected to happen, but it means that to do what you were trying to do, you need something like
b[[[0],
[1]], [[0, 1, 2]]]
np.ix_ is a helper that will do that for you so you don't have to write a dozen brackets.
I think you misunderstood the advanced selection syntax for this case. I used your example, just made it smaller to be easier to see.
import numpy as np
b = np.random.rand(5, 4, 3, 2)
# advanced selection works as expected
print b[[0,1],[0,1]] # http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html
# this picks the two i,j=0 (a 3x2 matrix) and i=1,j=1, another 3x2 matrix
# doesn't work - why?
#print b[[0,1],[0,1,2]] # this doesnt' work because [0,1] and [0,1,2] have different lengths
print b[[0,1,2],[0,1,2]] # works
Output:
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]]
[[[ 0.27334558 0.90065184]
[ 0.8624593 0.34324983]
[ 0.19574819 0.2825373 ]]
[[ 0.38660087 0.63941692]
[ 0.81522421 0.16661912]
[ 0.81518479 0.78655536]]
[[ 0.65336551 0.1435357 ]
[ 0.91380873 0.45225145]
[ 0.57255923 0.7645396 ]]]
I have a very long list of list and I am converting it to a numpy array using numpy.asarray(), is it safe to delete the original list after getting this matrix or does the newly created numpy array will also be affected by this action?
I am pretty sure that data is not shared and that you can safely remove the lists. Your original matrix is a nested structure of Python objects, with the numbers itself also Python objects, which can be located everywhere in memory. A Numpy array is also an object, but it is more or less a header that contains the dimensions and type of the data, with a pointer to a contiguous block of data where all the numbers are packed as close as possible as 'raw numbers'. There is no way how these two different ways could share data, so presumably the data is copied when you create the Numpy array. Example:
In [1]: m = [[1,2,3],[4,5,6],[7,8,9]]
In [2]: import numpy as np
In [3]: M = np.array(m)
In [4]: M[1,1] = 55
In [5]: M
Out[5]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 9]])
In [6]: m
Out[6]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # original is not modified!
Note that Numpy arrays can share data between each other, e.g. when you make a slice into an array. This is called a 'view', so if you modify data in the subset, it will also change in the original array:
In [18]: P = M[1:, 1:]
In [19]: P[1,1] = 666
In [20]: P
Out[20]:
array([[ 55, 6],
[ 8, 666]])
In [21]: M
Out[21]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 666]]) # original is also modified!
The data are copied over because the numpy array stores its own copy of the data as described by Bas Swinckels. You can test this for your self too. Although the trivially small list might make the point too, the ginormous data set below might bring the point home a little better ;)
import numpy as np
list_data = range(1000000000) # note, this will probably take a long time
# This will also take a long time
# because it is copying the data in memory
array_data = np.asarray(list_data)
# even this will probably take a while
del list_data
# But you still have the data even after deleting the list
print(array_data[1000])
Yes, it is safe to delete it if your input data consists of a list. From the documentation No copy is performed (ONLY) if the input is already an ndarray.
# 2x3 dimensional list
multidim_list = [
[1,2,3],
[4,5,6],
]
# 2x3x2 dimensional list
multidim_list2 = [
[
[1,2,3],
[4,5,6],
],
[
[7,8,9],
[10,11,12],
]
]
def multiply_list(list):
...
I would like to implement a function, that would multiply all elements in list by two. However my problem is that lists can have different amount of dimensions.
Is there a general way to loop/iterate multidimensional list and for example multiply each value by two?
EDIT1:
Thanks for the fast answers.
For this case, I don't want to use numpy.
The recursion seems good, and it doesn't even need to make copy of the list, which could be quite large actually.
Recursion is your friend:
from collections import MutableSequence
def multiply(list_):
for index, item in enumerate(list_):
if isinstance(item, MutableSequence):
multiply(item)
else:
list_[index] *= 2
You could just do isinstance(item, list) instead of isinstance(item, MutableSequence), but the latter way is more futureproof and generic. See the glossary for a short explanation.
You can make use of numpy:
import numpy as np
arr_1 = np.array(multidim_list)
arr_2 = np.array(multidim_list2)
Result:
>>> arr_1*2
array([[ 2, 4, 6],
[ 8, 10, 12]])
>>> arr_2*2
array([[[ 2, 4, 6],
[ 8, 10, 12]],
[[14, 16, 18],
[20, 22, 24]]])
numpy arrays do that out of the box.