Related
I have been learning Fancy indexing but when I observed the behavior of the following code I got a couple of questions...
According to my understanding,
Fancy Indexing is:
ndArray[ [0,1,2] ] i.e. passing a list of rows / columns
and
Slicing is:
ndArray[ 0:3 ] i.e. giving a range of rows / columns
Now, the problem
A numpy array,
arr = [ [1,2,3],
[4,5,6],
[7,8,9] ]
When I try fancy indexing:
arr[ [0,1], [1,2] ]
>>> [2, 6]
And when slice it,
arr[:2, 1:]
>>> [ [2, 3],
[5, 6] ]
Essentially both of them should return the two-dimension array as both of them mean the same, as they are used interchangeably!
:2 should be equivalent to [0,1] #For rows
1: should be equivalent to [1,2] #For cols
The question:
Why Fancy indexing is not returning as the slice notation? And how to achieve that?
Please enlighten me.
Thanks
Fancy indexing and slicing behave differently by definition / by numpy specification.
So, instead of questioning why that is so, it is better to:
Be able to recognize / distinguish / tell them apart (i.e., have a clear understanding of when does the indexing become fancy indexing, and when is it slicing).
Be aware of the differences in their semantics (outcomes).
In your example:
In the case of fancy indexing, the indices generated for the two axes are combined "in tandem" (similar to how the zip function combines two input sequences "in tandem". (In the words of the official numpy documentation, the two index arrays are "iterated together"). We are passing the list [0, 1] for indexing the array on axis 0, and passing the list [1, 2] for indexing the array on axis 1. The index 0 from the index array [0, 1] is combined only with the corresponding index 1 of the index array [1, 2]. Similarly, the index 1 of the index array [0, 1] is combined only with the corresponding index 2 of the index array [1, 2]. In other words, the index arrays do not combine with each other in a many-to-many fashion. All this was about fancy indexing.
In the case of slicing, the slice :2 that is specified for axis 0 conceptually generates indices '0' and '1' for axis 0; and the slice 1: specified for axis 1 conceptually generates indices 1 and 2 for axis 1. But these generated indices combine in a many-to-many fashion, unlike in the case of fancy indexing. So, they produce four combinations rather than just two.
So, the crucial difference in the defined semantics of fancy indexing and slicing is that in the case of fancy indexing, the fancy index arrays are iterated together.
I have two 1D arrays, x & y, one smaller than the other. I'm trying to find the index of every element of y in x.
I've found two naive ways to do this, the first is slow, and the second memory-intensive.
The slow way
indices= []
for iy in y:
indices += np.where(x==iy)[0][0]
The memory hog
xe = np.outer([1,]*len(x), y)
ye = np.outer(x, [1,]*len(y))
junk, indices = np.where(np.equal(xe, ye))
Is there a faster way or less memory intensive approach? Ideally the search would take advantage of the fact that we are searching for not one thing in a list, but many things, and thus is slightly more amenable to parallelization.
Bonus points if you don't assume that every element of y is actually in x.
I want to suggest one-line solution:
indices = np.where(np.in1d(x, y))[0]
The result is an array with indices for x array which corresponds to elements from y which were found in x.
One can use it without numpy.where if needs.
As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:
import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])
index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)
yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y
result = np.ma.array(yindex, mask=mask)
print result
the result is:
[-- 3 1 -- -- 6]
How about this?
It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.
import numpy as np
# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)
# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]
I think this is a clearer version:
np.where(y.reshape(y.size, 1) == x)[1]
than indices = np.where(y[:, None] == x[None, :])[1]. You don't need to broadcast x into 2D.
This type of solution I found to be best because unlike searchsorted() or in1d() based solutions that have seen posted here or elsewhere, the above works with duplicates and it doesn't care if anything is sorted. This was important to me because I wanted x to be in a particular custom order.
I would just do this:
indices = np.where(y[:, None] == x[None, :])[1]
Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.
The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:
import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')
It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.
It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.
EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.
Another solution would be:
a = np.array(['Bob', 'Alice', 'John', 'Jack', 'Brian', 'Dylan',])
z = ['Bob', 'Brian', 'John']
for i in z:
print(np.argwhere(i==a))
Use this line of code :-
indices = np.where(y[:, None] == x[None, :])[1]
My solution can additionally handle a multidimensional x. By default, it will return a standard numpy array of corresponding y indices in the shape of x.
If you can't assume that y is a subset of x, then set masked=True to return a masked array (this has a performance penalty). Otherwise, you will still get indices for elements not contained in y, but they probably won't be useful to you.
The answers by HYRY and Joe Kington were helpful in making this.
# For each element of ndarray x, return index of corresponding element in 1d array y
# If y contains duplicates, the index of the last duplicate is returned
# Optionally, mask indices where the x element does not exist in y
def matched_indices(x, y, masked=False):
# Flattened x
x_flat = x.ravel()
# Indices to sort y
y_argsort = y.argsort()
# Indices in sorted y of corresponding x elements, flat
x_in_y_sort_flat = y.searchsorted(x_flat, sorter=y_argsort)
# Indices in y of corresponding x elements, flat
x_in_y_flat = y_argsort[x_in_y_sort_flat]
if not masked:
# Reshape to shape of x
return x_in_y_flat.reshape(x.shape)
else:
# Check for inequality at each y index to mask invalid indices
mask = x_flat != y[x_in_y_flat]
# Reshape to shape of x
return np.ma.array(x_in_y_flat.reshape(x.shape), mask=mask.reshape(x.shape))
A more direct solution, that doesn't expect the array to be sorted.
import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])
# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()
Result is:
[2, 3, 3, 1, 4]
more compact solution:
indices, = np.in1d(a, b).nonzero()
I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.
What happens when numpy.apply_along_axis takes a 1d array as input? When I use it on 1d array, I see something strange:
y=array([1,2,3,4])
First try:
apply_along_axis(lambda x: x > 2, 0, y)
apply_along_axis(lambda x: x - 2, 0, y)
returns:
array([False, False, True, True], dtype=bool)
array([-1, 0, 1, 2])
However when I try:
apply_along_axis(lambda x: x - 2 if x > 2 else x, 0, y)
I get an error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I could of course use list comprehension then convert back to array instead, but that seems convoluted and I feel like I'm missing something about apply_along_axis when applied to a 1d array.
UPDATE: as per Jeff G's answer, my confusion stems from the fact that for 1d array with only one axis, what is being passed to the function is in fact the 1d array itself rather than the individual elements.
"numpy.where" is clearly better for my chosen example (and no need for apply_along_axis), but my question is really about the proper idiom for applying a general function (that takes one scalar and returns one scalar) to each element of an array (other than list comprehension), something akin to pandas.Series.apply (or map). I know of 'vectorize' but it seems no less unwieldy than list comprehension.
I'm unclear whether you're asking if y must be 1-D (answer is no, it can be multidimensional) or if you're asking about the function passed into apply_along_axis. To that, the answer is yes: the function you pass must take a 1-D array. (This is stated clearly in the function's documentation).
In your three examples, the type of x is always a 1-D array. The reason your first two examples work is because Python is implicitly broadcasting the > and - operators along that array.
Your third example fails because there is no such broadcasting along an array for if / else. For this to work with apply_along_axis you need to pass a function that takes a 1-D array. numpy.where would work for this:
>>> apply_along_axis(lambda x: numpy.where(x > 2, x - 2, x), 0, y)
array([1, 2, 1, 2])
P.S. In all these examples, apply_along_axis is unnecessary, thanks to broadcasting. You could achieve the same results with these:
>>> y > 2
>>> y - 2
>>> numpy.where(y > 2, y - 2, y)
This answer addresses the updated addendum to your original question:
numpy.vectorize will take an elementwise function and return a new function. The new function can be applied to an entire array. It's like map, but it uses the broadcasting rules of numpy.
f = lambda x: x - 2 if x > 2 else x # your elementwise fn
fv = np.vectorize(f)
fv(np.array([1,2,3,4]))
# Out[5]: array([1, 2, 1, 2])
What is the easiest and cleanest way to get the first AND the last elements of a sequence? E.g., I have a sequence [1, 2, 3, 4, 5], and I'd like to get [1, 5] via some kind of slicing magic. What I have come up with so far is:
l = len(s)
result = s[0:l:l-1]
I actually need this for a bit more complex task. I have a 3D numpy array, which is cubic (i.e. is of size NxNxN, where N may vary). I'd like an easy and fast way to get a 2x2x2 array containing the values from the vertices of the source array. The example above is an oversimplified, 1D version of my task.
Use this:
result = [s[0], s[-1]]
Since you're using a numpy array, you may want to use fancy indexing:
a = np.arange(27)
indices = [0, -1]
b = a[indices] # array([0, 26])
For the 3d case:
vertices = [(0,0,0),(0,0,-1),(0,-1,0),(0,-1,-1),(-1,-1,-1),(-1,-1,0),(-1,0,0),(-1,0,-1)]
indices = list(zip(*vertices)) #Can store this for later use.
a = np.arange(27).reshape((3,3,3)) #dummy array for testing. Can be any shape size :)
vertex_values = a[indices].reshape((2,2,2))
I first write down all the vertices (although I am willing to bet there is a clever way to do it using itertools which would let you scale this up to N dimensions ...). The order you specify the vertices is the order they will be in the output array. Then I "transpose" the list of vertices (using zip) so that all the x indices are together and all the y indices are together, etc. (that's how numpy likes it). At this point, you can save that index array and use it to index your array whenever you want the corners of your box. You can easily reshape the result into a 2x2x2 array (although the order I have it is probably not the order you want).
This would give you a list of the first and last element in your sequence:
result = [s[0], s[-1]]
Alternatively, this would give you a tuple
result = s[0], s[-1]
With the particular case of a (N,N,N) ndarray X that you mention, would the following work for you?
s = slice(0,N,N-1)
X[s,s,s]
Example
>>> N = 3
>>> X = np.arange(N*N*N).reshape(N,N,N)
>>> s = slice(0,N,N-1)
>>> print X[s,s,s]
[[[ 0 2]
[ 6 8]]
[[18 20]
[24 26]]]
>>> from operator import itemgetter
>>> first_and_last = itemgetter(0, -1)
>>> first_and_last([1, 2, 3, 4, 5])
(1, 5)
Why do you want to use a slice? Getting each element with
result = [s[0], s[-1]]
is better and more readable.
If you really need to use the slice, then your solution is the simplest working one that I can think of.
This also works for the 3D case you've mentioned.