I don't understand what it does, and only saw it used by a sklearn-object.
On trying some testing like this
x = [(1, 2), (2, 3), (3, 4)]
y = [[1, 2], [2, 3], [3, 4]]
print(y[:, 0])
I got this error (with both x and y):
TypeError: list indices must be integers or slices, not tuple
My assumption was that the : before the comma tells Python to take all entries, while the 0 specifies which 'subset'.
How does one use this expression properly and what does it do?
As explained here this is a numpy-specific notation to index arrays, not plain python. This is why it does not work on your code. In your (initial) case, the sklearn object probably wraps a numpy array that supports the numpy slicing syntax.
In your specific case, it would work as follows:
import numpy as np
y = np.array([[1, 2], [2, 3], [3, 4]])
print(y[:, 0])
# prints: [1 2 3]
This would yield all indexes along the first axis (i.e. use full column vectors), but use only index 0 in the second axis (i.e. use only the first column vector).
What you express would work with more complex objects, like numpy (and its famous slicing). In the case of vanilla Python, this is not possible. To access a specific number (in your case) you would have to do x[2][1] (yielding 4 in your example).
To achieve what you want (take the first item of each tuple), you would do
[item[0] for item in y]. This is list comprehension: iterate through y, take the object at the first index of each item, and make a list out of that.
Related
I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.
I am trying to find first_row = [1, 2, 3, 4, ...]
Currently, I have list = [[1, a], [2, b], [3, c], [4, d], ...]
and tried running: list[:, 0]
Output error: TypeError: List indices must be integer must be integer, not tuple
The subscript lst[:,0] does not do what you think it does. You might be used to pandas or other libraries which support indexing in multidimensional matrices, but the builtin list does not support this.
Instead, you should use a list-comprehension.
first_row = [column[0] for column in lst]
This is a shorthand for the following more verbose expression.
first_row = []
for column in lst:
first_row.append(lst[0])
In other words, you cannot vectorize multi-indexing with the list type. You have to recover each column one by one and the first element of each of those independently.
As a sidenote, try avoiding calling you variables list as it shadows the builtin of the same name.
list doesn't support multiple (i.e. "comma-separated") indices, so you need something like
[sublist[0] for sublist in list]
The most popular alternative supporting multiple indexing like you tried is numpy's multidimensional array
The reason why list cannot do this is because it is a general container, so it cannot imply what you store inside. You might have
values = [[1, 2], 'three', MyClass(4, 5, 6), None]
Applying [0, :] to this would make no sense, the same way as it doesn't in most cases.
first_row = [i[0] for i in list]
Edit:
Since there are already two answers as this one, let me show a different one if you want to use slicing. First of all you SHOULDN'T use list as the name of your variable, since list is a reserved word in python. Suppose your list is called list1, you can do:
first_row = list(np.array(list1)[0,:].astype(int))
I am trying to use numpy in Python in solving my project.
I have a random binary array rndm = [1, 0, 1, 1] and a resource_arr = [[2, 3], 4, 2, [1, 2]]. What I am trying to do is to multiply the array element wise, then get their sum. As an expected output for the sample above,
output = 5 0 2 3. I find hard to solve such problem because of the nested array/list.
So far my code looks like this:
def fitness_score():
output = numpy.add(rndm * resource_arr)
return output
fitness_score()
I keep getting
ValueError: invalid number of arguments.
For which I think is because of the addition that I am trying to do. Any help would be appreciated. Thank you!
Numpy treats its arrays as matrices, and resource_arr is not a (valid) matrix. In your case a python list is more suitable:
def sum_nested(l):
tmp = []
for element in l:
if isinstance(element, list):
tmp.append(numpy.sum(element))
else:
tmp.append(element)
return tmp
In this function we check for each element inside l if it is a list. If so, we sum its elements. On the other hand, if the encountered element is just a number, we leave it untouched. Please note that this only works for one level of nesting.
Now, if we run sum_nested([[2, 3], 4, 2, [1, 2]]) we will get [5 4 2 3]. All that's left is multiplying this result by the elements of rndm, which can be achieved easily using numpy:
def fitness_score(a, b):
return numpy.multiply(a, sum_nested(b))
Numpy is all about the non-jagged arrays. You can do things with jagged arrays, but doing so efficiently and elegantly isnt trivial.
Almost always, trying to find a way to map your datastructure to a non-nested one, for instance, encoding the information as below, will be more flexible, and more performant.
resource_arr = (
[0, 0, 1, 2, 3, 3]
[2, 3, 4, 2, 1, 2]
)
That is, an integer denoting the 'row' each value belongs to, paired with an array of equal size of the values themselves.
This may 'feel' wasteful when coming from a C-style way of doing arrays (omg more memory consumption), but staying away from nested datastructures is almost certainly your best bet in terms of performance, and the amount of numpy/scipy ecosystem that will actually be compatible with your data representation. If it really uses more memory is actually rather questionable; every new python object uses a ton of bytes, so if you have only few elements per nesting, it is the more memory efficient solution too.
In this case, that would give you the following efficient solution to your problem:
output = np.bincount(*resource_arr) * rndm
I have not worked much with pandas/numpy so I'm not sure if this is most efficient way, but it works (atleast for the example you have shown):
import numpy as np
rndm = [1, 0, 1, 1]
resource_arr = [[2, 3], 4, 2, [1, 2]]
multiplied_output = np.multiply(rndm, resource_arr)
print(multiplied_output)
output = []
for elem in multiplied_output:
output.append(sum(elem)) if isinstance(elem, list) else output.append(elem)
final_output = np.array(output)
print(final_output)
I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.
What is the easiest and cleanest way to get the first AND the last elements of a sequence? E.g., I have a sequence [1, 2, 3, 4, 5], and I'd like to get [1, 5] via some kind of slicing magic. What I have come up with so far is:
l = len(s)
result = s[0:l:l-1]
I actually need this for a bit more complex task. I have a 3D numpy array, which is cubic (i.e. is of size NxNxN, where N may vary). I'd like an easy and fast way to get a 2x2x2 array containing the values from the vertices of the source array. The example above is an oversimplified, 1D version of my task.
Use this:
result = [s[0], s[-1]]
Since you're using a numpy array, you may want to use fancy indexing:
a = np.arange(27)
indices = [0, -1]
b = a[indices] # array([0, 26])
For the 3d case:
vertices = [(0,0,0),(0,0,-1),(0,-1,0),(0,-1,-1),(-1,-1,-1),(-1,-1,0),(-1,0,0),(-1,0,-1)]
indices = list(zip(*vertices)) #Can store this for later use.
a = np.arange(27).reshape((3,3,3)) #dummy array for testing. Can be any shape size :)
vertex_values = a[indices].reshape((2,2,2))
I first write down all the vertices (although I am willing to bet there is a clever way to do it using itertools which would let you scale this up to N dimensions ...). The order you specify the vertices is the order they will be in the output array. Then I "transpose" the list of vertices (using zip) so that all the x indices are together and all the y indices are together, etc. (that's how numpy likes it). At this point, you can save that index array and use it to index your array whenever you want the corners of your box. You can easily reshape the result into a 2x2x2 array (although the order I have it is probably not the order you want).
This would give you a list of the first and last element in your sequence:
result = [s[0], s[-1]]
Alternatively, this would give you a tuple
result = s[0], s[-1]
With the particular case of a (N,N,N) ndarray X that you mention, would the following work for you?
s = slice(0,N,N-1)
X[s,s,s]
Example
>>> N = 3
>>> X = np.arange(N*N*N).reshape(N,N,N)
>>> s = slice(0,N,N-1)
>>> print X[s,s,s]
[[[ 0 2]
[ 6 8]]
[[18 20]
[24 26]]]
>>> from operator import itemgetter
>>> first_and_last = itemgetter(0, -1)
>>> first_and_last([1, 2, 3, 4, 5])
(1, 5)
Why do you want to use a slice? Getting each element with
result = [s[0], s[-1]]
is better and more readable.
If you really need to use the slice, then your solution is the simplest working one that I can think of.
This also works for the 3D case you've mentioned.