Python/Numpy subarray selection - python

I have some Numpy code which I'm trying to decipher. There's a line v1 = v1[:, a1.tolist()] which passes a numpy array a1 and converts it into a list. I'm confused as to what v1[:, a1.tolist()] actually does. I know that v1 is now being set to a column array given from v1 given by the selection [:, a1.tolist()] but what's getting selected? More precisely, what is [:, a.tolist()] doing?

The syntax you observed is easier to understand if you split it in two parts:
1. Using a list as index
With numpy the meaning of
a[[1,2,3]]
is
[a[1], a[2], a[3]]
In other words when using a list as index is like creating a list of using elements as index.
2. Selecting a column with [:,x]
The meaning of
a2[:, x]
is
[a2[0][x],
a2[1][x],
a2[2][x],
...
a2[n-1][x]]
I.e. is selecting one column from a matrix.
Summing up
The meaning of
a[:, [1, 3, 5]]
is therefore
[[a[ 0 ][1], a[ 0 ][3], a[ 0 ][5]],
[a[ 1 ][1], a[ 1 ][3], a[ 1 ][5]],
...
[a[n-1][1], a[n-1][3], a[n-1][5]]]
In other words a copy of a with a selection of columns (or duplication and reordering; elements in the list of indexes doesn't need to be distinct or sorted).

Assuming a simple example like a 2D array, v1[:, a1.tolist()] would selects all rows of v1, but only columns described by a1 values
Simple example:
>>> x
array([['a', 'b', 'c'],
['d', 'f', 'g']],
dtype='|S1')
>>> x[:,[0]]
array([['a'],
['d']],
dtype='|S1')
>>> x[:,[0, 1]]
array([['a', 'b'],
['d', 'f']],
dtype='|S1')

Related

How to filter with numpy on 2D array using np.where

I read the numpy doc and np.where takes 1 argument to return row indices when the condition is matching..
numpy.where(condition, [x, y, ]/)
In the context of multi dimensional array I want to find and replace when the condition is matching
this is doable with some other params from the doc [x, y, ] are replacement values
Here is my data structure :
my_2d_array = np.array([[1,2],[3,4]])
Here is how I filter a column with python my_2d_array[:,1]
Here is how I filter find/replace with numpy :
indices = np.where( my_2d_array[:,1] == 4, my_2d_array[:,1] , my_2d_array[:,1] )
(when the second column value match 4 invert the value in column two with column one)
So its hard for me to understand why the same syntax my_2d_array[:,1] is used to filter a whole column in python and to designate a single row of my 2D array for numpy where the condition is matched
Your array:
In [9]: arr = np.array([[1,2],[3,4]])
In [10]: arr
Out[10]:
array([[1, 2],
[3, 4]])
Testing for some value:
In [11]: arr==4
Out[11]:
array([[False, False],
[False, True]])
testing one column:
In [12]: arr[:,1]
Out[12]: array([2, 4])
In [13]: arr[:,1]==4
Out[13]: array([False, True])
As documented, np.where with just one argument is just a call to nonzero, which finds the index(s) for the True values:
So for the 2d array in [11] we get two arrays:
In [15]: np.nonzero(arr==4)
Out[15]: (array([1], dtype=int64), array([1], dtype=int64))
and for the 1d boolean in [13], one array:
In [16]: np.nonzero(arr[:,1]==4)
Out[16]: (array([1], dtype=int64),)
That array can be used to select a row from arr:
In [17]: arr[_,1]
Out[17]: array([[4]])
If used in the three argument where, it selects elements between the 2nd and 3rd arguments. For example, using arguments that have nothing to do with arr:
In [18]: np.where(arr[:,1]==4, ['a','b'],['c','d'])
Out[18]: array(['c', 'b'], dtype='<U1')
The selection gets more complicated if the arguments differ in shape; then the rules of broadcasting apply.
So the basic point with np.where is that all 3 arguments are first evaluated, and passed (in true python function fashion) to the where function. It then selects elements based on the cond, returning a new array.
That where is functionally the same as this list comprehension (or an equivalent for loop):
In [19]: [i if cond else j for cond,i,j in zip(arr[:,1]==4, ['a','b'],['c','d'])]
Out[19]: ['c', 'b']

Finding NumPy column index distance of non-minus-one elements in an n-d array

Suppose I have the following NumPy array:
import numpy as np
arr = np.array([['a', -1, -1, -1],
[ -1,'b','c', -1],
['e', -1,'d','f']], dtype=object)
Now I would like to find the column index distance of neighboring elements for each row if there is more than one non-minus-one element in one row.
For example, for pair ('b', 'c'), 'c' is in third column and 'b' is in the second column, the column difference is 2 (column index of 'c') - 1 = 1.
For ('e','d'), the distance would be 2 - 0 = 2. For ('d','f'), the distance would be 1. For ('e','f'), there is 'd' between them, so we do not consider it.
I think this does what you want:
[np.diff(np.where(row!=-1)).flatten() for row in arr]
The result:
[array([], dtype=int64), array([1]), array([2, 1])]
I can't think of a way to vectorize it (i.e. to avoid the loop); it's kind of a weird data structure (NumPy arrays contain elements of a single type, so depending on what you want to do, you might find object or Unicode more amenable).

iterating a filtered Numpy array whilst maintaining index information

I am attempting to pass filtered values from a Numpy array into a function.
I need to pass values only above a certain value, and their index position with the Numpy array.
I am attempting to avoid iterating over the entire array within python by using Numpys own filtering systems, the arrays i am dealing with have 20k of values in them with potentially only very few being relevant.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = np.nonzero(somearray > 4)
for i in arrayindex:
somefunction(arrayindex[0], somearray[arrayindex[0]])
This threw up errors of logic not being able to handle multiple values,
this led me to testing it through print statement to see what was going on.
for cell in arrayindex:
print(f"index {cell}")
print(f"data {somearray[cell]}")
I expected an output of
index 4
data 5
index 5
data 6
But instead i get
index [4 5]
data [5 6]
I have looked through different methods to iterate through numpy arrays such and neditor, but none seem to still allow me to do the filtering of values outside of the for loop.
Is there a solution to my quandary?
Oh, i am aware that is is generally frowned upon to loop through a numpy array, however the function that i am passing these values to are complex, triggering certain events and involving data to be uploaded to a data base dependent on the data location within the array.
Thanks.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = [idx for idx, val in enumerate(somearray) if val > 4]
for i in range(0, len(arrayindex)):
somefunction(arrayindex[i], somearray[arrayindex[i]])
for i in range(0, len(arrayindex)):
print("index", arrayindex[i])
print("data", somearray[arrayindex[i]])
You need to have a clear idea of what nonzero produces, and pay attention to the difference between indexing with a list(s) and with a tuple.
===
In [110]: somearray = np.array([1,2,3,4,5,6])
...: arrayindex = np.nonzero(somearray > 4)
nonzero produces a tuple of arrays, one per dimension (this becomes more obvious with 2d arrays):
In [111]: arrayindex
Out[111]: (array([4, 5]),)
It can be used directly as an index:
In [113]: somearray[arrayindex]
Out[113]: array([5, 6])
In this 1d case you could take the array out of the tuple, and iterate on it:
In [114]: for i in arrayindex[0]:print(i, somearray[i])
4 5
5 6
argwhere does a 'transpose', which could also be used for iteration
In [115]: idxs = np.argwhere(somearray>4)
In [116]: idxs
Out[116]:
array([[4],
[5]])
In [117]: for i in idxs: print(i,somearray[i])
[4] [5]
[5] [6]
idxs is (2,1) shape, so i is (1,) shape array, resulting in the brackets in the display. Occasionally it's useful, but nonzero is used more (often by it's other name, np.where).
2d
argwhere has a 2d example:
In [119]: x=np.arange(6).reshape(2,3)
In [120]: np.argwhere(x>1)
Out[120]:
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])
In [121]: np.nonzero(x>1)
Out[121]: (array([0, 1, 1, 1]), array([2, 0, 1, 2]))
In [122]: x[np.nonzero(x>1)]
Out[122]: array([2, 3, 4, 5])
While nonzero can be used to index the array, argwhere elements can't.
In [123]: for ij in np.argwhere(x>1):
...: print(ij,x[ij])
...:
...
IndexError: index 2 is out of bounds for axis 0 with size 2
Problem is that ij is a list, which is used to index on dimension. numpy distinguishes between lists and tuples when indexing. (Earlier versions fudged the difference, but current versions are taking a more rigorous approach.)
So we need to change the list into a tuple. One way is to unpack it:
In [124]: for i,j in np.argwhere(x>1):
...: print(i,j,x[i,j])
...:
...:
0 2 2
1 0 3
1 1 4
1 2 5
I could have used: print(ij,x[tuple(ij)]) in [123].
I should have used unpacking the [117] iteration:
In [125]: for i, in idxs: print(i,somearray[i])
4 5
5 6
or somearray[tuple(i)]

Vectorized selection along level variable

When building a DataArray, I can conveniently select along some coordinate:
import xarray as xr
d = xr.DataArray([1, 2, 3],
coords={'c': ['a', 'b', 'c']},
dims=['c'])
d.sel(c='a')
and even along multiple values on that coordinate:
d.sel(c=['a', 'b'])
However, this fails to work once the coordinate is part of a multi-index dimension:
d = xr.DataArray([1, 2, 3],
coords={'c': ('multi_index', ['a', 'b', 'c']),
'd': ('multi_index', ['x', 'y', 'z'])},
dims=['multi_index'])
d.sel(c='a') # error
d.sel(c=['a', 'b']) # error
with the error ValueError: dimensions or multi-index levels ['c'] do not exist.
Another error message I see in trying to do this is ValueError: Vectorized selection is not available along level variable.
It seems like one can only select along dimensions.
This becomes difficult when a single dimension contains a lot of metadata and one would only like to select based on the values of a single metadata-coordinate.
Is there a suggested workaround other than positionally indexing things by hand?
swap_dims does some workaround.
In [8]: d.swap_dims({'multi_index': 'c'}).sel(c=['a', 'b'])
Out[8]:
<xarray.DataArray (c: 2)>
array([1, 2])
Coordinates:
* c (c) <U1 'a' 'b'
d (c) <U1 'a' 'b'
where 'c' becomes dimension instead of 'multi_index'.
If you want to select based on 'c' and 'd' in random manner, the use of MultiIndex maybe appropriate. set_index does this,
In [12]: d.set_index(cd=['c', 'd'])
Out[12]:
<xarray.DataArray (multi_index: 3)>
array([1, 2, 3])
Coordinates:
* cd (cd) MultiIndex
- c (cd) object 'a' 'b' 'c'
- d (cd) object 'a' 'b' 'c'
Dimensions without coordinates: multi_index
In [13]: d.set_index(cd=['c', 'd']).sel(c='b')
Out[13]:
<xarray.DataArray (multi_index: 3)>
array([1, 2, 3])
Coordinates:
* d (d) object 'b'
Dimensions without coordinates: multi_index
However, vectorized selection is not yet supported for MultiIndex,
(it will complain ValueError: Vectorized selection is not available along level variable)
Maybe the first option is better for your use case.

python pandas: list of sublist: total items number

I've a list like this one:
categories_list = [
['a', array([ 12994, 1262824, 145854, 92469]),
'b', array([273300]),
'c', array([341395, 32857711])],
['a', array([ 356424311, 165573412, 2032850784]),
'b', array([2848105, 228835]),
'c', array([])],
['a', array([1431689, 30655043, 1739919]),
'b', array([597, 251911, 246600]),
'c', array([35590])]
]
where each array belongs to the letter before.
Example: a -> array([ 12994, 1262824, 145854, 92469]), b -> array([273300]), 'a' -> array([1431689, 30655043, 1739919]) and so on...
So, is it possible to retrieve the total items number for each letter?
Desiderata:
----------
a 10
b 6
c 3
All suggestions are welcome
pd.DataFrame(
[dict(zip(x[::2], [len(y) for y in x[1::2]])) for x in categories_list]
).sum()
a 10
b 6
c 3
dtype: int64
I'm aiming at creating a list of dictionaries. So I have to fill in the ...... with something that parses each sub-list with a dictionary
[ ...... for x in catgories_list]
If I use dict on a list or generator of tuples, it will magically turn that into a dictionary with keys as the first value in the tuple and values as the second value in the tuple.
dict(...list of tuples...)
zip will give me that generator of tuples
zip(list one, list two)
I know that in each sub-list, my keys are at the even indices [0, 2, 4...] and values are at the odd indices [1, 3, 5, ...]
# even odd
zip(x[::2], x[1::2])
but x[1::2] will be arrays, and I don't want the arrays. I want the length of the arrays.
# even odd
zip(x[::2], [len(y) for y in x[1::2]])
pandas.DataFrame will take a list of dictionaries and create a dataframe.
Finally, use sum to count the lengths.
I use groupby in order to group key in column 0, 2, 4 (which has keys a, b, c respectively) and then count number of distinct item number in the next column. Number in the group in this case is len(set(group)) (or len(group) if you want just total length of the group). See the code below:
from itertools import groupby, chain
count_distincts = []
cols = [0, 2, 4]
for c in cols:
for gid, group in groupby(categories_list, key=lambda x: x[c]):
group = list(chain(*[list(g[c + 1]) for g in group]))
count_distincts.append([gid, len(set(group))])
Output [['a', 10], ['b', 6], ['c', 3]]

Categories

Resources