Problem involving 'alphabetization' of sets of row elements - python

Consider a variable setSize (it can take value 2 or 3), and a numpy array v.
The number of columns in v is divisible by setSize. Here's a small sample:
import numpy as np
setSize = 2
# the array spaces are shown to emphasize that the rows
# are made up of sets having, in this case, 2 elements each.
v = np.array([[2,5, 3,5, 1,8],
[4,6, 2,7, 5,9],
[1,8, 2,3, 1,4],
[2,8, 1,4, 3,5],
[5,7, 2,3, 7,8],
[1,2, 4,6, 3,5],
[3,5, 2,8, 1,4]])
PROBLEM: For the rows that have all elements unique, I need to ALPHABETIZE the sets.
For example: set 1,14 would precede set 3,5, which would precede set 5,1.
As a final step, I need to eliminate any duplicated rows that may result.
In this example above, the array rows having indices 1,3,5,and 6 have unique elements,
so these rows must be alphabetized. The other rows are not changed.
Further, the rows v[3] and v[6], after alphabetization, are now identical. One of them may be dropped.
The final output looks like:
v = [[2,5, 3,5, 1,8],
[2,7, 4,6, 5,9],
[1,8, 2,3, 1,4],
[1,4, 2,8, 3,5],
[5,7, 2,3, 7,8],
[1,2, 3,5, 4,6]]
I can identify the rows having unique elements with code like below, but I stuck with the alphabetization code.
s = np.sort(v,axis=1)
v[(s[:,:-1] != s[:,1:]).all(1)]

Assuming you have unsuitable rows dropped with:
s = np.sort(v, axis=1)
idx = (s[:,:-1] != s[:,1:]).all(1)
w = v[idx]
Then you can get orders of each row with np.lexsort on a reshaped array:
w = w.reshape(-1,3,2)
s = np.lexsort((w[:,:,1], w[:,:,0]))
Then you can apply fancy indexing and reshape it back:
rows, orders = np.repeat(np.arange(len(s)), 3), s.flatten()
v[idx] = w[rows, orders].reshape((-1,6))
If you need to drop duplicated rows, you can do it like so:
u, idx = np.unique(v, return_index=True, axis=0)
output = v[np.sort(idx)]
Visualization of process:
Sample run:
>>> s
array([[1, 0, 2],
[1, 0, 2],
[0, 2, 1],
[2, 1, 0]], dtype=int64)
>>> rows
array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])
>>> orders
array([1, 0, 2, 1, 0, 2, 0, 2, 1, 2, 1, 0], dtype=int64)
>>> v[idx]
array([[2, 7, 4, 6, 5, 9],
[1, 4, 2, 8, 3, 5],
[1, 2, 3, 5, 4, 6],
[1, 4, 2, 8, 3, 5]])
>>> v
array([[2, 5, 3, 5, 1, 8],
[2, 7, 4, 6, 5, 9],
[1, 8, 2, 3, 1, 4],
[1, 4, 2, 8, 3, 5],
[5, 7, 2, 3, 7, 8],
[1, 2, 3, 5, 4, 6],
[1, 4, 2, 8, 3, 5]])
>>> output
array([[2, 5, 3, 5, 1, 8],
[2, 7, 4, 6, 5, 9],
[1, 8, 2, 3, 1, 4],
[1, 4, 2, 8, 3, 5],
[5, 7, 2, 3, 7, 8],
[1, 2, 3, 5, 4, 6]])

Related

numpy.roll horizontally on a 2D ndarray with different values

Doing np.roll(a, 1, axis = 1) on:
a = np.array([
[6, 3, 9, 2, 3],
[1, 7, 8, 1, 2],
[5, 4, 2, 2, 4],
[3, 9, 7, 6, 5],
])
results in the correct:
array([
[3, 6, 3, 9, 2],
[2, 1, 7, 8, 1],
[4, 5, 4, 2, 2],
[5, 3, 9, 7, 6]
])
The documentation says:
If a tuple, then axis must be a tuple of the same size, and each of the given axes is shifted by the corresponding number.
Now I like to roll rows of a by different values, like [1,2,1,3] meaning, first row will be rolled by 1, second by 2, third by 1 and forth by 3. But np.roll(a, [1,2,1,3], axis=(1,1,1,1)) doesn't seem to do it. What would be the correct interpretation of the sentence in the docs?
By specifying a tuple in np.roll you can roll an array along various axes. For example, np.roll(a, (3,2), axis=(0,1)) will shift each element of a by 3 places along axis 0, and it will also shift each element by 2 places along axis 1. np.roll does not have an option to roll each row by a different amount. You can do it though for example as follows:
import numpy as np
a = np.array([
[6, 3, 9, 2, 3],
[1, 7, 8, 1, 2],
[5, 4, 2, 2, 4],
[3, 9, 7, 6, 5],
])
shifts = np.c_[[1,2,1,3]]
a[np.c_[:a.shape[0]], (np.r_[:a.shape[1]] - shifts) % a.shape[1]]
It gives:
array([[3, 6, 3, 9, 2],
[1, 2, 1, 7, 8],
[4, 5, 4, 2, 2],
[7, 6, 5, 3, 9]])

Trying to convert pandas df to np array, dtaidistance computes list instead

I am attempting to compute the distance matrix for an ndarray that I have converted from pandas. I tried to convert the pandas df currently in this format:
move_df =
movement
0 [4, 3, 6, 2]
1 [5, 2, 3, 6, 2]
2 [4, 7, 2, 3, 6, 1]
3 [4, 4, 4, 3]
... ...
33410 [2, 6, 3, 1, 8]
[33410 x 1 columns]
to a numpy ndarray by using the following:
1) m = move_df.to_numpy()
2) m = pd.DataFrame(move_df.tolist()).values
3) m = [move_df.tolist() for i in move_df.columns]
Each of these conversions resulted in a numpy array in this format:
[[list([4, 3, 6, 2])]
[list([5, 2, 3, 6, 2])]
[list([4, 7, 2, 3, 6, 1])]
[list([4, 4, 4, 3])]
...
[list([2, 6, 3, 1, 8])]]
So when I try to run dtaidistance matrix, I get the following error:
d_m = dtw.distance_matrix(m)
TypeError: unsupported operand type(s) for -: 'list' and 'list'
But when I create a list of lists by copying and pasting several of the numpy arrays created with any of the methods mentioned above, the code works. But this is not feasible in the long run since the arrays are over 30k rows. Is there something I am doing wrong in the conversion from pandas df to numpy array? I used
print(type(m))
and it outputs that it is a numpy array and I already know that I cannot subtract a list from a list, hence the error.
EDIT:
For move_df.head(10).to_dict()
{'movement': {0: [4, 3, 6, 2],
1: [5, 2, 3, 6, 2],
2: [4, 7, 2, 3, 6, 1],
3: [4, 4, 4, 3],
4: [3, 6, 2, 3, 3],
5: [6, 2, 1],
6: [1, 1, 1, 1],
7: [7, 2, 3, 1, 1],
8: [7, 2, 3, 2, 1],
9: [6, 2, 3, 1]}}
(one of the dtaidistance authors here)
The dtaidistance package expects one of three formats:
A 2D numpy array (where all sequences have the same length by definition)
A Python list of 1D numpy.array or array.array.
A Python list of Python lists
In your case you could do:
series = move_df['movement'].to_list()
dtw.distance_matrix(series)
which works then on a list of lists.
To use the fast C implementation an array is required (either Numpy or std lib array). If you want to keep different lengths you can do
series = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double)).to_list()
dtw.distance_matrix_fast(series)
Note that it might make sense to do the apply operation inplace on your move_df datastructure such that you only have to do it once and not keep track of two nearly identical datastructures. After you do this, the to_list call is sufficient. Thus:
move_df['movement'] = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double))
series = move_df['movement'].to_list()
dtw.distance_matrix_fast(series)
If you want to use a 2D numpy matrix, you would need to truncate or pad all series to be the same length as is explained in other answers (for dtw padding is more common to not lose information).
ps. This assumes you want to do univariate DTW, the ndim subpackage for multivariate time series expects a different datastructure.
Assuming you want to form an array with the lists of length 4:
m = df['movement'].str.len().eq(4)
a = np.array(df.loc[m, 'movement'].to_list())
output:
array([[4, 3, 6, 2],
[4, 4, 4, 3],
[1, 1, 1, 1],
[6, 2, 3, 1]])
used input:
df = pd.DataFrame({'movement': [[4, 3, 6, 2],
[5, 2, 3, 6, 2],
[4, 7, 2, 3, 6, 1],
[4, 4, 4, 3],
[3, 6, 2, 3, 3],
[6, 2, 1],
[1, 1, 1, 1],
[7, 2, 3, 1, 1],
[7, 2, 3, 2, 1],
[6, 2, 3, 1]]})
A dataframe created with:
In [112]: df = pd.DataFrame({'movement': {0: [4, 3, 6, 2],
...: 1: [5, 2, 3, 6, 2],
...: 2: [4, 7, 2, 3, 6, 1],
...: 3: [4, 4, 4, 3],
...: 4: [3, 6, 2, 3, 3],
...: 5: [6, 2, 1],
...: 6: [1, 1, 1, 1],
...: 7: [7, 2, 3, 1, 1],
...: 8: [7, 2, 3, 2, 1],
...: 9: [6, 2, 3, 1]}})
has an object dtype column that contains lists. The array derived from that column is object dtype:
In [121]: arr = df['movement'].to_numpy()
In [122]: arr
Out[122]:
array([list([4, 3, 6, 2]), list([5, 2, 3, 6, 2]),
list([4, 7, 2, 3, 6, 1]), list([4, 4, 4, 3]),
list([3, 6, 2, 3, 3]), list([6, 2, 1]), list([1, 1, 1, 1]),
list([7, 2, 3, 1, 1]), list([7, 2, 3, 2, 1]), list([6, 2, 3, 1])],
dtype=object)
By selecting the column I get a 1d array, not the 2d you get. Otherwise it's the same
This cannot be converted into a 2d numeric dtype array. For most purposes we can think of this as a list of lists.
In [123]: arr.tolist()
Out[123]:
[[4, 3, 6, 2],
[5, 2, 3, 6, 2],
[4, 7, 2, 3, 6, 1],
[4, 4, 4, 3],
[3, 6, 2, 3, 3],
[6, 2, 1],
[1, 1, 1, 1],
[7, 2, 3, 1, 1],
[7, 2, 3, 2, 1],
[6, 2, 3, 1]]
If the lists were all the same length, or if we pick a subset, it is possible to construct a 2d array:
In [125]: arr[[0,3,6,9]]
Out[125]:
array([list([4, 3, 6, 2]), list([4, 4, 4, 3]), list([1, 1, 1, 1]),
list([6, 2, 3, 1])], dtype=object)
In [126]:
In [126]: np.stack(arr[[0,3,6,9]])
Out[126]:
array([[4, 3, 6, 2],
[4, 4, 4, 3],
[1, 1, 1, 1],
[6, 2, 3, 1]])
Padding and slicing could also be used to force the lists to matching lengths - but that could mean losing information.
But without knowing what dtw.distance_matrix expects (looks like it wants a 2d numeric array), or what these lists represent, I can't go further.
The fundamental point is that your dataframe contains lists that vary in length.

Eliminating array rows that do not meet a matching criterion

Consider an array, M, made up of pairs of elements. (I've used spaces to emphasize that we will be dealing with element PAIRS). The actual arrays will have a large number of rows, and 4,6,8 or 10 columns.
import numpy as np
M = np.array([[1,3, 2,1, 4,2, 3,3],
[3,5, 6,9, 5,1, 3,4],
[1,3, 2,4, 3,4, 7,2],
[4,5, 1,2, 2,1, 2,3],
[6,4, 4,1, 6,1, 4,7],
[6,7, 7,6, 9,7, 6,2],
[5,3, 1,5, 3,3, 3,3]])
PROBLEM: I want to eliminate rows from M having an element pair that has no common elements with any of the other pairs in that row.
In array M, the 2nd row and the 4th row should be eliminated. Here's why:
2nd row: the pair (6,9) has no common element with (3,5), (5,1), or (3,4)
4th row: the pair (4,5) has no common element with (1,2), (2,1), or (2,3)
I'm sure there's a nice broadcasting solution, but I can't see it.
This is a broadcasting solution. Hope it's self-explained:
a = M.reshape(M.shape[0],-1,2)
mask = ~np.eye(a.shape[1], dtype=bool)[...,None]
is_valid = (((a[...,None,:]==a[:,None,...])&mask).any(axis=(-1,-2))
|((a[...,None,:]==a[:,None,:,::-1])&mask).any(axis=(-1,-2))
).all(-1)
M[is_valid]
Output:
array([[1, 3, 2, 1, 4, 2, 3, 3],
[1, 3, 2, 4, 3, 4, 7, 2],
[6, 4, 4, 1, 6, 1, 4, 7],
[6, 7, 7, 6, 9, 7, 6, 2],
[5, 3, 1, 5, 3, 3, 3, 3]])
Another way of solving this would be the following -
M = np.array([[1,3, 2,1, 4,2, 3,3],
[3,5, 6,9, 5,1, 3,4],
[1,3, 2,4, 3,4, 7,2],
[4,5, 1,2, 2,1, 2,3],
[6,4, 4,1, 6,1, 4,7],
[6,7, 7,6, 9,7, 6,2],
[5,3, 1,5, 3,3, 3,3]])
MM = M.reshape(M.shape[0],-1,2)
matches_M = np.any(MM[:,:,None,:,None] == MM[:,None,:,None,:], axis=(-1,-2))
mask = ~np.eye(MM.shape[1], dtype=bool)[None,:]
is_valid = np.all(np.any(matches_M&mask, axis=-1), axis=-1)
M[is_valid]
array([[1, 3, 2, 1, 4, 2, 3, 3],
[1, 3, 2, 4, 3, 4, 7, 2],
[6, 4, 4, 1, 6, 1, 4, 7],
[6, 7, 7, 6, 9, 7, 6, 2],
[5, 3, 1, 5, 3, 3, 3, 3]])

How to return indices from sorting a 2d numpy array row-by-row?

Input: A 2D numpy array
Output: An array of indices that will sort the array row by row (or column by column)
E.g.: Say the function is get_sorted_indices(array, axis=0)
a = np.array([[1,2,3,4,5]
,[2,3,4,5,6]
,[1,2,3,4,5]
,[2,3,4,6,6]
,[2,3,4,5,6]])
ind = get_sorted_indices(a, axis=0)
Then we will get
>>> ind
[0, 2, 1, 4, 3]
>>> a[ind] # should be equals to a.sort(axis = 0)
array([[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[2, 3, 4, 5, 6],
[2, 3, 4, 6, 6]])
>>> a.sort(axis=0)
array([[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[2, 3, 4, 5, 6],
[2, 3, 4, 6, 6]])
I've looked at argsort but I don't understand its output and reading the documentation doesn't help:
>>> a.argsort()
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
>>> a.argsort(axis=0)
array([[0, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 1, 1, 1, 1],
[3, 3, 3, 4, 3],
[4, 4, 4, 3, 4]])
I can do this manually but I think I'm misunderstanding argsort or I'm missing something from numpy.
Is there a standard way to do this or I have no choice but to do this manually?

Get index of largest element for each submatrix in a Numpy 2D array

I have a 2D Numpy ndarray, x, that I need to split in square subregions of size s. For each subregion, I want to get the greatest element (which I do), and its position within that subregion (which I can't figure out).
Here is a minimal example:
>>> x = np.random.randint(0, 10, (6,8))
>>> x
array([[9, 4, 8, 9, 5, 7, 3, 3],
[3, 1, 8, 0, 7, 7, 5, 1],
[7, 7, 3, 6, 0, 2, 1, 0],
[7, 3, 9, 8, 1, 6, 7, 7],
[1, 6, 0, 7, 5, 1, 2, 0],
[8, 7, 9, 5, 8, 3, 6, 0]])
>>> h, w = x.shape
>>> s = 2
>>> f = x.reshape(h//s, s, w//s, s)
>>> mx = np.max(f, axis=(1, 3))
>>> mx
array([[9, 9, 7, 5],
[7, 9, 6, 7],
[8, 9, 8, 6]])
For example, the 8 in the lower left corner of mx is the greatest element from subregion [[1,6], [8, 7]] in the lower left corner of x.
What I want is to get an array similar to mx, that keeps the indices of the largest elements, like this:
[[0, 1, 1, 2],
[0, 2, 3, 2],
[2, 2, 2, 2]]
where, for example, the 2 in the lower left corner is the index of 8 in the linear representation of [[1, 6], [8, 7]].
I could do it like this: np.argmax(f[i, :, j, :]) and iterate over i and j, but the speed difference is enormous for large amounts of computation. To give you an idea, I'm trying to use (only) Numpy for max pooling. Basically, I'm asking if there is a faster alternative than what I'm using.
Here's one approach -
# Get shape of output array
m,n = np.array(x.shape)//s
# Reshape and permute axes to bring the block as rows
x1 = x.reshape(h//s, s, w//s, s).swapaxes(1,2).reshape(-1,s**2)
# Use argmax along each row and reshape to output shape
out = x1.argmax(1).reshape(m,n)
Sample input, output -
In [362]: x
Out[362]:
array([[9, 4, 8, 9, 5, 7, 3, 3],
[3, 1, 8, 0, 7, 7, 5, 1],
[7, 7, 3, 6, 0, 2, 1, 0],
[7, 3, 9, 8, 1, 6, 7, 7],
[1, 6, 0, 7, 5, 1, 2, 0],
[8, 7, 9, 5, 8, 3, 6, 0]])
In [363]: out
Out[363]:
array([[0, 1, 1, 2],
[0, 2, 3, 2],
[2, 2, 2, 2]])
Alternatively, to simplify things, we could use scikit-image that does the heavy work of reshaping and permuting axes for us -
In [372]: from skimage.util import view_as_blocks as viewB
In [373]: viewB(x, (s,s)).reshape(-1,s**2).argmax(1).reshape(m,n)
Out[373]:
array([[0, 1, 1, 2],
[0, 2, 3, 2],
[2, 2, 2, 2]])

Categories

Resources