Splitting a sorted array of repeated elements

Splitting a sorted array of repeated elements - python

I have an array of repeated elements, where each repeated element represents a class. What i would like to do is obtain the indices of the repeated elements and partition in order of the nth first elements in 3 slices. For example:
np.array([0, 2, 2, 1, 0, 1, 2, 1, 0, 0])
split the first occurences in 3
[0, 2, 1] [2, 0, 1], [2, 1, 0, 0]
I would like to find the indices of the repeated elements and split the array in proportions of 3, where each sliced array will contain the first 3 repeated elements indices:
So for the array and it's splits, i'd like to obtain the following:
array[0, 2, 2, 1, 0, 1, 2, 1, 0, 0]
indices:[0, 1, 3], [2, 4, 5], [6, 7, 8, 9]
I've tried the following:
a = np.array([0, 2, 2, 1, 0, 1, 2, 1, 0, 0])
length = np.arange(len(a))
array_set = (([length[a ==unique] for unique in np.unique(a)]))
But i can't figure how to split the partitions in order of the first occurences like the above examples.

This is a way to split the array in proportions of 3, that is, the last 0 will be left out:
# unique values
uniques = np.unique(a)
# counting occurrence of each unique value
occ = np.cumsum(a == uniques[:,None], axis=1)
# maximum common occurrence
max_occ = occ.max(axis=1).min()
# masking the first occurrences
u = (occ[None,...] == (np.arange(max_occ)+1)[:,None, None])
# the indexes
idx = np.sort(np.argmax(u, axis=-1), axis=-1)
# the partitions
partitions = a[idx]
Output:
# idx
array([[0, 1, 3],
[2, 4, 5],
[6, 7, 8]])
# partitions
array([[0, 2, 1],
[2, 0, 1],
[2, 1, 0]])

This is a problem where np.concatenate(...) + some algorithm + np.split(...) does the trick, though they are slow methods.
Lets start from concatenation and referencing indexes where you split:
classes = [[0, 2, 1], [2, 0, 1], [2, 1, 0, 0]]
split_idx = np.cumsum(list(map(len, classes[:-1])))
flat_classes = np.concatenate(classes)
Then indexes that sorts an initial array and also indexes of starts of groups are needed. In this case sorted array is [0,0,0,0,1,1,1,2,2,2] and distinct groups start at 0, 4 and 7.
c = np.array([0, 2, 2, 1, 0, 1, 2, 1, 0, 0])
idx = np.argsort(c)
u, cnt = np.unique(c, return_counts=True)
marker_idx = np.r_[0, np.cumsum(cnt[:-1])]
Now this is a trickiest part. It is known that one of indexes 0, 4 or 7 changes in each step (while you iterate on flat_classes), so you can accumulate these changes in a special array called counter which has 3 columns for each index and after that access only these indexes where changes were met:
take = np.zeros((len(flat_classes), len(u)), dtype=int)
take[np.arange(len(flat_classes)), flat_classes] = 1
counter = np.cumsum(take, axis=0)
counter = counter + marker_idx - np.ones(len(u), dtype=int)
active_idx = counter[np.arange(len(flat_classes)), flat_classes]
splittable = idx[active_idx] #remember that we are working on indices that sorts array
output = np.split(splittable, split_idx)
Output
[array([0, 1, 3], dtype=int64),
array([2, 4, 5], dtype=int64),
array([6, 7, 8, 9], dtype=int64)]
Remark: the main idea of solution is to manipulate with changes of indexes of other indexes that sorts an array. This is example of changes for this problem:
>>> counter
array([[0, 3, 6],
[0, 3, 7],
[0, 4, 7],
[0, 4, 8],
[1, 4, 8],
[1, 5, 8],
[1, 5, 9],
[1, 6, 9],
[2, 6, 9],
[3, 6, 9]]

Related

How to find the indices of the maximum in each line, a concatenation of rows, in numpy?

I don't know if this is simple or not or if it is asked before or not. (I searched but did not find the correct way to do it. I have found numpy.argmax and numpy.amax but I am not able to use them correctly.)
I have a numpy array (it is a CxKxN matrix) as follows (C=K=N=3):
array([[[1, 2, 3],
[2, 1, 4],
[4, 3, 3]],
[[2, 1, 1],
[1, 3, 1],
[3, 4, 2]],
[[5, 2, 1],
[3, 3, 3],
[4, 1, 2]]])
I would like to find the indices of the maximum elements across each line. A line is the concatenation of the three (C) rows of each matrix. In other words, the i-th line is the concatenation of the i-th row in the first matrix, the i-th row in the second matrix, ..., until the i-th row in the C-th matrix.
For example, the first line is
[1, 2, 3, 2, 1, 1, 5, 2, 1]
So I would like to return
[2, 0, 0] # the index of the maximum in the first line
and
[0, 1, 2] # the index of the maximum in the second line
and
[0, 2, 0] # the index of the maximum in the third line
or
[1, 2, 1] # the index of the maximum in the third line
or
[2, 2, 0] # the index of the maximum in the third line
Now, I am trying this
np.argmax(a[:,0,:], axis=None) # for the first line
It returns 6 and
np.argmax(a[:,1,:], axis=None)
and it returns 2 and
np.argmax(a[:,2,:], axis=None)
and it returns 0
but I am able to convert these numbers to indices like 6 = (2,0,0), etc.

With an transpose and reshape I get your 'rows'
In [367]: arr.transpose(1,0,2).reshape(3,9)
Out[367]:
array([[1, 2, 3, 2, 1, 1, 5, 2, 1],
[2, 1, 4, 1, 3, 1, 3, 3, 3],
[4, 3, 3, 3, 4, 2, 4, 1, 2]])
In [368]: np.argmax(_, axis=1)
Out[368]: array([6, 2, 0])
These max are same as yours. The same indices, but in a (3,3) array:
In [372]: np.unravel_index([6,2,0],(3,3))
Out[372]: (array([2, 0, 0]), array([0, 2, 0]))
Join them with middle dimension range:
In [373]: tup = (_[0],np.arange(3),_[1])
In [374]: np.transpose(tup)
Out[374]:
array([[2, 0, 0],
[0, 1, 2],
[0, 2, 0]])

Get array of indices of first zero in every row of numpy array

I have a numpy array of 1650 rows and 1275 columns containing 0s and 255s.
I want to get the index of every first zero in the row and store it in an array.
I used for loop to achieve that. Here is the example code
#new_arr is a numpy array and k is an empty array
for i in range(new_arr.shape[0]):
if not np.all(new_arr[i,:]) == 255:
x = np.where(new_arr[i,:]==0)[0][0]
k.append(x)
else:
k.append(-1)
It takes around 1.3 seconds for 1650 rows. Is there any other way or function to get the indices array in a much faster way?

One approach would be to get mask of matches with ==0 and then get argmax along each row, i.e argmax(axis=1) that gives us the first matching index for each row -
(arr==0).argmax(axis=1)
Sample run -
In [443]: arr
Out[443]:
array([[0, 1, 0, 2, 2, 1, 2, 2],
[1, 1, 2, 2, 2, 1, 0, 1],
[2, 1, 0, 1, 0, 0, 2, 0],
[2, 2, 1, 0, 1, 2, 1, 0]])
In [444]: (arr==0).argmax(axis=1)
Out[444]: array([0, 6, 2, 3])
Catching non-zero rows (if we can!)
To facilitate for rows that won't have any zero, we need to do one more step of work, with some masking -
In [445]: arr[2] = 9
In [446]: arr
Out[446]:
array([[0, 1, 0, 2, 2, 1, 2, 2],
[1, 1, 2, 2, 2, 1, 0, 1],
[9, 9, 9, 9, 9, 9, 9, 9],
[2, 2, 1, 0, 1, 2, 1, 0]])
In [447]: mask = arr==0
In [448]: np.where(mask.any(1), mask.argmax(1), -1)
Out[448]: array([ 0, 6, -1, 3])

function to find minimum number greater than zero from rows of a array and store into a list

I want to find out minimum number from the rows of 9x9 array A and store in a list m. But I want to exclude zero. The program that I made is returning zero as minimum.
m = []
def find (L):
for i in range(len(L)):
m.append(A[L[i]].min())
c = [2,3,4,5,6];
find(c)
print m

Here's a NumPy solution -
np.where(a>0,a,a.max()).min(1)
Sample run -
In [45]: a
Out[45]:
array([[0, 4, 6, 6, 1],
[3, 1, 5, 0, 0],
[6, 3, 6, 0, 0],
[0, 6, 3, 5, 2]])
In [46]: np.where(a>0,a,a.max()).min(1)
Out[46]: array([1, 1, 3, 2])
If you want to perform this operation along selected rows only specified by row indices in L -
def find(a,L):
return np.where(a[L]>0,a[L],a.max()).min(1)
Sample run -
In [62]: a
Out[62]:
array([[0, 4, 6, 6, 1],
[3, 1, 5, 0, 0],
[6, 3, 6, 0, 0],
[0, 6, 3, 5, 2]])
In [63]: L = [2,3]
In [64]: find(a,L)
Out[64]: array([3, 2])

Modified cumulative sum of numbers in a list

I want to create new list according cumulative sums of numbers in a list. Input is ideal - can be splitting to subset, sum of each subset is equal. Length of subset is not equal. Number of subset is input.
Each subset of output represents increment integers [0,1,2,3,...], which replace original input. Quantity of integers is number of subsets.
Example:
number of subsets = 2
input = [1, 4, 5]
#cumsum = [1, 5, 10]
subsets = [1,5], [10]
output-subsets = [0,0], [1]
output = [0, 0, 1]
Example1:
number of subsets = 4
input = [1, 2, 3, 4, 2, 5, 1, 6]
#cumsum = [1, 3, 6, 10, 12, 17, 18, 24]
subsets = [1,3,6], [10, 12],[17, 18], [24]
output-subsets = [0, 0, 0], [1, 1], [2, 2], [3]
output = [0, 0, 0, 1, 1, 2, 2, 3]
number of subsets = 2
input = [1, 2, 3, 4, 2, 5, 1, 6]
#cumsum = [1, 3, 6, 10, 12, 17, 18, 24]
subsets = [1, 3, 6, 10, 12],[17, 18, 24]
output-subsets = [0, 0, 0, 0, 0], [1, 1, 1]
output = [0, 0, 0, 0, 0, 1, 1, 1]
I try modified SO question:
def changelist(lis, t):
total = 0
s = sum(lis)
subset = s/t
for x in lis:
total += x
i= 1
if(total <= subset):
i = 0
yield i
#changelist([input array], number of subset)
print list(changelist([1, 2, 3, 4, 2, 5, 1, 6], 4))
but only first subset is correct:
output = [0, 0, 0, 1, 1, 1, 1, 1]
I think numpy.array_split is problematic strange behaviour of numpy array_split.
I would really love any kind of explanation or help.

This should solve your problem:
def changelist (l, t):
subset = sum(l) / t
current, total = 0, 0
for x in l:
total += x
if total > subset:
current, total = current + 1, x
yield current
Examples:
>>> list(changelist([1, 4, 5], 2))
[0, 0, 1]
>>> list(changelist([1, 2, 3, 4, 2, 5, 1, 6], 4))
[0, 0, 0, 1, 1, 2, 2, 3]
>>> list(changelist([1, 2, 3, 4, 2, 5, 1, 6], 2))
[0, 0, 0, 0, 0, 1, 1, 1]
How does it work?
current stores the "id" of the current subset, total the sum of the current subset.
For each element x in your initial list l, you add its value to the current total, if this total is greater than the expected sum of each subset (subset in my code), then you know that you are in the next subset (current = current + 1) and you "reset" the total of the current subset to the actuel element (total = x).

You can use NumPy here after converting the input to an array for a vectorized solution, assuming N as the number of subsets, as listed here -
def modified_cumsum(input,N):
A = np.asarray(input).cumsum()
return np.append(False,np.in1d(A,(1+np.arange(N))*A[-1]/N))[:-1].cumsum()
Sample runs -
In [31]: N = 2 #number of subsets
...: input = [1, 4, 5]
...:
In [32]: modified_cumsum(input,N)
Out[32]: array([0, 0, 1])
In [33]: N = 4 #number of subsets
...: input = [1, 2, 3, 4, 2, 5, 1, 6]
...:
In [34]: modified_cumsum(input,N)
Out[34]: array([0, 0, 0, 1, 1, 2, 2, 3])
In [35]: N = 2 #number of subsets
...: input = [1, 2, 3, 4, 2, 5, 1, 6]
...:
In [36]: modified_cumsum(input,N)
Out[36]: array([0, 0, 0, 0, 0, 1, 1, 1])

Count unique elements row wise in an ndarray

An extension to this question. In addition to having the unique elements row-wise, I want to have a similarly shaped array that gives me the count of unique values. For example, if the initial array looks like this:
a = np.array([[1, 2, 2, 3, 4, 5],
[1, 2, 3, 3, 4, 5],
[1, 2, 3, 4, 4, 5],
[1, 2, 3, 4, 5, 5],
[1, 2, 3, 4, 5, 6]])
I would like to get this as the output from the function:
np.array([[1, 2, 0, 1, 1, 1],
[1, 1, 2, 0, 1, 1],
[1, 1, 1, 2, 0, 1],
[1, 1, 1, 1, 2, 0],
[1, 1, 1, 1, 1, 1]])
In numpy v.1.9 there seems to be an additional argument return_counts that can return the counts in a flattened array. Is there some way this can be re-constructed into the original array dimensions with zeros where values were duplicated?

The idea behind this answer is very similar to the one used here. I'm adding a unique imaginary number to each row. Therefore, no two numbers from different rows can be equal. Thus, you can find all the unique values in a 2D array per row with just one call to np.unique.
The index, ind, returned when return_index=True gives you the location of the first occurrence of each unique value.
The count, cnt, returned when return_counts=True gives you the count.
np.put(b, ind, cnt) places the count in the location of the first occurence of each unique value.
One obvious limitation of the trick used here is that the original array must have int or float dtype. It can not have a complex dtype to start with, since multiplying each row by a unique imaginary number may produce duplicate pairs from different rows.
import numpy as np
a = np.array([[1, 2, 2, 3, 4, 5],
[1, 2, 3, 3, 4, 5],
[1, 2, 3, 4, 4, 5],
[1, 2, 3, 4, 5, 5],
[1, 2, 3, 4, 5, 6]])
def count_unique_by_row(a):
weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
b = a + weight[:, np.newaxis]
u, ind, cnt = np.unique(b, return_index=True, return_counts=True)
b = np.zeros_like(a)
np.put(b, ind, cnt)
return b
yields
In [79]: count_unique_by_row(a)
Out[79]:
array([[1, 2, 0, 1, 1, 1],
[1, 1, 2, 0, 1, 1],
[1, 1, 1, 2, 0, 1],
[1, 1, 1, 1, 2, 0],
[1, 1, 1, 1, 1, 1]])

This method does the same as np.unique for each row, by sorting each row and getting the length of consecutive equal values. This has complexity O(NMlog(M)) which is better than running unique on the whole array, since that has complexity O(NM(log(NM))
def row_unique_count(a):
args = np.argsort(a)
unique = a[np.indices(a.shape)[0], args]
changes = np.pad(unique[:, 1:] != unique[:, :-1], ((0, 0), (1, 0)), mode="constant", constant_values=1)
idxs = np.nonzero(changes)
tmp = np.hstack((idxs[-1], 0))
counts = np.where(tmp[1:], np.diff(tmp), a.shape[-1]-tmp[:-1])
count_array = np.zeros(a.shape, dtype="int")
count_array[(idxs[0], args[idxs])] = counts
return count_array
Running times:
In [162]: b = np.random.random(size=100000).reshape((100, 1000))
In [163]: %timeit row_unique_count(b)
100 loops, best of 3: 10.4 ms per loop
In [164]: %timeit count_unique_by_row(b)
100 loops, best of 3: 19.4 ms per loop
In [165]: assert np.all(row_unique_count(b) == count_unique_by_row(b))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a sorted array of repeated elements - python

Related

How to find the indices of the maximum in each line, a concatenation of rows, in numpy?

Get array of indices of first zero in every row of numpy array

function to find minimum number greater than zero from rows of a array and store into a list

Modified cumulative sum of numbers in a list

Count unique elements row wise in an ndarray

Categories

Resources