Eliminating redundant numpy rows - python

If I have an array
arr = [[0,1]
[1,2]
[2,3]
[4,3]
[5,6]
[3,4]
[2,1]
[6,7]]
how could I eliminate redundant rows where columns values may be swapped? In the example above, the code would reduce the array to
arr = [[0,1]
[1,2]
[2,3]
[4,3]
[5,6]
[6,7]]
I have thought about using a combination of slicing arr[:,::-1, np.all, and np.any, but what I have come up so far simply gives me True and False per row when comparing rows but this wouldn't discriminate between similar rows.
j = np.any([np.all(y==x, axis=1) for y in x[:,::-1]], axis=0)
which yields [False, True, False, True, False, True, True, False].
Thanks in advance.

Basically you want to Find Unique Rows, and these answers borrow heavily from the top two answers there - but you need to sort the rows first to eliminate different orders.
If you don't care about order of rows at the end, this is the short way (but slower than below):
np.vstack({tuple(row) for row in np.sort(arr,-1)})
If you do want to maintain order, you can turn each sorted row into a void object and use np.unique with return_index
b = np.ascontiguousarray(np.sort(arr,-1)).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_arr = arr[idx]
It might be tempting to use set row-wise instead of using np.sort(arr,-1) and np.void to make an object array, but this only works if there are no repeated values in rows. If there are, a row of [1,2,2] will be considered equivalent to a row with [1,1,2] - both will be set(1,2)

A solution without using numpy,
In [27]: result_ = set(([tuple(sorted(row)) for row in arr]))
In [28]: result = [list(i) for i in result_]
In [29]: result
Out[29]: [[0, 1], [1, 2], [6, 7], [5, 6], [2, 3], [3, 4]]

The solution using numpy.lexsort routine:
import numpy as np
arr = np.array([
[0,1], [1,2], [2,3], [4,3], [5,6], [3,4], [2,1], [6,7]
])
order = np.lexsort(arr.T)
a = arr[order] # sorted rows
arr= a[[i for i,r in enumerate(a) if i == len(a)-1 or set(a[i]) != set(a[i+1])]]
print(arr)
The output:
[[0 1]
[1 2]
[2 3]
[3 4]
[5 6]
[6 7]]

After getting the boolean list, you can use the folllowing technique to obtain the list with values where x and y are swapped.
In order to remove same rows, you can use the following block
#This block to remove elements where x and y are swapped provided the list j
j=[True,False..] #Your Boolean List
finalArray=[]
for (bool,value) in zip(j,arr):
if not bool:
finalArray.append(value)
#This code to remove same elements
finalArray= [list(x) for x in set(tuple(x) for x in arr)]

Related

I need to remove every point that has the same Y coordinate in an array

Basically I have an array list [x,y] that goes : [0,1][1,2][2,4][3,1][4,3] and the list goes on. I want to execute a code that removes the points that have the same y coordinate except the first one in order. I would like to have as output : [0,1][1,2][2,4][4,3]. How can I do this I have tried using np.unique but I can't mange to keep the first appearance or to remove based on the y coordinate.
Thanks
You can use HYRY's solution from numpy.unique with order preserved, you just need to select the Y column.
import numpy as np
a = np.array([[0,1], [1,2], [2,4], [3,1], [4,3]])
_, idx = np.unique(a[:, 1], return_index=True)
a[np.sort(idx)]
result:
[[0 1]
[1 2]
[2 4]
[4 3]]
array = [[0,1],[1,2],[2,4],[3,1],[4,3]]
occured = set()
result = []
for element in array:
if element[1] not in occured:
result.append(element)
occured.add(element[1])
array.clear()
array.extend(result)
print(array)
>> [[0, 1], [1, 2], [2, 4], [4, 3]]

concatenate arrays of different lengths into one multidimensional array

I know that you cant stack or concatenate arrays of different lenghths in NumPy as all matrices need to be rectangular, but is there any other way to achieve this?
For example:
a = [1, 2 ,3]
b = [9, 8]
Stacking them would give:
c = [[1, 2, 3]
[9, 8]]
alternatively if there is no way to create the above how could I write a function to get this: (0 in place of missing element to fill matrix)?
c = [[1, 2, 3]
[9, 8, 0]]
This code worked for me:
a = [1, 2 ,3]
b = [9,8]
while len(b) != len(a):
if len(b) > len(a):
a.append(0)
else:
b.append(0)
final = np.array([a,b])
print(final)
The code is self explanatory, but I will try my best to give a valid explanation:
We take two lists (say a and b) and we compare there lengths, if they are unequal we add element (in this case 0) to the one whose length is lower, this loops until their lengths are equal, then it simply converts them into a 2D array in numpy
Also you can replace 0 with np.NaN if you want NaN values
I think what you are looking for is:
In:
from itertools import zip_longest
a = [1, 2 ,3]
b = [9, 8]
c = np.array(list(zip_longest(*[a,b])),dtype=float).transpose()
print(c)
Out:
[[ 1. 2. 3.]
[ 9. 8. nan]]

Numpy array indexing with a List: difference between arr[:][li] and arr[:,li]

What is the explanation of the following behavior:
import numpy as np
arr = np.zeros((3, 3))
li = [1,2]
print('output1:', arr[:, li].shape)
print('output2:', arr[:][li].shape)
>>output1: (3, 2)
>>output2: (2, 3)
I would expect output2 to be equal to output1.
Let's use a different array where it's easier to see the difference:
>>> arr = np.arange(9).reshape(3, 3)
>>> arr
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
The first case arr[:, li] will select all elements from the first dimension (in this case all the rows), then index the array with [1, 2], which means just leaving out the first column:
array([[1, 2],
[4, 5],
[7, 8]])
Hence, the shape of this is (3, 2).
The other case arr[:] will copy the original array, so it doesn't change the shape, therefore it's equvivalent to arr[li], hence the output shape is (2, 3). In general you should avoid double indexing an array, because that might create views twice, which is inefficient.
You are getting the the correct output.
In first line
print('output1:', arr[:, li].shape)
You are printing 2nd and 3rd element of each subarray within arr, thus getting 3 elements each containing 2 values.
In second line
print('output2:', arr[:][li].shape)
You are selecting first the whole array, then from the whole array you select 2nd and 3rd element (each containing 3 elements themselves), thus getting 2 elements each containing 3 values.
The difference can be seen if you examine this code -
import numpy as np
arr = np.arange(9).reshape(3, 3)
li = [1,2]
print('output1:', arr[:, li])
print('output2:', arr[:][li])
This gives -
[[1 2]
[4 5]
[7 8]]
and
[[3 4 5]
[6 7 8]]
When you do arr[:, [1, 2]], what you are saying that you want to take all the rows of the array (: specifies this) and, from that, take column [1, 2].
On the other hand, when you do arr[:], you are referring to the full array first. Out of which you are again taking the first two rows.
Essentially, in the second case, [1 2] is referring to the row axis of the original array while in the first case, it's referring to the column.

Sum Rows of 2D np array with List of Indices

I have a 2d numpy array and a list of numbers. If the list is [1, 3, 1, 8] where the list sums to the number of rows, I want to output an array with the first row unchanged, the next three rows summed, the fifth row unchanged, and the remaining eight rows summed.
As an example:
A = [[0,0], [1,2], [3,4]] and l = [1, 2] would output [[0,0], [4,6]
I looked through np.sum and other functions but could not find not this functionality. Thank you.
You can just iterate over the indices of l and based on the position either take that row or sum over a range of rows.
import numpy as np
A = [[0,0], [1,2], [3,4]]
l = [1, 2]
ans = []
for i in range(len(l)):
if i%2 == 0:
ans.append(A[ l[i] ])
else:
ans.append( np.sum( A[ l[i-1]:l[i-1] + l[i] ], axis=0 ) )
ans = np.array(ans)
print(ans)
[[1 2]
[4 6]]
N.B:
If the list is [1, 3, 1, 8] where the list sums to the number of rows,
I want to output an array with the first row unchanged, the next three
rows summed, the fifth row unchanged, and the remaining eight rows
summed.
I think you meant [1, 3, 5, 8]
If the number of elements in l is relatively large large, you might get better performance by using groupby from pandas, e.g.
import pandas as pd
labels = np.repeat(np.arange(1, len(l) + 1), l)
# [1, 2, 2]
df = pd.DataFrame(A)
df['label'] = labels
result = df.groupby('label').sum().values
I ended up coming up with my own solution when I realized I could sort my list without affecting my desired output. I used np.unique to determine the first indices of each element in the sorted list and then summed the rows between those indices. See below.
elements, indices = np.unique(data, return_counts=True)
row_summing = np.append([0], np.cumsum(indices))[:-1] #[0, index1, index2,...]
output = np.add.reduceat(matrix, row_summing, axis=0)

Filter columns where one of the values is 0

I want to filter columns where one of the values in the column is 0. So
>>> test = numpy.array([[3,2,3], [0,4,2],[2,3,2]])
>>> test
[[3 2 3
0 4 2
2 3 2]]
would become
>>> test[somefilter]
[[2 3
4 2
3 2]]
I thought this could be done by
>>> test[:, ~numpy.any(0, axis=0)]
but this just gets the last column.
In your code, numpy.any(0, axis=0) tests whether any value in "0" is nonzero, so it will always evaluate False. Therefore, ~numpy.any(0, axis=0) will always evaluate True, which gets cast to the index 1, so you always get column 1 back.
Instead you want to look for columns in test where there are not any zeros in the row values:
test[:, ~numpy.any(test == 0, axis=0)]
Or equivalently, where all row values are nonzero using np.all():
test[:, np.all(test, axis=0)]
#[[2, 3]
# [4, 2]
# [3, 2]]
In your code, numpy.any(0, axis=0) always evaluates to 0. You need to pass in test==0 to check for values of 0 in test.
How about this?
In [37]: x = numpy.any(test==0, axis=0)
In [38]: test[:,numpy.where(x== False)[0]]
Out[38]:
array([[2, 3],
[4, 2],
[3, 2]])
Edit
I'm gonna leave this as a more roundabout way of doing the same thing, but I think ali_m's answer is more elegant and stylistically closer to to the asker's code.
If you wanted to filter columns where one value is 0, you could've used all:
test[:, test.all(axis=0)]
or
test[:, numpy.all(test, axis=0)]
How about not using numpy?
arr=[[3,2,3], [0,4,2],[2,3,2]]
for lis in arr:
for i,num in enumerate(lis):
if num==0:
for chk in arr:
del chk[i]
print arr
results:
[[2, 3], [4, 2], [3, 2]]

Categories

Resources