Efficent way of finding rows matching a list of values - python

This question is similar to this one, but I couldn't find a way to adapt it in my case.
I have a pretty big Nx3 matrix of integers. I need to find the list of rows matching a list of integers. The final goal is to filter the matrix to remove the rows containing one of these values.
Right now, the best I could come out with involves a for loop on my list of integers and numpy.logical_and.reduce to find the rows. I believe there must be a more efficient way, without having to go to a lower level language.
import numpy as np
matrix = np.random.randint(0,100000,(50000, 3))
values_to_find = np.random.randint(0,100000,10000)
matches = np.ones(len(matrix), bool)
for value in values_to_find:
matches = matches & np.logical_and.reduce(matrix != value, axis=1)
new_matrix = matrix[matches]
What's the more efficient and elegant way?

Another approach with np.isin ie
matrix[~np.isin(matrix, values_to_find).any(1)]

One approach would be to get the mask of matches across all rows with np.in1d and then look for rows with any one match and then get the rest of the rows -
matrix[~np.in1d(matrix, values_to_find).reshape(matrix.shape).any(1)]

Related

Merge one tensor into other tensor on specific indexes in PyTorch

Any efficient way to merge one tensor to another in Pytorch, but on specific indexes.
Here is my full problem.
I have a list of indexes of a tensor in below code xy is the original tensor.
I need to preserve the rows (those rows who are in indexes list) of xy and apply some function on elements other than those indexes (For simplicity let say the function is 'multiply them with two),
xy = torch.rand(100,4)
indexes=[1,2,55,44,66,99,3,65,47,88,99,0]
Then merge them back into the original tensor.
This is what I have done so far:
I create a mask tensor
indexes=[1,2,55,44,66,99,3,65,47,88,99,0]
xy = torch.rand(100,4)
mask=[]
for i in range(0,xy.shape[0]):
if i in indexes:
mask.append(False)
else:
mask.append(True)
print(mask)
import numpy as np
target_mask = torch.from_numpy(np.array(mask, dtype=bool))
print(target_mask.sum()) #output is 89 as these are element other than preserved.
Apply the function on masked rows
zy = xy[target_mask]
print(zy)
zy=zy*2
print(zy)
Code above is working fine and posted here to clarify the problem
Now I want to merge tensor zy into xy on specified index saved in the list indexes.
Here is the pseudocode I made, as one can see it is too complex and need 3 for loops to complete the task. and it will be too much resources wastage.
# pseudocode
for masked_row in indexes:
for xy_rows_index in xy:
if xy_rows_index= masked_row
pass
else:
take zy tensor row and replace here #another loop to read zy.
But I am not sure what is an efficient way to merge them, as I don't want to use NumPy or for loop etc. It will make the process slow, as the original tensor is too big and I am going to use GPU.
Any efficient way in Pytorch for this?
Once you have your mask you can assign updated values in place.
zy = 2 * xy[target_mask]
xy[target_mask] = zy
As for acquiring the mask I don't see a problem necessarily with your approach, though using the built-in set operations would probably be more efficient. This also gives an index tensor instead of a mask, which, depending on the number of indices being updated, may be more efficient.
i = list(set(range(len(xy)))-set(indexes))
zy = 2 * xy[i]
xy[i] = zy
Edit:
To address the comment, specifically to find the complement of indices of i we can do
i_complement = list(set(range(len(xy)))-set(i))
However, assuming indexes contains only values between 0 and len(xy)-1 then we could equivalently use i_complement = len(set(indexes)), which just removes the repeated values in indexes.

Finding and removing palindrome rows in 2D numpy array

What would be pythonic and effective way to find/remove palindrome rows from matrix. Though the title suggests matrix to be a numpy ndarray, it can be pandas DataFrame if it lead to more elegant solution.
Obvious way would be to implement this using for-loop, but I'm interested is there a more effective and succint way.
My first idea was to concatenate rows and rows-inverse, and then extract duplicates from concatenated matrix. But this list of duplicates will contain both initial row and its inverse. So to remove second instance of a palindrome I'd still have to do some for-looping.
My second idea was to somehow use broadcasting to get cartesian product of rows and apply my own ufunc (perhaps created using numba) to get 2D bool matrix. But I don't know how to create ufunc that would get matrix axis, instead of scalar.
EDIT:
I guess I should apologize for poorly formulated question (English is not my native language). I don't need to find out if any row itself is palindrome, but if there are pairs of rows within matrix that are palindromes.
I simply check if the array is equal its reflection (around axis 1) in all elements, if true it is a palindrome (correct me if I am wrong). Then I index out the rows that aren't palindromes.
import numpy as np
a = np.array([
[1,0,0,1], # Palindrome
[0,2,2,0], # Palindrome
[1,2,3,4],
[0,1,4,0],
])
wherepalindrome = (a == a[:,::-1]).all(1)
print(a[~wherepalindrome])
#[[1 2 3 4]
# [0 1 4 0]]
Naphat's answer is the pythonic (numpythonic) way to go. That should be the accepted answer.
But if your array is really large, you don't want to create a temporary copy, and you wish to explore Numba's intricacies, you can use something like this:
import numba as nb
#nb.njit(parallel=True)
def palindromic_rows(a):
rows, cols = a.shape
palindromes = np.full(rows, True, dtype=nb.boolean)
mid = cols // 2
for r in nb.prange(rows): # <-- parallel loop
for c in range(mid):
if a[r, c] != a[r, -c-1]:
palindromes[r] = False
break
return palindromes
This contraption just replaces the elegant (a == a[:,::-1]).all(axis=1), but it's almost an order of magnitude faster for very large arrays and it doesn't duplicate them.

Extracting the maximum element of an array by index based on conditional of other arrays

I believe that my problem is really straightforward and there must be a really easy way to solve this issue, however as I am quite new with Python, I could not sort it out by my own.
I made up the following examples, which naturally represents a way simpler scenario of what I have been working on, hence I am looking for a applicable general solution to other cases. So, please, consider:
import numpy as np
x = np.array([300,300,450,500,510,750,600,300])
x_validate1 = np.array([0,27,3,4,6,4,13,5])
x_validate2 = np.array([0,27,3,4,6,4,3,5])
x_validate3 = np.array([0,7,3,14,6,16,6,5])
x_validate4 = np.array([0,3,3,5,7,4,9,5])
What I need is to extract the maximum value in x whose same index in other arrays (x_validate1,2,3,4) represents elements between 5 and 10 (conditional), that means, if I wanted to pick up the maximum in the x array, it would logically be 750, however by applying this condition, what I want the script to return is 510, since for the other arrays, the condition is met.
Hope that I managed to be succinct and precise. I would really appreciate your help on this one!
Here's one approach:
# combine all the above arrays into one
a = np.array([x_validate1, x_validate2, x_validate3, x_validate4])
# check in which columns all rows satisfy the condition
m = ((a > 5) & (a < 10)).all(0)
# index x, and find the maximum value
x[m].max()
# 510

In Python, split array rows into groups according to the value in specific column of that array [duplicate]

This question already has an answer here:
Python: Split NumPy array based on values in the array
(1 answer)
Closed 3 years ago.
I have an array where each row of data follows a sequential order, identified by a label column at the end. As a small example, its format is similar to this:
arr = [[1,2,3,1],
[2,3,4,1],
[3,4,5,1],
[4,5,6,2],
[5,6,7,2],
[7,8,9,2],
[9,10,11,3]]
I would like to split the array into groups using the label column as the group-by marker. So the above array would produce 3 arrays:
arrA = [[1,2,3,1],
[2,3,4,1],
[3,4,5,1]]
arrB = [[4,5,6,2],
[5,6,7,2],
[7,8,9,2]]
arrC = [9,10,11,3]
I currently have this FOR loop, storing each group array in a wins list:
wins = []
for w in range(1, arr[-1,3]+1):
wins.append(arr[arr[:, 3] == w, :])
This does the job okay but I have several large datasets to process so is there a vectorized way of doing this, maybe by using diff() or where() from the numpy library?
Okay, I did some more digging using the "numpy group by" search criteria, thanks to the guy who commented but has now removed their comment, and found this very similar question: Is there any numpy group by function?.
I adapted the answer from Vincent J (https://stackoverflow.com/users/1488055/vincent-j) to this and it produced the correct result:
wins = np.split(arr[:, :], np.cumsum(np.unique(arr[:, 3], return_counts=True)[1])[:-1])
I will go with this code but by all means chip in if anyone thinks there's a better way.
I know you seem to want arrays, but I think for what you seem to be asking that a dict is possibly an easier way to approach this?
from collections import defaultdict
wins = defaultdict(list)
for item in arr:
wins[item[-1]].append(item)
Then your separate arrays you want are the values in wins (e.g., wins[1] is an array of items where the label is 1).
Just seems a little more Pythonic and readable to me!
I think this piece of code would be more than fast enough with any dataset that's not absolutely massive:
for a in arr:
while True:
try:
wins[a[-1]].append(a)
break
except IndexError:
wins.append([])
You definitely won't get anything better than O(n). If your data is stored somewhere else, like a SQL database or something, you'd probably be better off running this logic in the sql query itself.

Find intersecting values in multiple numpy arrays

I have 100 large arrays > 250,000 elements each. I want to find common values that are found in these arrays. I know that there are not going to be values that are found in all 100 arrays, but a small number values will be found in multiple arrays (I suspect 10-30%). I want to find which values are found with the highest frequency across these arrays. (Side point: arrays have no duplicates)
I know that I can loop through the arrays and eventually find them, but that will take a while. I also know about the np.intersect1d function, but I that only gives values that are found within all of the arrays, whereas I'm looking for values that are only going to be in around 20 of the 100 arrays.
My best bet is use the np.intersect1d function and loop through all possible combinations of the arrays, which would definitely take a while, but not as long as simply looping through all 250,000 x 100 values.
Example:
array_1 = array([1.98,2.33,3.44,,...11.1)
array_2 = array([1.26,1.49,4.14,,...9.0)
array_2 = array([1.58,2.33,3.44,,...19.1)
array_3 = array([4.18,2.03,3.74,,...12.1)
.
.
.
array_100= array([1.11,2.13,1.74,,...1.1)
No values in all 100, Is there a value that can be found in 30 different arrays?
You can either use np.unique with the return_counts keyword, or a vanilla Python Counter.
The first option works if you can concatenate your arrays into a single 250k x 100 monolith, or even string them out over after the other:
unq, counts = np.unique(monolith, return_counts=True)
ind = np.argsort(counts)[::-1]
unq = unq[ind]
counts = counts[ind]
This will leave you with an array containing all the unique values, and the frequency with which they occur.
If the arrays have to remain separate, use collections.Counter to accomplish the same task. In the following, I assume that you have a list containing your arrays. It would be very pointless to have a hundred individually named variables:
c = Counter()
for arr in arrays:
c.update(arr)
Now c.most_common will give you the most common elements and their counts.

Categories

Resources