Finding and removing palindrome rows in 2D numpy array - python

What would be pythonic and effective way to find/remove palindrome rows from matrix. Though the title suggests matrix to be a numpy ndarray, it can be pandas DataFrame if it lead to more elegant solution.
Obvious way would be to implement this using for-loop, but I'm interested is there a more effective and succint way.
My first idea was to concatenate rows and rows-inverse, and then extract duplicates from concatenated matrix. But this list of duplicates will contain both initial row and its inverse. So to remove second instance of a palindrome I'd still have to do some for-looping.
My second idea was to somehow use broadcasting to get cartesian product of rows and apply my own ufunc (perhaps created using numba) to get 2D bool matrix. But I don't know how to create ufunc that would get matrix axis, instead of scalar.
EDIT:
I guess I should apologize for poorly formulated question (English is not my native language). I don't need to find out if any row itself is palindrome, but if there are pairs of rows within matrix that are palindromes.

I simply check if the array is equal its reflection (around axis 1) in all elements, if true it is a palindrome (correct me if I am wrong). Then I index out the rows that aren't palindromes.
import numpy as np
a = np.array([
[1,0,0,1], # Palindrome
[0,2,2,0], # Palindrome
[1,2,3,4],
[0,1,4,0],
])
wherepalindrome = (a == a[:,::-1]).all(1)
print(a[~wherepalindrome])
#[[1 2 3 4]
# [0 1 4 0]]

Naphat's answer is the pythonic (numpythonic) way to go. That should be the accepted answer.
But if your array is really large, you don't want to create a temporary copy, and you wish to explore Numba's intricacies, you can use something like this:
import numba as nb
#nb.njit(parallel=True)
def palindromic_rows(a):
rows, cols = a.shape
palindromes = np.full(rows, True, dtype=nb.boolean)
mid = cols // 2
for r in nb.prange(rows): # <-- parallel loop
for c in range(mid):
if a[r, c] != a[r, -c-1]:
palindromes[r] = False
break
return palindromes
This contraption just replaces the elegant (a == a[:,::-1]).all(axis=1), but it's almost an order of magnitude faster for very large arrays and it doesn't duplicate them.

Related

Go from Permutation Indices to Permutation Matrix in Python

I have two lists of indices. I would like to generate the relevant permutation matrix. The two lists have equal size n and have all integers from 0 up to n-1.
Simple Example:
Given initial and final indices (as per the two-line convention https://en.wikipedia.org/wiki/Permutation_matrix):
initial_index = [3,0,2,1] and final_index = [0,1,3,2]
In other words, the last entry (3) has got to go to the first (0), the first (0) has got to go to the second (1) etc. You could also imagine zipping these two lists in order to obtain the permutation rules: [(3,0),(0,1),(2,3),(1,2)], read this as (3 -> 0),(0 -> 1) and so forth. This is a right-shift for a list, or a down-shift for a column vector. The resulting permutation matrix should be the following:
M = [[0,0,0,1],
[1,0,0,0],
[0,1,0,0],
[0,0,1,0]]
Multiplying this matrix by a column vector indeed shifts the entries down by 1, as required.
Are there any relevant operations that could achieve this efficiently?
You want an n-by-n matrix where, for every i from 0 to n-1, the cell at row final_index[i] and column initial_index[i] is set to 1, and every other cell is set to 0.
NumPy advanced indexing can be used to set those cells easily:
permutation_matrix = numpy.zeros((n, n), dtype=int)
permutation_matrix[final_index, initial_index] = 1
Alternatively to the good answer of #user2357112, you can use sparse matrices to be efficient in memory:
from scipy.sparse import csr_matrix
permutation_matrix = csr_matrix((np.ones(n, dtype=int), (final_index, initial_index)), shape=(n,n))
# Use permutation_matrix.todense() to convert the matrix if needed
The complexity of building this sparse matrix is O(n) in both time and space while for dense matrices it is O(n^2). So they are much better for big vectors (>1000).

How to find indices of matching elements in arrays?

I have two vectors, vector A is (1298,1), Vector B varies in a for loop but is always just a column vector, I am trying to use numpy.where to find the A-indices of the elements in B. Currently I have a for loop combing through Vector B element-wise and using numpy.isclose but I was wondering if anyone knows a quicker function and/or how to do this without a nested for loop? It works but very slowly.
The for loops looks like this
sphere_indices=[]
for k in range(len(A)):
for j in range(len(B)):
if np.isclose(B[j,0],A[k,0]):
sphere_indices.append(k) ```
There was never any reason to iterate through the all 1298 elements of vector A, in order to use numpy.where and numpy.isclose I just needed to use the elements in B one at a time so numpy can broadcast properly. The following code runs much faster. Any further improvements are always welcome.
for j in range(len(index)):
sphere_indices1=np.where(np.isclose(sphere_index[:,0],index[j,0]))
sphere_indices.append(sphere_indices1[0])```

Update numpy array with sparse indices and values

I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]

Efficent way of finding rows matching a list of values

This question is similar to this one, but I couldn't find a way to adapt it in my case.
I have a pretty big Nx3 matrix of integers. I need to find the list of rows matching a list of integers. The final goal is to filter the matrix to remove the rows containing one of these values.
Right now, the best I could come out with involves a for loop on my list of integers and numpy.logical_and.reduce to find the rows. I believe there must be a more efficient way, without having to go to a lower level language.
import numpy as np
matrix = np.random.randint(0,100000,(50000, 3))
values_to_find = np.random.randint(0,100000,10000)
matches = np.ones(len(matrix), bool)
for value in values_to_find:
matches = matches & np.logical_and.reduce(matrix != value, axis=1)
new_matrix = matrix[matches]
What's the more efficient and elegant way?
Another approach with np.isin ie
matrix[~np.isin(matrix, values_to_find).any(1)]
One approach would be to get the mask of matches across all rows with np.in1d and then look for rows with any one match and then get the rest of the rows -
matrix[~np.in1d(matrix, values_to_find).reshape(matrix.shape).any(1)]

Numpy Arrays comparison and indexing

I have 2 arrays of unequal size:
>>> np.size(array1)
4004001
>>> np.size(array2)
1000
Now, each element in array2 needs to be compared to all the elements in array1, to find the element which has the nearest value to that of this element in array2.
Upon finding this value, I need to store it in a different array of size 1000 - one of a size corresponding to array2.
The tedious and crude way of doing it could be using a for loop and taking each element from Array 2, subtracting its absolute value from array 1 elements and then taking the minimum value- this is going to make my code really slow.
I'd like to use numpy vectorized operations to do this but i've kind of hit a wall.
To make full use of the numpy parallelism we need vectorized functions. Further all values are found in the same array (array1) using the same criterium (nearest). Therefore, it is possible to make a special function for searching in array1 specifically.
However, to make the solution more reusable it is better to make a more general solution and then transform it into a more specific one. Thus, as a general approach to find the closest value, we start with this find nearest solution. Then we turn that into a more specific and vectorize it, to allow it to work on multiple element at once:
import math
import numpy as np
from functools import partial
def find_nearest_sorted(array,value):
idx = np.searchsorted(array, value, side="left")
if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
return array[idx-1]
else:
return array[idx]
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
array1_sorted = np.sort(array1)
# Partially apply array1 to find function, to turn the general function
# into a specific, working with array1 only.
find_nearest_in_array1 = partial(find_nearest_sorted, array1_sorted)
# Vectorize specific function to allow us to apply it to all elements of
# array2, the numpy way.
vectorized_find = np.vectorize(find_nearest_in_array1)
output = vectorized_find(array2)
Hopefully this is what you wanted, a new vector, mapping the data in array2 to the nearest values in array1.
The most "numpythonic" way is is to use broadcasting. This is a quick and easy way to calculate a distance matrix, for which you can then take the argmin of the absolute value.
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
# Calculate distance matrix (on truncated array1 for memory reasons)
dmat = array1[:400400] - array2[:,None]
# Take the abs of the distance matrix and work out the argmin along the last axis
ix = np.abs(dmat).argmin(axis=1)
shape of dmat:
(1000, 400400)
shape of ix and contents:
(1000,)
array([237473, 166831, 72369, 11663, 22998, 85179, 231702, 322752, ...])
However, it's memory hungry if you do this operation in one go, and actually doesn't work on my 8GB machine for the size of arrays that you specify, which is why I reduced the size of array1.
To make it work within memory constraints, simply slice one of the arrays into chunks and apply broadcasting on each chunk in turn (or parallelise). In this case, I've sliced array2 into 10 chunks:
# Define number of chunks and calculate chunk size
n_chunks = 10
chunk_len = array2.size // n_chunks
# Preallocate output array
out = np.zeros(1000)
for i in range(n_chunks):
s = slice(i*chunk_len, (i+1)*chunk_len)
out[s] = np.abs(array1 - array2[s, None]).argmin(axis=1)
import numpy as np
a = np.random.random(size=4004001).astype(np.float16)
b = np.random.random(size=1000).astype(np.float16)
#use numpy broadcasting to compare pairwise difference and then find the min arg in a for each element in b. Finally extract elements from a using the argmin array as indexes.
output = a[np.argmin(np.abs(b[:,None] -a),axis=1)]
This solution while simple can be very memory intensive. It may need a bit further optimisation if using it on large arrays.

Categories

Resources