I have two lists of indices. I would like to generate the relevant permutation matrix. The two lists have equal size n and have all integers from 0 up to n-1.
Simple Example:
Given initial and final indices (as per the two-line convention https://en.wikipedia.org/wiki/Permutation_matrix):
initial_index = [3,0,2,1] and final_index = [0,1,3,2]
In other words, the last entry (3) has got to go to the first (0), the first (0) has got to go to the second (1) etc. You could also imagine zipping these two lists in order to obtain the permutation rules: [(3,0),(0,1),(2,3),(1,2)], read this as (3 -> 0),(0 -> 1) and so forth. This is a right-shift for a list, or a down-shift for a column vector. The resulting permutation matrix should be the following:
M = [[0,0,0,1],
[1,0,0,0],
[0,1,0,0],
[0,0,1,0]]
Multiplying this matrix by a column vector indeed shifts the entries down by 1, as required.
Are there any relevant operations that could achieve this efficiently?
You want an n-by-n matrix where, for every i from 0 to n-1, the cell at row final_index[i] and column initial_index[i] is set to 1, and every other cell is set to 0.
NumPy advanced indexing can be used to set those cells easily:
permutation_matrix = numpy.zeros((n, n), dtype=int)
permutation_matrix[final_index, initial_index] = 1
Alternatively to the good answer of #user2357112, you can use sparse matrices to be efficient in memory:
from scipy.sparse import csr_matrix
permutation_matrix = csr_matrix((np.ones(n, dtype=int), (final_index, initial_index)), shape=(n,n))
# Use permutation_matrix.todense() to convert the matrix if needed
The complexity of building this sparse matrix is O(n) in both time and space while for dense matrices it is O(n^2). So they are much better for big vectors (>1000).
Related
For two-dimensaol array, according Guide to numpy
The two-styles of memory layout for arrays are connected through the transpose operation. Thus, if A is a (contiguous) C-style array, then the same block memory can be used to represent AT as a (contiguous) Fortran-style array. This kindof undorctondina non ho ucoful mhon trrina to orn mmonnina of Fortron
In three dimensional matrix
A = np.arange(6).reshape([1,2,3])
we can view A as a 1 by 1 block matrix, which means A has one entry, and that entry is a matrix with two rows and three columns.
So we can iteratively take the transpose according to the rules above.
My question is:
A= np.arange(6).reshape([1,2,3])
B = A.transpose(1,2,0)
In this case, how does it work. Is there a rule that can tell me how the elements of B are arranged
What would be pythonic and effective way to find/remove palindrome rows from matrix. Though the title suggests matrix to be a numpy ndarray, it can be pandas DataFrame if it lead to more elegant solution.
Obvious way would be to implement this using for-loop, but I'm interested is there a more effective and succint way.
My first idea was to concatenate rows and rows-inverse, and then extract duplicates from concatenated matrix. But this list of duplicates will contain both initial row and its inverse. So to remove second instance of a palindrome I'd still have to do some for-looping.
My second idea was to somehow use broadcasting to get cartesian product of rows and apply my own ufunc (perhaps created using numba) to get 2D bool matrix. But I don't know how to create ufunc that would get matrix axis, instead of scalar.
EDIT:
I guess I should apologize for poorly formulated question (English is not my native language). I don't need to find out if any row itself is palindrome, but if there are pairs of rows within matrix that are palindromes.
I simply check if the array is equal its reflection (around axis 1) in all elements, if true it is a palindrome (correct me if I am wrong). Then I index out the rows that aren't palindromes.
import numpy as np
a = np.array([
[1,0,0,1], # Palindrome
[0,2,2,0], # Palindrome
[1,2,3,4],
[0,1,4,0],
])
wherepalindrome = (a == a[:,::-1]).all(1)
print(a[~wherepalindrome])
#[[1 2 3 4]
# [0 1 4 0]]
Naphat's answer is the pythonic (numpythonic) way to go. That should be the accepted answer.
But if your array is really large, you don't want to create a temporary copy, and you wish to explore Numba's intricacies, you can use something like this:
import numba as nb
#nb.njit(parallel=True)
def palindromic_rows(a):
rows, cols = a.shape
palindromes = np.full(rows, True, dtype=nb.boolean)
mid = cols // 2
for r in nb.prange(rows): # <-- parallel loop
for c in range(mid):
if a[r, c] != a[r, -c-1]:
palindromes[r] = False
break
return palindromes
This contraption just replaces the elegant (a == a[:,::-1]).all(axis=1), but it's almost an order of magnitude faster for very large arrays and it doesn't duplicate them.
I have a matrix NxM.
N is big enough N >> 10000.
I wonder if there is an algorithm to mix all the lines of a matrix to get a 100 matrix for example. My matrices C must not be identical.
Thoughts?
So, do you want to keep the shape of the matrix and just shuffle the rows or do you want to get subsets of the matrix?
For the first case I think the permutation algorithm from numpy could be your choice. Just create a permutation of a index list, like Souin propose.
For the second case just use the numpy choice funtion (also from the random module) without replacement if I understood your needs correctly.
Suppose I have an array a = array([a0, a1,..., an]). For every element in the array, I need to find all cumulative sums in the direction from left to right. Thus, for n=2 I have to find
a0, a0+a1, a0+a1+a2, a1, a1+a2, a2.
The obvious approach is the following:
list_of_sums = [np.cumsum(a[i:]) for i in xrange(n+1)]
I wonder if there exists a more efficient way to do this. The issue is that n may be quite large and in addition I want to broadcast this for many different arrays a of the same length. Unfortunately, I did not succeed to create a vectorized function for this.
Another problem is how to store all these sums. Instead of using list, we can represent all the data as 2D array A, where in the first column A[:,0] I have all the sums with a0 as the first term, in the second column A[:,1] I have all the sums with a1 as the first term, and so on. This gives us (n+1)x(n+1) matrix, half elements of which are zeros (the right lower triangle). On the other hand, this requires in 2 times more memory than list use, or force us to use sparse matrices that may be overkill.
I have two scipy sparse csr matrices with the exact same shape but potentially different data values and nnz value. I now want to get the top 10 elements of one matrix and increase the value on the same indices on the other matrix. My current approach is as follows:
idx = a.data.argpartition(-10)[-10:]
i, j = matrix.nonzero()
i_idx = i[idx]
j_idx = j[idx]
b[i_idx, j_idx] += 1
The reason I have to go this way is that a.data and b.data do not necessarily have the same number of elements and hence the indices would differ.
My question now is whether I can improve this in some way. As far as I know the nonzero procedure is not elegant as I have to allocate two new arrays and I am very tough on memory already. I can get the j_indices via csr_matrix.indices but what about the i_indices? Can I use the indptr in a nice way for that?
Happy for any hints.
I'm not sure what the "top 10 elements" means. I assume that if you have matrices A and B you want to set B[i, j] += 1 if A[i, j] is within the first 10 nonzero entries of A (in CSR format).
I also assume that B[i, j] could be zero, which is the worst case performance wise, since you need to modify the sparsity structure of your matrix.
CSR is not an ideal format to use for changing sparsity structure. This is because every new insertion/deletion is O(nnzs) complexity (assuming the CSR storage is backed by an array - and it usually is).
You could use the DOK format for your second matrix (the one you are modifying), which provides O(1) access to elements.
I haven't benchmarked this but I suppose your option is 10 * O(nnzs) at worst, when you are adding 10 new nonzero values, whereas the DOK version should need O(nnzs) to build the matrix, then O(1) for each insertion and, finally, O(nnzs) to convert it back to CSR (assuming this is needed).