find nonzero indices as an array - python

I know numpy.where gives a tuple of the array coordinates where the condition applies. But what if I want an array?
assume the following 2d array:
a=np.array([[1 1 1 1 0],
[1 1 1 0 0],
[1 0 0 0 0],
[1 0 1 1 1],
[1 0 0 1 0]])
Now what I want is only the first occurrence of zeros, but for every row, even if it doesn't exist. Something like indexOf() in Java. So the output look like:
array([-1,2,2,1,0])
I need to cut pieces of an ndarray and it would be much easier to reduce a dimension rather than having a tuple and try to regenerate the missing rows.

Is this what you are looking for?
import numpy as np
a=np.array([[1, 1, 1, 1, 0],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 1, 1, 1],
[1, 0, 0, 1, 0]])
np.argmax(a==0, axis=0) - ~np.any(a==0, axis=0)
Output:
array([-1, 2, 2, 1, 0], dtype=int64)
The idea here is that np.argmax finds the index of the first matching element in each column (axis=0 for columns, which appears to be what you want in the output, but if you actually want rows, use axis=1). Because np.argmax returns 0 for columns that do not match at all, I subtract 1 from the result for each column that doesn't contain any 0.

Here is a less crafty solution but arguably easier to undestand.
First finds all matches and then creates an array with the first element of the matches and -1 if len == 0.
a=np.array([[1,1,1,1,0],
[1,1,1,0,0],
[1,0,0,0,0],
[1,0,1,1,1],
[1,0,0,1,0]])
matches = [np.where(np.array(i)==0)[0] for i in a.T]
np.array([i[0] if len(i) else -1 for i in matches]) # first occurence, else -1
array([-1, 2, 2, 1, 0])

Related

Python How to replace values in specific columns (defined by an array) with zero

I'm trying to replace values in specific columns with zero with python, and the column numbers are specified in another array.
Given the following 2 numpy arrays
a = np.array([[ 1, 2, 3, 4],
[ 1, 2, 1, 2],
[ 0, 3, 2, 2]])
and
b = np.array([1,3])
b indicates column numbers in array "a" where values need to be replaced with zero.
So the expected output is
([[ 1, 0, 3, 0],
[ 1, 0, 1, 0],
[ 0, 0, 2, 0]])
Any ideas on how I can accomplish this? Thanks.
Your question is:
I'm trying to replace values in specific columns with zero with python, and the column numbers are specified in another array.
This can be done like this:
a[:,b] = 0
Output:
[[1 0 3 0]
[1 0 1 0]
[0 0 2 0]]
The Integer array indexing section of Indexing on ndarrays in the numpy docs has some similar examples.
A simple for loop will accomplish this.
for column in b:
for row in range(len(a)):
a[row][column] = 0
print(a)
[[1 0 3 0]
[1 0 1 0]
[0 0 2 0]]

Mapping a list of elements to a range of an element from another list to create unique matrices

I am trying to "map a list of elements to a range of an element from another list to create unique matrices." Let me explain with a drawing.
Kickstart-inspired question
I hope that it makes sense.
This is inspired by Google Kickstart competition, which means that it is not a question exactly required by the contest.
But I thought of this question and I think that it is worth exploring.
But I am stuck with myself and not being able to move on much.
Here is the code I have, which obviously is not a correct solution.
values = input("please enter your input: ")
values = values.split()
values = [int(i) for i in values]
>>> please enter your input: 2 4 3 1 0 0 1 0 1 1 0 0 1 1 0 6 4 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0
rows_columns = []
matrix = []
for i in values:
if i > 1:
rows_columns[:1].append(i) # The "2" at the very beginning indicates how many matrices should be formed
elif i <= 1:
matrix.append(i)
rows_columns[:1]
>>> [4, 3, 6, 4]
matrix_all = []
for i in range(1, len(rows_columns)):
matrix_sub = []
for j in range(rows_columns[i]):
matrix_sub.append(matrix[j])
if matrix_sub not in matrix_all:
matrix_all.append(matrix_sub)
>>> [[1, 0, 0, 1], [1, 0, 0], [1, 0, 0, 1, 0, 1], [1, 0, 0, 1]]
I really wonder if the nested loop is a good idea to solve this question. This is the best way I could think of for the last couple of hours. What I want to get as a final result looks like below.
Final expected output
Given that there is information about how many rows and columns there should be on a matrix on one list and just enough numbers of elements to form the matrix on the other, what would be the solution to map(or create) the two matrices out of the other list, based on the dimensionality information on a list?
I hope that it is clear, let me know when it is not.
Thanks!
Without using numpy, here is one working solution, based on the input found in your code snippet, and the expected result listed in your final expected result link:
values = [2, 4, 3, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 6, 4, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
0, 1, 0, 1, 0, 1, 1, 1, 0]
v_idx = 1
"""
As per example, the number of matrices desired is found in the first input list element.
In the above values list, we want 2 matrices. The for loop below therefore executes exactly 2 times
"""
for matrix_nr in range(values[0]):
# The nr of rows and nr of columns are the next two elements in the values list
nr_rows = values[v_idx]
nr_cols = values[v_idx + 1]
# Calculate the start index for the next matrix specifications
new_idx = v_idx+2+(nr_rows*nr_cols)
# Slice the values list to extract the values for the current matrix to format
sub_elements = values[v_idx+2: new_idx]
matrix = []
# Append elements to the matrix by slicing values according to nr_rows and nr_cols
for r in range(nr_rows):
start_idx = r*nr_cols
end_idx = (r+1)*nr_cols
matrix.append(sub_elements[start_idx:end_idx])
print(matrix)
v_idx = new_idx
This gives the expected result:
[[1, 0, 0], [1, 0, 1], [1, 0, 0], [1, 1, 0]]
[[1, 0, 0, 0], [1, 0, 0, 1], [1, 1, 1, 1], [1, 0, 1, 0], [1, 0, 1, 0], [1, 1, 1, 0]]
As said, numpy could very likely be used to be a lot more efficient.

Python/NumPy: find the first index of zero, then replace all elements with zero after that for each row

I have an numpy array like this:
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
Question 1:
As shown in the title, I want to replace all elements with zero after the first zero appeared. The result should be like this :
a = np.array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 2: how to slice different columns for each row like this example?
As I am dealing with an array with large size. If any one could find an efficient way to solve this please. Thank you very much.
One way to accomplish question 1 is to use numpy.cumprod
>>> np.cumprod(a, axis=1)
array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 1:
You could iterate over the array like so:
for i in range(a.shape[0]):
j = 0
row = a[i]
while row[j]>0:
j += 1
row[j+1:] = 0
This will change the array in-place. If you are interested in very high performance, the answers to this question could be of use to find the first zero faster. np.where scans the entire array for this and therefore is not optimal for the task.
Actually, the fastest solution will depend a bit on the distribution of your array entries: If there are many floats in there and rarely is there ever a zero, the while loops in the code above will interrupt late on average, requiring to write only "a few" zeros. If however there are only two possible entries like in your sample array and these occur with a similar probability (i.e. ~50%), there would be a lot of zeros to be written to a, and the following will be faster:
b = np.zeros(a.shape)
for i in range(a.shape[0]):
j = 0
a_row = a[i]
b_row = b[i]
while a_row[j]>0:
b_row[j] = a_row[j]
j += 1
Question 2:
If you mean to slice each row individually on a similar criterion dealing with a first occurence of some kind, you could simply adapt this iteration pattern. If the criterion is more global (like finding the maximum of the row, for example) built-in methods like np.where exist that will be more efficient, but it probably would depend a bit on the criterion itself which choice is best.
Question 1: An efficient way to do this would be the following.
import numpy as np
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
for row in a:
zeros = np.where(row == 0)[0]
if (len(zeros)):# Check if zero exists
row[zeros[0]:] = 0
print(a)
Output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]
Question 2: Using the same array, for each row rowIdx, you can have a array of columns colIdxs that you want to extract from.
rowIdx = 2
colIdxs = [1, 3, 4]
print(a[rowIdx, colIdxs])
Output:
[0 1 1]
I prefer Ayrat's creative answer for the first question, but if you need to slice different columns for different rows in large size, this could help you:
indexer = tuple(np.s_[i:a.shape[1]] for i in (a==0).argmax(axis=1))
for i,j in enumerate(indexer):
a[i,j]=0
indexer:
(slice(1, 5, None), slice(4, 5, None), slice(1, 5, None), slice(1, 5, None))
or:
indexer = (a==0).argmax(axis=1)
for i in range(a.shape[0]):
a[i,indexer[i]:]=0
indexer:
[1 4 1 1]
output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]

How maintain sequence of occurence of numbers from ndarray into set using python?

The Scenario
I'm trying to get the number of clusters a dataframe belongs to.
Whose Data type is <type 'numpy.ndarray'> and data as below
records_Array = array([0, 0, 0, 0, 2, 2, 1, 1, 1], dtype=int32)
Obviously while printing I see [0 0 0 ..., 1 1 1] in this format.
Now, I need the numbers only once, so I convert into set and then to List,
cluster_set = list(set(records_Array))
The Output
On printing cluster_set, I get [0, 1, 2]
where as the clusters are in sequence of 0, 2, 1
Required
I need some function / method, that preserves the sequence of records_Array and returns in cluster_set
You want Pandas' pd.unique as it does not sort as it finds unique values. Numpy's unique function does.
a = np.array([0, 0, 0, 0, 2, 2, 1, 1, 1])
pd.unique(a)
array([0, 2, 1])

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.
The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Categories

Resources