How to index numpy array with an array - python

Given an array defined below as:
a = np.arange(30).reshape((3, 10)
col_index = [[1,2,3,5], [3,4,5,7]]
row_index = [2,1]
Is it possible to index a[row_index, col_index], so I can do something like
a[row_index, col_index] =1, so then a becomes
[[0,1,2,3,4,5,6,7,8,9], [10,11,12,1,1,1,16,1,18,19], [20,1,1,1,24,1,26,27,28,29]]
So to clarify, in row 2, column 1,2,3, and 5 are set to one, and in row 1, column 3,4,5,7 is also set to 1.

Or (if you don't like typing)
a[np.c_[row_index], col_index] = 1
or even shorter but Python 2 only
a[zip(row_index), col_index] = 1
What all these solutions do is to make row and col indices broadcastable to each other. np.c_ is the column concatenation convenience object. It makes columns out of 1D objects.
zip used to do essentially the same. Only, since Python 3 it returns an iterator instead of a list and numpy can't handle those. (One could do list(zip(row_index)) but that's not short.)

Related

how to extract overlapping sub-arrays with a window size and flatten them

I am trying to get better at using numpy functions and methods to run my programs in python faster
I want to do the following:
I create an array 'a' as:
a=np.random.randint(-10,11,10000).reshape(-1,10)
a.shape: (1000,10)
I create another array which takes only the first two columns in array a
b=a[:,0:2]
b,shape: (1000,2)
now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'.
So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows
of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array
'b' etc.
I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else
Thanks for your time and your help.
This loops over shifts rather than rows (loop of size 10):
N = 10
c = np.hstack([b[i:i-N] for i in range(N)])
Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).
c.shape: (990, 20)
Also I think you may be looking for a shape of (991, 20) if you want to include all windows.
you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:
from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)
c.shape: (991, 20)
If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).
UPDATE: if you want to find unique values and their frequencies in each row of c you can do:
unique_values = []
unique_counts = []
for row in c:
unique, unique_c = np.unique(row, return_counts=True)
unique_values.append(unique)
unique_counts.append(unique_c)
Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.

Cleanest way to replace np.array value with np.nan by user defined index

One question about mask 2-d np.array data.
For example:
one 2-d np.array value in the shape of 20 x 20.
An index t = [(1,2),(3,4),(5,7),(12,13)]
How to mask the 2-d array value by the (y,x) in index?
Usually, replacing with np.nan are based on the specific value like y[y==7] = np.nan
On my example, I want to replace the value specific location with np.nan.
For now, I can do it by:
Creating a new array value_mask in the shape of 20 x 20
Loop the value and testify the location by (i,j) == t[k]
If True, value_mask[i,j] = value[i,j] ; In verse, value_mask[i,j] = np.nan
My method was too bulky especially for hugh data(3 levels of loops).
Are there some efficiency method to achieve that? Any advice would be appreciate.
You are nearly there.
You can pass arrays of indices to arrays. You probably know this with 1D-arrays.
With a 2D-array you need to pass the array a tuple of lists (one tuple for each axis; one element in the lists (which have to be of equal length) for each array-element you want to chose). You have a list of tuples. So you have just to "transpose" it.
t1 = zip(*t)
gives you the right shape of your index array; which you can now use as index for any assignment, for example: value[t1] = np.NaN
(There are lots of nice explanation of this trick (with zip and *) in python tutorials, if you don't know it yet.)
You can use np.logical_and
arr = np.zeros((20,20))
You can select by location, this is just an example location.
arr[4:8,4:8] = 1
You can create a mask the same shape as arr
mask = np.ones((20,20)).astype(bool)
Then you can use the np.logical_and.
mask = np.logical_and(mask, arr == 1)
And finally, you can replace the 1s with the np.nan
arr[mask] = np.nan

Isolating a column out of a numpy array using a variable?

I am trying to isolate the last column of a numpy array. However, the function needs to work for arrays of different sizes. When I put it like this:
array[:,array_length]
#array_length is a variable set to the length of one row of the array
which seems like it would work, it returns an error telling me that I can't slice with a variable, but only with an integer.
Is there a way to do this with numpy that I'm not seeing?
To access the last column of a numpy array, you can use -1
last_col = array[:, -1]
Or you can also do
array_length = len(array[0]) - 1
last_col = array[:, array_length]

Counting non-zero elements within each row and within each column of a 2D NumPy array

I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.

Slicing arrays in Numpy / Scipy

I have an array like:
a = array([[1,2,3],[3,4,5],[4,5,6]])
What's the most efficient way to slice out a 1x2 array out of this that has only the first two columns of "a"?
i.e.
array([[2,3],[4,5],[5,6]]) in this case.
Two dimensional numpy arrays are indexed using a[i,j] (not a[i][j]), but you can use the same slicing notation with numpy arrays and matrices as you can with ordinary matrices in python (just put them in a single []):
>>> from numpy import array
>>> a = array([[1,2,3],[3,4,5],[4,5,6]])
>>> a[:,1:]
array([[2, 3],
[4, 5],
[5, 6]])
Is this what you're looking for?
a[:,1:]
To quote documentation, the basic slice syntax is i:j:k where i is the starting index, j is the stopping index, and k is the step (when k > 0).
Now if i is not given, it defaults to 0 if k > 0. Otherwise i defaults to n - 1 for k < 0 (where n is the length of the array).
If j is not given, it defaults to n (length of array).
That's for a one dimensional array.
Now a two dimensional array is a different beast. The slicing syntax for that is a[rowrange, columnrange].
So if you want all the rows, but just the last two columns, like in your case, you do:
a[0:3, 1:3]
Here, "[0:3]" means all the rows from 0 to 3. and "[1:3]" means all columns from column 1 to column 3.
Now as you may be wondering, even though you have only 3 columns and the numbering starts from 1, it must return 3 columns right? i.e: column 1, column 2, column 3
That is the tricky part of this syntax. The first column is actually column 0. So when you say "[1:3]", you are actually saying give me column 1 and column 2. Which are the last two columns you want. (There actually is no column 3.)
Now if you don't know how long your matrix is or if you want all the rows, you can just leave that part empty.
i.e.
a[:, 1:3]
Same goes for columns also. i.e if you wanted say, all the columns but just the first row, you would write
a[0:1,:]
Now, how the above answer a[:,1:] works is because when you say "[1:]" for columns, it means give me everything except for column 0, and till the end of all the columns. i.e empty means 'till the end'.
By now you must realize that anything on either side of the comma is all a subset of the one dimensional case I first mentioned above. i.e if you want to specify your rows using step sizes you can write
a[::2,1]
Which in your case would return
array([[2, 3],
[5, 6]])
i.e. a[::2,1] elucidates as: give me every other row, starting with the top most, and give me only the 2nd column.
This took me some time to figure out. So pasting it here, just in case it helps someone.

Categories

Resources