I'd like to sum one particular row of a large NumPy array. I know the function array.max() will give the maximum across the whole array, and array.max(1) will give me the maximum across each of the rows as an array. However, I'd like to get the maximum in a certain row (for example, row 7, or row 29). I have a large array, so getting the maximum for all rows will give me a significant time penalty.
You can easily access a row of a two-dimensional array using the indexing operator. The row itself is an array, a view of a part of the original array, and exposes all array methods, including sum() and max(). Therefore you can easily get the maximum per row like this:
x = arr[7].max() # Maximum in row 7
y = arr[29].sum() # Sum of the values in row 29
Just for completeness, you can do the same for columns:
z = arr[:, 5].sum() # Sum up all values in column 5.
Related
I am trying to get better at using numpy functions and methods to run my programs in python faster
I want to do the following:
I create an array 'a' as:
a=np.random.randint(-10,11,10000).reshape(-1,10)
a.shape: (1000,10)
I create another array which takes only the first two columns in array a
b=a[:,0:2]
b,shape: (1000,2)
now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'.
So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows
of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array
'b' etc.
I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else
Thanks for your time and your help.
This loops over shifts rather than rows (loop of size 10):
N = 10
c = np.hstack([b[i:i-N] for i in range(N)])
Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).
c.shape: (990, 20)
Also I think you may be looking for a shape of (991, 20) if you want to include all windows.
you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:
from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)
c.shape: (991, 20)
If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).
UPDATE: if you want to find unique values and their frequencies in each row of c you can do:
unique_values = []
unique_counts = []
for row in c:
unique, unique_c = np.unique(row, return_counts=True)
unique_values.append(unique)
unique_counts.append(unique_c)
Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.
I understand that
np.argmax(np.max(x, axis=1))
returns the index of the row that contains the maximum value and
np.argmax(np.max(x, axis=0))
returns the index of the row that contains the maximum value.
But what if the matrix contained strings? How can I change the code so that it still finds the index of the largest value?
Also (if there's no way to do what I previously asked for), can I change the code so that the operation is only carried out on a sub-section of the matrix, for instance, on the bottom right '2x2' sub-matrix in this example:
array = [['D','F,'J'],
['K',3,4],
['B',3,1]]
[[3,4],
[3,1]]
Can you try first converting the column to type dtype? If you take the min/max of a dtype column, it should use string values for the minimum/maximum.
Although not efficient, this could be one way to find index of the maximum number in the original matrix by using slices:
newmax=0
newmaxrow=0
newmaxcolumn=0
for row in [array[i][1:] for i in range(1,2)]:
for num in row:
if num>newmax:
newmax=num
newmaxcolumn=row.index(newmax)+1
newmaxrow=[array[i][1:] for i in range(1,2)].index(row)+1
Note: this method would not work if the lagest number lies within row 0 or column 0.
I have an numpy array 'A' of size 5000x10. I also have another number 'Num'. I want to apply the following to each row of A:
import numpy as np
np.max(np.where(Num > A[0,:]))
Is there a pythonic way than writing a for loop for above.
You could use argmax -
A.shape[1] - 1 - (Num > A)[:,::-1].argmax(1)
Alternatively with cumsum and argmax -
(Num > A).cumsum(1).argmax(1)
Explanation : With np.max(np.where(..), we are basically looking to get the last occurrence of matches along each row on the comparison.
For the same, we can use argmax. But, argmax on a boolean array gives us the first occurrence and not the last one. So, one trick is to perform the comparison and flip the columns with [:,::-1] and then use argmax. The column indices are then subtracted by the number of cols in the array to make it trace back to the original order.
On the second approach, it's very similar to a related post and therefore quoting from it :
One of the uses of argmax is to get ID of the first occurence of the max element along an axis in an array . So, we get the cumsum along the rows and get the first max ID, which represents the last non-zero elem. This is because cumsum on the leftover elements won't increase the sum value after that last non-zero element.
I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.