I do apologize my question is going to be wordy because i'm just at a loss on how to even start coding this. Pseudo-code answers are highly appreciated if only to allow me to understand how to solve this (then I can write some actual code and come back for help if necessary).
My problem isn't so much the code as it is understanding the logic I need (which is arguably the harder part of programming).
An informal explanation of my problem is that want to change a matrix A (which happens to be sparse) such that the row sums are equal to the column sums. I can do this by adding to A a matrix AS where S is a matrix of scales.
Formally, I want to find an S matrix such that (A + AS)ONESn = T and (t(A) + T(A)S)ONESn = T where ONESn is a vector of ones that creates T, the vector of row sums.
The vector T is set in stone as it were, it is the current column sums and is the target for the row sums.
I think the way I want to solve this is for each row i and column j where i = j I want to find the row sum and compute how far it is from the target. Then I want to change each element of that row such that the row sum equals the target (or is at least "close enough" where I can set the "close enough").
However, this is subject to the condition that the sum of column j must equal the target as well.
How can I design the logic so that I can start with say column 1 and row 1, figure out the values in row 1 and then figure out the values of column 1 subject to the first entry of column 1 being "fixed" by the earlier procedure.
Following that, row 2 should have its first value "fixed" by the above, and similarly the programme needs to figure out column 2 with fixed values for the first two entries now.
And so on until you get to the final column and row
I have tried programming a gradient descent but got stick on how to make the gradient descent for the columns depend on the gradient descent for the rows iteratively.
I've also worked this out by hand (for a 2x2 matrix), I can figure out the answer but I'm not sure how I managed to do so which is why I'm struggling to code it.
Suppose A is a 2x2 matrix of [1, 2, 3, 4]. Row sums are are [4, 6]. Column sums are [3, 7].
1 3 | 4
2 4 | 6
___
3 7
if I add the matrix S = [1, 0, -1, 0]
1 -1
0 0
I get A + S = [2, 2, 2, 4] which has row sums [4, 6].
2 2 | 4
2 4 | 6
___
4 6
Expected results are a matrix (A + AS) such that the row sums equal the column sums.
Or an error message saying "does not converge"
You have some matrix A and you need to add another matrix S so that the resulting matrix M has same row sums as column sums. This means:
A + S = M # For M row sums = column sums
So what you need to do is to find S. You can simply change the equotion to
S = M - S
Now you can set any matrix with same row sum and column sum for M and you get S. Once you have S you can do
A + S = M.
This means that you can add to every matrix A another matrix S so that the resulting matrix M has row sums= column sums. Hence, you will not get the messege "matrix does not converge".
Here is a R code
A <- matrix(rnorm(4), ncol= 2)
M <- matrix(c(2,2,2,4), ncol= 2)
S <- M - A
rowSums(A+S) == colSums(A+S)
TRUE TRUE
Or, more general:
row_col_num <- 5 # number of columns and rows
A <- matrix(rnorm(row_col_num *row_col_num ), ncol= row_col_num )
M <- matrix(rep(1, row_col_num *row_col_num ), ncol= row_col_num )
S <- M - A
rowSums(A+S) == colSums(A+S)
TRUE TRUE TRUE TRUE TRUE
The resulting matrix A+S is always as you set M. So I am not sure what this is for. If you need to know how to find S, where A+S gives you a matrix M with row sums= column sums, this is how you can do it.
Related
I am trying to come up with a function which does the following:
Take an array (preferably a sparse csr.matrix) N x N
Find which rows and/or columns all have 0 entries
Remove both the Nth row and the Nth column if any of the two (or both) have all 0 entries
Return the new NxN array (or sparse matrix) (with no all-0 entries rows and/or columns) and the index of the removed rows/columns.
I manage to return the correct array, but the return index of the removed rows and columns is not correct (and smaller): this is due to the fact that since I remove rows and columns, it might happen that new rows/columns now become all 0, while they were not before.
Take for example the array: [0,1,0,0],[0,0,0,0],[0,1,0,0],[1,0,0,0]. I shall remove the 2nd row and the 2nd column. Now, the new array [0,0,0],[0,0,0],[1,1,1] shall remove both the 1st and 2nd row and columns, which is okay, but it's easy to see how the returned indices are in a sense "scaled", i.e. they are not really the initial ones.
This is the function a created for the moment:
def remove_zero_rows(X):
# X is a scipy sparse matrix. We want to remove all zero rows/columns from it
creat_list=list(range(0, X.shape[1]))
nonzero_row_indice, nonzero_col_indice = X.nonzero()
unique_nonzero_indice = np.unique(nonzero_row_indice)
row_ind=np.array(list(set(creat_list).difference(unique_nonzero_indice))) ## Set of row 0s
nonzero_col_indice = np.unique(nonzero_col_indice)
col_ind=np.array(list(set(creat_list).difference(nonzero_col_indice))) ## Set of cols 0s
merge_two= list(set(row_ind) | set(col_ind)) # This is the index of 0 rows/columns
#Create new matrix
for i in range(X.shape[1]):
if(X.shape[1]-(np.unique(X.nonzero()[0]).size)>0 or X.shape[1]-(np.unique(X.nonzero()[1]).size)>0):
X = X[np.unique(X.nonzero()[0])][:,np.unique(X.nonzero()[0])]
X = X[np.unique(X.nonzero()[1])][:,np.unique(X.nonzero()[1])]
#print(i)
else:
break
return X, row_ind, col_ind, merge_two;
Thanks you!
I want to retrieve the original index of the column with the largest sum at each iteration after the previous column with the largest sum is removed. Meanwhile, the row of the same index of the deleted column is also deleted from the matrix at each iteration.
For example, in a 10 by 10 matrix, the 5th column has the largest sum, hence the 5th column and row are removed. Now the matrix is 9 by 9 and the sum of columns is recalculated. Suppose the 6th column has the largest sum, hence the 6th column and row of the current matrix are removed, which is the 7th in the original matrix. Do this iteratively until the desired number of columns index is preserved.
My code in Julia that does not work is pasted below. Step two in the for loop is not correct because a row is removed at each iteration, thus the sum of columns are different.
Thanks!
# a matrix of random numbers
mat = rand(10, 10);
# column sum of the original matrix
matColSum = sum(mat, dims=1);
# iteratively remove columns with the largest sum
idxColRemoveList = [];
matTemp = mat;
for i in 1:4 # Suppose 4 columns need to be removed
# 1. find the index of the column with the largest column sum at current iteration
sumTemp = sum(matTemp, dims=1);
maxSumTemp = maximum(sumTemp);
idxColRemoveTemp = argmax(sumTemp)[2];
# 2. record the orignial index of the removed scenario
idxColRemoveOrig = findall(x->x==maxSumTemp, matColSum)[1][2];
push!(idxColRemoveList, idxColRemoveOrig);
# 3. update the matrix. Note that the corresponding row is also removed.
matTemp = matTemp[Not(idxColRemoveTemp), Not(idxColRemoveTemp)];
end
python solution:
import numpy as np
mat = np.random.rand(5, 5)
n_remove = 3
original = np.arange(len(mat)).tolist()
removed = []
for i in range(n_remove):
col_sum = np.sum(mat, axis=0)
col_rm = np.argsort(col_sum)[-1]
removed.append(original.pop(col_rm))
mat = np.delete(np.delete(mat, col_rm, 0), col_rm, 1)
print(removed)
print(original)
print(mat)
I'm guessing the problem you had was keeping track with information what was the index of current columns/rows in original array. I've just used a list [0, 1, 2, ...] and then pop one value in each iteration.
A simpler way to code the problem would be to replace elements in the selected column with a significantly small number instead of deleting the column. This approach avoids the use of "sort" and "pop" to improve code efficiency.
import numpy as np
n = 1000
mat = np.random.rand(n, n)
n_remove = 500
removed = []
for i in range(n_remove):
# get sum of each column
col_sum = np.sum(mat, axis=0)
col_rm = np.argmax(col_sum)
# record the column ID
removed.append(col_rm)
# replace elements in the col_rm-th column and row with the zeros
mat[:, col_rm] = 1e-10
mat[col_rm, :] = 1e-10
print(removed)
I have two numpy matricies of the same shape.
In one of them each column contains all 0's except for a 1.
In the other matrix each column contains random numbers.
My goal is to count the number of columns for which the position of the 1 in the column of the first matrix corresponds with the position of the highest element in the column of the second matrix.
For example:
a = [[1,0],
[0,1]]
b = [[2,3],
[3,5]]
myFunc(a,b)
would yield 1 since the argmax of the first column in b is not the same as in a but it is the same in the second column.
My solution was to iterate over the columns and check if the argmax was the same, store that in a list and then sum that at the end, but this doesn't take advantage of numpy's fastness. Is there a faster way to do this? Thanks!
This checks the indices of max in each column of b against indices of 1s in corresponding column of a and counts the matches:
(a.T.nonzero()[1]==b.argmax(axis=0)).sum()
output in your example:
1
Given that there will only be a single 1 in the first array, then you should just be able to compare where the argmax is at the same position
def myfunc(binary_array,value_array):
return np.sum(a.argmax(axis=1)==b.argmax(axis=1))
a = np.array([[1,0],
[0,1]])
b = np.array([[2,3],
[3,5]])
myfunc(a,b)
1
c=np.array([[0,1,0],[1,0,0],[0,0,1]])
d=np.array([[1,2,3],[2,2,3],[1,3,4]])
myfunc(c,d)
1
e=np.array([[0,1,0],[0,0,1],[0,0,1]])
f=np.array([[1,2,3],[2,2,3],[1,3,4]])
myfunc(e,f)
2
I'm new to numpy. I have an Nx4 matrix and I want to find the average of each column when the last column equals 1 for instance. In matlab I would do something like mean1 = mean(data[column 4] == 1). This would return a matrix (or vector) with the mean of the columns, with the mean of column 4 being equal to 1. I can't find any specific documentation that specifies how to handle this. This shows how to filter the matrix, but I shouldn't have to reassign the matrix to a new variable, doubling storage size. Thanks in advance.
#make artificial data to match problem
data = np.random.random((100,4))
print( id(data) )
data[:,3] = data[:,3] < 0.5
print( id(data) ) #same object (memory location)
#get the filter
dfilter = data[:,3].astype(np.bool_)
#find the means
means = data[dfilter].mean(axis=0)
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.