This requires to yield the results as they are computed. should not store all of the data at any point. This should support streams of data larger than memory.
For each row add an int at the beginning that is the total of the row.
Once the entire input has been processed add a final row with the
totals of every column in the input. This should include the initial
total column and you should treat columns that are missing on a given
row as zeros.
The row totals are the first column instead of the last
(as is more common) because it makes rows of different length easier
to handle.
For example:
def func([(1,2,3), (4,5)]) = [(6,1,2,3),(9,4,5),(15,5,7,3)]
Hopefully you will learn something from this:
from itertools import izip_longest
def func(rows):
totals = []
for row in rows:
row = (sum(row),) + row
totals = [sum(col) for col in izip_longest(totals, row, fillvalue=0)]
yield row
yield tuple(totals)
>>> list(func([(1,2,3), (4,5)]))
[(6, 1, 2, 3), (9, 4, 5), [15, 5, 7, 3]]
This code iterates over all of the rows yielding a tuple comprising the summed columns and the original columns.
izip_longest() pairs items in the current row with the corresponding item in totals to maintain a running total of each column. izip_longest() was chosen because it can handle rows of different lengths and you can supply a fill value (0 in this case) for missing items.
Related
I want to retrieve the original index of the column with the largest sum at each iteration after the previous column with the largest sum is removed. Meanwhile, the row of the same index of the deleted column is also deleted from the matrix at each iteration.
For example, in a 10 by 10 matrix, the 5th column has the largest sum, hence the 5th column and row are removed. Now the matrix is 9 by 9 and the sum of columns is recalculated. Suppose the 6th column has the largest sum, hence the 6th column and row of the current matrix are removed, which is the 7th in the original matrix. Do this iteratively until the desired number of columns index is preserved.
My code in Julia that does not work is pasted below. Step two in the for loop is not correct because a row is removed at each iteration, thus the sum of columns are different.
Thanks!
# a matrix of random numbers
mat = rand(10, 10);
# column sum of the original matrix
matColSum = sum(mat, dims=1);
# iteratively remove columns with the largest sum
idxColRemoveList = [];
matTemp = mat;
for i in 1:4 # Suppose 4 columns need to be removed
# 1. find the index of the column with the largest column sum at current iteration
sumTemp = sum(matTemp, dims=1);
maxSumTemp = maximum(sumTemp);
idxColRemoveTemp = argmax(sumTemp)[2];
# 2. record the orignial index of the removed scenario
idxColRemoveOrig = findall(x->x==maxSumTemp, matColSum)[1][2];
push!(idxColRemoveList, idxColRemoveOrig);
# 3. update the matrix. Note that the corresponding row is also removed.
matTemp = matTemp[Not(idxColRemoveTemp), Not(idxColRemoveTemp)];
end
python solution:
import numpy as np
mat = np.random.rand(5, 5)
n_remove = 3
original = np.arange(len(mat)).tolist()
removed = []
for i in range(n_remove):
col_sum = np.sum(mat, axis=0)
col_rm = np.argsort(col_sum)[-1]
removed.append(original.pop(col_rm))
mat = np.delete(np.delete(mat, col_rm, 0), col_rm, 1)
print(removed)
print(original)
print(mat)
I'm guessing the problem you had was keeping track with information what was the index of current columns/rows in original array. I've just used a list [0, 1, 2, ...] and then pop one value in each iteration.
A simpler way to code the problem would be to replace elements in the selected column with a significantly small number instead of deleting the column. This approach avoids the use of "sort" and "pop" to improve code efficiency.
import numpy as np
n = 1000
mat = np.random.rand(n, n)
n_remove = 500
removed = []
for i in range(n_remove):
# get sum of each column
col_sum = np.sum(mat, axis=0)
col_rm = np.argmax(col_sum)
# record the column ID
removed.append(col_rm)
# replace elements in the col_rm-th column and row with the zeros
mat[:, col_rm] = 1e-10
mat[col_rm, :] = 1e-10
print(removed)
I am working on a python project which iterates through all the possible combinations of entries in a row of excel data to find which combination produces the correct output.
To achieve this, I am iterating through different combinations of 0 and 1 to choose whether that entry is required for the combination. 1 meaning data point is included in the calculation and 0 meaning the data point is not included.
The number of combinations would thus be equal to 2 ^ (Number of excel columns)
Example Excel Data:
1, 22, 7, 11, 2, 4
Example Iteration:
(1, 0, 0, 0, 1, 0)
I could be looking for what combination of the excel data would result in an output of 3, the only correct combination of the excel data being the above iteration.
However, I would know that any value greater than 3 would not be included in a possible combination that would equal 3. As such I would like to choose and set the values of these columns to 0 and iterate the other columns only. This would in turn reduce the number of combinations.
Combination = 2 ^ (Number of excel columns - Fixed Entry Columns)
At the moment I am using Itertools.products to get all combination which I need:
Numbers = ["0","1"]
for item in itertools.product(Numbers, repeat=len(df.columns)):
Iteration = pd.DataFrame(item) # Iteration e.g (0,1,1,1,0,0,1)
Data = df.iloc[0] # Excel data row
Data = Data.to_numpy()
Iteration = Iteration.astype(float)
Answer = np.dot(Data, Iteration) # Get the result of (Iteration * Data) to check if answer is correct
This results in iterating through combinations which I know will not work.
Is there a way to only iterate 0's and 1's in certain positions of the combination while keeping the known entries a fixed value (either 0 or 1) to reduce the possible combinations?
There are some excel files have over 25 columns which as a result would be 33,554,432 combinations. As such I am trying to reduce the number of columns which I need to iterate by setting values to the columns that I do know.
If you would need further clarification please let me know. I am novice programmer so I may be overlooking or over complicating a simple solution.
Find which columns meet your criteria for exclusion. Then just get the product combinations for the other columns.
One possible method:
from itertools import product
LIMIT=10
column_data = [1, 22, 7, 11, 2, 4]
changeable_indexes = [i for i,x in enumerate(column_data) if x <= LIMIT]
for item in product([0,1], repeat=len(changeable_indexes)):
row_iteration = [0] * len(column_data)
for index, value in zip(changeable_indexes, item):
row_iteration[index] = value
print(row_iteration)
I have the following function
def sum_NE(data, i, col='VALUES'):
return data.iloc[get_NE(i, len(data))][col].sum()
This works great. But I'd like to do one more thing. Column VALUES includes zeros and values bigger than zero. How do I count all the values bigger than zero, that are used when evaluating sum()?
Function get_NE returns a list. I tried the code below, but it doesn't work.
def sum_NE(data, i, col='VALUES'):
return data.iloc[get_NE(i, len(data))][col].count()
Function get_NE is a function that returns a list. E.g. [5, 6, 8, 12]. These values are rows in data dataframe and with [col] reference i'm looking at certain values in VALUES column. Those values are at first aggregated. Now i want to find out how many of those values are aggregated.
I found a solution:
def sum_NE(data, i, col='VALUES'):
return sum(1 for i in data.iloc[get_NE(i, len(data))][col] if float(i) > 0)
I'd like to sum one particular row of a large NumPy array. I know the function array.max() will give the maximum across the whole array, and array.max(1) will give me the maximum across each of the rows as an array. However, I'd like to get the maximum in a certain row (for example, row 7, or row 29). I have a large array, so getting the maximum for all rows will give me a significant time penalty.
You can easily access a row of a two-dimensional array using the indexing operator. The row itself is an array, a view of a part of the original array, and exposes all array methods, including sum() and max(). Therefore you can easily get the maximum per row like this:
x = arr[7].max() # Maximum in row 7
y = arr[29].sum() # Sum of the values in row 29
Just for completeness, you can do the same for columns:
z = arr[:, 5].sum() # Sum up all values in column 5.
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.