I have the following function
def sum_NE(data, i, col='VALUES'):
return data.iloc[get_NE(i, len(data))][col].sum()
This works great. But I'd like to do one more thing. Column VALUES includes zeros and values bigger than zero. How do I count all the values bigger than zero, that are used when evaluating sum()?
Function get_NE returns a list. I tried the code below, but it doesn't work.
def sum_NE(data, i, col='VALUES'):
return data.iloc[get_NE(i, len(data))][col].count()
Function get_NE is a function that returns a list. E.g. [5, 6, 8, 12]. These values are rows in data dataframe and with [col] reference i'm looking at certain values in VALUES column. Those values are at first aggregated. Now i want to find out how many of those values are aggregated.
I found a solution:
def sum_NE(data, i, col='VALUES'):
return sum(1 for i in data.iloc[get_NE(i, len(data))][col] if float(i) > 0)
Related
I have an arbitrary number of columns to check the condition as to whether any of them is equal to 1, then I want to create a new column based on the results. I want to do something along the lines of how-to-test-multiple-columns-of-pandas-for-a-condition-at-once-and-update-them:
cols=['col_1', ..., 'col_n']
test['col_n+1']=np.where(test[cols] > 0, 1, 0)
However, when I run this, I get an error of:
ValueError: Wrong number of items passed 5, placement implies 1
I understand why this is being thrown, but cannot find a pythonic way of doing this (I'm able to iterate through the dataframe and individually evaluate each column, etc., but the code is ugly)
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
col=['col2','col3']
#Check where any row has value greater than 19
test['test'] =test[col].gt(19).any(1).astype(int)
I'm trying to divide each value of a specific column in df_rent dataframe by simply accessing each value and divide by 1000. But it's returning an error and I cannot understand the reason.
Type of the column is float64.
for i in df_rent['distance_new']:
df_rent[i] = df_rent[i] / 1000
print(df_rent[i])
the error is because if you loop over df_rent['distance_new'],the i assigned will be the value of your first cell in 'distance_new', then the second, then the third, it will not be a pointer index. what you should do is rather simple
df_rent['distance_new']/=1000
In case someone doesn't understand, /= operator takes the value, divides by RHS, then replace the LHS value by the result. the LHS can be int, float, or in this case a whole column. this solution also works on multiple column if you slice them correctly, will be something like
df_rent.loc[:,['distance_new','other_col_1','other_col2']]/=1000
i contains the values iterating in df_rent['distance_new']
eg: if df_rent['distance_new'] = [23, 45, 67]
i is iterating on the value 23, then 45 then 67
i is not simplifying to index
also lists need to append the indexes first
try using a dictionary to generate keys,
lists can only modify existing keys
also post in the values of a sample array so you can communicate your problem better
I am wondering why pandas assign function cannot handle returned lists.
For example
df = pd.DataFrame({
"id" : [1,2,3,4,5],
"val" : [10,20,30,30,40]
})
def squareMe(x):
return x**2
df = df.assign(val2 = lambda x: squareMe(x.val))
# Out > Works fine : Returns a DataFrame with squared values
But if we return a list,
def squareMe(x):
return [x**2]
df = df.assign(val2 = lambda x: squareMe(x.val))
#Out > ValueError: Length of values (1) does not match length of index (5)
However pandas apply function works fine when returning a list
def squareMe(x):
return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))
Any particular reason why this is or am I doing something wrong?
Since you reference x.val in the call to squareMe, that function is passed a list (you can easily verify this by adding a debug statement to print type(x) inside the function).
Thus, x ** 2 returns a Series (since the expression is vectorized) and the assignment works correctly.
But when you return [x ** 2] you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5) means).
The difference is with apply is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.
More information: df.assign, df.apply
P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2
assign isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.
Docs:
**Parameters : kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it). If the values
are not callable, (e.g. a Series, scalar, or array), they are simply
assigned.
Doing [x ** 2] returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:
ValueError: Length of values (1) does not match length of index (5)
The length of values wouldn't match to the index.
This has been driving me nuts all day in my own work, but I've got it now.
cs95 is almost correct, but not quite. If you follow their advice and put a print(f"{type(x)}") in your squareMe function you'll see that it's a Series, not a list.
That's the catch, x.val is always a Series (the entire column of values), and squareMe returns a Series.
In contrast, apply, if you specify axis=1, will iterate over each row in the column, so each value of x.val and pass each one to squareMe, building a new Series for your new column in the process.
The reason it confused you (and me!) is that, when it works in your first example, it looks like squareMe is operating on integers and returning an integer for each row. But in fact, it's taking advantage of operator overloading to square the Series, not individual values: It's using the pow function, which is aliased as **, which like the other overloaded operators on Series, works element-wise.
Now, when you change squareMe to return the list of the result: [x**2], it's again squaring the entire Series to get a new Series of squares, but then making a list of that Series. That is, a list of a single element, the element being a Series.
Now assign was expecting a Series back from squareMe of the same length as the index of the dataframe, which is 5, and you returned it a list with a single element - hence the error: expected length 5, got 1.
Your apply, in the meantime, is working on the Series val because that's what you called it on, and it's iterating over the values in that series. Another way to do the apply, which is closer to your assign is this:
df["val2"] = df.apply(lambda x: squareMe(x.val), axis=1)
how can i sum a column given by index? i tried to use 'for row in list', but it results a TypeError.
function
index
sample
You should do as you say. However, the sum of the names is not possible if you initialize the column_sum to 0 (which is an integer).
The sum of the names does not seem relevant. But if you want the function get_total to work on any column, you should first check if the variables are integers. If not, then return 0 for instance.
def get_total(index, stations):
if not stations:
# The list is empty
return 0
if type(stations[0][index]) is int:
# The sum is possible
return sum([station[index] for station in stations])
return 0
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.