pandas: conditional count across row - python

I have a dataframe that has months for columns, and various departments for rows.
2013April 2013May 2013June
Dep1 0 10 15
Dep2 10 15 20
I'm looking to add a column that counts the number of months that have a value greater than 0. Ex:
2013April 2013May 2013June Count>0
Dep1 0 10 15 2
Dep2 10 15 20 3
The number of columns this function needs to span is variable. I think defining a function then using .apply is the solution, but I can't seem to figure it out.

first, pick your columns, cols
df[cols].apply(lambda s: (s > 0).sum(), axis=1)
this takes advantage of the fact that True and False are 1 and 0 respectively in python.
actually, there's a better way:
(df[cols] > 0).sum(1)
because this takes advantage of numpy vectorization
%timeit df.apply(lambda s: (s > 0).sum(), axis=1)
10 loops, best of 3: 141 ms per loop
%timeit (df > 0).sum(1)
1000 loops, best of 3: 319 µs per loop

Related

Adding dataframe columns together, separated by columns considering NaNs

How could NaN values be completely ommitted from the new column in order to avoid consecutive commas?
df['newcolumn'] = df.apply(''.join, axis=1)
One approach would probably be using a conditional lambda
df.apply(lambda x: ','.join(x.astype(str)) if(np.isnan(x.astype(str))) else '', axis = 1)
But this returns an error message:
TypeError: ("ufunc 'isnan' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''", 'occurred at index 0')
Edit:
Both your answers work. In order to obtain the answer, what critera would I use to determine which one to code? Performance considerations?
You can using stack , since it will remove the NaN by default
df.stack().groupby(level=0).apply(','.join)
Out[552]:
0 a,t,y
1 a,t
2 a,u,y
3 a,u,n
4 a,u
5 b,t,y
dtype: object
Data input
df
Out[553]:
Mary John David
0 a t y
1 a t NaN
2 a u y
3 a u n
4 a u NaN
5 b t y
you can use dropna in your apply such as:
df.apply(lambda x: ','.join(x.dropna()), axis = 1)
With #Wen input for df, if you compare for small df, this one is slightly faster
%timeit df.apply(lambda x: ','.join(x.dropna()),1)
1000 loops, best of 3: 1.04 ms per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1000 loops, best of 3: 1.6 ms per loop
but for bigger dataframe, #Wen answer is way faster
df_long = pd.concat([df]*1000)
%timeit df_long.apply(lambda x: ','.join(x.dropna()),1)
1 loop, best of 3: 850 ms per loop
%timeit df_long.stack().groupby(level=0).apply(','.join)
100 loops, best of 3: 13.1 ms per loop

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

python pandas - input values into new column

I have a small dataframe below of spending of 4 persons.
There is an empty column called 'Grade'.
I would like to rate those who spent more than $100 grade A, and grade B for those less than $100.
What is the most efficient method of filling up column 'Grade', assuming it is a big dataframe?
import pandas as pd
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
df['Grade']=''
You can use numpy.where:
df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
print (df)
Customer Spending Grade
0 Bob 130 A
1 Ken 22 B
2 Steve 313 A
3 Joe 46 B
Timings:
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Fastest way to do that would be to use lambda function with an apply function.
df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)

Finding the common columns when comparing two rows in a dataframe in python

I have a dataframe of the below structure. I want to get the column numbers which has the same value (for a specific value) when i compare two rows.
1 1 0 1 1
0 1 0 1 0
0 1 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 1
So for example when I use the above sample df to compare two rows to get the columns which has 1 in it, I should get col(1) and col(3) when I compare row(0) and row(1). Similarly, when I compare row(1) and row(2), I should get col(1). I want to know if there is a more efficient solution in python.
NB: I want only the matching column numbers and also I will specify the rows to compare.
Consider the following dataframe:
import numpy as np
df = pd.DataFrame(np.random.binomial(1, 0.2, (2, 10000)))
It will be a binary matrix of size 2x10000.
np.where((df.iloc[0] * df.iloc[1]))
Or,
np.where((df.iloc[0]) & (df.iloc[1]))
returns the columns that have 1s in both rows. Multiplication seems to be faster:
%timeit np.where((df.iloc[0]) & (df.iloc[1]))
1000 loops, best of 3: 400 µs per loop
%timeit np.where((df.iloc[0] * df.iloc[1]))
1000 loops, best of 3: 269 µs per loop
Here's a simple function. You can modify it as needed, depending on how you represent your data. I'm assuming a list of lists:
df = [[1,1,0,1,1],
[0,1,0,1,0],
[0,1,0,0,1],
[1,0,0,0,1],
[0,0,0,0,0],
[1,0,0,0,1]]
def compare_rows(df,row1,row2):
"""Returns the column numbers in which both rows contain 1's"""
column_numbers = []
for i,_ in enumerate(df[0]):
if (df[row1][i] == 1) and (df[row2][i] ==1):
column_numbers.append(i)
return column_numbers
compare_rows(df,0,1) produces the output:
[1,3]

Categories

Resources