Conditional counting across a row in pandas when matching a string - python

I have pandas dataframe of the form,df=
index,result1,result2,result3
0 s u s
1 u s u
2 s
3 s s u
i would like to add another column that contains a list of the number of times s occurs in that row, for example
index,result1,result2,result3,count
0 s u s 2
1 u s u 1
2 s 1
3 s s u 2
i have tried the following code
col=['result1','result2','result3']
df[cols].count(axis=1)
but this returns
0,3
1,3
2,1
3,3
so this counts the number of elements, i then tried
df[df[cols]=='s'].count(axis=1)
but this returned the following error: "Could not compare ['s'] with block values"
Any help would be greatly appreciated

For me works cast to string by astype numeric and NaN columns return your error:
print (df)
index result1 result2 result3 result4
0 0 s u 7 NaN
1 1 u s 7 NaN
2 2 s NaN 8 NaN
3 3 s s 7 NaN
4 4 NaN NaN 2 NaN
print (df.dtypes)
index int64
result1 object
result2 object
result3 int64
result4 float64
dtype: object
cols = ['result1','result2','result3','result4']
df['count'] = df[df[cols].astype(str) == 's'].count(axis=1)
print (df)
index result1 result2 result3 result4 count
0 0 s u 7 NaN 1
1 1 u s 7 NaN 1
2 2 s NaN 8 NaN 1
3 3 s s 7 NaN 2
4 4 NaN NaN 2 NaN 0
Or sum only True values from boolean mask:
print (df[cols].astype(str) == 's')
result1 result2 result3 result4
0 True False False False
1 False True False False
2 True False False False
3 True True False False
4 False False False False
cols = ['result1','result2','result3','result4']
df['count'] = (df[cols].astype(str) =='s').sum(axis=1)
print (df)
index result1 result2 result3 result4 count
0 0 s u 7 NaN 1
1 1 u s 7 NaN 1
2 2 s NaN 8 NaN 1
3 3 s s 7 NaN 2
4 4 NaN NaN 2 NaN 0
Another nice solution is from Nickil Maveli - use numpy:
df['count'] = (df[cols].values=='s').sum(axis=1)

Related

Setting the last n non NaN vale per group with nan

I have a DataFrame with (several) grouping variables and (several) value variables. My goal is to set the last n non nan values to nan. So let's take a simple example:
df = pd.DataFrame({'id':[1,1,1,2,2,],
'value':[1,2,np.nan, 9,8]})
df
Out[1]:
id value
0 1 1.0
1 1 2.0
2 1 NaN
3 2 9.0
4 2 8.0
The desired result for n=1 would look like the following:
Out[53]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
Use with groupby().cumcount():
N=1
groups = df.loc[df['value'].notna()].groupby('id')
enum = groups.cumcount()
sizes = groups['value'].transform('size')
df['value'] = df['value'].where(enum < sizes - N)
Output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
You can check cumsum after groupby get how many notna value per-row
df['value'].where(df['value'].notna().iloc[::-1].groupby(df['id']).cumsum()>1,inplace=True)
df
Out[86]:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN
One option: create a reversed cumcount on the non-NA values:
N = 1
m = (df
.loc[df['value'].notna()]
.groupby('id')
.cumcount(ascending=False)
.lt(N)
)
df.loc[m[m].index, 'value'] = np.nan
Similar approach with boolean masking:
m = df['value'].notna()
df['value'] = df['value'].mask(m[::-1].groupby(df['id']).cumsum().le(N))
output:
id value
0 1 1.0
1 1 NaN
2 1 NaN
3 2 9.0
4 2 NaN

How to make np.where write True instead of 1.0?

I want to fill column with True and NaN values
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = np.where(df['col1'].isin(my_list), True, np.NaN)
print (df)
It prints:
col1 col2
0 0 NaN
1 1 1.0
2 2 1.0
3 3 1.0
4 4 1.0
5 5 1.0
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
But it is very important for me to print bool value True, not float number 1.0. This column interacts with other columns. They are bool, so it must be bool too.
I know I can change it with replace function. But my DataFrame is very large. I cannot waste time. Is there a simple option to do it?
This code will solve your problem. np.where will returning you true because of numpy only deals with the number and True means 1 in number. that's why it's giving you 1.0 instead of True
Code
import numpy as np
import pandas as pd
my_list = [1,2,3,4,5]
df = pd.DataFrame({'col1' : [0,1,2,3,4,5,6,7,8,9,10]})
df['col2'] = df['col1'].apply(lambda x: True if x in my_list else np.NaN)
print (df)
Results
col1 col2
0 0 NaN
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 NaN
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
Use Nullable Boolean data type:
df['col2'] = pd.Series(np.where(df['col1'].isin(my_list), True, np.NaN), dtype='boolean')
print (df)
col1 col2
0 0 <NA>
1 1 True
2 2 True
3 3 True
4 4 True
5 5 True
6 6 <NA>
7 7 <NA>
8 8 <NA>
9 9 <NA>
10 10 <NA>
you can call this
df.col2 = df.col2.apply(lambda x: True if x==1.0 else x)

How to sum all values in a column only if they are numerical?

I have the following code:
def check(df, columns):
for col in columns:
if df[col].sum(axis=0) == 0:
return True
return false
This code goes through the columns of df and checks is the sum of all values in a column is equal to 0 (i.e. all values are 0, while ignoring empty fields).
However it fails if one of the columns in columns in non-numeric. How can I add a condition that df[col].sum(axis=0) == 0 should only work on numeric columns and it should ignore empty rows if any?
Use:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0,np.nan,0,-0,0],
'C':[7,8,9,4,2,3],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C E F
0 a 0.0 7 5 a
1 b 0.0 8 3 a
2 c NaN 9 6 a
3 d 0.0 4 9 b
4 e 0.0 2 2 b
5 f 0.0 3 4 b
def check(df, columns):
return df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any()
print (check(df, df.columns))
True
Another alternative with test missing values and chained boolean DataFrame by | for bitwise OR:
def check(df, columns):
df1 = df[columns].select_dtypes(np.number)
return (df1.eq(0) | df1.isna()).all().any()
Explanation:
First select columns specified in list, in sample all column and get only numeric columns by DataFrame.select_dtypes:
print (df[columns].select_dtypes(np.number))
B C E
0 0.0 7 5
1 0.0 8 3
2 NaN 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Then replace missing values by 0 with DataFrame.fillna:
print (df[columns].select_dtypes(np.number).fillna(0))
B C E
0 0.0 7 5
1 0.0 8 3
2 0.0 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Compare by DataFrame.eq for ==:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0))
B C E
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
5 True False False
Test if all columns are only Trues by DataFrame.all:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all())
B True
C False
E False
dtype: bool
And last test if at least one in Series in True by Series.any:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any())
True
You can try this condition as well :
if df[col].dtype == int or df[col].dtype == float:
#your code

Replacing values in a 2nd level column on MultiIndex df in Pandas

I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())

How can I non-iteratively place a NaN in a DataFrame column if there is a corresponding NaN in any other columns?

Given a 3-column DataFrame, df:
a b c
0 NaN a True
1 1 b True
2 2 c False
3 3 NaN False
4 4 e True
[5 rows x 3 columns]
I would like to place aNaN in column c for each row where a NaN exists in any other colunn. My current approach is as follows:
for col in df:
df['c'][pd.np.isnan(df[col])] = pd.np.nan
I strongly suspect that there is a way to do this via logical indexing instead of iterating through columns as I am currently doing.
How could this be done?
Thank you!
If you don't care about the bool/float issue, I propose:
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 NaN NaN
4 4 e 1
[5 rows x 3 columns]
If you really do, then starting again from your frame df you could:
>>> df["c"] = df["c"].astype(object)
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b True
2 2 c False
3 3 NaN NaN
4 4 e True
[5 rows x 3 columns]
df.c[df.ix[:, :'c'].apply(lambda r: any(r.isnull()), axis=1)] = np.nan
Note that you may need to change the type of column c to float or you'll get an error about being unable to assign nan to integer column.
filter and select the rows where you have NaN for either 'a' or 'b' and assign 'c' to NaN:
In [18]:
df.ix[pd.isnull(df.a) | pd.isnull(df.b),'c'] = NaN
In [19]:
df
Out[19]:
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 d 0
4 4 NaN NaN
[5 rows x 3 columns]

Categories

Resources