I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())
Related
I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work
I have the following code:
def check(df, columns):
for col in columns:
if df[col].sum(axis=0) == 0:
return True
return false
This code goes through the columns of df and checks is the sum of all values in a column is equal to 0 (i.e. all values are 0, while ignoring empty fields).
However it fails if one of the columns in columns in non-numeric. How can I add a condition that df[col].sum(axis=0) == 0 should only work on numeric columns and it should ignore empty rows if any?
Use:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0,np.nan,0,-0,0],
'C':[7,8,9,4,2,3],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C E F
0 a 0.0 7 5 a
1 b 0.0 8 3 a
2 c NaN 9 6 a
3 d 0.0 4 9 b
4 e 0.0 2 2 b
5 f 0.0 3 4 b
def check(df, columns):
return df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any()
print (check(df, df.columns))
True
Another alternative with test missing values and chained boolean DataFrame by | for bitwise OR:
def check(df, columns):
df1 = df[columns].select_dtypes(np.number)
return (df1.eq(0) | df1.isna()).all().any()
Explanation:
First select columns specified in list, in sample all column and get only numeric columns by DataFrame.select_dtypes:
print (df[columns].select_dtypes(np.number))
B C E
0 0.0 7 5
1 0.0 8 3
2 NaN 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Then replace missing values by 0 with DataFrame.fillna:
print (df[columns].select_dtypes(np.number).fillna(0))
B C E
0 0.0 7 5
1 0.0 8 3
2 0.0 9 6
3 0.0 4 9
4 0.0 2 2
5 0.0 3 4
Compare by DataFrame.eq for ==:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0))
B C E
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
5 True False False
Test if all columns are only Trues by DataFrame.all:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all())
B True
C False
E False
dtype: bool
And last test if at least one in Series in True by Series.any:
print (df[columns].select_dtypes(np.number).fillna(0).eq(0).all().any())
True
You can try this condition as well :
if df[col].dtype == int or df[col].dtype == float:
#your code
I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)
I have pandas dataframe of the form,df=
index,result1,result2,result3
0 s u s
1 u s u
2 s
3 s s u
i would like to add another column that contains a list of the number of times s occurs in that row, for example
index,result1,result2,result3,count
0 s u s 2
1 u s u 1
2 s 1
3 s s u 2
i have tried the following code
col=['result1','result2','result3']
df[cols].count(axis=1)
but this returns
0,3
1,3
2,1
3,3
so this counts the number of elements, i then tried
df[df[cols]=='s'].count(axis=1)
but this returned the following error: "Could not compare ['s'] with block values"
Any help would be greatly appreciated
For me works cast to string by astype numeric and NaN columns return your error:
print (df)
index result1 result2 result3 result4
0 0 s u 7 NaN
1 1 u s 7 NaN
2 2 s NaN 8 NaN
3 3 s s 7 NaN
4 4 NaN NaN 2 NaN
print (df.dtypes)
index int64
result1 object
result2 object
result3 int64
result4 float64
dtype: object
cols = ['result1','result2','result3','result4']
df['count'] = df[df[cols].astype(str) == 's'].count(axis=1)
print (df)
index result1 result2 result3 result4 count
0 0 s u 7 NaN 1
1 1 u s 7 NaN 1
2 2 s NaN 8 NaN 1
3 3 s s 7 NaN 2
4 4 NaN NaN 2 NaN 0
Or sum only True values from boolean mask:
print (df[cols].astype(str) == 's')
result1 result2 result3 result4
0 True False False False
1 False True False False
2 True False False False
3 True True False False
4 False False False False
cols = ['result1','result2','result3','result4']
df['count'] = (df[cols].astype(str) =='s').sum(axis=1)
print (df)
index result1 result2 result3 result4 count
0 0 s u 7 NaN 1
1 1 u s 7 NaN 1
2 2 s NaN 8 NaN 1
3 3 s s 7 NaN 2
4 4 NaN NaN 2 NaN 0
Another nice solution is from Nickil Maveli - use numpy:
df['count'] = (df[cols].values=='s').sum(axis=1)
I have a MWE that can be reproduced with the following code:
import pandas as pd
a = pd.DataFrame([[1,2],[3,4]], columns=['A', 'B'])
b = pd.DataFrame([[True,False],[False,True]], columns=['A', 'B'])
Which creates the following dataframes:
In [8]: a
Out[8]:
A B
0 1 2
1 3 4
In [9]: b
Out[9]:
A B
0 True False
1 False True
My question is, how can I change the values for dataframe A based on the boolean values in dataframe B?
Say for example if I wanted to make NAN values in dataframe A where there's an instance of False in dataframe B?
If need replace False to NaN:
print (a[b])
A B
0 1.0 NaN
1 NaN 4.0
or:
print (a.where(b))
A B
0 1.0 NaN
1 NaN 4.0
and if need replace True to NaN:
print (a[~b])
A B
0 NaN 2.0
1 3.0 NaN
or:
print (a.mask(b))
A B
0 NaN 2.0
1 3.0 NaN
Also you can use where or mask with some scalar value:
print (a.where(b, 7))
A B
0 1 7
1 7 4
print (a.mask(b, 7))
A B
0 7 2
1 3 7
print (a.where(b, 'TEST'))
A B
0 1 TEST
1 TEST 4