I need to change a column to either True or False based on the NaN value.
Here is the df.
missing
0 NaN
1 b
2 NaN
4 y
5 NaN
would become
missing
0 False
1 True
2 False
4 True
5 False
yes I can do a loop but there was to be a simple way to do in a single line of code.
thank you.
You can do
df['missing'].notna() # or notnull()
you need to overwrite the column values with binary applied on the same column, which can be achieved notna()
df['missing'] = df['missing'].notna()
Related
I have a dataframe that looks like this:
import pandas as pd, numpy as np
df = pd.DataFrame({'Fill' : [0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 1]})
df['flag'] = (df['Fill'] > 0)
df = df.replace(0,np.nan)
df
Fill flag
0 NaN False
1 NaN False
2 NaN False
3 3.0 True
4 NaN False
5 NaN False
6 NaN False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
My goal is to backwards fill with bfill() and pass a dynamic limit based on the value of the cells in the Fill column. I have also created a flag column, which is True for any cell > 0. I did this to protect against the fact that values in the Fill column might become floats as they are filled, so I didn't want to apply the logic o those cells, which started as NaN. This is what I have tried:
df['Fill'] = np.where((df['Fill'].notnull()) & (df.flag==True),
df['Fill'].apply(lambda x: x.bfill(limit=int(x-1))),
df['Fill'])
I am receiving an error: AttributeError: 'float' object has no attribute 'bfill' , but I thought that since I was filtering for the relevant rows with np.where that I could get around the nan values and that with int(x-1), I could avoid the float issue. I also tried something similar with the np.where on the inside of the .apply. Any help is much appreciated. See expected output below:
expected output:
Fill flag
0 NaN False
1 3.0 False
2 3.0 False
3 3.0 True
4 NaN False
5 NaN False
6 2.0 False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
You can create groups for each missing and last non missing values and replace by last values in custom function, if-else is necessaary for avoid error ValueError: Limit must be greater than 0:
m = df['Fill'].notnull() & df.flag
g = m.iloc[::-1].cumsum().iloc[::-1]
f = lambda x: x.bfill(limit=int(x.iat[-1]-1)) if x.iat[-1] > 1 else x
df['Fill'] = df.groupby(g)['Fill'].apply(f)
print (df)
Fill flag
0 NaN False
1 3.0 False
2 3.0 False
3 3.0 True
4 NaN False
5 NaN False
6 2.0 False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
I'm trying to find the difference between two Pandas MultiIndex objects of different shapes. I've used:
df1.index.difference(df2)
and receive
TypeError: '<' not supported between instances of 'float' and 'str'
My indices are str and datetime, but I suspect there are NaNs hidden there (the floats). Hence my question:
What's the best way to find the NaNs somewhere in the MultiIndex? How does one iterate through the levels and names? Can I use something like isna()?
For MultiIndex are not implemented many functions, you can check this.
You need convert MultiIndex to DataFrame by MultiIndex.to_frame first:
#W-B sample
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
print (idx.to_frame())
0 1
NaN 1 NaN 1
1 1 1.0 1
2 1.0 2
print (idx.to_frame().isnull())
0 1
NaN 1 True False
1 1 False False
2 False False
Or use DataFrame constructor:
print (pd.DataFrame(list(idx.tolist())))
0 1
0 NaN 1
1 1.0 1
2 1.0 2
Because:
print (pd.isnull(idx))
NotImplementedError: isna is not defined for MultiIndex
EDIT:
For check at least one True per rows use any with boolean indexing:
df = idx.to_frame()
print (df[df.isna().any(axis=1)])
0 1
NaN 1 NaN 1
Also is possible filter MultiIndex, but is necessary add MultiIndex.remove_unused_levels:
print (idx[idx.to_frame().isna().any(axis=1)].remove_unused_levels())
MultiIndex(levels=[[], [1]],
labels=[[-1], [0]])
We can using reset_index , then with isna
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
df=pd.DataFrame([1,2,3],index=idx)
df.reset_index().filter(like='level_').isna()
Out[304]:
level_0 level_1
0 True False
1 False False
2 False False
Given a DataFrame with possible NaN values, I'd like to determine which rows have NaN as a value but only for certain columns.
I believe the following should work...
my_df.query('colA.isnull() | colZ.isnull() | colN.isnull()')
However, I am coming across the following exception
TypeError: unhashable type: 'numpy.ndarray'
Now, I've determine that I can pass the param engine='python' to get the query to work. But, I'd like to use the optimized engine numexpr.
Is such a query possibly? Or do I have to iterate over each column I wish to filter on, one at a time?
Thanks.
One approach is to build a boolean mask that picks out the row(s) on which any of your conditions is satisfied.
# Method 1: build the boolean mask using bitwise operations
mask = ((df['colA'].isnull()) |
(df['colZ'].isnull()) |
(df['colN'].isnull()))
null_rows = df[mask]
# Method 2: pick desired columns from an element-wise boolean mask of null flags
mask = df.isnull()[['colA', 'colZ', 'colN']].any(axis=1)
null_rows = df[mask]
You can slice the columns and use df.isna().
df (generated using code I copied from somewhere else on SO earlier today, sorry I forget where, but thank you!):
0 1 2 3 4
0 0.763847 1.343149 0.096778 NaN 0.532322
1 -0.364227 -0.560027 NaN NaN NaN
2 -0.556234 0.384970 0.476016 NaN -0.385282
3 0.604560 -0.390024 -1.697762 1.207321 0.829520
4 NaN NaN 0.754011 2.137359 -0.594698
5 0.513925 0.651509 -1.500094 NaN -0.556604
6 NaN NaN -1.388030 NaN NaN
7 NaN -0.634743 0.024213 -0.439684 0.765820
8 0.815948 0.545350 -0.823986 NaN 1.655538
9 0.687386 1.477326 NaN 0.207531 0.571499
output of df.isna():
0 1 2 3 4
0 False False False True False
1 False False True True True
2 False False False True False
3 False False False False False
4 True True False False False
5 False False False True False
6 True True False True True
7 True False False False False
8 False False False True False
9 False False True False False
Row-wise operations:
df.isna().sum(axis=1)
0 1
1 3
2 1
3 0
4 2
5 1
6 4
7 1
8 1
9 1
Column-wise:
df.isna().sum()
0 3
1 2
2 2
3 6
4 2
To slice the df, use something like df.loc[:, 0:2].isna(). You can read up on slicing, .loc, and .iloc here: https://pandas.pydata.org/pandas-docs/stable/indexing.html
I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.
Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10 will return a boolean df
| is the OR operator
.any(1) returns any true element over the axis 1 (rows) see documentation
df.loc[] then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""
I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()