I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()
Related
I'd like to get index of rows which have only null values straight in pandas, python3.
thanks.
Use:
i = df.index[df.isna().all(axis=1)]
If large DataFrame, slowier solution:
i = df[df.isna().all(axis=1)].index
Sample:
df=pd.DataFrame({"a":[np.nan,0,1],
"b":[np.nan,1,np.nan]})
print (df)
a b
0 NaN NaN
1 0.0 1.0
2 1.0 NaN
i = df.index[df.isna().all(axis=1)]
print (i)
Int64Index([0], dtype='int64')
Explanation:
First compare missing values by DataFrame.isna:
print (df.isna())
a b
0 True True
1 False False
2 False True
Then check if all Trues per rows by DataFrame.all:
print (df.isna().all(axis=1))
0 True
1 False
2 False
dtype: bool
And last filter index values by boolean indexing.
I'm trying to find the difference between two Pandas MultiIndex objects of different shapes. I've used:
df1.index.difference(df2)
and receive
TypeError: '<' not supported between instances of 'float' and 'str'
My indices are str and datetime, but I suspect there are NaNs hidden there (the floats). Hence my question:
What's the best way to find the NaNs somewhere in the MultiIndex? How does one iterate through the levels and names? Can I use something like isna()?
For MultiIndex are not implemented many functions, you can check this.
You need convert MultiIndex to DataFrame by MultiIndex.to_frame first:
#W-B sample
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
print (idx.to_frame())
0 1
NaN 1 NaN 1
1 1 1.0 1
2 1.0 2
print (idx.to_frame().isnull())
0 1
NaN 1 True False
1 1 False False
2 False False
Or use DataFrame constructor:
print (pd.DataFrame(list(idx.tolist())))
0 1
0 NaN 1
1 1.0 1
2 1.0 2
Because:
print (pd.isnull(idx))
NotImplementedError: isna is not defined for MultiIndex
EDIT:
For check at least one True per rows use any with boolean indexing:
df = idx.to_frame()
print (df[df.isna().any(axis=1)])
0 1
NaN 1 NaN 1
Also is possible filter MultiIndex, but is necessary add MultiIndex.remove_unused_levels:
print (idx[idx.to_frame().isna().any(axis=1)].remove_unused_levels())
MultiIndex(levels=[[], [1]],
labels=[[-1], [0]])
We can using reset_index , then with isna
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
df=pd.DataFrame([1,2,3],index=idx)
df.reset_index().filter(like='level_').isna()
Out[304]:
level_0 level_1
0 True False
1 False False
2 False False
I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.
Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10 will return a boolean df
| is the OR operator
.any(1) returns any true element over the axis 1 (rows) see documentation
df.loc[] then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""
i just started to work with python (pandas) and now i have my first question.
I have a dataframe with the following row names:
ID A Class
1 True [0,5]
2 False [0,5]
3 True [5,10]
4 False [10,20]
5 True [0,5]
6 False [10,20]
Now i'm looking for a cool solution, where i can do something like this:
Class True False
[0,5] 2 1
[5,10] 1 0
[10,20] 0 2
I want to count how much True and False i have for a Class
Is there a fast solution? My Dataframe could have more than 2 million entries.
You can use pivot_table to do the aggregation. After that, it's just a matter of formatting the column names and index to match your desired output.
# Perform the pivot and aggregation.
df = pd.pivot_table(df, index='Class', columns='A', aggfunc='count', fill_value=0)
# Format column names and index to match desired output.
df.columns = [c[1] for c in df.columns]
df.reset_index(inplace=True)
The resulting output:
Class False True
0 [0,5] 1 2
1 [10,20] 2 0
2 [5,10] 0 1
Edit:
The above solution assumes that the elements of the 'Class' column are strings. If they are lists, you could do the following:
df['Class'] = df['Class'].map(tuple)
**original solution code here**
df['Class'] = df['Class'].map(list)
Let df be your dataframe, I would first use:
g = df.groupby('Class')['A'].value_counts().reset_index()
that returns:
Class A 0
0 [0,5] True 2
1 [0,5] False 1
2 [10,20] False 2
3 [5,10] True 1
then I would pivot the above table to get your desired shape:
a = pd.pivot_table(g, index='Class', columns='A', values=0).fillna(0)
This returns:
A False True
Class
[0,5] 1.0 2.0
[10,20] 2.0 0.0
[5,10] 0.0 1.0
How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0