I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.
Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10 will return a boolean df
| is the OR operator
.any(1) returns any true element over the axis 1 (rows) see documentation
df.loc[] then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""
Related
I have a dataframe that looks like this:
import pandas as pd, numpy as np
df = pd.DataFrame({'Fill' : [0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 1]})
df['flag'] = (df['Fill'] > 0)
df = df.replace(0,np.nan)
df
Fill flag
0 NaN False
1 NaN False
2 NaN False
3 3.0 True
4 NaN False
5 NaN False
6 NaN False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
My goal is to backwards fill with bfill() and pass a dynamic limit based on the value of the cells in the Fill column. I have also created a flag column, which is True for any cell > 0. I did this to protect against the fact that values in the Fill column might become floats as they are filled, so I didn't want to apply the logic o those cells, which started as NaN. This is what I have tried:
df['Fill'] = np.where((df['Fill'].notnull()) & (df.flag==True),
df['Fill'].apply(lambda x: x.bfill(limit=int(x-1))),
df['Fill'])
I am receiving an error: AttributeError: 'float' object has no attribute 'bfill' , but I thought that since I was filtering for the relevant rows with np.where that I could get around the nan values and that with int(x-1), I could avoid the float issue. I also tried something similar with the np.where on the inside of the .apply. Any help is much appreciated. See expected output below:
expected output:
Fill flag
0 NaN False
1 3.0 False
2 3.0 False
3 3.0 True
4 NaN False
5 NaN False
6 2.0 False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
You can create groups for each missing and last non missing values and replace by last values in custom function, if-else is necessaary for avoid error ValueError: Limit must be greater than 0:
m = df['Fill'].notnull() & df.flag
g = m.iloc[::-1].cumsum().iloc[::-1]
f = lambda x: x.bfill(limit=int(x.iat[-1]-1)) if x.iat[-1] > 1 else x
df['Fill'] = df.groupby(g)['Fill'].apply(f)
print (df)
Fill flag
0 NaN False
1 3.0 False
2 3.0 False
3 3.0 True
4 NaN False
5 NaN False
6 2.0 False
7 2.0 True
8 NaN False
9 NaN False
10 1.0 True
I need to look through my data set and find all values that meet the certain conditions. I have tried pandas.where(cond) which just accept one condition.
For example consider the following data set:
a b c d
1 2 3 899
4 5 -344 21
7 8 9 10
I need this result: 0< data.values and data.values <30
a b c d
1 2 3 Nan
4 5 Nan 21
7 8 9 10
Most of the scripts return the rows or columns that meet the conditions.
However I need the rest of the value in each column and row. For example I do not want to lose 2 and 3 in row first and 4 and 5 in row second.
Create boolean DataFrame and apply boolean indexing or use where with 'invert conditions' - < to >= and > to <=:
m = (df >= 0) & (df <= 30)
print (m)
a b c d
0 True True True False
1 True True False True
2 True True True True
df = df[m]
#alternatively
#df = df.where(m)
print (df)
a b c d
0 1 2 3.0 NaN
1 4 5 NaN 21.0
2 7 8 9.0 10.0
Numpy solution:
df = pd.DataFrame(np.where(m, df, np.nan), index=df.index, columns=df.columns)
print (df)
a b c d
0 1.0 2.0 3.0 NaN
1 4.0 5.0 NaN 21.0
2 7.0 8.0 9.0 10.0
Or use mask:
m = (df < 0) | (df > 30)
df = df.mask(m)
print (df)
a b c d
0 1 2 3.0 NaN
1 4 5 NaN 21.0
2 7 8 9.0 10.0
This can be accomplished with a binary expression (which can be compound) as the selection criteria. Pandas overloads the dunder (double underscore) function for array subscripting to take a binary expression. A common problem in using this is that it is not a logical expression, so you need to use bit wise operators & and | in the expression when it is compound. These operators bind tighter than equality and comparison operators (e.g. ==, >, >=) so you need to put your comparisons inside parentheses.
I believe the answer given by #jezrael will work. This is just an explanation of what s/he has provided.
Is the condition None == None is true or false?
I have 2 pandas-dataframes:
import pandas as pd
df1 = pd.DataFrame({'id':[1,2,3,4,5], 'value':[None,20,None,40,50]})
df2 = pd.DataFrame({'index':[1,2,3], 'value':[None,20,None]})
In [42]: df1
Out[42]: id value
0 1 NaN
1 2 20.0
2 3 NaN
3 4 40.0
4 5 50.0
In [43]: df2
Out[43]: index value
0 1 NaN
1 2 20.0
2 3 NaN
When I'm executing merge action it's looks like None == None is True:
In [37]: df3 = df1.merge(df2, on='value', how='inner')
In [38]: df3
Out[38]: id value index
0 1 NaN 1
1 1 NaN 3
2 3 NaN 1
3 3 NaN 3
4 2 20.0 2
but when I do this:
In [39]: df4 = df3[df3['value']==df3['value']]
In [40]: df4
Out[40]: id value index
4 2 20.0 2
In [41]: df3['value']==df3['value']
Out[41]: 0 False
1 False
2 False
3 False
4 True
It shows that None == None is false.
Pandas uses the floating point Not a Number value, NaN, to indicate that something is missing in a series of numbers. That's because that's easier to handle in the internal representation of data. You don't have any None objects in your series. Even so, if you use dtype=object data, None is used to encode missing value. See Working with missing data.
Not that it matters here, but NaN is always, by definition, not equal to NaN:
>>> float('NaN') == float('NaN')
False
When merging or broadcasting, Pandas knows what 'missing' means, there is no equality test being done on the NaN or None values in a series. Nulls are skipped explicitly.
If you want to test if a value is a null or not, use the series.isnull()and series.notnull() methods instead.
I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()
How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0