Replace outliers with column quantile in Pandas dataframe - python

I have a dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
A B
0 92 65
1 61 97
2 17 39
3 70 47
4 56 6
Here are 5% quantiles:
down_quantiles = df.quantile(0.05)
A 24.8
B 12.6
And here is the mask for values that are lower than quantiles:
outliers_low = (df < down_quantiles)
A B
0 False False
1 False False
2 True False
3 False False
4 False True
I want to set values in df lower than quantile to its column quantile. I can do it like this:
df[outliers_low] = np.nan
df.fillna(down_quantiles, inplace=True)
A B
0 92.0 65.0
1 61.0 97.0
2 24.8 39.0
3 70.0 47.0
4 56.0 12.6
But certainly there should be a more elegant way. How can I do this without fillna?
Thanks.

You can use DF.mask() method. Wherever there is a presence of a True instance, the values from the other series get's replaced aligned as per matching column names by providing axis=1.
df.mask(outliers_low, down_quantiles, axis=1)
Another variant would be to use DF.where() method after inverting your boolean mask using the tilde (~) symbol.
df.where(~outliers_low, down_quantiles, axis=1)

Related

Convert column of lists to columns of True/False if it belongs to the lists

I have the following DataFrame
Codice Sem CFU Rating Gruppo
0 51132 1 10 0.0 [STAT]
1 51197 1 5 0.0 [ING]
2 52354 1 5 0.0 [ING]
3 52496 1 10 0.0 [MST]
4 52498 2 10 0.0 [MST]
... ... ... ... ... ...
57 97667 1 8 3.0 [MTM]
58 97673 2 8 0.0 [MTM]
59 97683 2 5 5.0 [STAT, ING]
60 97690 2 12 0.0 [MST]
61 97725 2 10 0.0 [CSCL, MTM]
As you see, the Gruppo column is made of lists of a finite number of unique elements. I'm trying to generate from this a DF to be used in pulp for something like "if Codice belongs to Gruppo", thus I need a DF (or matrix, but I wanted to indexise using Codice and not just an ordinary integer) made like this:
Codice STAT ING ... MST
0 51132 True False ... False
1 51197 False True ... False
Basically True/False whether the corresponding list contains ING,STAT,MST,...
General solution if Codice are or not unique use DataFrame.explode with crosstab and test if not 0:
df1 = df.explode('Gruppo')
df2 = pd.crosstab(df1['Codice'], df1['Gruppo']).ne(0).reset_index()
Or use MultiLabelBinarizer with aggregate max and compare for equal 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df2 = (pd.DataFrame(mlb.fit_transform(df['Gruppo']),
columns=mlb.classes_,
index=df['Codice']).groupby(level=0).max().eq(1))
If values in Codice are unique is possible remove aggregation:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df2 = (pd.DataFrame(mlb.fit_transform(df['Gruppo']),
columns=mlb.classes_,
index=df['Codice']).eq(1))

Set a pandas column Boolean value based on other columns in the row

Assume a DataFrame
C1 C2 C3
1 NaN NaN NaN
2 20.1 15 200
3 NaN 12 100
4 22.5 8 80
I want to create a new column based on a summarizing boolean of the rest of the row. For example, are any of the values NaN? In that case, my new column value would be "False" for that row.
Or, perhaps, are ALL of the values NaN? In that case, I might want the new column to say False but otherwise True (we do have some values)
I considered using df.notnan() to create a Boolean DataFrame,
C1 C2 C3
1 False False False
2 True True True
3 False True True
4 True True True
I'm sure I'm just missing something simple, but I could not come up with a way to create the fourth column based on OR-ing the existing items in each row.
Also, a generic solution would be nice, one that doesn't require building an interim DF of Booleans.
Background: I have a dataset. Nutrient values are only sampled occasionally, so many of the rows do not contain those values. I would like to have a "Nutrients Sampled" column where the value is True or False based on whether I can expect to see any nutrient sample data in this record. There are 6 possible nutrients and I don't want to check all 6 columns.
I can write the code that checks all 6 columns; I just can't seem to create a new column with a truth value.
You can do that using any and all methods which are available on the data frame, just have to pass the argument axis=1 to operate along
example:
df['C4'] = pd.notnull(df).any(axis=1)
C1 C2 C3 C4
0 NaN NaN NaN False
1 20.1 15.0 200.0 True
2 NaN 12.0 100.0 True
3 22.5 8.0 80.0 True
I feel like we should using all
df['New']=~df.isna().all(1)
df
C1 C2 C3 New
1 NaN NaN NaN False
2 20.1 15.0 200.0 True
3 NaN 12.0 100.0 True
4 22.5 8.0 80.0 True
You can use apply method and define a function to map rows to a boolean.
Here is a function, you can customize based on your need (e.g. you can use all instead of any):
# if at least one of the values is NaN
def my_function(row):
return any(row[['C1', 'C2', 'C3']].isna())
And here is how to apply it to your dataframe and add new column:
df['new_column'] = df.apply(my_function, axis=1)
C1 C2 C3 new_column
0 NaN NaN NaN True
1 20.1 15.0 200.0 False
2 NaN 12.0 100.0 True
3 22.5 8.0 80.0 False
How about:
# interim df
df = {"C1": [False, True, False, True], ...
df ["C4"] = df.apply(lambda x: x.C1 or x.C2 or X.C3, axis=1)
Or ... directly as
original_df["C4"] = original_df.apply(lambda x: np.any(np.isnan(x)), axis = 1)
Regards,

Merge Pandas DataFrame where Values are not exactly alike

I have two DataFrames:
First one (sp_df)
X Y density keep mass size
10 20 33 False 23 23
3 2 52 True 5 5
1.2 3 35 False 25 52
Second one (ep_df)
X Y density keep mass size
2.1 1.1 55 True 4.0 4.4
1.1 2.9 60 False 24.8 54.8
9.0 25.0 33 False 22.0 10.0
now i need to merge them with their X/Y Position into something like this:
X-SP Y-SP density-SP ........ X-EP Y-EP density-EP......
1.5 2.0 30 1.0 2.4 28.7
So with the Data shown above you would get something like this:
X-SP Y-SP density-SP keep-SP mass-SP size-SP X-EP Y-EP density-EP keep-EP mass-EP size-EP
3 2 52 True 5 5 2.1 1.1 55 True 4.0 4.4
1.2 3 35 False 25 52 1.1 2.9 60 False 24.8 54.8
10 20 33 False 23 23 9.0 25.0 33 False 22.0 10.0
My Problem is now that those values are not frequently alike. So I need some kind comparison what two columns in the different dataframes are most likely to be the same. Unfortunately, I have no idea how I can get this done.
Any Tips, advice? Thanks in advance
you can merge the two dataframes like a cartesian product. This will make a dataframe with each row of first data frame joined with every row of second data frame. Than remove the rows which have more difference between X values of the two dataframes. Hope the following code helps,
import pandas as pd
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1].tolist()
df=df.drop(df.index[drop])
df
Edited code:
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1.01].tolist()
drop_negative=df.index[df["diff"] <= 0 ].tolist()
droped_values=drop+drop_negative
df=df.drop(df.index[droped_values])
df

How to print a value from a different column if certain conditions are met in an other column?

I need to print a value from a different column if certain conditions are met in an other column.
The df consist of 4 columns year, gdp, the difference and the signal
df=pd.read_excel("file.xls", names = ["year","GDP"])
gdp["diff"] = gdp["GDP"].diff(1)
signal = gdp["diff"].clip(lower = -1.0, upper=1.0)
gdp["signal"] = signal
The signal is -1 for neg values and +1 for pos values.
The condition is I have to print the year for which there are 2 consecutive negative periods.
rec_start=(gdp["signal"]==-1) & (gdp["signal"].shift(-1)==-1)
gdp["start"]=rec_start # which gives a boolean mask
rec and start are the same
year GDP diff signal start
0 1999q4 12323.3 NaN NaN False
1 2000q1 12359.1 35.8 1.0 False
2 2000q2 12592.5 233.4 1.0 False
3 2000q3 12607.7 15.2 1.0 False
4 2000q4 12679.3 71.6 1.0 False
5 2001q1 12643.3 -36.0 -1.0 False
6 2001q2 12710.3 67.0 1.0 False
7 2001q3 12670.1 -40.2 -1.0 False
8 2001q4 12705.3 35.2 1.0 False
9 2002q1 12822.3 117.0 1.0 False
Now i just have to figure out the right sintax to print the year row for 2 consecutive Trues.
trying with
foo=gdp.ix[(gdp["signal"]==-1) & (gdp["signal"].shift(-1)==-1)]["year"].iloc[0]
print(foo)
Did the trick.
Any help is much appreciated!
IIUC this should work:
df.year[np.where((df.signal==-1),(df.signal==df.signal.shift‌​()),0).astype('bool'‌​)]

Pandas drop rows with value less than a given value

I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.
Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10 will return a boolean df
| is the OR operator
.any(1) returns any true element over the axis 1 (rows) see documentation
df.loc[] then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""

Categories

Resources