Dropping outliers using standard deviation and mean formula [duplicate] - python

This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 4 years ago.
Hello everyone,
I am trying to remove the outliers from my dataset. I defined the outlier boundaries using the mean-3*std and mean+3*std. Now I want to delete the values smaller than mean-3*std and delete the values bigger than mean+3*std. Could you help me writing a formula for this? I am a beginner in python. I already looked at similar questions, but this did not helped so far.
Untill now I had the following:
import pandas as pd
print(df_OmanAirTO.mean()-3*df_OmanAirTO.std(), df_OmanAirTO.mean()+3*df_OmanAirTO.std())
resulting in:
FuelFlow 2490.145718
ThrustDerateSmoothed 8.522145
CoreSpeed 93.945180
EGTHotDayMargin 9.950557
EGT 684.168701
TotalAirTemperature 11.980698
ThrustDerate -3.780215
dtype: float64
FuelFlow 4761.600157
ThrustDerateSmoothed 29.439075
CoreSpeed 101.360974
EGTHotDayMargin 90.414781
EGT 915.952163
TotalAirTemperature 43.266653
ThrustDerate 44.672861
dtype: float64
Now I want to delete the values smaller than mean-3*std and delete the values bigger than mean+3*std. How can I do this?
Thank you in advance for helping me!

I assume you want to apply the outlier conditionals on each column (i.e. in column FuelFlow, remove cells smaller than 2490.145718 and larger than 4761.600157, and in column ThrustDerateSmoothed, remove cells smaller than 8.522145 and larger than 29.439075, etc...)
I will try this:
filt_outliers_df_oman = df.apply(lambda x: x[(x < df_OmanAir[x.name].mean()-3*df_OmanAir[x.name].std()) &
(x > df_OmanAIr[x.name].mean()+3*df_OmanAir[x.name].std())], axis=0)

Related

Setting numbers outside of range as null [duplicate]

This question already has answers here:
Python Pandas replace values if not in value range
(4 answers)
Closed last year.
I am working with pandas df and I am trying to make all the numbers that are outside of range set as null, but having trouble
df['Numbers'] = df['Numbers'].mask((df['Numbers']< -10) & (df['Numbers']> 10), inplace=True)
So I want to keep the numbers between -10 and 10, if the numbers are outside of those two numbers, it should be set as null.
What am I doing wrong here?
One thing that immediately strikes out at me is that you're using & with your two conditions, so you're basically trying to select all numbers that are both less than -10 and greater than 10...which isn't gonna work ;)
I'd rewrite your code like this:
df.loc[df['Numbers'].lt(-10) | df['Numbers'].gt(10), 'Numbers'] = np.nan
I would do it like this:
df['Numbers'] = df['Numbers'].where((df['Numbers']>-10) & (df['Numbers']<10))

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

Pandas - How to get sum of column with positive and negative values? [duplicate]

This question already has answers here:
converting currency with $ to numbers in Python pandas
(5 answers)
Closed 3 years ago.
I am summing a column of data using pandas that includes positive and negative values.
I first clean the data by removing the $ sign and parenthesis. Then format as a float.
How can I sum the whole column and subtract by the negative numbers?
Example:
$1000
($200)
$300
$1250
($100)
I want the answer to be 2250 not 2550.
Thanks in advance!
You want to identify the values and the signs:
# positive and negative
signs = np.where(s.str.startswith('('), -1, 1)
# extract the values
vals = s.str.extract('\$([\d\.]*)')[0].astype(int)
# calculate the sum
vals.mul(signs).sum()
# 2250
A Pandas DataFrame object has the .sum method that takes axis as a parameter
my_dataframe['name_of_column_you_want'].sum(axis = 0) # axis=0 means down (the rows)
I don't understand your example.
import re
def clean(column_name) :
if column_name.find('(') > 0 :
return float(re.match(r'(\d+)').group(0))
else :
return -float(re.match(r'(\d+)').group(0))
my_dataframe['column_name'].apply(clean).sum()

if-else for multiple conditions dataframe [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I don't know how to right properly the following idea:
I have a dataframe that has two columns, and many many rows.
I want to create a new column based on the data in these two columns, such that if there's 1 in one of them the value will be 1, otherwise 0.
Something like that:
if (df['col1']==1 | df['col2']==1):
df['newCol']=1
else:
df['newCol']=0
I tried to use .loc function in different ways but i get different errors, so either I'm not using it correctly, or this is not the right solution...
Would appreciate your help. Thanks!
Simply use np.where or np.select
df['newCol'] = np.where((df['col1']==1 | df['col2']==1), 1, 0)
OR
df['newCol'] = np.select([cond1, cond2, cond3], [choice1, choice2, choice3], default=def_value)
When a particular condition is true replace with the corresponding choice(np.select).
one way to solve this using .loc,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = 1
df['newCol'].fillna(0,inplace=True)
incase if you want newcol as string use,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = '1'
df['newCol'].fillna('0',inplace=True)
or
df['newCol']=df['newCol'].astype(str)

How to scale a numpy array from 0 to 1 with overshoot? [duplicate]

This question already has answers here:
How to normalize a NumPy array to within a certain range?
(8 answers)
Closed 3 years ago.
I am trying to scale a pandas or numpy array from 0 to a unknown max value with the defined number replaced with 1.
One solution I tried is just dividing the defined number I want by the array.
test = df['Temp'] / 33
This method does not scale all the way from 0 and I'm stuck trying to figure out a better mathematical way of solving this.
First, transform the DataFrame to a numpy array
import numpy as np
T = np.array(df['Temp'])
Then scale it to a [0, 1] interval:
def scale(A):
return (A-np.min(A))/(np.max(A) - np.min(A))
T_scaled = scale(T)
Then transform it to anywhere you want, e.g. to [55..100]
T2 = 55 + 45*T_scaled
I'm sure that this can be done within Pandas too (but I'm not familiar with it). Perhaps you might study Pandas df.apply()
scaled = (df['Temp']-df['Temp'].min()) / (33 - df['Temp'].min())
Just replace 33 with the number to want your data scaled to!

Categories

Resources