I have a dataset name data. I would like to calculate the Lower and Upper value for multiple numeric variables: Loan, Amount, Value, LTV, UR. Instead of computing them one by one, how can I automate the following python code?
#Loan
Q1= data['LOAN'].quantile(q=0.25)
Q3= data['LOAN'].quantile(q=0.75)
IQR= Q3 - Q1
Lower = (Q1 - 1.5*IQR)
Upper = (Q3 + 1.5*IQR)
print('Loan')
print(Lower)
print(Upper)
I'd define an array with the colum names, then i'll cycle all the columns calling a function that calculates the values:
names =['LOAN', 'AMOUNT','VALUE','LUV','UR']
for column in data[names]:
x,y=calculateLowerUpper(column)
print (column)
print(x)
print(y)
def calculateLowerUpper(column):
q1= data['LOAN'].quantile(q=0.25)
q3= data['LOAN'].quantile(q=0.75)
iqr= q3 - q1
lower = (q1 - 1.5*iqr)
upper = (q3 + 1.5*iqr)
return lower,upper
Related
In my data, I have this column "price_range".
Dummy dataset:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
I am using pandas. What is the most efficient way to get the upper and lower bound of the price range in seperate columns?
Alternatively, you can parse the string accordingly (if you want to limits for each row, rather than the total range:
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146']})
def get_lower_limit(some_string):
a = some_string.split(' - ')
return int(a[0].split('€')[-1])
def get_upper_limit(some_string):
a = some_string.split(' - ')
return int(a[1].split('€')[-1])
df['lower_limit'] = df.price_range.apply(get_lower_limit)
df['upper_limit'] = df.price_range.apply(get_upper_limit)
Output:
Out[153]:
price_range lower_limit upper_limit
0 €4 - €25 4 25
1 €3 - €14 3 14
2 €25 - €114 25 114
3 €112 - €146 112 146
You can do the following. First create two extra columns lower and upper which contain the lower bound and the upper bound from each row. Then find the minimum from the lower column and maximum from the upper column.
df = pd.DataFrame({'price_range': ['€4 - €25', '€3 - €14', '€25 - €114', '€112 - €146', 'No pricing available']})
df.loc[df.price_range != 'No pricing available', 'lower'] = df['price_range'].str.split('-').str[0]
df.loc[df.price_range != 'No pricing available', 'upper'] = df['price_range'].str.split('-').str[1]
df['lower'] = df.lower.str.replace('€', '').astype(float)
df['upper'] = df.upper.str.replace('€', '').astype(float)
price_range = [df.lower.min(), df.upper.max()]
Output:
>>> price_range
[3.0, 146.0]
The issue:
When trying to compute two-way nested ANOVA, the results do not equal the appropriate results from R (formulas and data are the same).
Sample:
We use "atherosclerosis" dataset from here: https://stepik.org/media/attachments/lesson/9250/atherosclerosis.csv.
To get nested data we replace dose values for age == 2:
df['dose'] = np.where((df['age']==2) & (df['dose']=='D1'),'D3', df.dose)
df['dose'] = np.where((df['age']==2) & (df['dose']=='D2'),'D4', df.dose)
So we have dose factor nested into age: values D1 and D2 are in first age and values D3 and D4 are only in the 2nd age.
After getting ANOVA table we have the results below:
mod = ols('expr~age/C(dose)', data=df).fit()
anova_table = sm.stats.anova_lm(mod, typ=1); anova_table
Screenshot
The total sum of the 'sum_sg' = 1590.257424 + 47.039636 + 197.452754 = 1834.7498139999998 that is NOT equal the right total sum (computed below) = 1805.5494956433238
grand_mean = df['expr'].mean()
ssq_t = sum((df.expr - grand_mean)**2)
Expected Output:
Let's try to get ANOVA table in R:
df <- read.csv(file = "/mnt/storage/users/kseniya/platform-adpkd-mrwda-aim-imaging/mrwda_training/data_samples/athero_new.csv")
nest <- aov(df$expr ~ df$age / factor(df$dose))
print(summary(nest))
The results:
Screenshot
Why they are not equal? The formulas are the same. Are there any mistakes in computing ANOVA through statsmodels?
The results from R seem to be right, because the total sum 197.5 + 17.8 + 1590.3 = 1805.6 is equal to the total sum computed manually.
The degrees of freedom aren't equal. I suspect that the model definition is not really the same between OLS and R. Since lm(y ~ x/z, data) is just a shortcut for lm(y ~ x + x:z, data), I prefer using the extended formulation and recheck if your data is the same. Use also lm instead of aov and . Behaviour of the Python and R implementations should be more similar.
Also behavior of C() in Python does not seem the same as factor() cast in R.
When I try to remove outliers from my dataset, I get this warning:
Code
def remout(df):
Q1 = df.quantile(0.02)
Q3 = df.quantile(0.98)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
return df
df=remout(df)
df
Warning Message
FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right`
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
First, you're not specifying which column you're targeting to omit your outliers from, if your dataframe has one column, fine, but generally speaking you have to give a dataframe and a single column as an input to this function.
Second, Interquartile score or IQR is defined as difference between the 75th percentile and the 25th percentile. Therefore the numbers you have introduced as Q1 and Q3 are wrong.
Third, there is a problem with the logic you have used in your function, this will solve the problem without any future warnings:
def remout(df, col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
print(Q1, Q3, IQR)
df = df[(df[col] > (Q1 - 1.5 * IQR)) & (df[col] < (Q3 + 1.5 * IQR))]
return df
df=remout(df, 'col_name')
Instead of col_name specify the name of the column you want to remove outliers from.
I am removing outliers from my dataset and wanted to get some thoughts on efficient methods.
I am currently using IQR to filter out any outliers in my data as below:
Q1 = df.grades.quantile(0.25)
Q3 = df.grades.quantile(0.75)
IQR = Q3 - Q1
Where the grades column in my df contains the values where I want to remove the outliers.
I have previous code used that does this but how can I edit code below, to account only for the grades column? (df.grades) and not just df
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
Thanks!
I think you need remove any and test column grades like:
df = df[~((df.grades < (Q1 - 1.5 * IQR)) | (df.grades > (Q3 + 1.5 * IQR)))]
I'm trying to extract the outliers from a pandas dataframe. I wanted to extract those by interquantile distance, but It triggered "TypeError: Invalid comparison between dtype=float64 and str", yet I don't see where it is a string or a float, when I checked both things I was comparing, upp_iqr, low_iqr where pandas.core.series.Series and prices list was a pandas.core.frame.DataFrame. The dataframe have only numbers in their cells. This is my code:
prices_list = df.filter(regex='Price')
q1 = prices_list.quantile(0.25)
q3 = prices_list.quantile(0.75)
iqr = q3 - q1
low_iqr = q1 - (1.5 * iqr)
upp_iqr = q3 + (1.5 * iqr)
if any(x > low_iqr for x in prices_list):
print("check_outlier: False")
logging.debug(print(price_outliers))
else:
print("check_outlier: False")
logging.debug(print(price_outliers))
I expected it to return a dataframe with their values and positions. Any help would be apreciated, please and thank you