hi i'm trying to remove outliers from columns with numerical features but when i execute my code the whole dataset is removed can any1 tell me what im doing wrong please
numerical_columns = data.select_dtypes(include=['int64','float64']).columns.tolist()
print('Number of rows before discarding outlier = %d' % (data.shape[0]))
for i in numerical_columns:
q1 = data[i].quantile(0.25)
q3 = data[i].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
data = data.loc[(data[i] > fence_low) & (data[i] < fence_high)]
print('Number of rows after discarding outlier = %d' % (data.shape[0]))
The below code has worked for me. Here col is the numerical column of dataframe for which you need to remove outliers
#Remove Outliers: keep only the ones that are within +3 to -3
# standard deviations in the column
df = df[np.abs(df[col]-df[col].mean()) <= (3*df[col].std())]
Related
I have a dataframe with the columns:
[id, range_start, range_stop, score]
If two rows have a range overlap by x percentage I retain the row with the higher score. However, I am confused how to pull out rows with no overlap to other ranges. I am using a nested loop and recursion to condense overlapping ranges into a new dataframe. However, this structure causes all rows to be retained when I am looking for the non overlapping rows.
## This is my function to recursively select the highest scoring overlapping regions
def overlap_retention(df_overlap, threshold, df_nonoverlap=None):
if df_nonoverlap != None:
df_nonoverlap = pd.DataFrame()
df_overlap = pd.DataFrame()
for index, row in x.iterrows():
rs = row['range_start']
re = row['range_end']
## Silly nested loop to compare ranges between all rows
for index2, row2 in x.drop(index).iterrows():
rs2 = row2['range_start']
re2 = row2['range_end']
readRegion=[*range(rs,re,1)]
refRegion=[*range(rs2,re2,1)]
regionUnion = set(readRegion).intersection(set(refRegion))
overlap_length = len(regionUnion)
overlap_min = min(rs, rs2)
overlap_max = max(re, re2)
overlap_full_range = overlap_max-overlap_min
overlap_percentage = (overlap_length/overlap_full_range)*100
## Check if they overlap by x_percentage and retain the higher score
if overlap_percentage>x_percentage:
evalue = row['score']
evalue_2 = row2['score']
if evalue_2 > evalue:
df_overlap = df_overlap.append(row2)
else:
df_overlap = df_overlap.append(row)
#----------------------------------------------------------
## How to find non-overlapping rows without pulling everything?
else:
df_nonoverlap = df_nonoverlap.append(row)
# ---------------------------------------------
### Recursion here to condense overlapped list further
if len(df_overlap)>1:
overlap_retention(df_overlap, threshold, df_nonoverlap)
else:
return(df_nonoverlap)
An example input is below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'range_start':[1,12,11,1,20, 10],
'range_end':[4,15,15,6,23,16],
'score':[3,1,8,2,5,1]}
input = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
The desired output can change based on the overlap threshold. In the example above id1 and id4 may both be retained or simply id1 depending on the overlap threshold:
data = {'id':['id1', 'id3', 'id5'],
'range_start':[1,11,20],
'range_end':[4,15,23],
'score':[3,8,5]}
output = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
You can make a cartesian join between all the ranges, then find length and % of the overlap for each pair, and filter it based on the x_overlap threshold.
After that, for each range we can find the overlapping range with the highest score (which could be the range itself, with the overlap of 100%):
# set min overlap parameter
x_overlap = 0.5
# cartesian join all ranges
z = df.assign(k=1).merge(
df.assign(k=1), on='k', suffixes=['_1', '_2'])
# find lengths of overlaps
z['len_overlap'] = (
z[['range_end_1', 'range_end_2']].min(axis=1) -
z[['range_start_1', 'range_start_2']].max(axis=1)).clip(0)
# we're only interested in cases where ranges overlap, so the total
# range is the range between min(start1, start2) and max(end1, end2)
z['len_total'] = (
z[['range_end_1', 'range_end_2']].max(axis=1) -
z[['range_start_1', 'range_start_2']].min(axis=1)).clip(0)
# find % overlap and filter out pairs above threshold
# these include 'pairs' where a range is paired to itself
z['pct_overlap'] = z['len_overlap'] / z['len_total']
z = z[z['pct_overlap'] > x_overlap]
# for each range find an overlapping range with the highest score
# (could be the range itself)
z = z.sort_values('score_2').groupby('id_1')['id_2'].last()
# filter the inputs
df_out = df[df['id'].isin(z)]
df_out
Output:
id range_start range_end score
0 id1 1 4 3
2 id3 11 15 8
4 id5 20 23 5
P.S. Please note that it is not very clear what should happen with id4 in your example. Since you don't have it in your output, I assumed (hopefully correctly) that you're not interested in zero-length ranges in the output
P.P.S. There is a new syntax for cartesian join in pandas 1.2.0+ with how=cross parameter in the merge method. I've used in my answer a version with a dummy variable k=1, which is more verbose, but compatible with older versions
I think you need a very clear definition of overlap. If you have [2;7], [6;10] and [7;8], which one overlaps with which one ?
Avoid using input as a variable name, it shadows the function input() (to get input from the user)
If you want to select clear overlaps (only the start or the end differs), and you only have at most ONE overlap, here you go:
sorted_df = df.sort_values(by=["range_start"])
starts_earlier = sorted_df[sorted_df.range_end.shift(-1) == sorted_df.range_end]
sorted_df = df.sort_values(by=["range_end"])
ends_earlier = sorted_df[sorted_df.range_start.shift(-1) == sorted_df.range_start]
Then you can do a df.drop(starts_earlier.index) and df.drop(ends_earlier.index) to remove the shorter ones/
df.shift() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
This code won't work for multiple overlapping segments. If you are interested in that, let me know.
I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.
You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']
The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)
I am doing some styling to pandas columns where I want to highlight green or red values + or - 2*std of the corresponding column, but when I loop over to go to the next column, previous work is essentially deleted and only the last column shows any changes.
Function:
def color_outliers(value):
if value <= (mean - (2*std)):
# print(mean)
# print(std)
color = 'red'
elif value >= (mean + (2*std)):
# print(mean)
# print(std)
color = 'green'
else:
color = 'black'
return 'color: %s' % color
Code:
comp_holder = []
titles = []
i = 0
for value in names:
titles.append(names[i])
i+=1
#Number of Articles and Days of search
num_days = len(page_list[0]['items']) - 2
num_arts = len(titles)
arts = 0
days = 0
# print(num_days)
# print(num_arts)
#Sets index of dataframe to be timestamps of articles
for days in range(num_days):
comp_dict = {}
comp_dict = {'timestamp(YYYYMMDD)' : int(int(page_list[0]['items'][days]['timestamp'])/100)}
#Adds each article from current day in loop to dictionary for row append
for arts in range(num_arts):
comp_dict[titles[arts]] = page_list[arts]['items'][days]['views']
comp_holder.append(comp_dict)
comp_df = pd.DataFrame(comp_holder)
arts = 0
days = 0
outliers = comp_df
for arts in range(num_arts):
mean = comp_df[titles[arts]].mean()
std = comp_df[titles[arts]].std()
outliers = comp_df.style.applymap(color_outliers, subset = [titles[arts]])
Each time I go through this for loop, the 'outliers' styling data frame resets itself and only works on the current subset, but if I remove the subset, it uses one mean and std for the entire data frame. I have tried style.apply using axis=0 but i can't get it to work.
My data frame consists of 21 columns, the first being the timestamp and the next twenty being columns of ints based upon input files. I also have two lists indexed from 0 to 19 of means and std of each column.
I would apply on the whole column instead of applymap. I'm not sure I can follow your code since I don't know how your data look like, but this is what I would do:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,100, [10,3]))
# compute the statistics
stats = df.agg(['mean','std'])
# format function on columns
def color_outlier(col, thresh=2):
# extract mean and std of the column
mean, std = stats[col.name]
return np.select((col<=mean-std*thresh, col>=mean+std*thresh),
('color: red', 'color: green'),
'color: black')
# thresh changes for demonstration, remove when used
df.style.apply(color_outlier, thresh=0.5)
Output:
Using python pandas,
I am trying to write a condition in pandas which will match two columns from two different excel file having the same column name and different numerical values in them. For each column there are 2000 rows to match.
The condition:
if final value = ( if File1(column1value) - File2(column1value) = 0 then update the value with 1;
if File1(column1value) - File2(column1value) is less than or equ al to 0.2 then keep File1Column1Value;
if (File1Column1) - File2(column1value) greater than 0.2 the. update the value with 0.
https://i.stack.imgur.com/Nx3WA.jpg
df1 = pd.read_excel('file_name1') # get input from excel files
df2 = pd.read_excel('file_name2')
p1 = df1['p1'].values
p11 = df2['p11'].values
new_col = [] # we will store desired values here
for i in range(len(p1)):
if p1[i] - p11[i] == 0:
new_col.append(1)
elif abs(p1[i] - p11[i]) > 0.2:
new_col.append(0)
else:
new_col.append(p1[i])
df1['new_column'] = new_col # we add new column with our values
You can also remove old column df.drop('column', axis = 1)
I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0
How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0