how to take floor and capping for removing outliers - python

How to calculate 99% and 1% percentile as cap and floor for each column, the if value >= 99% percentile then redefine the value as the value of 99% percentile; similarly if value <= 1% percentile then redefine value as the value of 1% percentile
np.random.seed(2)
df = pd.DataFrame({'value1': np.random.randn(100), 'value2': np.random.randn(100)})
df['lrnval'] = np.where(np.random.random(df.shape[0])>=0.7, 'learning', 'validation')
if we have hundreds columns, can we use apply function instead of do loop?

Based on Abdou's answer, the following might save you some time:
for col in df.columns:
percentiles = df[col].quantile([0.01, 0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
or use numpy.clip:
import numpy as np
for col in df.columns:
percentiles = df[col].quantile([0.01, 0.99]).values
df[col] = np.clip(df[col], percentiles[0], percentiles[1])

You can first define a helper function that takes in as arguments a series and a value and changes that value according to the conditions mentioned above:
def scale_val(s, val):
percentiles = s.quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Then you can use pd.DataFrame.apply and pd.Series.apply:
df.apply(lambda s: s.apply(lambda v: scale_val(s,v)))
Please note that this may be a somewhat slow solution if you are dealing with a large amount of data, but I would suggest you give a shot and see if it will solve your problem within a reasonable time.
Edit:
If you only want to get the percentiles for rows of df where the column lrnval is equal to "learning", you can modify the function to calculate the percentiles for only rows where that condition is true:
def scale_val2(s, val):
percentiles = s[df.lrnval.eq('learning')].quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Since there is a column that contains strings, I assume that you won't be doing any calculations on it. So, I would change the code as follows:
df.filter(regex='[^lrnval]').apply(lambda s: s.apply(lambda v: scale_val2(s,v)))
I hope this proves useful.

Related

Apply custom function to entire dataframe

I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks
You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)

fast way to iterate through list, find duplicates and perform calculations

I have two lists, one of areas and one of prices which are the same size.
For example:
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
I need to return a list of prices for each unique area not including the original area.
So for 1500, the lists that I need returned are: [750] and [200]
For the 2000, the lists that I need returned are [600,1000], [800,1000] and [800,600]
For the 1800 and 500, the lists I need returned are both empty lists [].
The goal is then to determine whether a value is an outlier subject to the absolute value of the price - mean(excluding the price itself) being less than 5 * population standard deviation(calculated excluding the price itself)
import statistics
area = [1500,2000,2000,1800,2000,1500,500]
price = [200,800,600,800,1000,750,200]
outlier_idx = []
for idx, val in enumerate(area):
comp_idx = [i for i, x in enumerate(area) if x == val]
comp_idx.remove(idx)
comp_price = [price[i] for i in comp_idx]
if len(comp_price)>2:
sigma = statistics.stdev(comp_price)
p_m = statistics.mean(comp_price)
if abs(price[idx]-p_m) > 5 * sigma:
outlier_idx.append(idx)
area = [i for j, i in enumerate(area) if j not in outlier_idx]
price = [i for j, i in enumerate(price) if j not in outlier_idx]
The problem is that this calculation takes up a lot of time and I am dealing with arrays that can be quite large.
I am stuck as to how I can increase the computational efficiency.
I am open to using numpy, pandas or any other common packages.
Additionally, I have tried the problem in pandas:
df['p-p_m'] = ''
df['sigma'] = ''
df['outlier'] = False
for name, group in df.groupby('area'):
if len(group)>1:
idx = list(group.index)
for i in range(len(idx)):
tmp_idx = idx.copy()
tmp_idx.pop(i)
df['p-p_m'][idx[i]] = abs(group.price[idx[i]] - group.price[tmp_idx].mean())
df['sigma'][idx[i]] = group.price[tmp_idx].std(ddof=0)
if df['p-p_m'][idx[i]] > 3*df['sigma'][idx[i]]:
df['outlier'][idx[i]] = True
Thanks.
This code is how to must created the list for each area:
df = pd.DataFrame({'area': area, 'price': price})
price_to_delete = [item for idx_array in df.groupby('price').groups.values() for item in idx_array[1:]]
df.loc[price_to_delete, 'price'] = None
df = df.groupby('area').agg(lambda x: [] if all(x.isnull()) else x.tolist())
df
I don't understand what are you want, but this part is to calculate outliers for each price in each area:
df['outlier'] = False
df['outlier'] = df['price'].map(lambda x: abs(np.array(x) - np.mean(x)) > 3*np.std(x) if len(x) > 0 else [])
df
I hope this help you, in any way!
Here is a solution that combines Numpy and Numba. Although correct, I did not test it against alternative approaches regarding efficiency, but Numba usually results in significant speedups for tasks that require looping through data. I have added one extra point which is an outlier, according to your definition.
import numpy as np
from numba import jit
# data input
price = np.array([200,800,600,800,1000,750,200, 2000])
area = np.array([1500,2000,2000,1800,2000,1500,500, 1500])
#jit(nopython=True)
def outliers(price, area):
is_outlier = np.full(len(price), False)
for this_area in set(area):
indexes = area == this_area
these_prices = price[indexes]
for this_price in set(these_prices):
arr2 = these_prices[these_prices != this_price]
if arr2.size > 1:
std = arr2.std()
mean = arr2.mean()
indices = (this_price == price) & (this_area == area)
is_outlier[indices] = np.abs(mean - this_price) > 5 * std
return is_outlier
> outliers(price, area)
> array([False, False, False, False, False, False, False, True])
The code should be fast in case you have several identical price levels for each area, as they will be updated all at once.
I hope this helps.

Find int values in a numpy array that are "close in value" and combine them

I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.

Replace NaN with a random value every row

I have a dataset with a column 'Self_Employed'. In these columns are values 'Yes', 'No' and 'NaN. I want to replace the NaN values with a value that is calculated in calc(). I've tried some methods I found on here, but I couldn't find one that was applicable to me.
Here is my code, I put the things i've tried in comments.:
# Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = randint(0,(SEno+SEyes))
if rand_SE > 81:
return 'No'
else:
return 'Yes'
> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
>
>
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
> # df[df['Self_Employed']]
>
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
>
> # for i in range(df['Self_Employed'].isnull().sum()):
> # print(df.Self_Employed[i]
df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df
now the line where i tried it with df_nan seems to work, but then I have a separate set with only the former missing values, but I want to fill the missing values in the whole dataset. For the last row I'm getting an error, i linked to a screenshot of it.
Do you understand my problem and if so, can you help?
This is the set with only the rows where Self_Employed is NaN
This is the original dataset
This is the error
Make shure that SEno+SEyes != null
use the .loc method to set the value for Self_Employed when it is empty
SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = np.random.randint(0,(SEno+SEyes))
if(rand_SE >= 81):
return 'No'
else:
return 'Yes'
df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
What about df['Self_Employed'] = df['Self_Employed'].fillna(calc())?
You could first identify the locations of your NaNs like
na_loc = df.index[df['Self_Employed'].isnull()]
Count the amount of NaNs in your column like
num_nas = len(na_loc)
Then generate an according amount of random numbers, readily indexed and set up
fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)
And finally substitute those values in your dataframe
df.loc[na_loc]['Self_Employed'] = fill_values

Pandas - how to change cell according to next 10 cell's average

I have a dataset that I am trying to clean up. The data is all numeric. Basically, if there is a cell that is below 0 or above 100 i want to set it to NaN. I solved this with this code:
for col in df:
df.loc[df[col] < 0, col] = numpy.NaN
df.loc[df[col] > 100, col] = numpy.NaN
For values above 0 but below 20 i need to check the 10 cells above and below it. If the value is more than 20 different than the average of either the 10 cells in the same column above or below then it should also be set to numpy.NaN.
I am not sure how to go about this one quite yet, after reading the documentation i know that i can simply pass in a function into the df.loc[] that returns a boolean list. However, I am not sure how to access the the passed in value's index to check for the 10 values above and below. I think it could look like something like this, but i am not even sure if this would properly produce a boolean list the way pd.df.loc[] wants it.
def myFunc(value):
#access index and create avgs for both tenBefore and tenAfter
if abs(tenBeforeAvg - value) > 20 or abs(tenAfterAvg - value) > 20:
return False
else:
return True
for col in df:
df.loc[df[col] < 0, col] = numpy.NaN
df.loc[df[col] > 100, col] = numpy.NaN
df.loc[myFunc(df[col]), col] = numpy.NaN
Thanks ahead.
Perhaps this can help you on the way.
You can compare your DataFrame with a rolling_mean DataFrame and a reversed one for the averages above and below.
However, due to the NaNs in your dataframe, the average will not always be calculated, so you can ensure that it's calculated regardless using the min_periods.
Do check if it's accurate, as I haven't.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 110, (100, 3)))
#remove those higher than 100, lower than 0.
df[(df < 0) | (df > 100)] = np.nan
mean_desc = df.rolling(10, min_periods=1).mean()
mean_asc = df[::-1].rolling(10, min_periods=1).mean() # reversed rolling avg.
mean_asc.index = mean_desc.index
df[(df < 20) & (df > 0) & (df > mean_desc - 20) & (df < mean_desc + 20) & (df > mean_asc - 20) & (df < mean_asc + 20)] = "np.nan" # <-- replace with np.nan
print(df)

Categories

Resources