Detecting outliers from a list - python

I want to detect and store outliers from a list and this is what I am doing
Code:
def outliers(y,thresh=3.5):
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y <= m])
right_mad = np.median(abs_dev[y >= m])
y_mad = left_mad * np.ones(len(y))
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > thresh
bids = [5000,5500,4500,1000,15000,5200,4900]
z = outliers(bids)
bidd = np.array(bids)
out_liers = bidd[z]
This gives results as:
out_liers = array([ 1000, 15000])
Is there a better way to do this, where I don't get the results in array but in a list?
Also please can someone explain me why we used
thresh=3.5
modified_z_score = 0.6745 * abs_dev / y_mad

This works:
def outliers_modified_z_score(ys, threshold=3.5):
ys_arr = np.array(ys)
median_y = np.median(ys_arr)
median_absolute_deviation_y = np.median(np.abs(ys_arr - median_y))
modified_z_scores = 0.6745 * (ys_arr - median_y) / median_absolute_deviation_y
return (ys_arr[np.abs(modified_z_scores) > threshold]).tolist()

That's because you are using numpy function. Default type used there is numpy.ndarray, which speeds up the computations. In the case you just need a list as output argument, use tolist() method.
z = outliers(bids)
bidd = np.array(bids)
out_liers = bidd[z].tolist()

Related

python: returns Magnus and Neudecker's duplication matrix of size n

Hi I am trying to write the following code which is written in matlab to write it in Python:
The following script is the matlab code
`
function d = dupmat(n)
% Returns Magnus and Neudecker's duplication matrix of size n
a = tril(ones(n));
i = find(a); # find non-zero elements
a(i) = 1:length(i);
a = a + tril(a,-1)';
j = vec(a);
m = n*(n+1)/2;
d = zeros(n*n,m);
for r = 1:nrows(d)
d(r, j(r)) = 1;
end
`
I have tried to write this code but it does not work
import numpy as np
def dupmat(n):
a = np.tril(np.ones(n))
i = np.nonzero(a) # find the non zero elements of the function
a = a[1:len(i)]
a = a + np.tril(a,-1)
j = np.vectorize(a)
m = n*(n+1)/2
d = np.zeros([n*n,int(m)])
for r in range(0,d.shape[0]):
if (d[r] == j[r] == 1):
d[r] = 1
return d

Substituting values in SymPy summation

When substituting values into a SymPy sum, it doesn't seem to recognise that the variables are indexed, and simply factors out all the indexed variables, like so:
# Define variables.
z_tilde_i = sympy.IndexedBase('\\tilde{z}')
rho_i = sympy.IndexedBase('\\rho')
M = sympy.symbols('M')
n = sympy.symbols('n', integer = True)
i = sympy.Idx('i', n)
# Define equation M = sum(rho * deltaZ).
eq_total_mass = sympy.Eq(M, sympy.Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, n - 1)))
# Try to substitute values.
print(eq_total_mass.rhs.subs(n, 3).doit())
>>> 3*(\tilde{z}[i + 1] - \tilde{z}[i])*\rho[i]
How to make the SymPy sum recognise the indexed variables?
For a workaround:
There is no need to define i as Idx:
>>> i = var('i')
>>> Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, 1)).doit()
(-\tilde{z}[0] + \tilde{z}[1])*\rho[0] + (-\tilde{z}[1] + \tilde{z}[2])*\rho[1]
Or if you do, don't use the integer=True when defining n:
>>> n = var('n')
>>> i = sympy.Idx('i', n)
>>> Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, 1)).doit()
(-\tilde{z}[0] + \tilde{z}[1])*\rho[0] + (-\tilde{z}[1] + \tilde{z}[2])*\rho[1]

Want to get maximum values from output and apply them to equation

Input code is:
# Input data:
S = pd.S = 2000 # Saturation flow
L = pd.L = 5 # Lost time
eb = pd.eb = 1000
wb = pd.wb = 600
sb = pd.sb = 400
nb = pd.nb = 500
# a) C_min = Minimum cycle length calculation
Y_eb = pd.Y_eb = eb / S
Y_wb = pd.Y_wb = wb / S
Y_sb = pd.Y_sb = sb / S
Y_nb = pd.Y_nb = nb / S
Y_eb_wb_sb_nb = [Y_eb,Y_wb,Y_sb,Y_nb]
Y_eb_wb_sb_nb
Output:
[0.5, 0.3, 0.2, 0.25]
Then
if Y_eb > Y_wb:
print(C_min = L / 1 - (Y_eb + Y_wb))
I want to:
Get maximum values from (Y_eb;Y_wb) and (Y_sb;Y_nb) and apply these values to formula:
C_min = L / (1- [max of (Y_eb;Y_wb)] + [max of (Y_sb;Y_nb)])
Use max built-in fuction:
C_min = L / (1- max(Y_eb,Y_wb) + max(Y_sb,Y_nb))
python has a built-in max function, that give the max of a list...
max(iterable, *[, key, default])
max(arg1, arg2, *args[, key])
"Return the largest item in an iterable or the largest of two or more
arguments"
https://docs.python.org/3/library/functions.html#max
Answer:
C_min = L / (1- max([Y_eb, Y_wb]) + max([Y_sb, Y_nb]))

Fastest way to perform calculation between two dataframe colums?

I have a pandas dataframe with 6 million rows. The columns are:
['x', 'y']
I need to apply a simple calculation between x an y, and append it to the dataframe.
This is what I've tried:
'''
Calculates the height of a pressure level in feet
'''
def pressure_to_elevation(P, T = None):
sea_level_pressure = 1013.25
if T is not None:
# https://www.omnicalculator.com/physics/air-pressure-at-altitude
P0 = sea_level_pressure
g = 9.80665
M = 0.0289644
R0 = 8.31447
m = (np.log(P/P0)*T) / -(g*M/R0)
f = 3.28084 * m
return f
b = 0.190284
c = 145366.45
return (1-math.pow((P/sea_level_pressure), b)) * c
test_df['result'] = test_fd.apply(lambda row: pressure_to_elevation(row['x'], row['y']),axis=1)
Unfortunately, this takes a ridiculous amount of time... in fact, I've yet to see it complete.
Is there a faster way to do this?
Try this:
def pressure_to_elevation(P, T):
sea_level_pressure = 1013.25
P0 = sea_level_pressure
g = 9.80665
M = 0.0289644
R0 = 8.31447
b = 0.190284
c = 145366.45
return np.where(T.notnull(),
3.28084 * ((np.log(P/P0)*T) / -(g*M/R0)),
(1-np.pow((P/sea_level_pressure), b)) * c)
Usage:
test_df['result'] = pressure_to_elevation(test_df['x'], test_df['y'])
I believe if you break this out into separate steps and avoid iterating through the entire dataframe, the speed will increase dramatically. Give the following a shot.
test_df['result_1'] = (test_df['x']/sea_level_pressure)
test_df['result_1'] = test_df['result']**0.190284
test_df['result_1'] = (1 - test_df['result'])*145366.45
test_df['result_2'] = 3.28084*((np.log(test_df['x']/sea_level_pressure)*test_df['y'])/(-1*(9.80665*0.0289644/8.31447)))
test_df['final_result'] = np.where(pd.isnull(test_df['y']), test_df['result_1'], test_df['result_2'])

Matrix kind-of program

I get TypeError: 'int' object is not iterable on the following code, why?
def temp_media(c, l):
c_ini = c
l_ini = l
res_vert = 0
res_horiz = 0
dim = dimensoes()
c_max = dim[0] // 2
l_max = dim[1]
for l in l_max:
for c in c_max:
res_vert = res_vert + calcula_temp(c, l)
res_horiz = res_horiz + calcula_temp(c, l)
return (((res_horiz / (c_max - c_ini)) + (res_vert / (l_max - l_ini))) / 2)
How can I fix this?
You need to use range (or xrange in python 2) in your for loops:
c_max = dim[0] // 2
l_max = dim[1]
for l in range(l_max):
for c in range(c_max):

Categories

Resources