Function for finding outliers

Function for finding outliers - python

I have this dataframe:
DF1
with these columns:
obs_1 obs_2
31 173
16 20
38 49
12 16
45 49
14 174
83 88
43 46
43 46
27 45
32 40
625 669
4 4
61 99
20 26
103 -356
8 110
146 246
38 50
11 92
10 97
9 90
217 234
9 177
28 28
22 22
12 123
35 147
59 63
31 143
18 130
45 55
46 50
21 21
17 152
63 70
52 73
24 24
15 -1172
43 54
88 96
22 34
42 56
14 56
19 20
40 42
23 120
68 73
80 -1263
14 124
35 41
40 176
13 52
21 26
22 102
43 -1325
18 18
36 162
68 69
17 34
20 30
26 27
45 55
78 82
I am trying to find the outliers, noting if it is an outlier in a new column using this function:
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:, None]
median = np.median(points, axis=0)
diff = (points - median) **2
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
Discussed here:Link to discussion
I have tried this code:
DF1['obs_1_outlier'] = is_outlier(df1.obs_1.to_numpy())
I don't receive any errors, but all results are FALSE, and I have a suspicion that something isn't calculating correctly in the function.
I have a feeling it is with the way I am sending the column to the function, but I can't put my finger on it.
Edit 1/2023 - removed np.sum from:
diff = np.sum((points - median)**2, axis=-1)
Thanks to Guilherme.

Related

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.

pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

Is numpys setdiff1d broken?

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)
Now the following should always be true:
len(all_idx) == len(idxValid)+len(idxTrain)
Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:
import numpy as np
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(len(all_idx), len(idxValid), len(idxTrain))
This results in -> 100000, 1000, 99005
I am confused?! Please try yourself. I would be glad to understand this.

idxValid = np.random.choice(all_idx, 10, replace=False)
Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice
replace boolean, optional
Whether the sample is with or without replacement

Consider the following example:
all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
Now if you print out your validation dataset:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]
You can actually observe that there are duplicates in the resulting set and thus
len(all_idx) == len(idxValid)+len(idxTrain)
wouldn't result to True.
What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
Now the results should be as expected:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0, 100)
print(all_idx)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)
print(len(all_idx) == len(idxValid)+len(idxTrain))
and the output is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]
[ 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
90 91 93 94 95 97 98 99]
True
Consider using train_test_split from scikit-learn which is straight-forward:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

How do you split a time series into separate, even segments?

I want to perform a manual short time fourier transform. I have a simple time series in the form of a cosine wave. I want to perform a short time fourier transform by splitting up the time series into a number of evenly spaced segments that include overlap... how do i do that?
this is my time series:
fs = 10e3 # Sampling frequency
N = 1e5 # Number of samples
time = np.arange(N) / fs
x = np.cos(5*time) # Some random audio wave
# x.shape gives (100000,)
How do i split into say, 10 evenly spaced segments?

Here's one way to do this.
import numpy as np
def get_windows(n, Mt, olap):
# Split a signal of length n into olap% overlapping windows each containing Mt terms
ists = []
ieds = []
ist = 0
while 1:
ied = ist + Mt
if ied > n:
break
ists.append(ist)
ieds.append(ied)
ist += int(Mt * (1 - olap/100))
return ists, ieds
n = 100
x = np.arange(n)
ists, ieds = get_windows(n, Mt=20, olap=50) # windows of length 20 and 50% overlap
for ist, ied in zip(ists, ieds):
print(x[ist:ied])
result:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
[30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
[40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59]
[50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69]
[60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79]
[70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89]
[80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
If your data is relatively small and you are comfortable with storing all the windows in RAM, then you can continue as follows:
X = np.array([x[ist:ied] for ist, ied in zip(ists, ieds)])
# X.shape is (nwindows, Mt)
By doing this, you can generate W a windowing function (e.g. Hanning window) as a 1D array of shape (Mt, ), so that W*X will broadcast in a way so that W applies to each window in X.
I just noticed that the term "window" is used with two meanings in this context. Sorry for the confusion.

Array reshape not mapping correctly to numpy meshgrid

I have a long 121 element array where the data is stored in ascending order and I want to reshape to an 11x11 matrix and so I use the NumPy reshape command
Z = data.attributevalue[2,time,axial,:]
Z = np.reshape(Z, (int(math.sqrt(datacount)), int(math.sqrt(datacount))))
The data should be oriented in a Cartesian plane and I create the mesh grid with the following
x = np.arange(1.75, 12.5, 1)
y = np.arange(1.75, 12.5, 1)
X,Y = np.meshgrid(x, y)
The issue is that rows of Z are in the wrong order so the data in the last row of the matrix should be in the first and vice-versa. I want to rearrange so the rows are filled in the proper manner. The starting array Z is assembled in the following arrangement [datapoint #1, datapoint #2 ...., datapoint #N]. Datapoint #1 should be in the top left and the last point in the bottom right. Is there a simple way of accomplishing this or do I have to make a function to changed the order of the rows?
my plot statement is the following
surf = self.ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.jet,
linewidth=1, antialiased=True)
***UPDATE****
I tried populating the initial array backwards and still no luck. I changed the orientation of the axis to the following
y = np.arrange(12.5,1,-1)
This flipped the data but my axis label is wrong so it is not a real solution to my issue. Any ideas?

It is possible that your original array does not look like a 1x121 array. The following code block shows how you reshape an array from 1x121 to 11x11.
import numpy as np
A = np.arange(1,122)
print A
print A.reshape((11,11))
Gives:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[[ 1 2 3 4 5 6 7 8 9 10 11]
[ 12 13 14 15 16 17 18 19 20 21 22]
[ 23 24 25 26 27 28 29 30 31 32 33]
[ 34 35 36 37 38 39 40 41 42 43 44]
[ 45 46 47 48 49 50 51 52 53 54 55]
[ 56 57 58 59 60 61 62 63 64 65 66]
[ 67 68 69 70 71 72 73 74 75 76 77]
[ 78 79 80 81 82 83 84 85 86 87 88]
[ 89 90 91 92 93 94 95 96 97 98 99]
[100 101 102 103 104 105 106 107 108 109 110]
[111 112 113 114 115 116 117 118 119 120 121]]

Issue with merging time series variables to create new DataFrame with arbitrary index

So I am trying to merge the following columns of data which are currently indexed as daily entries (but only have points once per week). I have separated the columns into year variables but am having trouble getting them into a combined dataframe and disregard the date index so that I can build out min/max columns by week over the years. I am not sure how to get merge/join function to do this.
#Create year variables, append to new dataframe with new index
I have the following:
def minmaxdata():
Totrigs = dataforgraphs()
tr = Totrigs
yrs=[tr['2007'],tr['2008'],tr['2009'],tr['2010'],tr['2011'],tr['2012'],tr['2013'],tr['2014']]
yrlist = ['tr07','tr08','tr09','tr10','tr11','tr12','tr13','tr14']
dic = dict(zip(yrlist,yrs))
yr07,yr08,yr09,yr10,yr11,yr12,yr13,yr14 =dic['tr07'],dic['tr08'],dic['tr09'],dic['tr10'],dic['tr11'],dic['tr12'],dic['tr13'],dic['tr14']
minmax = yr07.append([yr08,yr09,yr10,yr11,yr12,yr13,yr14],ignore_index=True)
I would like a Dataframe like the following:
2007 2008 2009 2010 2011 2012 2013 2014 min max
1 10 13 10 12 34 23 22 14 10 34
2 25 ...
3 22
4 ...
5
.
.
. ...
52

I'm not sure what your original data look like, but I don't think it's a good idea to hard-code all years. You lose re-usability. I'll setup a sequence of random integers indexed by date with one date per week.
In [65]: idx = pd.date_range ('2007-1-1','2014-12-31',freq='W')
In [66]: df = pd.DataFrame(np.random.randint(100, size=len(idx)), index=idx, columns=['value'])
In [67]: df.head()
Out[67]:
value
2007-01-07 7
2007-01-14 2
2007-01-21 85
2007-01-28 55
2007-02-04 36
In [68]: df.tail()
Out[68]:
value
2014-11-30 76
2014-12-07 34
2014-12-14 43
2014-12-21 26
2014-12-28 17
Then get year of the week:
In [69]: df['year'] = df.index.year
In [70]: df['week'] = df.groupby('year').cumcount()+1
(You may try df.index.week for week# but I've seen weird behavior like starting from week #53 in Jan.)
Finally, do a pivot table to transform and get row-wise max/min:
In [71]: df2 = df.pivot_table(index='week', columns='year', values='value')
In [72]: df2['max'] = df2.max(axis=1)
In [73]: df2['min'] = df2.min(axis=1)
And now our dataframe df2 looks like this and should be what you need:
In [74]: df2
Out[74]:
year 2007 2008 2009 2010 2011 2012 2013 2014 max min
week
1 7 82 13 32 24 58 18 10 82 7
2 2 5 29 0 2 97 59 83 97 0
3 85 89 8 83 63 73 47 49 89 8
4 55 5 1 44 78 10 13 87 87 1
5 36 41 48 98 98 24 24 69 98 24
6 51 43 62 60 44 57 34 33 62 33
7 37 66 72 46 28 11 73 36 73 11
8 30 13 86 93 46 67 95 15 95 13
9 78 84 16 21 70 39 43 90 90 16
10 9 2 88 15 39 81 44 96 96 2
11 34 76 16 44 44 26 30 77 77 16
12 2 24 23 13 25 69 25 74 74 2
13 66 91 67 77 18 47 95 66 95 18
14 59 52 22 42 40 99 88 21 99 21
15 76 17 31 57 43 31 91 67 91 17
16 76 38 53 43 84 45 78 9 84 9
17 88 53 34 22 99 93 61 42 99 22
18 78 19 82 19 5 80 55 69 82 5
19 54 92 56 6 2 85 7 67 92 2
20 8 56 86 41 60 76 31 81 86 8
21 64 76 11 38 41 98 39 72 98 11
22 21 86 34 1 15 27 26 95 95 1
23 82 90 3 17 62 18 93 20 93 3
24 47 42 32 27 83 8 22 14 83 8
25 15 66 70 16 4 22 26 14 70 4
26 12 68 21 7 86 2 27 10 86 2
27 85 85 9 39 17 94 67 42 94 9
28 73 80 96 49 46 23 69 84 96 23
29 57 74 6 71 79 31 79 7 79 6
30 18 84 85 34 71 69 0 62 85 0
31 24 40 93 53 72 46 44 71 93 24
32 95 4 58 57 68 27 95 71 95 4
33 65 84 87 41 38 45 71 33 87 33
34 62 14 41 83 79 63 44 13 83 13
35 49 96 50 62 25 45 69 63 96 25
36 6 38 86 34 98 60 67 80 98 6
37 99 44 26 19 19 20 57 17 99 17
38 2 40 7 65 68 58 68 13 68 2
39 72 31 83 65 69 39 10 76 83 10
40 90 31 42 20 7 8 62 79 90 7
41 10 46 82 96 30 43 12 84 96 10
42 79 38 28 78 25 9 80 2 80 2
43 64 83 63 40 29 86 10 15 86 10
44 89 91 62 48 53 69 16 0 91 0
45 99 26 85 45 26 53 79 86 99 26
46 35 14 46 25 74 6 68 44 74 6
47 17 9 84 88 29 83 85 1 88 1
48 18 69 55 16 77 35 16 76 77 16
49 60 4 36 50 81 28 50 34 81 4
50 36 29 38 28 81 86 71 43 86 28
51 41 82 95 27 95 77 74 26 95 26
52 2 81 89 82 28 2 11 17 89 2
53 NaN NaN NaN NaN NaN 0 NaN NaN 0 0
EDIT:
If you need max/min over a certain columns, just list them. In this case (2007-2013), they are consecutive so you can do the following.
df2['max_2007to2013'] = df2[range(2007,2014)].max(axis=1)
If not, simply list them like: df2[[2007,2010,2012,2013]].max(axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function for finding outliers - python

Related

Place data from a Pandas DF into a Grid or Template

Is numpys setdiff1d broken?

How do you split a time series into separate, even segments?

Array reshape not mapping correctly to numpy meshgrid

Issue with merging time series variables to create new DataFrame with arbitrary index

Categories

Resources