How to efficiently calculate a running standard deviation - python

I have an array of lists of numbers, e.g.:
[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)
I would like to efficiently calculate the mean and standard deviation at each index of a list, across all array elements.
To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).
To do the standard deviation, I loop through again, now that I have the mean calculated.
I would like to avoid going through the array twice, once for the mean and then once for the standard deviation (after I have a mean).
Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g., Perl or Python) or pseudocode is fine.

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:
Wikipedia: Algorithms for calculating variance
It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.
You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).
I wrote two blog entries on the topic which go into more details, including how to delete previous values online:
Computing Sample Mean and Variance Online in One Pass
Deleting Values in Welford’s Algorithm for Online Mean and Variance
You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:
Javadoc: stats.OnlineNormalEstimator
Source: stats.OnlineNormalEstimator.java
JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
LingPipe Home Page

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x2 (call it 'sum_x2') as you go. The value of the standard deviation is then:
stdev = sqrt((sum_x2 / n) - (mean * mean))
where
mean = sum_x / n
This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.
You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

Here is a literal pure Python translation of the Welford's algorithm implementation from John D. Cook’s excellent Accurately computing running variance article:
File running_stats.py
import math
class RunningStats:
def __init__(self):
self.n = 0
self.old_m = 0
self.new_m = 0
self.old_s = 0
self.new_s = 0
def clear(self):
self.n = 0
def push(self, x):
self.n += 1
if self.n == 1:
self.old_m = self.new_m = x
self.old_s = 0
else:
self.new_m = self.old_m + (x - self.old_m) / self.n
self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)
self.old_m = self.new_m
self.old_s = self.new_s
def mean(self):
return self.new_m if self.n else 0.0
def variance(self):
return self.new_s / (self.n - 1) if self.n > 1 else 0.0
def standard_deviation(self):
return math.sqrt(self.variance())
Usage:
rs = RunningStats()
rs.push(17.0)
rs.push(19.0)
rs.push(24.0)
mean = rs.mean()
variance = rs.variance()
stdev = rs.standard_deviation()
print(f'Mean: {mean}, Variance: {variance}, Std. Dev.: {stdev}')

Perhaps not what you were asking, but ... If you use a NumPy array, it will do the work for you, efficiently:
from numpy import array
nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
(0.00, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.00, 0.01, 0.05, 0.03)))
print nums.std(axis=1)
# [ 0.0116619 0.00979796 0.00632456 0.01788854]
print nums.mean(axis=1)
# [ 0.022 0.018 0.02 0.02 ]
By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:
Computing sample mean and variance online in one pass

The Python runstats Module is for just this sort of thing. Install runstats from PyPI:
pip install runstats
Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.
from runstats import Statistics
stats = [Statistics() for num in range(len(data[0]))]
for row in data:
for index, val in enumerate(row):
stats[index].push(val)
for index, stat in enumerate(stats):
print 'Index', index, 'mean:', stat.mean()
print 'Index', index, 'standard deviation:', stat.stddev()
Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.
Disclaimer: I am the author the Python runstats module.

Statistics::Descriptive is a very decent Perl module for these types of calculations:
#!/usr/bin/perl
use strict; use warnings;
use Statistics::Descriptive qw( :all );
my $data = [
[ 0.01, 0.01, 0.02, 0.04, 0.03 ],
[ 0.00, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];
my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures
for my $ref ( #$data ) {
$stat->add_data( #$ref );
printf "Running mean: %f\n", $stat->mean;
printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__
Output:
Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566

Have a look at PDL (pronounced "piddle!").
This is the Perl Data Language which is designed for high precision mathematics and scientific computing.
Here is an example using your figures....
use strict;
use warnings;
use PDL;
my $figs = pdl [
[0.01, 0.01, 0.02, 0.04, 0.03],
[0.00, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.00, 0.01, 0.05, 0.03],
];
my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
say "Mean scores: ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms): ", $rms;
Which produces:
Mean scores: [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms): [0.011661904 0.009797959 0.0063245553 0.017888544]
Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".
However, it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)
For more PDL information, have a look at:
pdl.perl.org (official PDL page).
PDL quick reference guide on PerlMonks
Dr. Dobb's article on PDL
PDL Wiki
Wikipedia entry for PDL
SourceForge project page for PDL

Unless your array is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.
My preference would be to use the NumPy array maths extension to convert your array of arrays into a NumPy 2D array and get the standard deviation directly:
>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0)
array([ 1. , 1. , 0.5, 1.5, 1.5, 1.5])
>>> a.mean(axis=0)
array([ 2. , 3. , 4.5, 4.5, 5.5, 6.5])
If that's not an option and you need a pure Python solution, keep reading...
If your array is
x = [
[ 1, 2, 4, 3, 4, 5 ],
[ 3, 4, 5, 6, 7, 8 ],
....
]
Then the standard deviation is:
d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N) for sx, sx2 in zip(sum_x, sum_x2) ]
If you are determined to loop through your array only once, the running sums can be combined.
sum_x = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
for i, t in enumerate(v):
sum_x[i] += t
sum_x2[i] += t**2
This isn't nearly as elegant as the list comprehension solution above.

I like to express the update this way:
def running_update(x, N, mu, var):
'''
#arg x: the current data sample
#arg N : the number of previous samples
#arg mu: the mean of the previous samples
#arg var : the variance over the previous samples
#retval (N+1, mu', var') -- updated mean, variance and count
'''
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)
so that a one-pass function would look like this:
def one_pass(data):
N = 0
mu = 0.0
var = 0.0
for x in data:
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
# could yield here if you want partial results
return (N, mu, var)
note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).
In a class it would look like this:
class RunningMeanVar(object):
def __init__(self):
self.N = 0
self.mu = 0.0
self.var = 0.0
def push(self, x):
self.N = self.N + 1
rho = 1.0/N
d = x-self.mu
self.mu += rho*d
self.var += + rho*((1-rho)*d**2-self.var)
# reset, accessors etc. can be setup as you see fit
This also works for weighted samples:
def running_update(w, x, N, mu, var):
'''
#arg w: the weight of the current sample
#arg x: the current data sample
#arg mu: the mean of the previous N sample
#arg var : the variance over the previous N samples
#arg N : the number of previous samples
#retval (N+w, mu', var') -- updated mean, variance and count
'''
N = N + w
rho = w/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)

Here's a "one-liner", spread over multiple lines, in functional programming style:
def variance(data, opt=0):
return (lambda (m2, i, _): m2 / (opt + i - 1))(
reduce(
lambda (m2, i, avg), x:
(
m2 + (x - avg) ** 2 * i / (i + 1),
i + 1,
avg + (x - avg) / (i + 1)
),
data,
(0, 0, 0)))

As the following answer describes:
Does Pandas, SciPy, or NumPy provide a cumulative standard deviation function?
The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that, you'll have to convert your data into a Pandas dataframe (or a series if it is one-dimensional), but there are functions for that.

Here is a practical example of how you could implement a running standard deviation with Python and NumPy:
a = np.arange(1, 10)
s = 0
s2 = 0
for i in range(0, len(a)):
s += a[i]
s2 += a[i] ** 2
n = (i + 1)
m = s / n
std = np.sqrt((s2 / n) - (m * m))
print(std, np.std(a[:i + 1]))
This will print out the calculated standard deviation and a check standard deviation calculated with NumPy:
0.0 0.0
0.5 0.5
0.8164965809277263 0.816496580927726
1.118033988749895 1.118033988749895
1.4142135623730951 1.4142135623730951
1.707825127659933 1.707825127659933
2.0 2.0
2.29128784747792 2.29128784747792
2.5819888974716116 2.581988897471611
I am just using the formula described in this thread:
stdev = sqrt((sum_x2 / n) - (mean * mean))

Responding to Charlie Parker's 2021 question:
I'd like an answer that I can just copy paste to my code in numpy. My input is a matrix of size [N, 1] where N is the number of data points and I already have computed the running mean and I assuming we have computed the running std/variance, how to update we the new batch of data.
Here we have two implementations of a function that takes the original mean, original variance and original size and the new sample and returns the total mean and total variance of the combined original and new sample (to get the standard deviation, just take variance's square root by using **(1/2)). The first uses NumPy, and the second one uses Welford. You may choose the one that best applies to your case.
def mean_and_variance_update_numpy(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
sample_to_append_mean = np.mean(sample_to_append)
sample_to_append_size = len(sample_to_append)
total_size = previous_size+sample_to_append_size
total_mean = (previous_mean*previous_size+sample_to_append_mean*sample_to_append_size)/total_size
total_var = (((previous_var+(total_mean-previous_mean)**2)*previous_size)+((np.var(sample_to_append)+(sample_to_append_mean-tm)**2)*sample_to_append_size))/total_size
return (total_mean, total_var)
def mean_and_variance_update_welford(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
pos = previous_size
mean = previous_mean
v = previous_var*previous_size
for value in sample_to_append:
pos += 1
mean_next = mean + (value - mean) / pos
v = v + (value - mean)*(value - mean_next)
mean = mean_next
return (mean, v/pos)
Let's check if it works:
import numpy as np
def mean_and_variance_udpate_numpy:
...
def mean_and_variance_udpate_welford:
...
# Making the samples and results deterministic
np.random.seed(0)
# Our initial sample has 100 samples, we want to append 10
n0, n1 = 100, 10
# Using np.matrix only, because it was in the question. 'np.array' is more common
s0 = np.matrix(1e3+np.random.random_sample(n0)*1e-3).T
s1 = np.matrix(1e3+np.random.random_sample(n1)*1e-3).T
# Precalculating our mean and var for initial sample:
s0mean, s0var = np.mean(s0), np.var(s0)
# Calculating mean and variance for s0+s1 using our NumPy updater
mean_and_variance_update_numpy(s0mean, s0var, len(s0), s1)
# (1000.0004826329636, 8.24577589696613e-08)
# Calculating mean and variance for s0+s1 using our Welford updater
mean_and_variance_update_welford(s0mean, s0var, len(s0), s1)
# (1000.0004826329634, 8.245775896913623e-08)
# Similar results, now checking with NumPy's calculation over the concatenation of s0 and s1
s0s1 = np.concatenate([s0,s1])
(np.mean(s0s1), np.var(s0s1))
# (1000.0004826329638, 8.245775896917313e-08)
Here the three results are closer:
# np(s0s1) (1000.0004826329638, 8.245775896917313e-08)
# np(s0)updnp(s1) (1000.0004826329636, 8.245775896966130e-08)
# np(s0)updwf(s1) (1000.0004826329634, 8.245775896913623e-08)
It is possible to see that the results are very similar.

n=int(raw_input("Enter no. of terms:"))
L=[]
for i in range (1,n+1):
x=float(raw_input("Enter term:"))
L.append(x)
sum=0
for i in range(n):
sum=sum+L[i]
avg=sum/n
sumdev=0
for j in range(n):
sumdev=sumdev+(L[j]-avg)**2
dev=(sumdev/n)**0.5
print "Standard deviation is", dev

Figure I could jump on the old bandwagon. This should work with rbg values
Adapted from
https://math.stackexchange.com/a/2148949
import numpy as np
class IterativeNormStats():
def __init__(self):
"""uint64 max is 18446744073709551615
256**2 = 65536
so we can store 18446744073709551615 / 65536 = 281,474,976,710,656
images before running into overflow issues. I think we'll be ok
"""
self.n = 0
self.rgb_sum = np.zeros(3, dtype=np.uint64)
self.rgb_sq_sum = np.zeros(3, dtype=np.uint64)
def update(self, img_arr):
rgbs = np.reshape(img_arr, (-1, 3)).astype(np.uint64)
self.n += rgbs.shape[0]
self.rgb_sum += np.sum(rgbs, axis=0)
self.rgb_sq_sum += np.sum(np.square(rgbs), axis=0)
def mean(self):
return self.rgb_sum / self.n
def std(self):
return np.sqrt((self.rgb_sq_sum / self.n) - np.square(self.rgb_sum / self.n))
def test_IterativeNormStats():
img_a = np.ones((10, 10, 3), dtype=np.uint8) * (1, 2, 3)
img_b = np.ones((10, 10, 3), dtype=np.uint8) * (2, 4, 6)
img_c = np.ones((10, 10, 3), dtype=np.uint8) * (3, 6, 9)
ins = IterativeNormStats()
for i in range(1000):
for img in [img_a, img_b, img_c]:
ins.update(img)
x = np.vstack([
np.reshape(img_a, (-1, 3)),
np.reshape(img_b, (-1, 3)),
np.reshape(img_c, (-1, 3)),
]*1000)
expected_mean = np.mean(x, axis=0)
expected_std = np.std(x, axis=0)
print(expected_mean)
print(ins.mean())
print(expected_std)
print(ins.std())
assert np.allclose(ins.mean(), expected_mean)
if __name__ == "__main__":
test_IterativeNormStats()

I came across thee welford package that's pretty simple to use:
pip install welford
Then
import numpy as np
from welford import Welford
# Initialize Welford object
w = Welford()
# Input data samples sequentialy
w.add(np.array([0, 100]))
w.add(np.array([1, 110]))
w.add(np.array([2, 120]))
# output
print(w.mean) # mean --> [ 1. 110.]
print(w.var_s) # sample variance --> [1, 100]
print(w.var_p) # population variance --> [ 0.6666 66.66]
# You can add other samples after calculating variances.
w.add(np.array([3, 130]))
w.add(np.array([4, 140]))
# output with added samples
print(w.mean) # mean --> [ 2. 120.]
print(w.var_s) # sample variance --> [ 2.5 250. ]
print(w.var_p) # population variance --> [ 2. 200.]
Notes:
Unlike most othere answers you can feed a Welford object a Numpy array directly
You can even add multiple with Welford.add_all(...)
You can merge independent computations with w1.merge(w2)
You should choose var_p or var_s depending on which one you want to use (Population and Sample variance)
As said, those are variances so you should use np.sqrt to get the associated standard deviation

Here is a simple implementation in python:
class RunningStats:
def __init__(self):
self.mean_x_square = 0
self.mean_x = 0
self.n = 0
def update(self, x):
self.mean_x_square = (self.mean_x_square * self.n + x ** 2) / (self.n + 1)
self.mean_x = (self.mean_x * self.n + x) / (self.n + 1)
self.n += 1
def mean(self):
return self.mean_x
def std(self):
return self.variance() ** 0.5
def variance(self):
return self.mean_x_square - self.mean_x ** 2
Test:
import numpy as np
running_stats = RunningStats()
v = [1.1, 3.5, 5, -8.1, 91]
[running_stats.update(x) for x in v]
print(running_stats.mean() - np.mean(v))
print(running_stats.std() - np.std(v))
print(running_stats.variance() - np.var(v))

Related

Indexing dynamic vector of class probabilities

For my code, I have a large (up to 40,000) vector of class probabilities. This set of class probabilities also needs to be reweighted regularly, so assume it will change on every call of the code. The vector sums to 1. I need to efficiently search through this for the index corresponding to that probability.
As an example - say the vector was [0.25, 0.25, 0.25, 0.25], uniform prob across 4 objects. My probability result is a 0.67. This corresponds to index 3, since 0.67 > sum(probvec[0:1]) but 0.67 <= sum(probvec[0:2]).
I'm open to changing the probability vector to make it the running sum, i.e. [0.25, 0.5, 0.75, 1], though then I'd also need a suggestion as to how to perform updates.
Any help would be appreciated.
Step 1: pre-compute all the partial sums up to the i-th index.
Step 2: scan your sums_probvec with binary search for obtaining the result in logtime.
import numpy as np
probvec = np.full(4, 0.25)
prob = 0.67
# pre-compute all the partial sums up to the i-th index
sum_probvec = [probvec[0]]
for i in range(1, len(probvec)) :
sum_probvec.append(sum_probvec[i-1] + probvec[i])
# use binary search for logtime results
i = 0
j = len(sum_probvec)
while i != j-1:
mid = (i + j) // 2
if prob > sum_probvec[mid]:
i = mid
else:
j = mid
index = i+2
print (index) # 3

Recursive function (Chapman-Kolmogorov eq.) for Transition Probabilities

I'm stuck with building a recursive function which is best illustrated through a quick example.
Take a Markov process with 2 states, State 1 and State 2. The notation p_ij represents the probability of transitioning to state j given that the current state is i. In this example,
p_11 = 0.8 (probability of staying in State 1 given current state is State 1)
p_12 = 0.2
p_21 = 0.6
p_22 = 0.4
And the transition probability matrix is:
import numpy as np
pij = np.array([[.8, .2], [.6, .4]])
print(pij)
# [[ 0.8 0.2]
# [ 0.6 0.4]]
The n-step transition probability, denoted r_ij(n), represents the probability that the state after n time periods will be j, given that the current state is i. r_ij(n) can be found using the Chapman-Kolmogorov equation,
with the initial condition
m is the total number of states. [From Bertsekas/Tsitsiklis, 2008.]
I'm trying to build r_ij(n). The first 5 steps should look like this:
My start:
def r(p, n):
m = np.sqrt(p.size) # or p.shape[0]
if n == 1:
return p
elif n > 1:
res = []
for k in range(m):
for i in p:
for j in i:
# This line is patently wrong...
# Not sure how to reference i
return r(n - 1) * p[k, j]
return np.sum(res)
p0 = np.array([[.8, .2], [.6, .4]])
print(r(p0, n=5))
# [[.7501, .2499],
# [.7498, .2502]]
But I am a bit lost with the notation.
You don't necessarily need a recursive function for this. matrix_power is essentially rij you are looking for:
def rij(pij, n):
return matrix_power(pij, n)
pij = np.array([[.8, .2], [.6, .4]])
from numpy.linalg.linalg import matrix_power
matrix_power(pij, 2)
#array([[ 0.76, 0.24],
# [ 0.72, 0.28]])
matrix_power(pij, 3)
#array([[ 0.752, 0.248],
# [ 0.744, 0.256]])
matrix_power(pij, 4)
#array([[ 0.7504, 0.2496],
# [ 0.7488, 0.2512]])
matrix_power(pij, 5)
#array([[ 0.75008, 0.24992],
# [ 0.74976, 0.25024]])
To define a recursive function, np.dot will make the task easier:
def rij(pij, n):
if n == 1:
return pij
else:
return np.dot(rij(pij, n-1), pij)
rij(pij, 5)
#array([[ 0.75008, 0.24992],
# [ 0.74976, 0.25024]])

Exponential Moving Average by time interval [duplicate]

I have a range of dates and a measurement on each of those dates. I'd like to calculate an exponential moving average for each of the dates. Does anybody know how to do this?
I'm new to python. It doesn't appear that averages are built into the standard python library, which strikes me as a little odd. Maybe I'm not looking in the right place.
So, given the following code, how could I calculate the moving weighted average of IQ points for calendar dates?
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
(there's probably a better way to structure the data, any advice would be appreciated)
EDIT:
It seems that mov_average_expw() function from scikits.timeseries.lib.moving_funcs submodule from SciKits (add-on toolkits that complement SciPy) better suits the wording of your question.
To calculate an exponential smoothing of your data with a smoothing factor alpha (it is (1 - alpha) in Wikipedia's terms):
>>> alpha = 0.5
>>> assert 0 < alpha <= 1.0
>>> av = sum(alpha**n.days * iq
... for n, iq in map(lambda (day, iq), today=max(days): (today-day, iq),
... sorted(zip(days, IQ), key=lambda p: p[0], reverse=True)))
95.0
The above is not pretty, so let's refactor it a bit:
from collections import namedtuple
from operator import itemgetter
def smooth(iq_data, alpha=1, today=None):
"""Perform exponential smoothing with factor `alpha`.
Time period is a day.
Each time period the value of `iq` drops `alpha` times.
The most recent data is the most valuable one.
"""
assert 0 < alpha <= 1
if alpha == 1: # no smoothing
return sum(map(itemgetter(1), iq_data))
if today is None:
today = max(map(itemgetter(0), iq_data))
return sum(alpha**((today - date).days) * iq for date, iq in iq_data)
IQData = namedtuple("IQData", "date iq")
if __name__ == "__main__":
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
iqdata = list(map(IQData, days, IQ))
print("\n".join(map(str, iqdata)))
print(smooth(iqdata, alpha=0.5))
Example:
$ python26 smooth.py
IQData(date=datetime.date(2008, 1, 1), iq=110)
IQData(date=datetime.date(2008, 1, 2), iq=105)
IQData(date=datetime.date(2008, 1, 7), iq=90)
95.0
I'm always calculating EMAs with Pandas:
Here is an example how to do it:
import pandas as pd
import numpy as np
def ema(values, period):
values = np.array(values)
return pd.ewma(values, span=period)[-1]
values = [9, 5, 10, 16, 5]
period = 5
print ema(values, period)
More infos about Pandas EWMA:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.ewma.html
I did a bit of googling and I found the following sample code (http://osdir.com/ml/python.matplotlib.general/2005-04/msg00044.html):
def ema(s, n):
"""
returns an n period exponential moving average for
the time series s
s is a list ordered from oldest (index 0) to most
recent (index -1)
n is an integer
returns a numeric array of the exponential
moving average
"""
s = array(s)
ema = []
j = 1
#get n sma first and calculate the next n period ema
sma = sum(s[:n]) / n
multiplier = 2 / float(1 + n)
ema.append(sma)
#EMA(current) = ( (Price(current) - EMA(prev) ) x Multiplier) + EMA(prev)
ema.append(( (s[n] - sma) * multiplier) + sma)
#now calculate the rest of the values
for i in s[n+1:]:
tmp = ( (i - ema[j]) * multiplier) + ema[j]
j = j + 1
ema.append(tmp)
return ema
You can also use the SciPy filter method because the EMA is an IIR filter. This will have the benefit of being approximately 64 times faster as measured on my system using timeit on large data sets when compared to the enumerate() approach.
import numpy as np
from scipy.signal import lfilter
x = np.random.normal(size=1234)
alpha = .1 # smoothing coefficient
zi = [x[0]] # seed the filter state with first value
# filter can process blocks of continuous data if <zi> is maintained
y, zi = lfilter([1.-alpha], [1., -alpha], x, zi=zi)
I don't know Python, but for the averaging part, do you mean an exponentially decaying low-pass filter of the form
y_new = y_old + (input - y_old)*alpha
where alpha = dt/tau, dt = the timestep of the filter, tau = the time constant of the filter? (the variable-timestep form of this is as follows, just clip dt/tau to not be more than 1.0)
y_new = y_old + (input - y_old)*dt/tau
If you want to filter something like a date, make sure you convert to a floating-point quantity like # of seconds since Jan 1 1970.
My python is a little bit rusty (anyone can feel free to edit this code to make corrections, if I've messed up the syntax somehow), but here goes....
def movingAverageExponential(values, alpha, epsilon = 0):
if not 0 < alpha < 1:
raise ValueError("out of range, alpha='%s'" % alpha)
if not 0 <= epsilon < alpha:
raise ValueError("out of range, epsilon='%s'" % epsilon)
result = [None] * len(values)
for i in range(len(result)):
currentWeight = 1.0
numerator = 0
denominator = 0
for value in values[i::-1]:
numerator += value * currentWeight
denominator += currentWeight
currentWeight *= alpha
if currentWeight < epsilon:
break
result[i] = numerator / denominator
return result
This function moves backward, from the end of the list to the beginning, calculating the exponential moving average for each value by working backward until the weight coefficient for an element is less than the given epsilon.
At the end of the function, it reverses the values before returning the list (so that they're in the correct order for the caller).
(SIDE NOTE: if I was using a language other than python, I'd create a full-size empty array first and then fill it backwards-order, so that I wouldn't have to reverse it at the end. But I don't think you can declare a big empty array in python. And in python lists, appending is much less expensive than prepending, which is why I built the list in reverse order. Please correct me if I'm wrong.)
The 'alpha' argument is the decay factor on each iteration. For example, if you used an alpha of 0.5, then today's moving average value would be composed of the following weighted values:
today: 1.0
yesterday: 0.5
2 days ago: 0.25
3 days ago: 0.125
...etc...
Of course, if you've got a huge array of values, the values from ten or fifteen days ago won't contribute very much to today's weighted average. The 'epsilon' argument lets you set a cutoff point, below which you will cease to care about old values (since their contribution to today's value will be insignificant).
You'd invoke the function something like this:
result = movingAverageExponential(values, 0.75, 0.0001)
In matplotlib.org examples (http://matplotlib.org/examples/pylab_examples/finance_work2.html) is provided one good example of Exponential Moving Average (EMA) function using numpy:
def moving_average(x, n, type):
x = np.asarray(x)
if type=='simple':
weights = np.ones(n)
else:
weights = np.exp(np.linspace(-1., 0., n))
weights /= weights.sum()
a = np.convolve(x, weights, mode='full')[:len(x)]
a[:n] = a[n]
return a
I found the above code snippet by #earino pretty useful - but I needed something that could continuously smooth a stream of values - so I refactored it to this:
def exponential_moving_average(period=1000):
""" Exponential moving average. Smooths the values in v over ther period. Send in values - at first it'll return a simple average, but as soon as it's gahtered 'period' values, it'll start to use the Exponential Moving Averge to smooth the values.
period: int - how many values to smooth over (default=100). """
multiplier = 2 / float(1 + period)
cum_temp = yield None # We are being primed
# Start by just returning the simple average until we have enough data.
for i in xrange(1, period + 1):
cum_temp += yield cum_temp / float(i)
# Grab the timple avergae
ema = cum_temp / period
# and start calculating the exponentially smoothed average
while True:
ema = (((yield ema) - ema) * multiplier) + ema
and I use it like this:
def temp_monitor(pin):
""" Read from the temperature monitor - and smooth the value out. The sensor is noisy, so we use exponential smoothing. """
ema = exponential_moving_average()
next(ema) # Prime the generator
while True:
yield ema.send(val_to_temp(pin.read()))
(where pin.read() produces the next value I'd like to consume).
May be shortest:
#Specify decay in terms of span
#data_series should be a DataFrame
ema=data_series.ewm(span=5, adjust=False).mean()
import pandas_ta as ta
data["EMA3"] = ta.ema(data["close"], length=3)
pandas_ta is a Technical Analysis Library: https://github.com/twopirllc/pandas-ta. Above code calculates the Exponential Moving Average (EMA) for a series. You can specify the lag value using 'length'. Spesifically, above code calculates '3-day EMA'.
Here is a simple sample I worked up based on http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
Note that unlike in their spreadsheet, I don't calculate the SMA, and I don't wait to generate the EMA after 10 samples. This means my values differ slightly, but if you chart it, it follows exactly after 10 samples. During the first 10 samples, the EMA I calculate is appropriately smoothed.
def emaWeight(numSamples):
return 2 / float(numSamples + 1)
def ema(close, prevEma, numSamples):
return ((close-prevEma) * emaWeight(numSamples) ) + prevEma
samples = [
22.27, 22.19, 22.08, 22.17, 22.18, 22.13, 22.23, 22.43, 22.24, 22.29,
22.15, 22.39, 22.38, 22.61, 23.36, 24.05, 23.75, 23.83, 23.95, 23.63,
23.82, 23.87, 23.65, 23.19, 23.10, 23.33, 22.68, 23.10, 22.40, 22.17,
]
emaCap = 10
e=samples[0]
for s in range(len(samples)):
numSamples = emaCap if s > emaCap else s
e = ema(samples[s], e, numSamples)
print e
I'm a little late to the party here, but none of the solutions given were what I was looking for. Nice little challenge using recursion and the exact formula given in investopedia.
No numpy or pandas required.
prices = [{'i': 1, 'close': 24.5}, {'i': 2, 'close': 24.6}, {'i': 3, 'close': 24.8}, {'i': 4, 'close': 24.9},
{'i': 5, 'close': 25.6}, {'i': 6, 'close': 25.0}, {'i': 7, 'close': 24.7}]
def rec_calculate_ema(n):
k = 2 / (n + 1)
price = prices[n]['close']
if n == 1:
return price
res = (price * k) + (rec_calculate_ema(n - 1) * (1 - k))
return res
print(rec_calculate_ema(3))
A fast way (copy-pasted from here) is the following:
def ExpMovingAverage(values, window):
""" Numpy implementation of EMA
"""
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a
I am using a list and a rate of decay as inputs. I hope this little function with just two lines may help you here, considering deep recursion is not stable in python.
def expma(aseries, ratio):
return sum([ratio*aseries[-x-1]*((1-ratio)**x) for x in range(len(aseries))])
more simply, using pandas
def EMA(tw):
for x in tw:
data["EMA{}".format(x)] = data['close'].ewm(span=x, adjust=False).mean()
EMA([10,50,100])
Papahaba's answer was almost what I was looking for (thanks!) but I needed to match initial conditions. Using an IIR filter with scipy.signal.lfilter is certainly the most efficient. Here's my redux:
Given a NumPy vector, x
import numpy as np
from scipy import signal
period = 12
b = np.array((1,), 'd')
a = np.array((period, 1-period), 'd')
zi = signal.lfilter_zi(b, a)
y, zi = signal.lfilter(b, a, x, zi=zi*x[0:1])
Get the N-point EMA (here, 12) returned in the vector y

Efficient way to implement simple filter with varying coeffients in Python/Numpy

I am looking for an efficient way to implement a simple filter with one coefficient that is time-varying and specified by a vector with the same length as the input signal.
The following is a simple implementation of the desired behavior:
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
output = myfilter(signal, weights)
Is there a way to do this more efficiently with numpy or scipy?
You can trade in the overhead of the loop for a couple of additional ops:
import numpy as np
def myfilter(signal, weights):
output = np.empty_like(weights)
val = signal[0]
for i in range(len(signal)):
val += weights[i]*(signal[i] - val)
output[i] = val
return output
def vectorised(signal, weights):
wp = np.r_[1, np.multiply.accumulate(1 - weights[1:])]
sw = weights * signal
sw[0] = signal[0]
sws = np.add.accumulate(sw / wp)
return wp * sws
weights = np.random.uniform(0, 0.1, (100,))
signal = np.linspace(1, 3, 100)
print(np.allclose(myfilter(signal, weights), vectorised(signal, weights)))
On my machine the vectorised version is several times faster. It uses a "closed form" solution of your recurrence equation.
Edit: For very long signal / weight (100,000 samples, say) this method doesn't work because of overflow. In that regime you can still save a bit (more than 50% on my machine) using the following trick, which has the added bonus that you needn't solve the recurrence formula, only invert it.
from scipy import linalg
def solver(signal, weights):
rw = 1 / weights[1:]
v = np.r_[1, rw, 1-rw, 0]
v.shape = 2, -1
return linalg.solve_banded((1, 0), v, signal)
This trick uses the fact that your recurrence is formally similar to a Gauss elimination on a matrix with only one nonvanishing subdiagonal. It piggybacks on a library function that specialises in doing precisely that.
Actually, quite proud of this one.

Is there a numpy builtin to reject outliers from a list

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.
import numpy as np
def reject_outliers(data):
m = 2
u = np.mean(data)
s = np.std(data)
filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return filtered
>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]
I say 'something like' because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I've used here).
Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.
Building on eumiro's answer:
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else np.zero(len(d))
return data[s<m]
Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.
Note that for the data[s<m] syntax to work, data must be a numpy array.
This method is almost identical to yours, just more numpyst (also working on numpy arrays only):
def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]
Benjamin Bannier's answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.
def reject_outliers_2(data, m=2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m]
Example:
data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))
Gives:
[[10, 10, 10, 17, 10, 10]] # 17 is not filtered
[10, 10, 10, 10, 10] # 17 is filtered (it's distance, 7, is greater than m)
Building on Benjamin's, using pandas.Series, and replacing MAD with IQR:
def reject_outliers(sr, iq_range=0.5):
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
return sr[ (sr - median).abs() <= iqr]
For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.
An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)
As a simple example:
import numpy as np
# Create some random numbers
x = np.random.normal(5, 2, 1000)
# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500
# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))
# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)
rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)
The output I get is:
Mean= 4.99760520022
Median= 4.95395274981
Max/Min= 11.1226494654 -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649
Mean= 9.64760520022
Median= 4.95667658782
Max/Min= 2205.43861943 -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694
Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462
Which is close to the expected value of 2.
If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):
y = x[abs(x - p50) < rSig*5]
# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))
Which gives:
Mean= 4.99755359935
Median= 4.95213030447
Max/Min= 11.1226494654 -2.15388472011
StdDev= 1.97692712883
I have no idea which approach is the more efficent/robust
I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).
To do so I used numpy's masking functions:
def reject_outliers(data, m=2):
stdev = np.std(data)
mean = np.mean(data)
maskMin = mean - stdev * m
maskMax = mean + stdev * m
mask = np.ma.masked_outside(data, maskMin, maskMax)
print('Masking values outside of {} and {}'.format(maskMin, maskMax))
return mask
I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".
The code provided in this answer works on both single dim numpy array and multiple numpy array.
Let's import some modules firstly.
import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr
z score based method
This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.
def sd_outlier(x, axis = None, bar = 3, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_z = stat.zscore(x, axis = axis)
if side == 'gt':
return d_z > bar
elif side == 'lt':
return d_z < -bar
elif side == 'both':
return np.abs(d_z) > bar
IQR based method
This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.
def q1(x, axis = None):
return np.percentile(x, 25, axis = axis)
def q3(x, axis = None):
return np.percentile(x, 75, axis = axis)
def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'
d_iqr = iqr(x, axis = axis)
d_q1 = q1(x, axis = axis)
d_q3 = q3(x, axis = axis)
iqr_distance = np.multiply(d_iqr, bar)
stat_shape = list(x.shape)
if isinstance(axis, collections.Iterable):
for single_axis in axis:
stat_shape[single_axis] = 1
else:
stat_shape[axis] = 1
if side in ['gt', 'both']:
upper_range = d_q3 + iqr_distance
upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
if side in ['lt', 'both']:
lower_range = d_q1 - iqr_distance
lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)
if side == 'gt':
return upper_outlier
if side == 'lt':
return lower_outlier
if side == 'both':
return np.logical_or(upper_outlier, lower_outlier)
Finally, if you want to filter out the outliers, use a numpy selector.
Have a nice day.
Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.
(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)
You could try to iteratively apply your algorithm or you filter using the interquartile range:
(here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)
import numpy as np
def sortoutOutliers(dataIn,factor):
quant3, quant1 = np.percentile(dataIn, [75 ,25])
iqr = quant3 - quant1
iqrSigma = iqr/1.34896
medData = np.median(dataIn)
dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ]
return(dataOut)
So many answers, but I'm adding a new one that can be useful for the author or even for other users.
You could use the Hampel filter. But you need to work with Series.
Hampel filter returns the Outliers indices, then you can delete them from the Series, and then convert it back to a List.
To use Hampel filter, you can easily install the package with pip:
pip install hampel
Usage:
# Imports
from hampel import hampel
import pandas as pd
list_d = [2, 4, 5, 1, 6, 5, 40]
# List to Series
time_series = pd.Series(list_d)
# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)
# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)
filtered_d.values.tolist()
print(f'filtered_d: {filtered_d.values.tolist()}')
And the output will be:
filtered_d: [2, 4, 5, 1, 6, 5]
Where, ts is a pandas Series object and window_size is a total window size will be computed as 2 * window_size + 1.
For this Series I set window_size with the value 3.
The cool thing about working with Series is being able to generate graphics:
# Imports
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()
And the output will be:
To learn more about Hampel filter, I recommend the following readings:
Python implementation of the Hampel Filter
Outlier Detection with Hampel Filter
Clean-up your time series data with a Hampel Filter
if you want to get the index position of the outliers idx_list will return it.
def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
data_range = np.arange(len(data))
idx_list = data_range[s>=m]
return data[s<m], idx_list
data_points = np.array([8, 10, 35, 17, 73, 77])
print(reject_outliers(data_points))
after rejection: [ 8 10 35 17], index positions of outliers: [4 5]
For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:
mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)
Then it is possible to compute the mean:
masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))
(I use it for Background Subtraction)
Here I find the outliers in x and substitute them with the median of a window of points (win) around them (taking from Benjamin Bannier answer the median deviation)
def outlier_smoother(x, m=3, win=3, plots=False):
''' finds outliers in x, points > m*mdev(x) [mdev:median deviation]
and replaces them with the median of win points around them '''
x_corr = np.copy(x)
d = np.abs(x - np.median(x))
mdev = np.median(d)
idxs_outliers = np.nonzero(d > m*mdev)[0]
for i in idxs_outliers:
if i-win < 0:
x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
elif i+win+1 > len(x):
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
else:
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
if plots:
plt.figure('outlier_smoother', clear=True)
plt.plot(x, label='orig.', lw=5)
plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')
plt.plot(x_corr, '-o', label='corrected')
plt.legend()
return x_corr
Trim outliers in a numpy array along axis and replace them with min or max values along this axis, whichever is closer. The threshold is z-score:
def np_z_trim(x, threshold=10, axis=0):
""" Replace outliers in numpy ndarray along axis with min or max values
within the threshold along this axis, whichever is closer."""
mean = np.mean(x, axis=axis, keepdims=True)
std = np.std(x, axis=axis, keepdims=True)
masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
min = np.nanmin(masked, axis=axis, keepdims=True)
max = np.nanmax(masked, axis=axis, keepdims=True)
repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
return np.where(np.isnan(masked), repl, masked)
My solution drops the top and bottom percentiles, keeping values that are equal to the boundary:
def remove_percentile_outliers(data, percent_to_drop=0.001):
low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
return data[(data >= low )&(data <= high)]
My solution let the outliers equal to the previous value.
test_data = [2,4,5,1,6,5,40, 3]
def reject_outliers(data, m=2):
mean = np.mean(data)
std = np.std(data)
for i in range(len(data)) :
if np.abs(data[i] -mean) > m*std :
data[i] = data[i-1]
return data
reject_outliers(test_data)
Output:
[2, 4, 5, 1, 6, 5, 5, 3]

Categories

Resources