Normalizing a list of numbers in Python

Normalizing a list of numbers in Python - python

I need to normalize a list of values to fit in a probability distribution, i.e. between 0.0 and 1.0.
I understand how to normalize, but was curious if Python had a function to automate this.
I'd like to go from:
raw = [0.07, 0.14, 0.07]
to
normed = [0.25, 0.50, 0.25]

Use :
norm = [float(i)/sum(raw) for i in raw]
to normalize against the sum to ensure that the sum is always 1.0 (or as close to as possible).
use
norm = [float(i)/max(raw) for i in raw]
to normalize against the maximum

if your list has negative numbers, this is how you would normalize it
a = range(-30,31,5)
norm = [(float(i)-min(a))/(max(a)-min(a)) for i in a]

For ones who wanna use scikit-learn, you can use
from sklearn.preprocessing import normalize
x = [1,2,3,4]
normalize([x]) # array([[0.18257419, 0.36514837, 0.54772256, 0.73029674]])
normalize([x], norm="l1") # array([[0.1, 0.2, 0.3, 0.4]])
normalize([x], norm="max") # array([[0.25, 0.5 , 0.75, 1.]])

How long is the list you're going to normalize?
def psum(it):
"This function makes explicit how many calls to sum() are done."
print "Another call!"
return sum(it)
raw = [0.07,0.14,0.07]
print "How many calls to sum()?"
print [ r/psum(raw) for r in raw]
print "\nAnd now?"
s = psum(raw)
print [ r/s for r in raw]
# if one doesn't want auxiliary variables, it can be done inside
# a list comprehension, but in my opinion it's quite Baroque
print "\nAnd now?"
print [ r/s for s in [psum(raw)] for r in raw]
Output
# How many calls to sum()?
# Another call!
# Another call!
# Another call!
# [0.25, 0.5, 0.25]
#
# And now?
# Another call!
# [0.25, 0.5, 0.25]
#
# And now?
# Another call!
# [0.25, 0.5, 0.25]

try:
normed = [i/sum(raw) for i in raw]
normed
[0.25, 0.5, 0.25]

There isn't any function in the standard library (to my knowledge) that will do it, but there are absolutely modules out there which have such functions. However, its easy enough that you can just write your own function:
def normalize(lst):
s = sum(lst)
return map(lambda x: float(x)/s, lst)
Sample output:
>>> normed = normalize(raw)
>>> normed
[0.25, 0.5, 0.25]

If you consider using numpy, you can get a faster solution.
import random, time
import numpy as np
a = random.sample(range(1, 20000), 10000)
since = time.time(); b = [i/sum(a) for i in a]; print(time.time()-since)
# 0.7956490516662598
since = time.time(); c=np.array(a);d=c/sum(a); print(time.time()-since)
# 0.001413106918334961

Try this :
from __future__ import division
raw = [0.07, 0.14, 0.07]
def norm(input_list):
norm_list = list()
if isinstance(input_list, list):
sum_list = sum(input_list)
for value in input_list:
tmp = value /sum_list
norm_list.append(tmp)
return norm_list
print norm(raw)
This will do what you asked.
But I will suggest to try Min-Max normalization.
min-max normalization :
def min_max_norm(dataset):
if isinstance(dataset, list):
norm_list = list()
min_value = min(dataset)
max_value = max(dataset)
for value in dataset:
tmp = (value - min_value) / (max_value - min_value)
norm_list.append(tmp)
return norm_list

If working with data, many times pandas is the simple key
This particular code will put the raw into one column, then normalize by column per row. (But we can put it into a row and do it by row per column, too! Just have to change the axis values where 0 is for row and 1 is for column.)
import pandas as pd
raw = [0.07, 0.14, 0.07]
raw_df = pd.DataFrame(raw)
normed_df = raw_df.div(raw_df.sum(axis=0), axis=1)
normed_df
where normed_df will display like:
0
0 0.25
1 0.50
2 0.25
and then can keep playing with the data, too!

Here is a not-terribly-inefficient one liner similar to the top answer (only performs summation once)
norm = (lambda the_sum:[float(i)/the_sum for i in raw])(sum(raw))
A similar method can be done for a list with negative numbers
norm = (lambda the_max, the_min: [(float(i)-the_min)/(the_max-the_min) for i in raw])(max(raw),min(raw))

Use scikit-learn:
from sklearn.preprocessing import MinMaxScaler
data = np.array([1,2,3]).reshape(-1, 1)
scaler = MinMaxScaler()
scaler.fit(data)
print(scaler.transform(data))

Related

Indexing dynamic vector of class probabilities

For my code, I have a large (up to 40,000) vector of class probabilities. This set of class probabilities also needs to be reweighted regularly, so assume it will change on every call of the code. The vector sums to 1. I need to efficiently search through this for the index corresponding to that probability.
As an example - say the vector was [0.25, 0.25, 0.25, 0.25], uniform prob across 4 objects. My probability result is a 0.67. This corresponds to index 3, since 0.67 > sum(probvec[0:1]) but 0.67 <= sum(probvec[0:2]).
I'm open to changing the probability vector to make it the running sum, i.e. [0.25, 0.5, 0.75, 1], though then I'd also need a suggestion as to how to perform updates.
Any help would be appreciated.

Step 1: pre-compute all the partial sums up to the i-th index.
Step 2: scan your sums_probvec with binary search for obtaining the result in logtime.
import numpy as np
probvec = np.full(4, 0.25)
prob = 0.67
# pre-compute all the partial sums up to the i-th index
sum_probvec = [probvec[0]]
for i in range(1, len(probvec)) :
sum_probvec.append(sum_probvec[i-1] + probvec[i])
# use binary search for logtime results
i = 0
j = len(sum_probvec)
while i != j-1:
mid = (i + j) // 2
if prob > sum_probvec[mid]:
i = mid
else:
j = mid
index = i+2
print (index) # 3

Curve fitting for each column in Pandas + extrapolate values

I have a data set with some 300 columns, each of them depth-dependent. The simplified version of the Pandas DataFrame would look something like this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy_optimize import curve_fit
df1 = pd.DataFrame({'depth': [1.65, 2.15, 2.65, 3.15, 3.65, 4.15, 4.65, 5.15, 5.65, 6.15, 6.65, 7.15, 7.65, 8.15, 8.65],
'400.0': [13.909261, 7.758734, 3.513627, 2.095409, 1.628918, 0.782643, 0.278548, 0.160153, -0.155895, -0.152373, -0.147820, -0.023997, 0.010729, 0.006050, 0.002356],
'401.0': [14.581624, 8.173803, 3.757856, 2.223524, 1.695623, 0.818065, 0.300235, 0.173674, -0.145402, -0.144456, -0.142969, -0.022471, 0.010802, 0.006181, 0.002641],
'402.0': [15.253988, 8.588872, 4.002085, 2.351638, 1.762327, 0.853486, 0.321922, 0.187195, -0.134910, -0.136539, -0.138118, -0.020945, 0.010875, 0.006313, 0.002927],
'403.0': [15.633908, 8.833914, 4.146499, 2.431543, 1.798185, 0.874350, 0.333470, 0.192128, -0.130119, -0.134795, -0.136049, -0.019307, 0.012037, 0.006674, 0.003002],
'404.0': [15.991816, 9.066159, 4.283401, 2.507818, 1.831721, 0.894119, 0.344256, 0.196415, -0.125758, -0.133516 , -0.134189, -0.017659, -0.013281,0.007053, 0.003061],
'405.0': [16.349725, 9.298403, 4.420303, 2.584094, 1.865257, 0.913887, 0.355041, 0.200702, -0.121396, -0.132237, -0.132330, -0.016012, 0.014525, 0.007433, 0.003120]
})
What I need to do is to estimate the K in the equation below. Basically each column corresponds to a I(z) profile. The I(0) has to be calculated, for which I used the curve_fit, as a reference I'm using this helpful post: https://stackoverflow.com/a/15369787/7541421
x = df1.depth # Column values as a function of depth
y = df1['400.0']
plt.plot(x, y, 'ro',label="Original Data")
def func(def func(x, I0, k): # a = I0, b = k
return I0 * np.exp(-k*x)
popt, pcov = curve_fit(func, x, y)
print ("E0 = %s , k = %s" % (popt[0], popt[1]))
plt.plot(x, func(x, *popt), label="Fitted Curve")
Could this be done for each column separately and somehow saved as a new DataFrame?
Also, the new DataFrame needs to be propagated to the values towards z=0 for certain dz quotas. In this case I'm missing [0.15, 0.65, 1.15] in my depth column.
So for every z I need to get per each column the I(z) from the function.
How can I automatize it since every data set has a different depth range in my case?
P.S. Alternatively, as it has been originally discussed in this post, a log-transfored linear regression fit can be applied, for which the solution is written in an answer below.

Some changes have been made after the conversation with the principal author of this answer and with his approval.
First of all, since we are dealing with log-transform quantities, it is necessary to find the range of values which correspond to non-negative values per column.
negative_idx_aux = df_drop_depth.apply(lambda x:(x<0).nonzero()[0][:1].tolist())
negative_idx = [item for sublist in negative_idx_aux for item in sublist]
if len(negative_idx) > 0:
max_idx = max_idx = np.min(negative_idx)
else:
max_idx = None
Compared to the original, I only merge the loops to obtain both the slope and intercept.
iz_cols = df1.columns.difference(['depth'])
slp_int = {}
for c in iz_cols:
slope, intercept, r_value, p_value, std_err = stats.linregress(df1['depth'][0:max_idx],np.log(df1[c][0:max_idx]))
slp_int[c] = [intercept, slope]
slp_int = pd.DataFrame(, index = ['intercept', 'slope'])
Exponentiating intercept gives us the value of I at the surface:
slp_int.loc['intercept'] = np.exp(slp_int.loc['intercept'])
The last part of the post has been corrected due to a misunderstanding of the final concept.
The dataframe is now recreated, with new values for the surface depths (above the depth range of df1, keeping the df1 for values below.
First a whole range between z = 0 and the maximum value of the depth column is recreated, with an assigned step plus keeping the value at z = 0:
depth = np.asarray(df1.depth)
depth_min = np.min(depth) ;
depth_min_arr = np.array([depth_min])
step = 0.5
missing_vals_aux = np.arange(depth_min - step, 0, -step)[::-1]
missing_vals = np.concatenate(([0.], missing_vals_aux), axis=0)
depth_tot = np.concatenate((missing_vals, depth), axis=0)
df_boundary = pd.DataFrame(columns = iz_cols)
df_up = pd.DataFrame(columns = iz_cols)
Create a dataframe with the range of the upward-propagated depth quotas:
for c in iz_cols:
df_up[c] = missing_vals
Fill the data with the regression-obtained parameters:
upper_df = slp_int.loc['intercept']*np.exp(slp_int.loc['slope']*df_up)
upper_df['depth'] = missing_vals
Merge the df1 and the upper_df to obtain a whole profile:
lower_df = df1
lower_df['depth'] = depth
df_profile_tot = upper_df.append(lower_df, ignore_index=True)

How to real-time filter with scipy and lfilter?

Disclaimer: I am probably not as good at DSP as I should be and therefore have more issues than I should have getting this code to work.
I need to filter incoming signals as they happen. I tried to make this code to work, but I have not been able to so far.
Referencing scipy.signal.lfilter doc
import numpy as np
import scipy.signal
import matplotlib.pyplot as plt
from lib import fnlib
samples = 100
x = np.linspace(0, 7, samples)
y = [] # Unfiltered output
y_filt1 = [] # Real-time filtered
nyq = 0.5 * samples
f1_norm = 0.1 / nyq
f2_norm = 2 / nyq
b, a = scipy.signal.butter(2, [f1_norm, f2_norm], 'band', analog=False)
zi = scipy.signal.lfilter_zi(b,a)
zi = zi*(np.sin(0) + 0.1*np.sin(15*0))
This sets zi as zi*y[0 ] initially, which in this case is 0. I have got it from the example code in the lfilter documentation, but I am not sure if this is correct at all.
Then it comes to the point where I am not sure what to do with the few initial samples.
The coefficients a and b are len(a) = 5 here.
As lfilter takes input values from now to n-4, do I pad it with zeroes, or do I need to wait until 5 samples have gone by and take them as a single bloc, then continuously sample each next step in the same way?
for i in range(0, len(a)-1): # Append 0 as initial values, wrong?
y.append(0)
step = 0
for i in xrange(0, samples): #x:
tmp = np.sin(x[i]) + 0.1*np.sin(15*x[i])
y.append(tmp)
# What to do with the inital filterings until len(y) == len(a) ?
if (step> len(a)):
y_filt, zi = scipy.signal.lfilter(b, a, y[-len(a):], axis=-1, zi=zi)
y_filt1.append(y_filt[4])
print(len(y))
y = y[4:]
print(len(y))
y_filt2 = scipy.signal.lfilter(b, a, y) # Offline filtered
plt.plot(x, y, x, y_filt1, x, y_filt2)
plt.show()

I think I had the same problem, and found a solution on https://github.com/scipy/scipy/issues/5116:
from scipy import zeros, signal, random
def filter_sbs():
data = random.random(2000)
b = signal.firwin(150, 0.004)
z = signal.lfilter_zi(b, 1) * data[0]
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
if __name__ == '__main__':
result = filter_sbs()
The idea is to pass the filter state z in each subsequent call to lfilter. For the first few samples the filter may give strange results, but later (depending on the filter length) it starts to behave correctly.

The problem is not how you are buffering the input. The problem is that in the 'offline' version, the state of the filter is initialized using lfilter_zi which computes the internal state of an LTI so that the output will already be in steady-state when new samples arrive at the input. In the 'real-time' version, you skip this so that the filter's initial state is 0. You can either initialize both versions to using lfilter_zi or else initialize both to 0. Then, it doesn't matter how many samples you filter at a time.
Note, if you initialize to 0, the filter will 'ring' for a certain amount of time before reaching a steady state. In the case of FIR filters, there is an analytic solution for determining this time. For many IIR filters, there is not.
This following is correct. For simplicity's sake I initialize to 0 and feed the input on sample at a time. However, any non-zero block size will produce equivalent output.
from scipy import signal, random
from numpy import zeros
def filter_sbs(data, b):
z = zeros(b.size-1)
result = zeros(data.size)
for i, x in enumerate(data):
result[i], z = signal.lfilter(b, 1, [x], zi=z)
return result
def filter(data, b):
result = signal.lfilter(b,1,data)
return result
if __name__ == '__main__':
data = random.random(20000)
b = signal.firwin(150, 0.004)
result1 = filter_sbs(data, b)
result2 = filter(data, b)
print(result1 - result2)
Output:
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... -5.55111512e-17
0.00000000e+00 1.66533454e-16]

PyMC3 Multinomial Model doesn't work with non-integer observe data

I'm trying to use PyMC3 to solve a fairly simple multinomial distribution. It works perfectly if I have the 'noise' value set to 0.0. However when I change it to anything else, for example 0.01, I get an error in the find_MAP() function and it hangs if I don't use find_MAP().
Is there some reason that the multinomial has to be sparse?
import numpy as np
from pymc3 import *
import pymc3 as mc
import pandas as pd
print 'pymc3 version: ' + mc.__version__
sample_size = 10
number_of_experiments = 1
true_probs = [0.2, 0.1, 0.3, 0.4]
k = len(true_probs)
noise = 0.0
y = np.random.multinomial(n=number_of_experiments, pvals=true_probs, size=sample_size)+noise
y_denominator = np.sum(y,axis=1)
y = y/y_denominator[:,None]
with Model() as multinom_test:
probs = Dirichlet('probs', a = np.ones(k), shape = k)
for i in range(sample_size):
data = Multinomial('data_%d' % (i),
n = y[i].sum(),
p = probs,
observed = y[i])
with multinom_test:
start = find_MAP()
trace = sample(5000, Slice())
trace[probs].mean(0)
Error:
ValueError: Optimization error: max, logp or dlogp at max have non-
finite values. Some values may be outside of distribution support.
max: {'probs_stickbreaking_': array([ 0.00000000e+00, -4.47034834e-
08, 0.00000000e+00])} logp: array(-inf) dlogp: array([
0.00000000e+00, 2.98023221e-08, 0.00000000e+00])Check that 1) you
don't have hierarchical parameters, these will lead to points with
infinite density. 2) your distribution logp's are properly specified.
Specific issues:

This works for me
sample_size = 10
number_of_experiments = 100
true_probs = [0.2, 0.1, 0.3, 0.4]
k = len(true_probs)
noise = 0.01
y = np.random.multinomial(n=number_of_experiments, pvals=true_probs, size=sample_size)+noise
with pm.Model() as multinom_test:
a = pm.Dirichlet('a', a=np.ones(k))
for i in range(sample_size):
data_pred = pm.Multinomial('data_pred_%s'% i, n=number_of_experiments, p=a, observed=y[i])
trace = pm.sample(50000, pm.Metropolis())
#trace = pm.sample(1000) # also works with NUTS
pm.traceplot(trace[500:]);

How to efficiently calculate a running standard deviation

I have an array of lists of numbers, e.g.:
[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)
I would like to efficiently calculate the mean and standard deviation at each index of a list, across all array elements.
To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).
To do the standard deviation, I loop through again, now that I have the mean calculated.
I would like to avoid going through the array twice, once for the mean and then once for the standard deviation (after I have a mean).
Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g., Perl or Python) or pseudocode is fine.

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:
Wikipedia: Algorithms for calculating variance
It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.
You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).
I wrote two blog entries on the topic which go into more details, including how to delete previous values online:
Computing Sample Mean and Variance Online in One Pass
Deleting Values in Welford’s Algorithm for Online Mean and Variance
You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:
Javadoc: stats.OnlineNormalEstimator
Source: stats.OnlineNormalEstimator.java
JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
LingPipe Home Page

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x2 (call it 'sum_x2') as you go. The value of the standard deviation is then:
stdev = sqrt((sum_x2 / n) - (mean * mean))
where
mean = sum_x / n
This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.
You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

Here is a literal pure Python translation of the Welford's algorithm implementation from John D. Cook’s excellent Accurately computing running variance article:
File running_stats.py
import math
class RunningStats:
def __init__(self):
self.n = 0
self.old_m = 0
self.new_m = 0
self.old_s = 0
self.new_s = 0
def clear(self):
self.n = 0
def push(self, x):
self.n += 1
if self.n == 1:
self.old_m = self.new_m = x
self.old_s = 0
else:
self.new_m = self.old_m + (x - self.old_m) / self.n
self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)
self.old_m = self.new_m
self.old_s = self.new_s
def mean(self):
return self.new_m if self.n else 0.0
def variance(self):
return self.new_s / (self.n - 1) if self.n > 1 else 0.0
def standard_deviation(self):
return math.sqrt(self.variance())
Usage:
rs = RunningStats()
rs.push(17.0)
rs.push(19.0)
rs.push(24.0)
mean = rs.mean()
variance = rs.variance()
stdev = rs.standard_deviation()
print(f'Mean: {mean}, Variance: {variance}, Std. Dev.: {stdev}')

Perhaps not what you were asking, but ... If you use a NumPy array, it will do the work for you, efficiently:
from numpy import array
nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
(0.00, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.00, 0.01, 0.05, 0.03)))
print nums.std(axis=1)
# [ 0.0116619 0.00979796 0.00632456 0.01788854]
print nums.mean(axis=1)
# [ 0.022 0.018 0.02 0.02 ]
By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:
Computing sample mean and variance online in one pass

The Python runstats Module is for just this sort of thing. Install runstats from PyPI:
pip install runstats
Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.
from runstats import Statistics
stats = [Statistics() for num in range(len(data[0]))]
for row in data:
for index, val in enumerate(row):
stats[index].push(val)
for index, stat in enumerate(stats):
print 'Index', index, 'mean:', stat.mean()
print 'Index', index, 'standard deviation:', stat.stddev()
Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.
Disclaimer: I am the author the Python runstats module.

Statistics::Descriptive is a very decent Perl module for these types of calculations:
#!/usr/bin/perl
use strict; use warnings;
use Statistics::Descriptive qw( :all );
my $data = [
[ 0.01, 0.01, 0.02, 0.04, 0.03 ],
[ 0.00, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];
my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures
for my $ref ( #$data ) {
$stat->add_data( #$ref );
printf "Running mean: %f\n", $stat->mean;
printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__
Output:
Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566

Have a look at PDL (pronounced "piddle!").
This is the Perl Data Language which is designed for high precision mathematics and scientific computing.
Here is an example using your figures....
use strict;
use warnings;
use PDL;
my $figs = pdl [
[0.01, 0.01, 0.02, 0.04, 0.03],
[0.00, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.00, 0.01, 0.05, 0.03],
];
my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
say "Mean scores: ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms): ", $rms;
Which produces:
Mean scores: [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms): [0.011661904 0.009797959 0.0063245553 0.017888544]
Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".
However, it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)
For more PDL information, have a look at:
pdl.perl.org (official PDL page).
PDL quick reference guide on PerlMonks
Dr. Dobb's article on PDL
PDL Wiki
Wikipedia entry for PDL
SourceForge project page for PDL

Unless your array is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.
My preference would be to use the NumPy array maths extension to convert your array of arrays into a NumPy 2D array and get the standard deviation directly:
>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0)
array([ 1. , 1. , 0.5, 1.5, 1.5, 1.5])
>>> a.mean(axis=0)
array([ 2. , 3. , 4.5, 4.5, 5.5, 6.5])
If that's not an option and you need a pure Python solution, keep reading...
If your array is
x = [
[ 1, 2, 4, 3, 4, 5 ],
[ 3, 4, 5, 6, 7, 8 ],
....
]
Then the standard deviation is:
d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N) for sx, sx2 in zip(sum_x, sum_x2) ]
If you are determined to loop through your array only once, the running sums can be combined.
sum_x = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
for i, t in enumerate(v):
sum_x[i] += t
sum_x2[i] += t**2
This isn't nearly as elegant as the list comprehension solution above.

I like to express the update this way:
def running_update(x, N, mu, var):
'''
#arg x: the current data sample
#arg N : the number of previous samples
#arg mu: the mean of the previous samples
#arg var : the variance over the previous samples
#retval (N+1, mu', var') -- updated mean, variance and count
'''
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)
so that a one-pass function would look like this:
def one_pass(data):
N = 0
mu = 0.0
var = 0.0
for x in data:
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
# could yield here if you want partial results
return (N, mu, var)
note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).
In a class it would look like this:
class RunningMeanVar(object):
def __init__(self):
self.N = 0
self.mu = 0.0
self.var = 0.0
def push(self, x):
self.N = self.N + 1
rho = 1.0/N
d = x-self.mu
self.mu += rho*d
self.var += + rho*((1-rho)*d**2-self.var)
# reset, accessors etc. can be setup as you see fit
This also works for weighted samples:
def running_update(w, x, N, mu, var):
'''
#arg w: the weight of the current sample
#arg x: the current data sample
#arg mu: the mean of the previous N sample
#arg var : the variance over the previous N samples
#arg N : the number of previous samples
#retval (N+w, mu', var') -- updated mean, variance and count
'''
N = N + w
rho = w/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)

Here's a "one-liner", spread over multiple lines, in functional programming style:
def variance(data, opt=0):
return (lambda (m2, i, _): m2 / (opt + i - 1))(
reduce(
lambda (m2, i, avg), x:
(
m2 + (x - avg) ** 2 * i / (i + 1),
i + 1,
avg + (x - avg) / (i + 1)
),
data,
(0, 0, 0)))

As the following answer describes:
Does Pandas, SciPy, or NumPy provide a cumulative standard deviation function?
The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that, you'll have to convert your data into a Pandas dataframe (or a series if it is one-dimensional), but there are functions for that.

Here is a practical example of how you could implement a running standard deviation with Python and NumPy:
a = np.arange(1, 10)
s = 0
s2 = 0
for i in range(0, len(a)):
s += a[i]
s2 += a[i] ** 2
n = (i + 1)
m = s / n
std = np.sqrt((s2 / n) - (m * m))
print(std, np.std(a[:i + 1]))
This will print out the calculated standard deviation and a check standard deviation calculated with NumPy:
0.0 0.0
0.5 0.5
0.8164965809277263 0.816496580927726
1.118033988749895 1.118033988749895
1.4142135623730951 1.4142135623730951
1.707825127659933 1.707825127659933
2.0 2.0
2.29128784747792 2.29128784747792
2.5819888974716116 2.581988897471611
I am just using the formula described in this thread:
stdev = sqrt((sum_x2 / n) - (mean * mean))

Responding to Charlie Parker's 2021 question:
I'd like an answer that I can just copy paste to my code in numpy. My input is a matrix of size [N, 1] where N is the number of data points and I already have computed the running mean and I assuming we have computed the running std/variance, how to update we the new batch of data.
Here we have two implementations of a function that takes the original mean, original variance and original size and the new sample and returns the total mean and total variance of the combined original and new sample (to get the standard deviation, just take variance's square root by using **(1/2)). The first uses NumPy, and the second one uses Welford. You may choose the one that best applies to your case.
def mean_and_variance_update_numpy(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
sample_to_append_mean = np.mean(sample_to_append)
sample_to_append_size = len(sample_to_append)
total_size = previous_size+sample_to_append_size
total_mean = (previous_mean*previous_size+sample_to_append_mean*sample_to_append_size)/total_size
total_var = (((previous_var+(total_mean-previous_mean)**2)*previous_size)+((np.var(sample_to_append)+(sample_to_append_mean-tm)**2)*sample_to_append_size))/total_size
return (total_mean, total_var)
def mean_and_variance_update_welford(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
pos = previous_size
mean = previous_mean
v = previous_var*previous_size
for value in sample_to_append:
pos += 1
mean_next = mean + (value - mean) / pos
v = v + (value - mean)*(value - mean_next)
mean = mean_next
return (mean, v/pos)
Let's check if it works:
import numpy as np
def mean_and_variance_udpate_numpy:
...
def mean_and_variance_udpate_welford:
...
# Making the samples and results deterministic
np.random.seed(0)
# Our initial sample has 100 samples, we want to append 10
n0, n1 = 100, 10
# Using np.matrix only, because it was in the question. 'np.array' is more common
s0 = np.matrix(1e3+np.random.random_sample(n0)*1e-3).T
s1 = np.matrix(1e3+np.random.random_sample(n1)*1e-3).T
# Precalculating our mean and var for initial sample:
s0mean, s0var = np.mean(s0), np.var(s0)
# Calculating mean and variance for s0+s1 using our NumPy updater
mean_and_variance_update_numpy(s0mean, s0var, len(s0), s1)
# (1000.0004826329636, 8.24577589696613e-08)
# Calculating mean and variance for s0+s1 using our Welford updater
mean_and_variance_update_welford(s0mean, s0var, len(s0), s1)
# (1000.0004826329634, 8.245775896913623e-08)
# Similar results, now checking with NumPy's calculation over the concatenation of s0 and s1
s0s1 = np.concatenate([s0,s1])
(np.mean(s0s1), np.var(s0s1))
# (1000.0004826329638, 8.245775896917313e-08)
Here the three results are closer:
# np(s0s1) (1000.0004826329638, 8.245775896917313e-08)
# np(s0)updnp(s1) (1000.0004826329636, 8.245775896966130e-08)
# np(s0)updwf(s1) (1000.0004826329634, 8.245775896913623e-08)
It is possible to see that the results are very similar.

n=int(raw_input("Enter no. of terms:"))
L=[]
for i in range (1,n+1):
x=float(raw_input("Enter term:"))
L.append(x)
sum=0
for i in range(n):
sum=sum+L[i]
avg=sum/n
sumdev=0
for j in range(n):
sumdev=sumdev+(L[j]-avg)**2
dev=(sumdev/n)**0.5
print "Standard deviation is", dev

Figure I could jump on the old bandwagon. This should work with rbg values
Adapted from
https://math.stackexchange.com/a/2148949
import numpy as np
class IterativeNormStats():
def __init__(self):
"""uint64 max is 18446744073709551615
256**2 = 65536
so we can store 18446744073709551615 / 65536 = 281,474,976,710,656
images before running into overflow issues. I think we'll be ok
"""
self.n = 0
self.rgb_sum = np.zeros(3, dtype=np.uint64)
self.rgb_sq_sum = np.zeros(3, dtype=np.uint64)
def update(self, img_arr):
rgbs = np.reshape(img_arr, (-1, 3)).astype(np.uint64)
self.n += rgbs.shape[0]
self.rgb_sum += np.sum(rgbs, axis=0)
self.rgb_sq_sum += np.sum(np.square(rgbs), axis=0)
def mean(self):
return self.rgb_sum / self.n
def std(self):
return np.sqrt((self.rgb_sq_sum / self.n) - np.square(self.rgb_sum / self.n))
def test_IterativeNormStats():
img_a = np.ones((10, 10, 3), dtype=np.uint8) * (1, 2, 3)
img_b = np.ones((10, 10, 3), dtype=np.uint8) * (2, 4, 6)
img_c = np.ones((10, 10, 3), dtype=np.uint8) * (3, 6, 9)
ins = IterativeNormStats()
for i in range(1000):
for img in [img_a, img_b, img_c]:
ins.update(img)
x = np.vstack([
np.reshape(img_a, (-1, 3)),
np.reshape(img_b, (-1, 3)),
np.reshape(img_c, (-1, 3)),
]*1000)
expected_mean = np.mean(x, axis=0)
expected_std = np.std(x, axis=0)
print(expected_mean)
print(ins.mean())
print(expected_std)
print(ins.std())
assert np.allclose(ins.mean(), expected_mean)
if __name__ == "__main__":
test_IterativeNormStats()

I came across thee welford package that's pretty simple to use:
pip install welford
Then
import numpy as np
from welford import Welford
# Initialize Welford object
w = Welford()
# Input data samples sequentialy
w.add(np.array([0, 100]))
w.add(np.array([1, 110]))
w.add(np.array([2, 120]))
# output
print(w.mean) # mean --> [ 1. 110.]
print(w.var_s) # sample variance --> [1, 100]
print(w.var_p) # population variance --> [ 0.6666 66.66]
# You can add other samples after calculating variances.
w.add(np.array([3, 130]))
w.add(np.array([4, 140]))
# output with added samples
print(w.mean) # mean --> [ 2. 120.]
print(w.var_s) # sample variance --> [ 2.5 250. ]
print(w.var_p) # population variance --> [ 2. 200.]
Notes:
Unlike most othere answers you can feed a Welford object a Numpy array directly
You can even add multiple with Welford.add_all(...)
You can merge independent computations with w1.merge(w2)
You should choose var_p or var_s depending on which one you want to use (Population and Sample variance)
As said, those are variances so you should use np.sqrt to get the associated standard deviation

Here is a simple implementation in python:
class RunningStats:
def __init__(self):
self.mean_x_square = 0
self.mean_x = 0
self.n = 0
def update(self, x):
self.mean_x_square = (self.mean_x_square * self.n + x ** 2) / (self.n + 1)
self.mean_x = (self.mean_x * self.n + x) / (self.n + 1)
self.n += 1
def mean(self):
return self.mean_x
def std(self):
return self.variance() ** 0.5
def variance(self):
return self.mean_x_square - self.mean_x ** 2
Test:
import numpy as np
running_stats = RunningStats()
v = [1.1, 3.5, 5, -8.1, 91]
[running_stats.update(x) for x in v]
print(running_stats.mean() - np.mean(v))
print(running_stats.std() - np.std(v))
print(running_stats.variance() - np.var(v))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalizing a list of numbers in Python - python

I need to normalize a list of values to fit in a probability distribution, i.e. between 0.0 and 1.0. I understand how to normalize, but was curious if Python had a function to automate this. I'd like to go from: raw = [0.07, 0.14, 0.07] to normed = [0.25, 0.50, 0.25]

Use : norm = [float(i)/sum(raw) for i in raw] to normalize against the sum to ensure that the sum is always 1.0 (or as close to as possible). use norm = [float(i)/max(raw) for i in raw] to normalize against the maximum

if your list has negative numbers, this is how you would normalize it a = range(-30,31,5) norm = [(float(i)-min(a))/(max(a)-min(a)) for i in a]

For ones who wanna use scikit-learn, you can use from sklearn.preprocessing import normalize x = [1,2,3,4] normalize([x]) # array([[0.18257419, 0.36514837, 0.54772256, 0.73029674]]) normalize([x], norm="l1") # array([[0.1, 0.2, 0.3, 0.4]]) normalize([x], norm="max") # array([[0.25, 0.5 , 0.75, 1.]])

try: normed = [i/sum(raw) for i in raw] normed [0.25, 0.5, 0.25]

Use scikit-learn: from sklearn.preprocessing import MinMaxScaler data = np.array([1,2,3]).reshape(-1, 1) scaler = MinMaxScaler() scaler.fit(data) print(scaler.transform(data))

Related

Indexing dynamic vector of class probabilities

Curve fitting for each column in Pandas + extrapolate values

How to real-time filter with scipy and lfilter?

PyMC3 Multinomial Model doesn't work with non-integer observe data

How to efficiently calculate a running standard deviation

Categories

Resources