Related
I have a huge text file which contains the position (x,y,z) and velocity component (vx,vy,vz) of a million stars. After doing some rotations and projections, I obtain new position and velocity component (x',y',z',vx',vy',vz') of the stars.
My final step is to compute the velocity along the line of sight, it's like I have to "average" the vz component, and to do this I try to create a FITS file in which every pixel contain the mean value of the vz component.
Here there's a part of my code:
mod = np.genfromtxt('data_bar_region.txt')
x = list(mod[:,0])
y = list(mod[:,1])
vz = mod[:,5]
x_rang_1 = np.arange(-40, 41, 1)
y_rang_1 = np.arange(-40, 41, 1)
fake_data_1 = np.array((len(x_rang_1),len(x_rang_1)))
for i in range(len(x_rang_1)-1):
for j in range(len(y_rang_1)-1):
vel_tmp = []
for index in range(len(x)):
if x_rang_1[i] <= x[index] <= x_rang_1[i+1]:
if y_rang_1[j] <= y[index] <= y_rang_1[j+1]:
vel_tmp.append(vz[index])
fake_data_1[j,i] = np.mean(vel_tmp)
hdu1 = fits.PrimaryHDU(fake_data_1)
hdu1.writeto('TEST.fits')
This code is too much slow (it took about 8 hours on my laptop) and I don't know how I can speed up.
Did you have some suggestions or other ways to compute the v_LOS in a better and faster way?
EDIT : Before performing the "averaging", I have to divide the image into portions of several shape and dimensions (such portions are called "bins").
Here there's an [image of the bins (on the right panel, there's the same image of bins but it's zoomed to better evidence what are bins)] 1.
So, I have another FITS file (called bins.fits) with the same dimension of fake_data_1, and I just want to find the correspondence between these 2 files, because I want to calculate the mean and the std of the distribution of stars in the several bins.
Alternatively, I have a text file which contains the info on which pixel belongs to a specific bin, for example:
x y bin
1 1 34
1 2 34
1 3 34
. . .
34 56 37
34 57 37
34 58 37
and so on. The bins.file has the size (564,585), and so, also the fake_data_1, changing opportunity the start and stop of x and y range. I attached the whole script:
mod = np.genfromtxt('data_new_bar_scaled.txt')
# to match the correct position and size of the observation,
# I have to multiply by a factor equal to the semi-size
x = mod[:, 0]*(585-1)/200
y = mod[:, 1]*(564-1)/200
vz = mod[:,5]
A = fits.open('NGC4277_TESIkinematic.fits')
bins = A[7].data.T
start_x = -(585-1)/2
stop_x = (585-1)/2
step_x = step # step in x_rang_1
x_rang = np.arange(start_x, stop_x + step_x, step_x)
start_y = -(564-1)/2
stop_y = (564-1)/2
step_y = step # step in y_rang_1
y_rang = np.arange(start_y, stop_y + step_y, step_y)
fake_data_1 = np.empty((len(x_rang), len(y_rang)))
fake_data_1[:] = np.NaN # initialize with NaN
print(fake_data_1.shape)
print(bins.shape)
d = {}
for i in range(len(x)):
index_for_x = math.floor((x[i] - start_x) / step_x)
index_for_y = math.floor((y[i] - start_y) / step_y)
if 0 <= index_for_x < len(x_rang) and 0 <= index_for_y < len(y_rang):
key = (x_rang[index_for_x], y_rang[index_for_y])
if key in d:
d[key].append(vz[i])
else:
d[key] = [vz[i]]
bb = np.unique(bins)
print(len(bb))
for i, x in enumerate(x_rang):
for j, y in enumerate(y_rang):
key = (x, y)
for z in range(len(bb)):
j,k = np.where(bb[z]==bins)
print('index :', z)
if key in d:
fake_data_1[j,k] = np.mean(d[key])
Your code is so slow since the nested loops in your code iterate over a million of stars 1600 (80*80) times. You can improve the performance by using a dictionary and iterating over a million of stars just once.
You can try the following code, which is about 1600 times faster:
import numpy as np
import math
mod = np.genfromtxt('data_bar_region.txt')
x = list(mod[:, 0])
y = list(mod[:, 1])
vz = mod[:, 5]
x_rang_1 = np.arange(-40, 41, 1)
y_rang_1 = np.arange(-40, 41, 1)
fake_data_1 = np.empty((len(x_rang_1), len(y_rang_1)))
fake_data_1[:] = np.NaN # initialize with NaN
d = {}
for i in range(len(x)):
key = (math.floor(x[i]), math.floor(y[i]))
if key in d:
d[key].append(vz[i])
else:
d[key] = [vz[i]]
for i, x in enumerate(x_rang_1):
for j, y in enumerate(y_rang_1):
key = (x, y)
if key in d:
fake_data_1[i, j] = np.mean(d[key])
hdu1 = fits.PrimaryHDU(fake_data_1)
hdu1.writeto('TEST.fits')
UPDATE
For a generalized version for step in x_rang_1 (or y_rang_1), you can try the following code:
import numpy as np
import math
mod = np.genfromtxt('data_bar_region.txt')
x = list(mod[:, 0])
y = list(mod[:, 1])
vz = mod[:, 5]
start_x_rang_1 = -40
stop_x_rang_1 = 40
step_x_rang_1 = 0.5 # step in x_rang_1
x_rang_1 = np.arange(start_x_rang_1, stop_x_rang_1 + step_x_rang_1, step_x_rang_1)
start_y_rang_1 = -40
stop_y_rang_1 = 40
step_y_rang_1 = 1 # step in y_rang_1
y_rang_1 = np.arange(start_y_rang_1, stop_y_rang_1 + step_y_rang_1, step_y_rang_1)
fake_data_1 = np.empty((len(x_rang_1), len(y_rang_1)))
fake_data_1[:] = np.NaN # initialize with NaN
d = {}
for i in range(len(x)):
index_for_x_rang_1 = math.floor((x[i] - start_x_rang_1) / step_x_rang_1)
index_for_y_rang_1 = math.floor((y[i] - start_y_rang_1) / step_y_rang_1)
if 0 <= index_for_x_rang_1 < len(x_rang_1) and 0 <= index_for_y_rang_1 < len(y_rang_1):
key = (x_rang_1[index_for_x_rang_1], y_rang_1[index_for_y_rang_1])
if key in d:
d[key].append(vz[i])
else:
d[key] = [vz[i]]
for i, x in enumerate(x_rang_1):
for j, y in enumerate(y_rang_1):
key = (x, y)
if key in d:
fake_data_1[i, j] = np.mean(d[key])
hdu1 = fits.PrimaryHDU(fake_data_1)
hdu1.writeto('TEST.fits')
UPDATE 2
Maybe like the following?
When I supposed the inputs are
x y vz
0 0.1 10
1.8 0 4
1.2 1.9 5.2
bins = np.array(
[[34, 35, 34, 34, 36],
[37, 36, 34, 35, 36],
[34, 35, 37, 36, 34]]) # shape: (5, 3)
You want the following code?
import numpy as np
import math
x = np.array([0, 1.8, 1.2, ])
y = np.array([0.1, 0, 1.9, ])
vz = np.array([10, 4, 5.2])
start_x_rang_1 = 0
stop_x_rang_1 = 2
step_x_rang_1 = 1 # step in x_rang_1
x_rang_1 = np.arange(start_x_rang_1, stop_x_rang_1 + step_x_rang_1, step_x_rang_1)
start_y_rang_1 = 0
stop_y_rang_1 = 0.5
step_y_rang_1 = 2 # step in y_rang_1
y_rang_1 = np.arange(start_y_rang_1, stop_y_rang_1 + step_y_rang_1, step_y_rang_1)
fake_data_1 = np.empty((len(x_rang_1), len(y_rang_1))) # shape: (3, 5)
fake_data_1[:] = np.NaN # initialize with NaN
bins = np.array(
[[34, 35, 34, 34, 36],
[37, 36, 34, 35, 36],
[34, 35, 37, 36, 34]]) # shape: (3, 5)
d_bins = {}
for i in range(len(x)):
index_for_x_rang_1 = math.floor((x[i] - start_x_rang_1) / step_x_rang_1)
index_for_y_rang_1 = math.floor((y[i] - start_y_rang_1) / step_y_rang_1)
if 0 <= index_for_x_rang_1 < len(x_rang_1) and 0 <= index_for_y_rang_1 < len(y_rang_1):
key = bins[index_for_x_rang_1, index_for_y_rang_1]
if key in d_bins:
d_bins[key].append(vz[i])
else:
d_bins[key] = [vz[i]]
d_bins_mean = {}
for bin in d_bins:
d_bins_mean[bin] = np.mean(d_bins[bin])
get_corresponding_mean = np.vectorize(lambda x: d_bins_mean.get(x, np.NaN))
result = get_corresponding_mean(bins)
print(result)
which prints
[[10. nan 10. 10. nan]
[ 4.6 nan 10. nan nan]
[10. nan 4.6 nan 10. ]]
My program:
# -*- coding: utf-8 -*-
import numpy as np
import itertools
from scipy.optimize import minimize
global width
width = 0.3
def time_builder(f, t0=0, tf=300):
return list(np.round(np.arange(t0, tf, 1/f*1000),3))
def duo_stim_overlap(t1, t2):
"""
Function taking 2 timelines build by time_builder function in input
and returning the ids of overlapping pulses between the 2.
len(t1) < len(t2)
"""
pulse_id_t1 = [x for x in range(len(t1)) for y in range(len(t2)) if abs(t1[x] - t2[y]) < width]
pulse_id_t2 = [x for x in range(len(t2)) for y in range(len(t1)) if abs(t2[x] - t1[y]) < width]
return pulse_id_t1, pulse_id_t2
def optimal_delay(s):
frequences = [20, 60, 80, 250, 500]
t0 = 0
tf = 150
delay = 0 # delay between signals,
timelines = list()
overlap = dict()
for i in range(len(frequences)):
timelines.append(time_builder(frequences[i], t0+delay, tf))
overlap[i] = list()
delay += s
for subset in itertools.combinations(timelines, 2):
p1_stim, p2_stim = duo_stim_overlap(subset[0], subset[1])
overlap[timelines.index(subset[0])] += p1_stim
overlap[timelines.index(subset[1])] += p2_stim
optim_param = 0
for key, items in overlap.items():
optim_param += (len(list(set(items)))/len(timelines[key]))
return optim_param
res = minimize(optimal_delay, 1.5, method='Nelder-Mead', tol = 0.01, bounds = [(0, 5)], options={'disp': True})
So my goal is to minimize the value optim_param computed by the function optimal_delay.
First of all, gradient methods don't do anything. They stop at the first iteration.
Second, I would need to set bounds for the s value of optimal delay (between 0 and 5 for instance). I know it's not possible with the Nelder-Mead simplex method, but the others didn't work at all.
Third, I don't really know how to set the parameter tol for termination. Bot tol = 0.01 and tol = 0.0000001 didn' t gave me good result. (and really close ones).
And finally if I start at 1.8 for instance, the minimize function gives me a value far from being a minimum...
What am I doing wrong?
If you plot your optimal_delay function you'll see that it's far from convex. The search will just find any local minima close to your starting point.
I have a continuous floating point data, ranging from -257.2 to 154.98,
I have no idea how it is distributed. But I would want it to be in the bins - say -270 to -201, -200 to -141, -140 to -71, -70 to -1, 0 to 69, 70 to 139, 140 to 209
Is there a way to do this?, Specifically, I am looking out for:
data = np.random.rand(10)
data
array([ 0.58791019, 0.2385624 , 0.70927668, 0.22916244, 0.87479326,
0.49609703, 0.3758358 , 0.35743165, 0.30816457, 0.2018548 ])
def GenRangedData(data, min, max, step):
#some code
no_of_bins = (max - min)/ step
bins = []
#some code
return bins
rd = GenRangedData(data, 0, 1, 0.1)
# should generate:
rd
[[], [0.2385624, 0.22916244, 0.2018548], [0.3758358, 0.35743165, 0.30816457], [0.49609703], [0.58791019], [], [0.70927668], [0.87479326]]
I can obviously do this by manually iterating over all the numbers, but I am looking to automate it, so that min max and step can be experimented a lot. Is there a way to do this efficiently?
This is what I could come up with, I do not know if this is the best way,
If you think this can be done faster, pl update/edit
def GenRangedData(data, min, max, step):
cat_data = []
bins = ((i_max - i_min) / step) + 2
for x in range(0, len(data)):
temp_data = []
for y in range(0, len(data[x])):
for n in range(0, int(bins)):
if data[x][y] < (i_min + (n*step)):
temp_data.append(n)
break
cat_data.append(temp_data)
In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
def entropy(labels):
""" Computes entropy of 0-1 vector. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts[np.nonzero(counts)] / n_labels
n_classes = len(probs)
if n_classes <= 1:
return 0
return - np.sum(probs * np.log(probs)) / np.log(n_classes)
Is there a faster way?
#Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts / counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit operations:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
Timeit results:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
Winner: numpy/math (entropy2)
It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
With the data as a pd.Series and scipy.stats, calculating the entropy of a given quantity is pretty straightforward:
import pandas as pd
import scipy.stats
def ent(data):
"""Calculates entropy of the passed `pd.Series`
"""
p_data = data.value_counts() # counts occurrence of each value
entropy = scipy.stats.entropy(p_data) # get entropy from counts
return entropy
Note: scipy.stats will normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.
An answer that doesn't rely on numpy, either:
import math
from collections import Counter
def eta(data, unit='natural'):
base = {
'shannon' : 2.,
'natural' : math.exp(1),
'hartley' : 10.
}
if len(data) <= 1:
return 0
counts = Counter()
for d in data:
counts[d] += 1
ent = 0
probs = [float(c) / len(data) for c in counts.values()]
for p in probs:
if p > 0.:
ent -= p * math.log(p, base[unit])
return ent
This will accept any datatype you could throw at it:
>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005
>>> eta([c for c in "mary had a little lamb"])
2.311097886212714
The answer provided by #Jarad suggested timings as well. To that end:
repeat_number = 1000000
e = timeit.repeat(
stmt='''eta(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''',
repeat=3,
number=repeat_number)
Timeit results: (I believe this is ~4x faster than the best numpy approach)
print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))
Method: eta, Avg.: 10.461799
Following the suggestion from unutbu I create a pure python implementation.
def entropy2(labels):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute standard entropy.
for i in probs:
ent -= i * log(i, base=n_classes)
return ent
The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.
Here is my approach:
labels = [0, 0, 1, 1]
from collections import Counter
from scipy import stats
stats.entropy(list(Counter(labels).values()), base=2)
My favorite function for entropy is the following:
def entropy(labels):
prob_dict = {x:labels.count(x)/len(labels) for x in labels}
probs = np.array(list(prob_dict.values()))
return - probs.dot(np.log2(probs))
I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.
Uniformly distributed data (high entropy):
s=range(0,256)
Shannon entropy calculation step by step:
import collections
import math
# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]
# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]
# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy
8.0
One-liner (assuming import collections):
def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])
A proper function:
import collections
import math
def H(s):
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
return sum(e_x)
Test cases - English text taken from CyberChef entropy estimator:
>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104
This method extends the other solutions by allowing for binning. For example, bin=None (default) won't bin x and will compute an empirical probability for each element of x, while bin=256 chunks x into 256 bins before computing the empirical probabilities.
import numpy as np
def entropy(x, bins=None):
N = x.shape[0]
if bins is None:
counts = np.bincount(x)
else:
counts = np.histogram(x, bins=bins)[0] # 0th idx is counts
p = counts[np.nonzero(counts)]/N # avoids log(0)
H = -np.dot( p, np.log2(p) )
return H
This is the fastest Python implementation I've found so far:
import numpy as np
def entropy(labels):
ps = np.bincount(labels) / len(labels)
return -np.sum([p * np.log2(p) for p in ps if p > 0])
from collections import Counter
from scipy import stats
labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)
BiEntropy wont be the fastest way of computing entropy, but it is rigorous and builds upon Shannon Entropy in a well defined way. It has been tested in various fields including image related applications.
It is implemented in Python on Github.
Bit late for the party, but I stumbled at this and all answers seems to rely on Kullback–Leibler divergence, which has no upper bound, and hence, doesn't fit my needs.
Here I have an approximation (the TODO!could be improved) of an entropy function that goes from [0,1].
It calculates the biass of a single column.
class Pandas_Dataframe_helper:
#some other methods here...
#staticmethod
def column_biass(df_column):
df_column_as_list = list(df_column)
N = len(df_column_as_list)
values,counts = np.unique(df_column_as_list, return_counts=True)
#generate synth list (TODO! what if not even number? Minimum Comun Multiple of(num_different_labels,[x for x in counts]))
num_different_labels = len(values)
num_items_per_label = N // num_different_labels
synthetic_list = []
for current_value in values:
synthetic_list.extend([current_value] * num_items_per_label)
#TODO! aproximacion
if(len(synthetic_list) != len(df_column_as_list)):
synthetic_list.extend([current_value] * (len(df_column_as_list) - len(synthetic_list)))
#now, extrapolate differences between sorted-input-list and synsthetic_list
df_column_as_list_sorted = sorted(df_column_as_list)
counter_unmatches = 0
for i in range(0,N):
if(df_column_as_list_sorted[i] != synthetic_list[i]):
counter_unmatches += 1
#upper_bound = g(N,num_different_labels)
#((K-1)M)-1 K==num_different_labels , M==num theorically perfect distribution's items per label
upper_bound = ((num_different_labels-1)*num_items_per_label)-1
return counter_unmatches/upper_bound
#---------------------------------------------------------------------------------------------------------------------
Complete code at https://github.com/glezo1/pcommonlibs/blob/master/com/glezo/pandas_dataframe_helper/Pandas_Dataframe_Helper.py
The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.
def entropy(A, axis=None):
"""Computes the Shannon entropy of the elements of A. Assumes A is
an array-like of nonnegative ints whose max value is approximately
the number of unique values present.
>>> a = [0, 1]
>>> entropy(a)
1.0
>>> A = np.c_[a, a]
>>> entropy(A)
1.0
>>> A # doctest: +NORMALIZE_WHITESPACE
array([[0, 0], [1, 1]])
>>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE
array([ 1., 1.])
>>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE
array([[ 0.], [ 0.]])
>>> entropy([0, 0, 0])
0.0
>>> entropy([])
0.0
>>> entropy([5])
0.0
"""
if A is None or len(A) < 2:
return 0.
A = np.asarray(A)
if axis is None:
A = A.flatten()
counts = np.bincount(A) # needs small, non-negative ints
counts = counts[counts > 0]
if len(counts) == 1:
return 0. # avoid returning -0.0 to prevent weird doctests
probs = counts / float(A.size)
return -np.sum(probs * np.log2(probs))
elif axis == 0:
entropies = map(lambda col: entropy(col), A.T)
return np.array(entropies)
elif axis == 1:
entropies = map(lambda row: entropy(row), A)
return np.array(entropies).reshape((-1, 1))
else:
raise ValueError("unsupported axis: {}".format(axis))
def entropy(base, prob_a, prob_b ):
import math
base=2
x=prob_a
y=prob_b
expression =-((x*math.log(x,base)+(y*math.log(y,base))))
return [expression]
I have an array of lists of numbers, e.g.:
[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)
I would like to efficiently calculate the mean and standard deviation at each index of a list, across all array elements.
To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).
To do the standard deviation, I loop through again, now that I have the mean calculated.
I would like to avoid going through the array twice, once for the mean and then once for the standard deviation (after I have a mean).
Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g., Perl or Python) or pseudocode is fine.
The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:
Wikipedia: Algorithms for calculating variance
It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.
You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).
I wrote two blog entries on the topic which go into more details, including how to delete previous values online:
Computing Sample Mean and Variance Online in One Pass
Deleting Values in Welford’s Algorithm for Online Mean and Variance
You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:
Javadoc: stats.OnlineNormalEstimator
Source: stats.OnlineNormalEstimator.java
JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
LingPipe Home Page
The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x2 (call it 'sum_x2') as you go. The value of the standard deviation is then:
stdev = sqrt((sum_x2 / n) - (mean * mean))
where
mean = sum_x / n
This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.
You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.
Here is a literal pure Python translation of the Welford's algorithm implementation from John D. Cook’s excellent Accurately computing running variance article:
File running_stats.py
import math
class RunningStats:
def __init__(self):
self.n = 0
self.old_m = 0
self.new_m = 0
self.old_s = 0
self.new_s = 0
def clear(self):
self.n = 0
def push(self, x):
self.n += 1
if self.n == 1:
self.old_m = self.new_m = x
self.old_s = 0
else:
self.new_m = self.old_m + (x - self.old_m) / self.n
self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)
self.old_m = self.new_m
self.old_s = self.new_s
def mean(self):
return self.new_m if self.n else 0.0
def variance(self):
return self.new_s / (self.n - 1) if self.n > 1 else 0.0
def standard_deviation(self):
return math.sqrt(self.variance())
Usage:
rs = RunningStats()
rs.push(17.0)
rs.push(19.0)
rs.push(24.0)
mean = rs.mean()
variance = rs.variance()
stdev = rs.standard_deviation()
print(f'Mean: {mean}, Variance: {variance}, Std. Dev.: {stdev}')
Perhaps not what you were asking, but ... If you use a NumPy array, it will do the work for you, efficiently:
from numpy import array
nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
(0.00, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.00, 0.01, 0.05, 0.03)))
print nums.std(axis=1)
# [ 0.0116619 0.00979796 0.00632456 0.01788854]
print nums.mean(axis=1)
# [ 0.022 0.018 0.02 0.02 ]
By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:
Computing sample mean and variance online in one pass
The Python runstats Module is for just this sort of thing. Install runstats from PyPI:
pip install runstats
Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.
from runstats import Statistics
stats = [Statistics() for num in range(len(data[0]))]
for row in data:
for index, val in enumerate(row):
stats[index].push(val)
for index, stat in enumerate(stats):
print 'Index', index, 'mean:', stat.mean()
print 'Index', index, 'standard deviation:', stat.stddev()
Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.
Disclaimer: I am the author the Python runstats module.
Statistics::Descriptive is a very decent Perl module for these types of calculations:
#!/usr/bin/perl
use strict; use warnings;
use Statistics::Descriptive qw( :all );
my $data = [
[ 0.01, 0.01, 0.02, 0.04, 0.03 ],
[ 0.00, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];
my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures
for my $ref ( #$data ) {
$stat->add_data( #$ref );
printf "Running mean: %f\n", $stat->mean;
printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__
Output:
Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566
Have a look at PDL (pronounced "piddle!").
This is the Perl Data Language which is designed for high precision mathematics and scientific computing.
Here is an example using your figures....
use strict;
use warnings;
use PDL;
my $figs = pdl [
[0.01, 0.01, 0.02, 0.04, 0.03],
[0.00, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.00, 0.01, 0.05, 0.03],
];
my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
say "Mean scores: ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms): ", $rms;
Which produces:
Mean scores: [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms): [0.011661904 0.009797959 0.0063245553 0.017888544]
Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".
However, it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)
For more PDL information, have a look at:
pdl.perl.org (official PDL page).
PDL quick reference guide on PerlMonks
Dr. Dobb's article on PDL
PDL Wiki
Wikipedia entry for PDL
SourceForge project page for PDL
Unless your array is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.
My preference would be to use the NumPy array maths extension to convert your array of arrays into a NumPy 2D array and get the standard deviation directly:
>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0)
array([ 1. , 1. , 0.5, 1.5, 1.5, 1.5])
>>> a.mean(axis=0)
array([ 2. , 3. , 4.5, 4.5, 5.5, 6.5])
If that's not an option and you need a pure Python solution, keep reading...
If your array is
x = [
[ 1, 2, 4, 3, 4, 5 ],
[ 3, 4, 5, 6, 7, 8 ],
....
]
Then the standard deviation is:
d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N) for sx, sx2 in zip(sum_x, sum_x2) ]
If you are determined to loop through your array only once, the running sums can be combined.
sum_x = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
for i, t in enumerate(v):
sum_x[i] += t
sum_x2[i] += t**2
This isn't nearly as elegant as the list comprehension solution above.
I like to express the update this way:
def running_update(x, N, mu, var):
'''
#arg x: the current data sample
#arg N : the number of previous samples
#arg mu: the mean of the previous samples
#arg var : the variance over the previous samples
#retval (N+1, mu', var') -- updated mean, variance and count
'''
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)
so that a one-pass function would look like this:
def one_pass(data):
N = 0
mu = 0.0
var = 0.0
for x in data:
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
# could yield here if you want partial results
return (N, mu, var)
note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).
In a class it would look like this:
class RunningMeanVar(object):
def __init__(self):
self.N = 0
self.mu = 0.0
self.var = 0.0
def push(self, x):
self.N = self.N + 1
rho = 1.0/N
d = x-self.mu
self.mu += rho*d
self.var += + rho*((1-rho)*d**2-self.var)
# reset, accessors etc. can be setup as you see fit
This also works for weighted samples:
def running_update(w, x, N, mu, var):
'''
#arg w: the weight of the current sample
#arg x: the current data sample
#arg mu: the mean of the previous N sample
#arg var : the variance over the previous N samples
#arg N : the number of previous samples
#retval (N+w, mu', var') -- updated mean, variance and count
'''
N = N + w
rho = w/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)
Here's a "one-liner", spread over multiple lines, in functional programming style:
def variance(data, opt=0):
return (lambda (m2, i, _): m2 / (opt + i - 1))(
reduce(
lambda (m2, i, avg), x:
(
m2 + (x - avg) ** 2 * i / (i + 1),
i + 1,
avg + (x - avg) / (i + 1)
),
data,
(0, 0, 0)))
As the following answer describes:
Does Pandas, SciPy, or NumPy provide a cumulative standard deviation function?
The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that, you'll have to convert your data into a Pandas dataframe (or a series if it is one-dimensional), but there are functions for that.
Here is a practical example of how you could implement a running standard deviation with Python and NumPy:
a = np.arange(1, 10)
s = 0
s2 = 0
for i in range(0, len(a)):
s += a[i]
s2 += a[i] ** 2
n = (i + 1)
m = s / n
std = np.sqrt((s2 / n) - (m * m))
print(std, np.std(a[:i + 1]))
This will print out the calculated standard deviation and a check standard deviation calculated with NumPy:
0.0 0.0
0.5 0.5
0.8164965809277263 0.816496580927726
1.118033988749895 1.118033988749895
1.4142135623730951 1.4142135623730951
1.707825127659933 1.707825127659933
2.0 2.0
2.29128784747792 2.29128784747792
2.5819888974716116 2.581988897471611
I am just using the formula described in this thread:
stdev = sqrt((sum_x2 / n) - (mean * mean))
Responding to Charlie Parker's 2021 question:
I'd like an answer that I can just copy paste to my code in numpy. My input is a matrix of size [N, 1] where N is the number of data points and I already have computed the running mean and I assuming we have computed the running std/variance, how to update we the new batch of data.
Here we have two implementations of a function that takes the original mean, original variance and original size and the new sample and returns the total mean and total variance of the combined original and new sample (to get the standard deviation, just take variance's square root by using **(1/2)). The first uses NumPy, and the second one uses Welford. You may choose the one that best applies to your case.
def mean_and_variance_update_numpy(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
sample_to_append_mean = np.mean(sample_to_append)
sample_to_append_size = len(sample_to_append)
total_size = previous_size+sample_to_append_size
total_mean = (previous_mean*previous_size+sample_to_append_mean*sample_to_append_size)/total_size
total_var = (((previous_var+(total_mean-previous_mean)**2)*previous_size)+((np.var(sample_to_append)+(sample_to_append_mean-tm)**2)*sample_to_append_size))/total_size
return (total_mean, total_var)
def mean_and_variance_update_welford(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
pos = previous_size
mean = previous_mean
v = previous_var*previous_size
for value in sample_to_append:
pos += 1
mean_next = mean + (value - mean) / pos
v = v + (value - mean)*(value - mean_next)
mean = mean_next
return (mean, v/pos)
Let's check if it works:
import numpy as np
def mean_and_variance_udpate_numpy:
...
def mean_and_variance_udpate_welford:
...
# Making the samples and results deterministic
np.random.seed(0)
# Our initial sample has 100 samples, we want to append 10
n0, n1 = 100, 10
# Using np.matrix only, because it was in the question. 'np.array' is more common
s0 = np.matrix(1e3+np.random.random_sample(n0)*1e-3).T
s1 = np.matrix(1e3+np.random.random_sample(n1)*1e-3).T
# Precalculating our mean and var for initial sample:
s0mean, s0var = np.mean(s0), np.var(s0)
# Calculating mean and variance for s0+s1 using our NumPy updater
mean_and_variance_update_numpy(s0mean, s0var, len(s0), s1)
# (1000.0004826329636, 8.24577589696613e-08)
# Calculating mean and variance for s0+s1 using our Welford updater
mean_and_variance_update_welford(s0mean, s0var, len(s0), s1)
# (1000.0004826329634, 8.245775896913623e-08)
# Similar results, now checking with NumPy's calculation over the concatenation of s0 and s1
s0s1 = np.concatenate([s0,s1])
(np.mean(s0s1), np.var(s0s1))
# (1000.0004826329638, 8.245775896917313e-08)
Here the three results are closer:
# np(s0s1) (1000.0004826329638, 8.245775896917313e-08)
# np(s0)updnp(s1) (1000.0004826329636, 8.245775896966130e-08)
# np(s0)updwf(s1) (1000.0004826329634, 8.245775896913623e-08)
It is possible to see that the results are very similar.
n=int(raw_input("Enter no. of terms:"))
L=[]
for i in range (1,n+1):
x=float(raw_input("Enter term:"))
L.append(x)
sum=0
for i in range(n):
sum=sum+L[i]
avg=sum/n
sumdev=0
for j in range(n):
sumdev=sumdev+(L[j]-avg)**2
dev=(sumdev/n)**0.5
print "Standard deviation is", dev
Figure I could jump on the old bandwagon. This should work with rbg values
Adapted from
https://math.stackexchange.com/a/2148949
import numpy as np
class IterativeNormStats():
def __init__(self):
"""uint64 max is 18446744073709551615
256**2 = 65536
so we can store 18446744073709551615 / 65536 = 281,474,976,710,656
images before running into overflow issues. I think we'll be ok
"""
self.n = 0
self.rgb_sum = np.zeros(3, dtype=np.uint64)
self.rgb_sq_sum = np.zeros(3, dtype=np.uint64)
def update(self, img_arr):
rgbs = np.reshape(img_arr, (-1, 3)).astype(np.uint64)
self.n += rgbs.shape[0]
self.rgb_sum += np.sum(rgbs, axis=0)
self.rgb_sq_sum += np.sum(np.square(rgbs), axis=0)
def mean(self):
return self.rgb_sum / self.n
def std(self):
return np.sqrt((self.rgb_sq_sum / self.n) - np.square(self.rgb_sum / self.n))
def test_IterativeNormStats():
img_a = np.ones((10, 10, 3), dtype=np.uint8) * (1, 2, 3)
img_b = np.ones((10, 10, 3), dtype=np.uint8) * (2, 4, 6)
img_c = np.ones((10, 10, 3), dtype=np.uint8) * (3, 6, 9)
ins = IterativeNormStats()
for i in range(1000):
for img in [img_a, img_b, img_c]:
ins.update(img)
x = np.vstack([
np.reshape(img_a, (-1, 3)),
np.reshape(img_b, (-1, 3)),
np.reshape(img_c, (-1, 3)),
]*1000)
expected_mean = np.mean(x, axis=0)
expected_std = np.std(x, axis=0)
print(expected_mean)
print(ins.mean())
print(expected_std)
print(ins.std())
assert np.allclose(ins.mean(), expected_mean)
if __name__ == "__main__":
test_IterativeNormStats()
I came across thee welford package that's pretty simple to use:
pip install welford
Then
import numpy as np
from welford import Welford
# Initialize Welford object
w = Welford()
# Input data samples sequentialy
w.add(np.array([0, 100]))
w.add(np.array([1, 110]))
w.add(np.array([2, 120]))
# output
print(w.mean) # mean --> [ 1. 110.]
print(w.var_s) # sample variance --> [1, 100]
print(w.var_p) # population variance --> [ 0.6666 66.66]
# You can add other samples after calculating variances.
w.add(np.array([3, 130]))
w.add(np.array([4, 140]))
# output with added samples
print(w.mean) # mean --> [ 2. 120.]
print(w.var_s) # sample variance --> [ 2.5 250. ]
print(w.var_p) # population variance --> [ 2. 200.]
Notes:
Unlike most othere answers you can feed a Welford object a Numpy array directly
You can even add multiple with Welford.add_all(...)
You can merge independent computations with w1.merge(w2)
You should choose var_p or var_s depending on which one you want to use (Population and Sample variance)
As said, those are variances so you should use np.sqrt to get the associated standard deviation
Here is a simple implementation in python:
class RunningStats:
def __init__(self):
self.mean_x_square = 0
self.mean_x = 0
self.n = 0
def update(self, x):
self.mean_x_square = (self.mean_x_square * self.n + x ** 2) / (self.n + 1)
self.mean_x = (self.mean_x * self.n + x) / (self.n + 1)
self.n += 1
def mean(self):
return self.mean_x
def std(self):
return self.variance() ** 0.5
def variance(self):
return self.mean_x_square - self.mean_x ** 2
Test:
import numpy as np
running_stats = RunningStats()
v = [1.1, 3.5, 5, -8.1, 91]
[running_stats.update(x) for x in v]
print(running_stats.mean() - np.mean(v))
print(running_stats.std() - np.std(v))
print(running_stats.variance() - np.var(v))