Numpy not producing desired sample variance value

Numpy not producing desired sample variance value - python

I have a list for which I like to calculate sample variance. When I use numpy.var I get a different result from the function I defined.
Can someone please help me understand what I am missing?
my_ls = [227, 222, 218, 217, 225, 218, 216, 229, 228, 221]
def calc_mean(ls):
sum_tmp = 0
for i in ls:
sum_tmp = sum_tmp + i
return round(sum_tmp/len(ls), 2)
def calc_var(ls):
tmp_mean = calc_mean(ls)
tmp_sum = 0
for i in ls:
tmp_sum = tmp_sum + ((i - tmp_mean)**2)
return round(tmp_sum/(len(ls)-1), 2)
calc_var(my_ls)
>>> 23.66
np.var(my_ls)
>>> 21.29
23.66 is my desired value.

The denominator in the case of np.var(my_ls) by default is the total sample size (N).
You can use the Delta Degrees of Freedom (ddof) parameter in numpy to show you're calculating the sample variance by setting ddof = 1 for the mean degree of freedom. I.e. your denominator now becomes (N-1)
np.var(my_ls,ddof=1)
>>> 23.66

You can use the ddof parameter of np.var(), which stands for "degrees of freedom":
np.var(my_ls, ddof=1)
# 23.65555555555555
to get you to the desired result.
Essentially, you divide your sum of squares by n - 1, while np.var() divides by n - ddof, which defaults to 0.
A discussion on these topics can be found on Wikipedia.

You are using the unbiased formula for the variance, i.e. you are dividing the summatory by N-1, while np.var seems to compute the variance normalizing by the total number of elements in the vector.
Check for example here, section "Sample variance".

your function calc_var(ls) doesn't implement variance formula:
The variance is the average of the squared deviations from the mean,
i.e., var = mean(abs(x - x.mean())**2).
you can use:
def calc_var(ls):
tmp_mean = calc_mean(ls)
means = []
for i in ls:
means.append((i - tmp_mean)**2)
var = calc_mean(means)
return round(var, 2)
print(calc_var(my_ls))
print(np.var(my_ls))
output:
21.29
21.29

Related

fft normalization vs the norm-option

i have to often normalize the result of circular-convolution, so I 'borrowed'-and-modfied the following routine :
def fft_normalize(x):# Normalize a vector x in complex domain.
c = np.fft.rfft(x)
# Look at real and image as if they were real
ri = np.vstack([c.real, c.imag])
# Normalize magnitude of each complex/real pair
norm = np.linalg.norm(ri, axis=0)
if np.any(norm==0): norm[norm == 0] = np.float64(1e-308) #!fixme
ri= np.divide(ri,norm)
c_proj = ri[0,:] + 1j * ri[1,:]
rv = np.fft.irfft(c_proj, n=x.shape[-1])
return rv
def fft_convolution(a, b):
return np.fft.irfft(np.fft.rfft(a) * np.fft.rfft(b))
so that I do this :
fft_normalize(fft_convolution(a,b))
i see in the numpy docs there is a 'norm' option. Is this equivalent to what i'm doing ?
And if yes, which option should I use ?
def fft_convolution2(a, b):
return np.fft.irfft(np.fft.rfft(a) * np.fft.rfft(b), norm='ortho')
When I test it it behaves better when I do fft_normalize()
Second, i had to add this line, but it does not seem, right. any ideas ?
if np.any(norm==0): norm[norm == 0] = np.float64(1e-308) #!fixme
As a side note, if you know !! numpy docs mentions that they promote float32 to float64 and that scipy.fftpack does not !!
Would fftpack be faster ! because scipy says fftpack is obsolete and there is no info on the new scipy ?
#cris are you sugesting i do it this way :
def fft_normalize(x):# Normalize a vector x in complex domain.
c = np.fft.rfft(x)
ri = np.vstack([c.real, c.imag])
norm = np.abs(c)
if np.any(norm==0): norm[norm == 0] = MIN #!fixme
ri= np.divide(ri,norm)
c_proj = ri[0,:] + 1j * ri[1,:]
rv = np.fft.irfft(c_proj, n=x.shape[-1])
return rv

The norm argument to the FFT functions in NumPy determine whether the transform result is multiplied by 1, 1/N or 1/sqrt(N), with N the number of samples in the array. Normally, the inverse transform is normalized by dividing by N, and the forward transform is not. Specifying “ortho” here causes both transforms to be normalized by 1/sqrt(2). Specifying “forward” causes only the forward transform to be normalized by 1/N.
These normalizations are very different from the one you apply, where you normalize each element in the frequency domain.

Why does my sigmoid function return values not in the interval ]0,1[?

I am implementing logistic regression in Python with numpy. I have generated the following data set:
# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 1000
# class 1
# covariance matrix
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([1.,1])
# number of data points
m1 = 1000
# generate m gaussian distributed data points with
# mean and cov.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
X = np.concatenate((r0,r1))
Now I have implemented the sigmoid function with the aid of the following methods:
def logistic_function(x):
""" Applies the logistic function to x, element-wise. """
return 1.0 / (1 + np.exp(-x))
def logistic_hypothesis(theta):
return lambda x : logistic_function(np.dot(generateNewX(x), theta.T))
def generateNewX(x):
x = np.insert(x, 0, 1, axis=1)
return x
After applying logistic regression, I found out that the best thetas are:
best_thetas = [-0.9673200946417307, -1.955812236119612, -5.060885703369424]
However, when I apply the logistic function with these thetas, then the output is numbers that are not inside the interval [0,1]
Example:
data = logistic_hypothesis(np.asarray(best_thetas))(X)
print(data
This gives the following result:
[2.67871968e-11 3.19858822e-09 3.77845881e-09 ... 5.61325410e-03
2.19767618e-01 6.23288747e-01]
Can someone help me understand what has gone wrong with my implementation? I cannot understand why I am getting such big values. Isnt the sigmoid function supposed to only give results in the [0,1] interval?

It does, it's just in scientific notation.
'e' Exponent notation. Prints the number in scientific notation using
the letter ‘e’ to indicate the exponent.
>>> a = [2.67871968e-11, 3.19858822e-09, 3.77845881e-09, 5.61325410e-03]
>>> [0 <= i <= 1 for i in a]
[True, True, True, True]

Exponential Moving Average by time interval [duplicate]

I have a range of dates and a measurement on each of those dates. I'd like to calculate an exponential moving average for each of the dates. Does anybody know how to do this?
I'm new to python. It doesn't appear that averages are built into the standard python library, which strikes me as a little odd. Maybe I'm not looking in the right place.
So, given the following code, how could I calculate the moving weighted average of IQ points for calendar dates?
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
(there's probably a better way to structure the data, any advice would be appreciated)

EDIT:
It seems that mov_average_expw() function from scikits.timeseries.lib.moving_funcs submodule from SciKits (add-on toolkits that complement SciPy) better suits the wording of your question.
To calculate an exponential smoothing of your data with a smoothing factor alpha (it is (1 - alpha) in Wikipedia's terms):
>>> alpha = 0.5
>>> assert 0 < alpha <= 1.0
>>> av = sum(alpha**n.days * iq
... for n, iq in map(lambda (day, iq), today=max(days): (today-day, iq),
... sorted(zip(days, IQ), key=lambda p: p[0], reverse=True)))
95.0
The above is not pretty, so let's refactor it a bit:
from collections import namedtuple
from operator import itemgetter
def smooth(iq_data, alpha=1, today=None):
"""Perform exponential smoothing with factor `alpha`.
Time period is a day.
Each time period the value of `iq` drops `alpha` times.
The most recent data is the most valuable one.
"""
assert 0 < alpha <= 1
if alpha == 1: # no smoothing
return sum(map(itemgetter(1), iq_data))
if today is None:
today = max(map(itemgetter(0), iq_data))
return sum(alpha**((today - date).days) * iq for date, iq in iq_data)
IQData = namedtuple("IQData", "date iq")
if __name__ == "__main__":
from datetime import date
days = [date(2008,1,1), date(2008,1,2), date(2008,1,7)]
IQ = [110, 105, 90]
iqdata = list(map(IQData, days, IQ))
print("\n".join(map(str, iqdata)))
print(smooth(iqdata, alpha=0.5))
Example:
$ python26 smooth.py
IQData(date=datetime.date(2008, 1, 1), iq=110)
IQData(date=datetime.date(2008, 1, 2), iq=105)
IQData(date=datetime.date(2008, 1, 7), iq=90)
95.0

I'm always calculating EMAs with Pandas:
Here is an example how to do it:
import pandas as pd
import numpy as np
def ema(values, period):
values = np.array(values)
return pd.ewma(values, span=period)[-1]
values = [9, 5, 10, 16, 5]
period = 5
print ema(values, period)
More infos about Pandas EWMA:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.ewma.html

I did a bit of googling and I found the following sample code (http://osdir.com/ml/python.matplotlib.general/2005-04/msg00044.html):
def ema(s, n):
"""
returns an n period exponential moving average for
the time series s
s is a list ordered from oldest (index 0) to most
recent (index -1)
n is an integer
returns a numeric array of the exponential
moving average
"""
s = array(s)
ema = []
j = 1
#get n sma first and calculate the next n period ema
sma = sum(s[:n]) / n
multiplier = 2 / float(1 + n)
ema.append(sma)
#EMA(current) = ( (Price(current) - EMA(prev) ) x Multiplier) + EMA(prev)
ema.append(( (s[n] - sma) * multiplier) + sma)
#now calculate the rest of the values
for i in s[n+1:]:
tmp = ( (i - ema[j]) * multiplier) + ema[j]
j = j + 1
ema.append(tmp)
return ema

You can also use the SciPy filter method because the EMA is an IIR filter. This will have the benefit of being approximately 64 times faster as measured on my system using timeit on large data sets when compared to the enumerate() approach.
import numpy as np
from scipy.signal import lfilter
x = np.random.normal(size=1234)
alpha = .1 # smoothing coefficient
zi = [x[0]] # seed the filter state with first value
# filter can process blocks of continuous data if <zi> is maintained
y, zi = lfilter([1.-alpha], [1., -alpha], x, zi=zi)

I don't know Python, but for the averaging part, do you mean an exponentially decaying low-pass filter of the form
y_new = y_old + (input - y_old)*alpha
where alpha = dt/tau, dt = the timestep of the filter, tau = the time constant of the filter? (the variable-timestep form of this is as follows, just clip dt/tau to not be more than 1.0)
y_new = y_old + (input - y_old)*dt/tau
If you want to filter something like a date, make sure you convert to a floating-point quantity like # of seconds since Jan 1 1970.

My python is a little bit rusty (anyone can feel free to edit this code to make corrections, if I've messed up the syntax somehow), but here goes....
def movingAverageExponential(values, alpha, epsilon = 0):
if not 0 < alpha < 1:
raise ValueError("out of range, alpha='%s'" % alpha)
if not 0 <= epsilon < alpha:
raise ValueError("out of range, epsilon='%s'" % epsilon)
result = [None] * len(values)
for i in range(len(result)):
currentWeight = 1.0
numerator = 0
denominator = 0
for value in values[i::-1]:
numerator += value * currentWeight
denominator += currentWeight
currentWeight *= alpha
if currentWeight < epsilon:
break
result[i] = numerator / denominator
return result
This function moves backward, from the end of the list to the beginning, calculating the exponential moving average for each value by working backward until the weight coefficient for an element is less than the given epsilon.
At the end of the function, it reverses the values before returning the list (so that they're in the correct order for the caller).
(SIDE NOTE: if I was using a language other than python, I'd create a full-size empty array first and then fill it backwards-order, so that I wouldn't have to reverse it at the end. But I don't think you can declare a big empty array in python. And in python lists, appending is much less expensive than prepending, which is why I built the list in reverse order. Please correct me if I'm wrong.)
The 'alpha' argument is the decay factor on each iteration. For example, if you used an alpha of 0.5, then today's moving average value would be composed of the following weighted values:
today: 1.0
yesterday: 0.5
2 days ago: 0.25
3 days ago: 0.125
...etc...
Of course, if you've got a huge array of values, the values from ten or fifteen days ago won't contribute very much to today's weighted average. The 'epsilon' argument lets you set a cutoff point, below which you will cease to care about old values (since their contribution to today's value will be insignificant).
You'd invoke the function something like this:
result = movingAverageExponential(values, 0.75, 0.0001)

In matplotlib.org examples (http://matplotlib.org/examples/pylab_examples/finance_work2.html) is provided one good example of Exponential Moving Average (EMA) function using numpy:
def moving_average(x, n, type):
x = np.asarray(x)
if type=='simple':
weights = np.ones(n)
else:
weights = np.exp(np.linspace(-1., 0., n))
weights /= weights.sum()
a = np.convolve(x, weights, mode='full')[:len(x)]
a[:n] = a[n]
return a

I found the above code snippet by #earino pretty useful - but I needed something that could continuously smooth a stream of values - so I refactored it to this:
def exponential_moving_average(period=1000):
""" Exponential moving average. Smooths the values in v over ther period. Send in values - at first it'll return a simple average, but as soon as it's gahtered 'period' values, it'll start to use the Exponential Moving Averge to smooth the values.
period: int - how many values to smooth over (default=100). """
multiplier = 2 / float(1 + period)
cum_temp = yield None # We are being primed
# Start by just returning the simple average until we have enough data.
for i in xrange(1, period + 1):
cum_temp += yield cum_temp / float(i)
# Grab the timple avergae
ema = cum_temp / period
# and start calculating the exponentially smoothed average
while True:
ema = (((yield ema) - ema) * multiplier) + ema
and I use it like this:
def temp_monitor(pin):
""" Read from the temperature monitor - and smooth the value out. The sensor is noisy, so we use exponential smoothing. """
ema = exponential_moving_average()
next(ema) # Prime the generator
while True:
yield ema.send(val_to_temp(pin.read()))
(where pin.read() produces the next value I'd like to consume).

May be shortest:
#Specify decay in terms of span
#data_series should be a DataFrame
ema=data_series.ewm(span=5, adjust=False).mean()

import pandas_ta as ta
data["EMA3"] = ta.ema(data["close"], length=3)
pandas_ta is a Technical Analysis Library: https://github.com/twopirllc/pandas-ta. Above code calculates the Exponential Moving Average (EMA) for a series. You can specify the lag value using 'length'. Spesifically, above code calculates '3-day EMA'.

Here is a simple sample I worked up based on http://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_averages
Note that unlike in their spreadsheet, I don't calculate the SMA, and I don't wait to generate the EMA after 10 samples. This means my values differ slightly, but if you chart it, it follows exactly after 10 samples. During the first 10 samples, the EMA I calculate is appropriately smoothed.
def emaWeight(numSamples):
return 2 / float(numSamples + 1)
def ema(close, prevEma, numSamples):
return ((close-prevEma) * emaWeight(numSamples) ) + prevEma
samples = [
22.27, 22.19, 22.08, 22.17, 22.18, 22.13, 22.23, 22.43, 22.24, 22.29,
22.15, 22.39, 22.38, 22.61, 23.36, 24.05, 23.75, 23.83, 23.95, 23.63,
23.82, 23.87, 23.65, 23.19, 23.10, 23.33, 22.68, 23.10, 22.40, 22.17,
]
emaCap = 10
e=samples[0]
for s in range(len(samples)):
numSamples = emaCap if s > emaCap else s
e = ema(samples[s], e, numSamples)
print e

I'm a little late to the party here, but none of the solutions given were what I was looking for. Nice little challenge using recursion and the exact formula given in investopedia.
No numpy or pandas required.
prices = [{'i': 1, 'close': 24.5}, {'i': 2, 'close': 24.6}, {'i': 3, 'close': 24.8}, {'i': 4, 'close': 24.9},
{'i': 5, 'close': 25.6}, {'i': 6, 'close': 25.0}, {'i': 7, 'close': 24.7}]
def rec_calculate_ema(n):
k = 2 / (n + 1)
price = prices[n]['close']
if n == 1:
return price
res = (price * k) + (rec_calculate_ema(n - 1) * (1 - k))
return res
print(rec_calculate_ema(3))

A fast way (copy-pasted from here) is the following:
def ExpMovingAverage(values, window):
""" Numpy implementation of EMA
"""
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a

I am using a list and a rate of decay as inputs. I hope this little function with just two lines may help you here, considering deep recursion is not stable in python.
def expma(aseries, ratio):
return sum([ratio*aseries[-x-1]*((1-ratio)**x) for x in range(len(aseries))])

more simply, using pandas
def EMA(tw):
for x in tw:
data["EMA{}".format(x)] = data['close'].ewm(span=x, adjust=False).mean()
EMA([10,50,100])

Papahaba's answer was almost what I was looking for (thanks!) but I needed to match initial conditions. Using an IIR filter with scipy.signal.lfilter is certainly the most efficient. Here's my redux:
Given a NumPy vector, x
import numpy as np
from scipy import signal
period = 12
b = np.array((1,), 'd')
a = np.array((period, 1-period), 'd')
zi = signal.lfilter_zi(b, a)
y, zi = signal.lfilter(b, a, x, zi=zi*x[0:1])
Get the N-point EMA (here, 12) returned in the vector y

How numpy.cov() function is implemented?

I have my own implementation of the covariance function based on the equation:
'''
Calculate the covariance coefficient between two variables.
'''
import numpy as np
X = np.array([171, 184, 210, 198, 166, 167])
Y = np.array([78, 77, 98, 110, 80, 69])
# Expected value function.
def E(X, P):
expectedValue = 0
for i in np.arange(0, np.size(X)):
expectedValue += X[i] * (P[i] / np.size(X))
return expectedValue
# Covariance coefficient function.
def covariance(X, Y):
'''
Calculate the product of the multiplication for each pair of variables
values.
'''
XY = X * Y
# Calculate the expected values for each variable and for the XY.
EX = E(X, np.ones(np.size(X)))
EY = E(Y, np.ones(np.size(Y)))
EXY = E(XY, np.ones(np.size(XY)))
# Calculate the covariance coefficient.
return EXY - (EX * EY)
# Display matrix of the covariance coefficient values.
covMatrix = np.array([[covariance(X, X), covariance(X, Y)],
[covariance(Y, X), covariance(Y, Y)]])
print("My function:", covMatrix)
# Display standard numpy.cov() covariance coefficient matrix.
print("Numpy.cov() function:", np.cov([X, Y]))
But the problem is, that I'm getting different values from my function and from numpy.cov(), ie:
My function: [[ 273.88888889 190.61111111]
[ 190.61111111 197.88888889]]
Numpy.cov() function: [[ 328.66666667 228.73333333]
[ 228.73333333 237.46666667]]
Why is that? How is numpy.cov() function implemented? If the function numpy.cov() is well-implemented, what am I doing wrong? I'll just say, that results from my function covariance() are consistent with the results from paper examples in the internet for calculating the covariance coefficient, eg http://www.naukowiec.org/wzory/statystyka/kowariancja_11.html.

The numpy function has a different normalization to yours as a default setting. Try instead
>>> np.cov([X, Y], ddof=0)
array([[ 273.88888889, 190.61111111],
[ 190.61111111, 197.88888889]])
References:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance

Computing a binomial probability for huge numbers

I want to compute binomial probabilities on python. I tried to apply the formula:
probability = scipy.misc.comb(n,k)*(p**k)*((1-p)**(n-k))
Some of the probabilities I get are infinite. I checked some values for which p=inf. For one of them, n=450,000 and k=17. This value must be greater than 1e302 which is the maximum value handled by floats.
I then tried to use sum(np.random.binomial(n,p,numberOfTrials)==valueOfInterest)/numberOfTrials
This draws numberOfTrials samples and computes the average number of times the value valueOfInterest is drawn.
This doesn't raise any infinite value. However, is this a valid way to proceed? And why this way wouldn't raise any infinite value whereas computing the probabilities does?

Because you're using scipy I thought I would mention that scipy already has statistical distributions implemented. Also note that when n is this large the binomial distribution is well approximated by the normal distribution (or Poisson if p is very small).
n = 450000
p = .5
k = np.array([17., 225000, 226000])
b = scipy.stats.binom(n, p)
print b.pmf(k)
# array([ 0.00000000e+00, 1.18941527e-03, 1.39679862e-05])
n = scipy.stats.norm(n*p, np.sqrt(n*p*(1-p)))
print n.pdf(k)
# array([ 0.00000000e+00, 1.18941608e-03, 1.39680605e-05])
print b.pmf(k) - n.pdf(k)
# array([ 0.00000000e+00, -8.10313274e-10, -7.43085142e-11])

Work in the log domain to compute combination and exponentiation functions and then raise them to exponent.
Something like this:
combination_num = range(k+1, n+1)
combination_den = range(1, n-k+1)
combination_log = np.log(combination_num).sum() - np.log(combination_den).sum()
p_k_log = k * np.log(p)
neg_p_K_log = (n - k) * np.log(1 - p)
p_log = combination_log + p_k_log + neg_p_K_log
probability = np.exp(p_log)
Gets rid of numeric underflow/overflow because of large numbers. On your example with n=450000 and p = 0.5, k = 17, it returns p_log = -311728.4, i. e., the log of final probability is pretty small and hence underflow occurs while taking np.exp. However, you can still work with log probability.

I thing you should do all you computation using logarithms:
from scipy import special, exp, log
lgam = special.gammaln
def binomial(n, k, p):
return exp(lgam(n+1) - lgam(n-k+1) - lgam(k+1) + k*log(p) + (n-k)*log(1.-p))

To avoid multiplicity like zero by like infinity use step by step multiplication as this.
def Pbinom(N,p,k):
q=1-p
lt1=[q]*(N-k)
gt1=list(map(lambda x: p*(N-k+x)/x, range(1,k+1)))
Pb=1.0
while (len(lt1) + len(gt1)) > 0:
if Pb>1:
if len(lt1)>0:
Pb*=lt1.pop()
else:
if len(gt1)>0:
Pb*=gt1.pop()
else:
if len(gt1)>0:
Pb*=gt1.pop()
else:
if len(lt1)>0:
Pb*=lt1.pop()
return Pb

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy not producing desired sample variance value - python

You are using the unbiased formula for the variance, i.e. you are dividing the summatory by N-1, while np.var seems to compute the variance normalizing by the total number of elements in the vector. Check for example here, section "Sample variance".

Related

fft normalization vs the norm-option

Why does my sigmoid function return values not in the interval ]0,1[?

Exponential Moving Average by time interval [duplicate]

How numpy.cov() function is implemented?

Computing a binomial probability for huge numbers

Categories

Resources