how do I compute a weighted moving average using pandas - python

Using pandas I can compute
simple moving average SMA using pandas.stats.moments.rolling_mean
exponential moving average EMA using pandas.stats.moments.ewma
But how do I compute a weighted moving average (WMA) as described in wikipedia http://en.wikipedia.org/wiki/Exponential_smoothing ... using pandas?
Is there a pandas function to compute a WMA?

Using pandas you can calculate a weighted moving average (wma) using:
.rolling() combined with .apply()
Here's an example with 3 weights and window=3:
data = {'colA': random.randint(1, 6, 10)}
df = pd.DataFrame(data)
weights = np.array([0.5, 0.25, 0.25])
sum_weights = np.sum(weights)
df['weighted_ma'] = (df['colA']
.rolling(window=3, center=True)
.apply(lambda x: np.sum(weights*x) / sum_weights, raw=False)
)
Please note that in .rolling() I have used argument center=True.
You should check if this applies with your usecase or whether you need center=False.

No, there is no implementation of that exact algorithm. Created a GitHub issue about it here:
https://github.com/pydata/pandas/issues/886
I'd be happy to take a pull request for this-- implementation should be straightforward Cython coding and can be integrated into pandas.stats.moments

If data is a Pandas DataFrame or Series and you want to compute the WMA over the rows, you can do it using
wma = data[::-1].cumsum().sum() * 2 / data.shape[0] / (data.shape[0] + 1)
If you want a rolling WMA of window length n, use
data.rolling(n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))
as n = x.shape[0]. Note that this solution might be a bit slower than the one by Sander van den Oord, but you don't have to worry about the weights.

Construct a kernel with the weights, and apply it to your series using numpy.convolve.
import pandas as pd
import numpy as np
def wma(arr, period):
kernel = np.arange(period, 0, -1)
kernel = np.concatenate([np.zeros(period - 1), kernel / kernel.sum()])
return np.convolve(arr, kernel, 'same')
df = pd.DataFrame({'value':np.arange(11)})
df['wma'] = wma(df['value'], 4)
Here I am interpreting WMA according to this page: https://en.wikipedia.org/wiki/Moving_average
For this type of WMA, the weights should be a linear range of n values, adding up to 1.0.
Note that I pad the front of the kernel with zeros. This is because we want a 'one-sided' window function, so that 'future' values in the time series do not affect the moving average.
numpy.convolve is fast, unlike apply()!
You can also use numpy.correlate if you reverse the kernel.

Related

Apply custom weighting in Pandas rolling function [duplicate]

Using pandas I can compute
simple moving average SMA using pandas.stats.moments.rolling_mean
exponential moving average EMA using pandas.stats.moments.ewma
But how do I compute a weighted moving average (WMA) as described in wikipedia http://en.wikipedia.org/wiki/Exponential_smoothing ... using pandas?
Is there a pandas function to compute a WMA?
Using pandas you can calculate a weighted moving average (wma) using:
.rolling() combined with .apply()
Here's an example with 3 weights and window=3:
data = {'colA': random.randint(1, 6, 10)}
df = pd.DataFrame(data)
weights = np.array([0.5, 0.25, 0.25])
sum_weights = np.sum(weights)
df['weighted_ma'] = (df['colA']
.rolling(window=3, center=True)
.apply(lambda x: np.sum(weights*x) / sum_weights, raw=False)
)
Please note that in .rolling() I have used argument center=True.
You should check if this applies with your usecase or whether you need center=False.
No, there is no implementation of that exact algorithm. Created a GitHub issue about it here:
https://github.com/pydata/pandas/issues/886
I'd be happy to take a pull request for this-- implementation should be straightforward Cython coding and can be integrated into pandas.stats.moments
If data is a Pandas DataFrame or Series and you want to compute the WMA over the rows, you can do it using
wma = data[::-1].cumsum().sum() * 2 / data.shape[0] / (data.shape[0] + 1)
If you want a rolling WMA of window length n, use
data.rolling(n).apply(lambda x: x[::-1].cumsum().sum() * 2 / n / (n + 1))
as n = x.shape[0]. Note that this solution might be a bit slower than the one by Sander van den Oord, but you don't have to worry about the weights.
Construct a kernel with the weights, and apply it to your series using numpy.convolve.
import pandas as pd
import numpy as np
def wma(arr, period):
kernel = np.arange(period, 0, -1)
kernel = np.concatenate([np.zeros(period - 1), kernel / kernel.sum()])
return np.convolve(arr, kernel, 'same')
df = pd.DataFrame({'value':np.arange(11)})
df['wma'] = wma(df['value'], 4)
Here I am interpreting WMA according to this page: https://en.wikipedia.org/wiki/Moving_average
For this type of WMA, the weights should be a linear range of n values, adding up to 1.0.
Note that I pad the front of the kernel with zeros. This is because we want a 'one-sided' window function, so that 'future' values in the time series do not affect the moving average.
numpy.convolve is fast, unlike apply()!
You can also use numpy.correlate if you reverse the kernel.

Efficient expanding OLS in pandas

I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered

Are these functions equivalent?

I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks
I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.

statistics for histogram of periodic data

For a series of angle values in (-pi, pi) range, I make a histogram. Is there an effective way to calculate a mean and modal (post probable) value? Consider following examples:
import numpy as N, cmath
deg = N.pi/180.
d = N.array([-175., 170, 175, 179, -179])*deg
i = N.sum(N.exp(1j*d))
ave = cmath.phase(i)
i /= float(d.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print ave/deg, stdev/deg
Now, let's have a histogram:
counts, bins = N.histogram(data, N.linspace(-N.pi, N.pi, 360))
Is it possible to calculate mean, mode having counts and bins? For non-periodic data, calculation of a mean is straightforward:
ave = sum(counts*bins[:-1])
Calculations of a modal value requires more effort. Actually, I'm not sure my code below is correct: firstly, I identify bins which occur most frequently and then I calculate an arithmetic mean:
cmax = bins[N.argmax(counts)]
mode = N.mean(N.take(bins, N.nonzero(counts == cmax)[0]))
I have no idea, how to calculate standard deviation from such data, though. One obvious solution to all my problems (at least those described above) is to convert histogram data to a data series and then use it in calculations. This is not elegant, however, and inefficient.
Any hints will be very appreciated.
This is the partial solution I wrote.
import numpy as N, cmath
import scipy.stats as ST
d = [-175, 170.2, 175.57, 179, -179, 170.2, 175.57, 170.2]
deg = N.pi/180.
data = N.array(d)*deg
i = N.sum(N.exp(1j*data))
ave = cmath.phase(i) # correct and exact mean for periodic data
wrong_ave = N.mean(d)
i /= float(data.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
wrong_stdev = N.std(d)
bins = N.linspace(-N.pi, N.pi, 360)
counts, bins = N.histogram(data, bins, normed=False)
# consider it weighted vector addition
nz = N.nonzero(counts)[0]
weight = counts[nz]
i = N.sum(weight * N.exp(1j*bins[nz])/len(nz))
pave = cmath.phase(i) # correct and approximated mean for periodic data
i /= sum(weight)/float(len(nz))
pstdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print
print 'scipy: %12.3f (mean) %12.3f (stdev)' % (ST.circmean(data)/deg, \
ST.circstd(data)/deg)
When run, it gives following results:
mean: 175.840 85.843 175.360
stdev: 0.472 151.785 0.430
scipy: 175.840 (mean) 3.673 (stdev)
A few comments now: the first column gives mean/stdev calculated. As can be seen, the mean agrees well with scipy.stats.circmean (thanks JoeKington for pointing it out). Unfortunately stdev differs. I will look at it later. The second column gives completely wrong results (non-periodic mean/std from numpy obviously does not work here). The 3rd column gives sth I wanted to obtain from the histogram data (#JoeKington: my raw data won't fit memory of my computer.., #dmytro: thanks for your input: of course, bin size will influence result but in my application I don't have much choice, i.e. I have to reduce data somehow). As can be seen, the mean (3rd column) is properly calculated, stdev needs further attention :)
Have a look at scipy.stats.circmean and scipy.stats.circstd.
Or do you only have the histogram counts, and not the "raw" data? If so, you could fit a Von Mises distribution to your histogram counts and approximate the mean and stddev in that way.
Here's how to get an approximation.
Since Var(x) = <x^2> - <x>^2, we have:
meanX = N.sum(counts * bins[:-1]) / N.sum(counts)
meanX2 = N.sum(counts * bins[:-1]**2) / N.sum(counts)
std = N.sqrt(meanX2 - meanX**2)

Computing cross-correlation function?

In R, I am using ccf or acf to compute the pair-wise cross-correlation function so that I can find out which shift gives me the maximum value. From the looks of it, R gives me a normalized sequence of values. Is there something similar in Python's scipy or am I supposed to do it using the fft module? Currently, I am doing it as follows:
xcorr = lambda x,y : irfft(rfft(x)*rfft(y[::-1]))
x = numpy.array([0,0,1,1])
y = numpy.array([1,1,0,0])
print xcorr(x,y)
To cross-correlate 1d arrays use numpy.correlate.
For 2d arrays, use scipy.signal.correlate2d.
There is also scipy.stsci.convolve.correlate2d.
There is also matplotlib.pyplot.xcorr which is based on numpy.correlate.
See this post on the SciPy mailing list for some links to different implementations.
Edit: #user333700 added a link to the SciPy ticket for this issue in a comment.
If you are looking for a rapid, normalized cross correlation in either one or two dimensions
I would recommend the openCV library (see http://opencv.willowgarage.com/wiki/ http://opencv.org/). The cross-correlation code maintained by this group is the fastest you will find, and it will be normalized (results between -1 and 1).
While this is a C++ library the code is maintained with CMake and has python bindings so that access to the cross correlation functions is convenient. OpenCV also plays nicely with numpy. If I wanted to compute a 2-D cross-correlation starting from numpy arrays I could do it as follows.
import numpy
import cv
#Create a random template and place it in a larger image
templateNp = numpy.random.random( (100,100) )
image = numpy.random.random( (400,400) )
image[:100, :100] = templateNp
#create a numpy array for storing result
resultNp = numpy.zeros( (301, 301) )
#convert from numpy format to openCV format
templateCv = cv.fromarray(numpy.float32(template))
imageCv = cv.fromarray(numpy.float32(image))
resultCv = cv.fromarray(numpy.float32(resultNp))
#perform cross correlation
cv.MatchTemplate(templateCv, imageCv, resultCv, cv.CV_TM_CCORR_NORMED)
#convert result back to numpy array
resultNp = np.asarray(resultCv)
For just a 1-D cross-correlation create a 2-D array with shape equal to (N, 1 ). Though there is some extra code involved to convert to an openCV format the speed-up over scipy is quite impressive.
I just finished writing my own optimised implementation of normalized cross-correlation for N-dimensional arrays. You can get it from here.
It will calculate cross-correlation either directly, using scipy.ndimage.correlate, or in the frequency domain, using scipy.fftpack.fftn/ifftn depending on whichever will be quickest.
For 1D array, numpy.correlate is faster than scipy.signal.correlate, under different sizes, I see a consistent 5x peformance gain using numpy.correlate. When two arrays are of similar size (the bright line connecting the diagonal), the performance difference is even more outstanding (50x +).
# a simple benchmark
res = []
for x in range(1, 1000):
list_x = []
for y in range(1, 1000):
# generate different sizes of series to compare
l1 = np.random.choice(range(1, 100), size=x)
l2 = np.random.choice(range(1, 100), size=y)
time_start = datetime.now()
np.correlate(a=l1, v=l2)
t_np = datetime.now() - time_start
time_start = datetime.now()
scipy.signal.correlate(in1=l1, in2=l2)
t_scipy = datetime.now() - time_start
list_x.append(t_scipy / t_np)
res.append(list_x)
plt.imshow(np.matrix(res))
As default, scipy.signal.correlate calculates a few extra numbers by padding and that might explained the performance difference.
>> l1 = [1,2,3,2,1,2,3]
>> l2 = [1,2,3]
>> print(numpy.correlate(a=l1, v=l2))
>> print(scipy.signal.correlate(in1=l1, in2=l2))
[14 14 10 10 14]
[ 3 8 14 14 10 10 14 8 3] # the first 3 is [0,0,1]dot[1,2,3]

Categories

Resources