I have the following pandas object. It is an OHLC time-series data-frame.
I would like to calculate the EMA30 of the close column. For that, I used 2 different approaches, just as a test.
# Approach A, as explained by sentdex in this video:
# https://youtu.be/t_JXXT7VgeQ?list=PLbLcS9xv6IuGi8uyxMP3-BN-lTRQNpqEG&t=245
def ExpMovingAverage(values, window):
weights = np.exp(np.linspace(-1., 0., window))
weights /= weights.sum()
a = np.convolve(values, weights, mode='full')[:len(values)]
a[:window] = a[window]
return a
# Approach B
pd.Series.ewm(local_df['close'].copy(), span=30).mean()
Once, calculated I add them into their respective new columns.
# EMA30 (Using approach A)
local_df['ema30_a'] = pd.Series.ewm(local_df['close'].copy(), span=30).mean()
# EMA30 (Using approach B)
x = local_df['close'].values
calculate_ema30_b = ExpMovingAverage(x, 30)
local_df['ema30_b'] = calculate_ema30_b
The resulting data frame is below:
However, once plotted, it seems like the pandas (blue) deviates from the other numpy based approaches (red). In that case, which of the calculation methods is the one that is correct?
Related
I have a 3D data matrix of sea level data (time, y, x) and I found the power spectrum by taking the square of the FFT but there are low frequencies that are really dominant. I want to get rid of those low frequencies by applying a high pass filter... how would I go about doing that?
Example of data set and structure/code is below:
This is the data set and creating the arrays:
Yearmin = 2018
Yearmax = 2019
year_len = Yearmax - Yearmin + 1.0 # number of years
direcInput = "filepath"
a = s.Dataset(direcInput+"test.nc", mode='r')
#creating arrays
lat = a.variables["latitude"][:]
lon = a.variables["longitude"][:]
time1 = a.variables["time"][:] #DAYS SINCE JAN 1ST 1950
sla = a.variables["sla"][:,:,:] #t, y, x
time = Yearmin + (year_len * (time1 - np.min(time1)) / ( np.max(time1) - np.min(time1)))
#detrending and normalizing data
def standardize(y, detrend = True, normalize = True):
if detrend == True:
y = signal.detrend(y, axis=0)
y = (y - np.mean(y, axis=0))
if normalize == True:
y = y / np.std(y, axis=0)
return y
sla_standard = standardize(sla)
print(sla_standard.shape) = (710, 81, 320)
#fft
fft = np.fft.rfft(sla_standard, axis=0)
spec = np.square(abs(fft))
frequencies = (0, nyquist, df)
#PLOTTING THE FREQUENCIES VS SPECTRUM FOR A FEW DIFFERENT SPATIAL LOCATIONS
plt.plot(frequencies, spec[:, 68,85])
plt.plot(frequencies, spec[:, 23,235])
plt.plot(frequencies, spec[:, 39,178])
plt.plot(frequencies, spec[:, 30,149])
plt.xlim(0,.05)
plt.show()
My goal is to make a high pass filter of the ORIGINAL time series (sla_standard) to remove the two really big peaks. Which type of filter should I use? Thank you!
Use .axes.Axes.set_ylim to set the y-axis limit.
Axes.set_ylim(self, left=None, right=None, emit=True, auto=False, *, ymin=None, ymax=None)
So in your case ymin=None and you set ymax for example to ymax=60000 before you start plotting.
Thus plt.ylim(ymin=None, ymax=60000).
Taking out data should not be done here because its "falsifying results". What you actually want is to zoom in on the chart. The person who reads the chart independently from you would interpret the data falsely if not made aware in advance. Peaks that go off the chart are okay because everybody understands that.
Or:
Directly replacement of certain values in an array (arr):
arr[arr > ori] = dest
For example in your case ori=60000 and dest=1
All values larger ">" than 60k are replaces by 1.
The different filters: As you state a filter acts on the frequencies of your signal. Different filter shapes exist and some of them have complex expressions because they need to be implemented in real time processing (causal). However in your case, you seem to post process the data. You can use the Fourier Transform, that requires all the data (non causal).
The filter to choose: Consequently you can directly perform you filtering operation in the Fourier domain by applying a mask on your frequencies. If you want to remove frequencies, I recommand you to use a binary mask made of 0 and 1. Why? Because it is the simplest filter you can think about. It is scientifically relevant to state that you completely removed some frequencies (say it and justify it). However it is more difficult to claim that you let some and attenuated a little bit others, and that you chose arbitrarily the attenuation factor...
Python implementation
signal_fft = np.fft.rfft(sla_standard,axis=0)
mask = np.ones_like(sla_standard)
mask[freq_to_filter,...] = 0.0 # define here the frequencies to filter
filtered_signal = np.fft.irfft(mask*signal_fft,axis=0)
I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered
I'm trying to cross correlate two sets of data, by taking the fourier transform of both and multiplying the conjugate of the first fft with the second fft, before transforming back to time space. In order to test my code, I am comparing the output with the output of numpy.correlate. However, when I plot my code, (restricted to a certain window), it seems the two signals go in opposite directions/are mirrored about zero.
This is what my output looks like
My code:
import numpy as np
import pyplot as plt
phl_data = np.sin(np.arange(0, 10, 0.1))
mlac_data = np.cos(np.arange(0, 10, 0.1))
N = phl_data.size
zeroes = np.zeros(N-1)
phl_data = np.append(phl_data, zeroes)
mlac_data = np.append(mlac_data, zeroes)
# cross-correlate x = phl_data, y = mlac_data:
# take FFTs:
phl_fft = np.fft.fft(phl_data)
mlac_fft = np.fft.fft(mlac_data)
# fft of cross-correlation
Cw = np.conj(phl_fft)*mlac_fft
#Cw = np.fft.fftshift(Cw)
# transform back to time space:
Cxy = np.fft.fftshift(np.fft.ifft(Cw))
times = np.append(np.arange(-N+1, 0, dt),np.arange(0, N, dt))
plt.plot(times, Cxy)
plt.xlim(-250, 250)
# test against convolving:
c = np.correlate(phl_data, mlac_data, mode='same')
plt.plot(times, c)
plt.show()
(both data sets have been padded with N-1 zeroes)
The documentation to numpy.correlate explains this:
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
and:
Notes
The definition of correlation above is not unique and sometimes correlation may be defined differently. Another common definition is:
c'_{av}[k] = sum_n a[n] conj(v[n+k])
which is related to c_{av}[k] by c'_{av}[k] = c_{av}[-k].
Thus, there is not a unique definition, and the two common definitions lead to a reversed output.
I'm trying to pass two arrays for a fitting function, that takes both values.
Data file:
Column 1: Time
Column 2: Temperature
Column 3: Volume
Column 4: Pressure
0.000,0.946,4.668,0.981
0.050,0.946,4.668,0.981
0.100,0.946,4.669,0.981
0.150,0.952,4.588,0.996
0.200,1.025,4.008,1.117
0.250,1.210,3.093,1.361
0.300,1.445,2.299,1.652
0.350,1.650,1.803,1.887
0.400,1.785,1.524,2.038
0.450,1.867,1.340,2.145
0.500,1.943,1.138,2.280
0.550,2.019,0.958,2.411
0.600,2.105,0.750,2.587
0.650,2.217,0.542,2.791
0.700,2.332,0.366,2.978
0.750,2.420,0.242,3.116
0.800,2.444,0.219,3.114
0.850,2.414,0.219,3.080
here is the code
import numpy as np
from scipy.optimize import curve_fit
# Importing the Data
Time_Air1 = []
Vol_Air1 = []
Temp_Air1 = []
Pres_Air1 = []
with open('Good_Air_Run1.csv', 'r') as Air1:
reader = csv.reader(Air1, delimiter=',')
for row in reader:
Time_Air1.append(row[0])
Temp_Air1.append(row[1])
Vol_Air1.append(row[2])
Pres_Air1.append(row[3])
# Arrays are now passable floats
Time_Air1 = np.float32(np.array(Time_Air1))
Vol_Air1 = np.float32(np.array(Vol_Air1))
Temp_Air1 = np.float32(np.array(Temp_Air1))
Pres_Air1 = np.float32(np.array(Pres_Air1))
# fitting Model
def model_Gamma(V, gam, C):
return -gam*np.log(V) + C
# Air Data Fitting Data
x1 = Vol_Air1
y1 = Pres_Air1
p0_R1 = (1.0 ,1.0)
optR1, pcovR1 = curve_fit(model_Gamma, x1, y1, p0_R1)
gam_R1, C_R1 = optR1
gam_R1p, C_R1p = pcovR1
y1Mair = model_Gamma2(x_air1, gam_R1, C_R1)
compute the gamma coefficient, but it's not giving me the value i'm expecting, ~1.2. It gives me ~0.72
Yes this is the correct value because my friend fit the data into gnuplot and got that value.
If there is any information needed to actually try this, i'm happy to supply it.
Caveat: the result obtained here for gamma (about 1.7) still deviates from the postulated 1.2. This answer merely highlights the source of possible errors and illustrates what a good fit might look like.
You are trying to fit data where the dependent variable is related to the independent variable by a model that resembles that of adiabatic processes for ideal gases. Here, the pressure and the volume of a gass are related through
pressure * volume**gamma = constant
When you rearrange the left hand side and right hand side, you get:
pressure = constant * volume**-gamma
or in logarithmic form:
log(pressure) = log(constant) - gamma * log(volume)
You could fit the pressure data to the volume data using either of these 2 forms,
but the fit might not be optimal, because of measurement errors. One such error could be a fixed offset (e.g. some solid object is present in a beaker: the volume scale on the beaker will not represent accurately the volume of any liquid you pour in it).
When you account for such errors, often times a fit becomes markedly better.
Below, I've shown the fitting of your data using 3 models: the first is your model, the second takes into account a volume offset, and the third is a non-logarithmic variant of the 2nd model (it is basically the 2nd equation, with an optional volume offset). Note that in your code when you fit to what I call model1, you do not pass log(pressure) to the model, which would only make sense in case your pressure data is already tabulated on a logarithmic scale.
>>> import numpy as np
>>> from scipy.optimize import curve_fit
>>> data = np.genfromtxt('/tmp/datafile.txt',
... names=('time', 'temp', 'vol', 'press'), delimiter=',', usecols=range(4))
>>> def model1(volume, gamma, c):
... return np.log(c) - gamma*np.log(volume)
...
>>> def model2(volume, gamma, c, volume_offset):
... return np.log(c) - gamma*np.log(volume + volume_offset)
...
>>> def model3(volume, gamma, c, volume_offset):
... return c * (volume + volume_offset)**(-gamma)
...
>>> vol, press = data['vol'], data['press']
>>> guess1, _ = curve_fit(model1, vol, np.log(press))
>>> guess2, _ = curve_fit(model2, vol, np.log(press))
>>> guess3, _ = curve_fit(model3, vol, press)
>>> guess1, guess2, guess3
(array([ 0.38488521, 2.04536926]),
array([ 1.7269364 , 44.57369479, 4.44625865]),
array([ 1.73186133, 45.20087949, 4.46364872]))
>>> rms = lambda x: np.sqrt(np.mean(x**2))
>>> rms( press - np.exp(model1(vol, *guess1)))
0.29464410744456304
>>> rms(press - model3(vol, *guess3))
0.012672077620951249
Notice how guess2 and guess3 are nearly identical
The last two lines indicate the rms error. You'll notice that it is smaller for the model that takes into account the offset (if you plot them, you'll see the fit is much better than when you use model1*).
As a final remark, have a look at numpy's excellent functions for importing data, like the one I've shown here (np.genfromtxt), as they can save you a lot of tedious typing, like I demonstrated in this code.
Footnote: * when you plot using model1, don't forget to put everything back to linear scale, like this:
plt.plot(vol, np.exp(model1(vol, *guess1)))
I am trying to implement a very simple example of the law of large numbers using PyMC. The goal is to generate many sample averages of samples of different sizes. For example, in the code below, I'm taking repeatedly taking groups of 5 samples (samples_to_average = 5), calculating their mean, and then finding the 95% CI of the resulting trace.
The code below runs, but what I'd like to do is modify samples_to_average to be a list, so that I can calculate confidence intervals for a range of different sample sizes in a single pass.
import scipy.misc
import numpy as np
import pymc as mc
samples_to_average = 5
list_of_samples = mc.DiscreteUniform("response", lower=1, upper=10, size=1000)
#mc.deterministic
def sample_average(x=list_of_samples, n=samples_to_average):
samples = int(n)
selected = x[0:samples]
total = np.sum(selected)
sample_average = float(total) / samples
return sample_average
def getConfidenceInterval():
responseModel = mc.Model([samples_to_average, list_of_samples, sample_average])
mapRes = mc.MAP(responseModel)
mapRes.fit()
mcmc = mc.MCMC(responseModel)
mcmc.sample( 10000, 5000)
upper = np.percentile(mcmc.trace('sample_average')[:],95)
lower = np.percentile(mcmc.trace('sample_average')[:],5)
return (lower, upper)
print getConfidenceInterval()
Most examples I've seen using the deterministic decorator use global stochastic variables. However, to achieve my aim, I think what I need to do is create a stochastic variable (of the correct length) in getConfidenceInterval(), and pass this to sample_average (rather than supplying sample_average using globals / default parameter).
How can a variable created in getConfidenceInterval() be passed into sample_average(), or alternatively, what is another way that I can evaluate multiple models using different values of samples_to_average? I'd like to avoid globals if possible.
Before addressing your question, I would like to simplify the way sample_average is written so that it is more compact and easier to understand.
sample_average = mc.Lambda('sample_average', lambda x=list_of_samples, n=samples_to_average: np.mean(x[:n]))
Now you can generalize this to the case where samples_to_average is an array of parameters:
samples_to_average = np.arange(5, 25, 5)
sample_average = mc.Lambda('sample_average', lambda x=list_of_samples, n=samples_to_average: [np.mean(x[:t]) for t in n])
The getConfidenceInterval function would also have to be changed as shown below:
def getConfidenceInterval():
responseModel = mc.Model([samples_to_average, list_of_samples, sample_average])
mapRes = mc.MAP(responseModel)
mapRes.fit()
mcmc = mc.MCMC(responseModel)
mcmc.sample( 10000, 5000)
average = np.vstack((t for t in mcmc.trace('sample_average')))
upper = np.percentile(average, 95, axis = 0)
lower = np.percentile(average, 5, axis = 0)
return (lower, upper)
I used vstack to aggregate the sample averages into a 2D array and then used the axis option in Numpy's percentile function to compute percentiles along each column.