I am trying to transition from R to Python for my time series analysis - but am finding it quite hard. The code below is what I would have used in R - to regress a sine curve onto some data with a known period.
year <- c(0:100)
lm(data~sin(2*pi*year/15)+cos(2*pi*year/15))
Now I want to do the same in Python I am coming across very long methods involving making initial guesses then optimising etc. What is the simplest way to achieve the comparable result?
I did not get exactly what you are looking for, lm mean linear model, you could try linear regression in sklearn as follows:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
year = np.arange(0, 100, 1)
year = np.reshape(year, (1, -1))
year_predict = np.arange(100, 200, 1)
year_predict = np.reshape(year_predict, (1, -1))
y = np.sin(2*np.pi*year/15)+np.cos(2*np.pi*year/15)
lm = LinearRegression()
lm.fit(year, y)
y_pred = lm.predict(year_predict)
plt.plot(year[0,:], y[0,:])
plt.plot(year_predict[0,:], y_pred[0,:])
plt.ylabel('np.sin(2*pi*year/15)+np.cos(2*pi*year/15)')
plt.show()
More info you can find here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
If you have something unclear, write here. We could help you
Related
I want to detect the outliers in a "time series data" which contains the trend and seasonality components. I want to leave out the peaks which are seasonal and only consider only the other peaks and label them as outliers. As I am new to time series analysis, Please assist me to approach this time series problem.
The coding platform is using is Python.
Attempt 1 : Using ARIMA model
I have trained my model and forecasted for the test data. Then being able to compute the difference between forecasted results with my actual values of test data then able to find out the outliers based on the variance observed.
Implementation of Auto Arima
!pip install pyramid-arima
from pyramid.arima import auto_arima
stepwise_model = auto_arima(train_log, start_p=1, start_q=1,max_p=3, max_q=3,m=7,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore', suppress_warnings=True,stepwise=True)
import math
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from sklearn.metrics import mean_squared_error
Split data into train and test-sets
train, test = actual_vals[0:-70], actual_vals[-70:]
Log Transformation
train_log, test_log = np.log10(train), np.log10(test)
Converting to list
history = [x for x in train_log]
predictions = list()
predict_log=list()
Fitting Stepwise ARIMA model
for t in range(len(test_log)):
stepwise_model.fit(history)
output = stepwise_model.predict(n_periods=1)
predict_log.append(output[0])
yhat = 10**output[0]
predictions.append(yhat)
obs = test_log[t]
history.append(obs)
Plotting
figsize=(12, 7)
plt.figure(figsize=figsize)
pyplot.plot(test,label='Actuals')
pyplot.plot(predictions, color='red',label='Predicted')
pyplot.legend(loc='upper right')
pyplot.show()
But I can detect the outliers only in test data. Actually, I have to detect the outliers for the whole time series data including the train data I am having.
Attempt 2 : Using Seasonal Decomposition
I have used the below code to split the original data into Seasonal, Trend, Residuals and can be seen in the below image.
from statsmodels.tsa.seasonal import seasonal_decompose
decomposed = seasonal_decompose()
Then am using the residual data to find out the outliers using boxplot since the seasonal and trend components were removed. Does this makes sense ?
Or is there any other simple or better approach to go with ?
You can:
in the 4th graph (residual plot) at "Attempt 2 : Using Seasonal Decomposition" try to check for extreme points and that may lead you to some anomalies in the seasonal series.
Supervised(if you have some labeled data): Do some classification.
Unsupervised: Try to predict the next value and create a confidence interval to check whether the prediction lays inside it or not.
You can try to calculate the relative extrema of data. using argrelextrema as shown here for example:
from scipy.signal import argrelextrema
x = np.array([2, 1, 2, 3, 2, 0, 1, 0])
argrelextrema(x, np.greater)
output:
(array([3, 6]),)
Some random data (My implementation of the above argrelextrema):
I tried to run a Ridge Regression on Boston housing data with python, but I had the following questions that I cannot find answer anywhere so I decided to post it here:
Is scaling recommended before fitting the model? Because I get the same score when I scale and when I don't scale. Also, what is the interpretation of the alpha/coeff graph in terms of choosing the best alpha?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
df = pd.read_csv('../housing.data',delim_whitespace=True,header=None)
col_names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df.columns = col_names
X = df.loc[:,df.columns!='MEDV']
col_X = X.columns
y = df['MEDV'].values
# Feature Scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
clf = Ridge()
coefs = []
alphas = np.logspace(-6, 6, 200)
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X_std, y)
coefs.append(clf.coef_)
plt.figure(figsize=(20, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Alpha/coefficient graph for scaled X
Alpha/coefficient graph for unscaled X
On the scaled data, when I compute the score and choose the alpha thanks to CV, I get:
from sklearn.linear_model import RidgeCV
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 7]).fit(X_std, y)
> clf.score(X_std, y)
> 0.74038
> clf.alpha_
> 5.0
On the non-scaled data, I even get a slightly better score with a completely different alpha:
clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 6]).fit(X, y)
> clf.score(X, y)
> 0.74064
> clf.alpha_
> 0.01
Thanks for your insights on the matter, looking forward to reading your answers!
I think you should scale because Ridge Regularization penalizes large values, and so you don't want to lose meaningful features because of scaling issues. Perhaps you don't see a difference because the housing data is a toy dataset and is already scaled well.
A larger alpha is a stronger penalty on large values. The graph is showing you (though it has no labeling) that with a stronger alpha you send coefficients to zero more strongly. The more gradual lines are the smaller weights, so they're effected less or almost not at all until alpha becomes sufficiently large. The sharper ones are larger weights, so they drop to zero more quickly. When they do, the feature will disappear from your regression.
For the scaled data, the magnitude of design matrix is smaller, and the coefficients tend to be larger (and more L2 penalty is imposed). To minimize L2, we need more and more small coefficients. How to get more and more small coefficients? The way is to choose a very big alpha, so we can have more smaller coefficients. That is why if you scale the data, the optimal alpha is a great number.
How to fit a locally weighted regression in python so that it can be used to predict on new data?
There is statsmodels.nonparametric.smoothers_lowess.lowess, but it returns the estimates only for the original data set; so it seems to only do fit and predict together, rather than separately as I expected.
scikit-learn always has a fit method that allows the object to be used later on new data with predict; but it doesn't implement lowess.
Lowess works great for predicting (when combined with interpolation)! I think the code is pretty straightforward-- let me know if you have any questions!
Matplolib Figure
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.interpolate import interp1d
import statsmodels.api as sm
# introduce some floats in our x-values
x = list(range(3, 33)) + [3.2, 6.2]
y = [1,2,1,2,1,1,3,4,5,4,5,6,5,6,7,8,9,10,11,11,12,11,11,10,12,11,11,10,9,8,2,13]
# lowess will return our "smoothed" data with a y value for at every x-value
lowess = sm.nonparametric.lowess(y, x, frac=.3)
# unpack the lowess smoothed points to their values
lowess_x = list(zip(*lowess))[0]
lowess_y = list(zip(*lowess))[1]
# run scipy's interpolation. There is also extrapolation I believe
f = interp1d(lowess_x, lowess_y, bounds_error=False)
xnew = [i/10. for i in range(400)]
# this this generate y values for our xvalues by our interpolator
# it will MISS values outsite of the x window (less than 3, greater than 33)
# There might be a better approach, but you can run a for loop
#and if the value is out of the range, use f(min(lowess_x)) or f(max(lowess_x))
ynew = f(xnew)
plt.plot(x, y, 'o')
plt.plot(lowess_x, lowess_y, '*')
plt.plot(xnew, ynew, '-')
plt.show()
I've created a module called moepy that provides an sklearn-like API for a LOWESS model (incl. fit/predict). This enables predictions to be made using the underlying local regression models, rather than the interpolation method described in the other answers. A minimalist example is shown below.
# Imports
import numpy as np
import matplotlib.pyplot as plt
from moepy import lowess
# Data generation
x = np.linspace(0, 5, num=150)
y = np.sin(x) + (np.random.normal(size=len(x)))/10
# Model fitting
lowess_model = lowess.Lowess()
lowess_model.fit(x, y)
# Model prediction
x_pred = np.linspace(0, 5, 26)
y_pred = lowess_model.predict(x_pred)
# Plotting
plt.plot(x_pred, y_pred, '--', label='LOWESS', color='k', zorder=3)
plt.scatter(x, y, label='Noisy Sin Wave', color='C1', s=5, zorder=1)
plt.legend(frameon=False)
A more detailed guide on how to use the model (as well as its confidence and prediction interval variants) can be found here.
Consider using Kernel Regression instead.
statmodels has an implementation.
If you have too many data points, why not use sk.learn's radiusNeighborRegression and specify a tricube weighting function?
It's not clear whether it's a good idea to have a dedicated LOESS object with separate fit/predict methods like what is commonly found in Scikit-Learn. By contrast, for neural networks, you could have an object which stores only a relatively small set of weights. The fit method would then optimize the "few" weights by using a very large training dataset. The predict method only needs the weights to make new predictions, and not the entire training set.
Predictions based on LOESS and nearest neighbors, on the other hand, need the entire training set to make new predictions. The only thing a fit method could do is store the training set in the object for later use. If x and y are the training data, and x0 are the points at which to make new predictions, this object-oriented fit/predict solution would look something like the following:
model = Loess()
model.fit(x, y) # No calculations. Just store x and y in model.
y0 = model.predict(x0) # Uses x and y just stored.
By comparison, in my localreg library, I opted for simplicity:
y0 = localreg(x, y, x0)
It really comes down to design choices, as the performance would be the same.
One advantage of the fit/predict approach is that you could have a unified interface like they do in Scikit-Learn, where one model could easily be swapped by another. The fit/predict approach also encourages a machine learning way to think of it, but in that sense LOESS is not very efficient, since it requires storing and using all the data for every new prediction. The latter approach leans more towards the origins of LOESS as a scatterplot smoothing algorithm, which is how I prefer to think about it. This might also shed some light on why statsmodel do it the way they do.
Check out the loess class in scikit-misc. The fitted object has a predict method:
loess_fit = loess(x, y, span=.01);
loess_fit.fit();
preds = loess_fit.predict(x_new).values
https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess.html
I'm trying to use pyMC to provide a Bayesian estimate of a covariance matrix given some data. I'm roughly following the stock covariance example provided in this online guide (link here), but I have a more simplistic example model that I made up. I've got two values that I draw from a multivariate normal distribution, and I've constructed it in such a way that I know the covariance/correlation between the two variables.
I've posted my short code below. Essentially what I'm doing is constructing an artificial data set where the correlation matrix should be [[1, -0.5], [-0.5, 1]]. At the end of the mcmc sampling, I get a predicted value for the off-diagonal term that is quite a bit different. I've looked at the convergence criteria, and it looks like the autocorrelation is low and the distribution is stationary. However, I will admit I'm still wrapping my head around all the nuances here and there could be aspects of this that are still beyond my grasp.
This question is related to and very much based on these other two SO questions (One and Two). I felt the need to ask my own question despite the similarity because I'm not getting the answer I expect to get. If any of you computational statisticians out there can help provide insight into this problem it would be greatly appreciated!
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns
p=2
prior_mu=np.ones(p)
prior_sdev=np.ones(p)
prior_corr_inv=np.eye(p)
def cov2corr(A):
"""
covariance matrix to correlation matrix.
"""
d = np.sqrt(A.diagonal())
A = ((A.T / d).T) / d
#A[ np.diag_indices(A.shape[0]) ] = np.ones( A.shape[0] )
return A
# construct artificial data set
muVector=[10,5]
sdevVector=[3.,5.]
corrMatrix=np.matrix([[1,-0.5],[-0.5, 1]])
cov_matrix=np.diag(sdevVector)*corrMatrix*np.diag(sdevVector)
n_obs = 500
x = np.random.multivariate_normal(muVector,cov_matrix,n_obs)
prior_mu = np.array(muVector)
prior_std = np.array(sdevVector)
inv_cov_matrix = pm.Wishart( "inv_cov_matrix", n_obs, np.diag(prior_std**2) )
mu = pm.Normal( "returns", prior_mu, 1, size = 2)
# create the model and sample
obs = pm.MvNormal( "observed returns", mu, inv_cov_matrix, observed = True, value = x )
model = pm.Model( [obs, mu, inv_cov_matrix] )
mcmc = pm.MCMC(model)
mcmc.use_step_method(pm.AdaptiveMetropolis,inv_cov_matrix)
mcmc.sample( 1e5, 2e4, 10)
# Determine prediction - Does not equal corrMatrix!
inv_cov_samples = mcmc.trace("inv_cov_matrix")[:]
mean_covariance_matrix = np.linalg.inv( inv_cov_samples.mean(axis=0) )
prediction = cov2corr(mean_covariance_matrix*n_obs)
i got a function of 5 variables Fx(s,m,p,h,l)
import numpy as np
s= np.arange(0,135,15)/10
m= np.array([150,180,195,210,240,255,270,285,300])
p=np.array([-1.5,-1,-0.5,0,0.5,1,1.5])
h=np.array([0,3,6,9,12])
l=np.array([0,0.5,1,1.5,2,2.5,3,4])
and 180 values of the function in a csv file.
i would like to calculate missing value by interpolation in all points
and use radial basis function thin_plate will by great. is it possible?
for information i found here
Python 4D linear interpolation on a rectangular grid
InterpolatingFunction
but if i replace some value in data array by None, at this point f(point) give 'nan'. and i don t want to use a linear interpolation because for a set of 4 variables i got 2 points with a value.
thanks very much for helping LL
Try SVR from scikit-learn to solve your problem:
from sklearn.svm import SVR # it uses RBF as default kernel
import numpy as np
n_samples, n_features = 180, 5
y = np.random.randn(n_samples) #function values
X = np.random.randn(n_samples, n_features) #points
clf = SVR(C=1.0, epsilon=0.2)
clf.fit(X, y)
#Get value at a new point:
clf.predict([0,150,-1.5,0,0])
Because of len(s)*len(m)*len(p)*len(h)*len(l) is 22680 and function values are known only in 180 points, you have poor information about your function....