Linear regression model selection for time series data prediction

Linear regression model selection for time series data prediction - python

I have a signal and want to predict y which present Number of requests, using regression models. Currently, I am using OLS regression model to predict y. But the prediction error is very high, as my signal has a lot of variations (ups and downs) as shown below.
I noticed that my model most of the time overestimate y (Number of Requests), especially if the points to be predicted is preceded by large value of y's. As indicated below in the yellow and red circle.
So I am not sure if there's a robust regression models to accommodate this problem of having a lot of variations in my datasets. Also is there any way to segment out these large values by adapting the window size such that it doesn't include these values?
Could you please advise

From the visualization of the error I would say a linear model is not appropriate and you should consider using something that handles periodic data as well as moving average - your data appears to have periodic elements, and a moving average element that goes beyond something "linear". Consider something like ARIMA. Here's a link to a tutorial on ARIMA: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/ Please post the results :)
Vishaal

Related

Target Dependent Variables is continuous but Independent Variables are Categorical

I am working on a dataset where my dependent variable is continuous but all my independent variables are categorical(non-binary). I have tried one hot encoding or created dummy variables. I am getting low R2 about 0.4 but high adjusted R2 around 0.9. However I am getting vertical lines in my regression plot and residual plot, even though my QQ line seems to fit into a straight line with some heavy tails at the end. So may I know if regression model is the right method to be used in this kind of scenario? If its a yes how should the plots be analyzed and if its a no, what are the other methods and libraries that can be employed to yield a better result?

I try to address some of your questions below:
However I am getting vertical lines in my regression plot and residual
plot
This is expected if all your independent variables (IV) are categorical. Each category is encoded as binary and the prediction for each observation would be combinations of each category. For simple illustration, imagine a prediction by 2 binary variables, there can only be 4 outcomes (0/0, 0/1, 1/0, 1/1).. and if you extend this to many binary variables, you see that kind of discrete prediction.
In other words, there is no slope to speak of so you should not see a continuous prediction. You can read more about regression with categories here
even though my QQ line seems to fit into a straight line with some
heavy tails at the end. So may I know if regression model is the right
method to be used in this kind of scenario?
Yes you can still use a linear model.
If its a yes how should the plots be analyzed and if its a no, what
are the other methods and libraries that can be employed to yield a
better result?
What you have is basically similar to an anova analysis except you are not doing inference. You can check for the homogeneity of variance using a levene test, or other similar test. These test can be extremely sensitive when you have a large number of observations. Looking at your qq plot , which looks at quantiles, I think its fine.

CNN regression results in two distinct (incorrect) predictions

I'm trying to solve a regression problem using a Python Keras CNN (Tensorflow as the backbone), where I try to predict a single y-value based on an 8-dimensional satellite image (23x45 pixels) that I have fetched from Google Earth Engine using their Python API. I currently have 280 images that I augment to get 2500 images using flipping and random noise. The data is normalized & standardized and I have removed outliers and images with only zeros.
I've tested numerous CNN-architecture, for example, this:
(Convolution2D(4,4,3),MaxPooling2D(2,2),dense(50),dropout(0.4),dense(30),dropout(0.4),dense(1)
This results in a weird behaviour where the predicted value is in mainly two distinct groups or clusters (where each group has very little variance). The true value has a much higher variance. See image below.
I have chosen not to publish any code snippets as my question is more of a general nature. What might lead to such clustered predictions? Are there any good common tricks to improve the results?
I've tried to solve the problem using a normal neural network and regression tools from SciKit-Learn, by flattening the images to one long array (length 23x45x8 = 8280). That doesn't result in clustering, although the accuracy is still quite low. I assume that is due to insufficient or inappropriate data.
Plotted Truth (x) vs Prediction (y) which shows that the prediction is heavily clustered

your model is quite simple, it cannot even properly extract feature, so i guess it is under fit. and your dropout is 40% in 2 layers, which quite high for such small network. you also have linear activation, it seems that way.
and yes number of sample can also have effect on group prediction, mostly class with majority of samples is chosen.
i have also noticed some of your truth values are greater than 1 and less than 0. you have to normalize properly and use proper activation function.

Input data necessary for forecasting/ estimating trends for a given variable

This could be more of a theoretical question than a code-related one. In my current job I find myself estimating/ predicting (this last is more opportunistic) the water level for a given river in Africa.
The point is that I am developing a simplistic multiple regression model that takes more than 15 years of historical water levels and precipitation (from different locations) to generate water level estimates.
I am not that used to work with Machine Learning or whatever the correct name is. I am more used to model data and generate fittings (the current data can be perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials.
So the point is; once I have a multiple regression model, my colleagues advised me not to use fitted data for the estimation but all the raw data instead. Since they couldn't explain to me the reason of that, I attempted to use the fitted data as raw inputs (in my defense, a median of all the fitting models has a very low deviation error == nice fittings). But what I don't understand is why should I use just the raw data, which cold be noisy, innacurate, taking into account factors that are not directly related (biasing the regression?). What is the advantage of that?
My lack of theoretical knowledge in the field is what makes me wonder about that. Should I always use all the raw data to determine the variables of my multiple regression or can I use the fitted values (i.e. get a median of the different fitting models of each historical year)?
Thanks a lot!

here is my 2 cents
I think your colleagues are saying that because it would be better for the model to learn the correlations between the raw data and the actual rain fall.
In the field you will start with the raw data so being able to predict directly from it is very useful. The more work you do after the raw data is work you will have to do every time you want to make a prediction.
However, if a simpler model work perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials then I would recommend doing that. As long as your (y_pred - t_true) ** 2 is very small

What happens after creating PCA?

What happens after I create a dimensionality reduction algorithm (PCA) that has produced a matrix W?
How do I now use it to predict real time data?
Do I need to create a User interface or what?
If thats the case, how and where?
Im completely lost

Depends on what you wanted to achieve by doing PCA. One use case can be Clustering. If you just look at the summary of your PCA model you'll be able to see the list of Principal Components and the Proportion of variance explained by each Principal Component. You can choose components which explain most of the variance (normally the cumulative proportion of variance explained by PCs should be >80%) in your model. You can plot a Scree plot and look for break in the graph to determine number of Components to use in Clustering. Input those PCs in your clustering model and you might see some good clustering results.
For your second query follow this Link
And you don't need to create any user interface right now. Just follow the above link and you should get some understanding around PCA and as to how to use it for predictive modeling.

Fitting a curve to a set of data points for time series prediction

I currently have a set of data points (hit counts), which are structured as a time series. The data is something like:
time hits
20 200
32 439
57 512
How can I fit a curve to this data or find a formula so that I can predict points in the future? Ideally, I can answer a question like "How many views will there be when the time is 100?"
Thanks for your help!
EDIT: What I've tried so far:
I've tried a variety of methods, including:
Creating a Logistic Regression using sklearn (however, there are no features for the data)
Creating a curve fit using optimize.curve_fit from scipy (however, I don't have a function for the data)
Creating a function from a UnivariateSpline to pass into curve_fit (something went wrong, I can't pin it down)
I'm trying to model when content goes viral, so I assume that a polynomial or exponential curve is ideal.
I tried the links from #Bill previously, but I have no function for the data. Do you know how I can find one?
EDIT 2:
Here's a sample of about two days of data:
Here is what is expected over time.

As other people have said it is difficult to give an answer with so few information.
I suggest you to define some new variable like time, time*time, time*time*time and to fit a LinearRegression model using this as input variable.
I will start with these and then in case using something of more complex like neural network (not in sklearn) or SVR.
Hope this can help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.