I would like to build a linear regression model to determine the influence of various parameters on quote prices. The data of the quotes were collected over 10 years.
y = Price
X = [System size(int),ZIP, Year, module_manufacturer, module_name, inverter_manufacturer,inverter_name, battery storage (binary), number of installers/offerer in the region(int), installer_density, new_construction(binary), self_installation(binary), household density]
Questions:
What type of regression model is suitable for this dataset?
Due to technological progress, quote prices decrease over years. How can I account for the different years in the model? I found some examples where years where considered as binary variables. Another option: multiple regression models for each year. Is there a way to combine these multiple models?
Is the dataset a type of panel data?
Unfortunately, I have not yet found any information that could explicitly help me with my data. But maybe I didn't use the right search terms. I would be very happy about any suggestions that nudge me in the right direction.
Suppose you have a data.frame called data with columns price, system_size, zip, year, battery_storage etc. Then you can start with a simple linear regression:
lm(price ~ system_size + zip + year + battery_storage, data = data)
year is included in the model so you take changes over time into account.
If you want to remove batch effects (e.g. different regions zip codes) and you just care to model the price after getting rid of the effect of different locations, you can run a linear mixed model:
lmerTest::lmer(price ~ system_size + year + battery_storage + (1|zip), data = data)
If you have a high correlation e.g. between year and system_size, you might want to include interaction terms like year:system_size into your formula.
As a rule of thumb, you need to have 10 samples for each variable to get a reasonable fit. If you have more, you can do a variable selection first.
Related
I am pretty new to ML and completely new to creating my own models. I have went through tensorflows time-series forecasting tutorial and other LSTM time series examples on how to predict with multi-variate inputs. After trying multiple examples I think I realized that this is not what I want to achieve.
My problem involves a dataset that is in hourly intervals and includes 4 different variables with the possibility of more in the future. One column being the datetime. I want to train a model with data that can range from one month to many years. I will then create an input set that involves a few of the variables included during training with at least one of them missing which is what I want it to predict.
For example if I had 2 years worth of solar panel data and weather conditions from Jan-12-2016 to Jan-24-2018 I would like to train the model on that then have it predict the solar panel data from May-05-2021 to May-09-2021 given the weather conditions during that date range. So in my case I am not necessarily forecasting but using existing data to point at a certain day of any year given the conditions of that day at each hour. This would mean I should be able to go back in time as well.
Is this possible to achieve using tensorflow and If so are there any resources or tutorials I should be looking at?
See Best practice for encoding datetime in machine learning.
Relevant variables for solar power seem to be hour-of-day and day-of-year. (I don't think separating into month and day-of-month is useful, as they are part of the same natural cycle; if you had data over spending habits of people who get paid monthly, then it would make sense to model the month cycle, but Sun doesn't do anything particular monthly.) Divide them by hours-in-day and days-in-year respectively to get them to [0,1), multiply by 2*PI to get to a circle, and create a sine column and a cosine column for each of them. This gives you four features that capture the cyclic nature of day and year.
CyclicalFeatures discussed in the link is a nice convenience, but you should still pre-process your day-of-year into a [0,1) interval manually, as otherwise you can't get it to handle leap years correctly - some years days-in-year are 365 and some years 366, and CyclicalFeatures only accepts a single max value per column.
I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv
I'm trying to predict the next values of "LandAverageTemperature".
I've asked another question about this topic earlier.(its here:How to predict correctly in sklearn RandomForestRegressor?) I couldn't get any answer for that question.After not getting anything in my first question and then failing for another day, I've decided to start from scratch.
Right now, I want to know which value is in my dataset is "x" to make the prediction correctly. I read that y is a dependent variable which that I want to predict and x is the independent variable that I should use as "predictor" to help the prediction proccess. In that case my y variable is "LandAverageTemperature". I don't know what the x value is. I was using date values for x at first but I'm not sure that is true at the moment.
And if I shouldn't use RandomForestRegressor or sklearn (I've started with spark to this project) for this dataset please let me know. Thanks in advance.
You only have one variable (LandAverageTemperature), so obviously that's what you're going to use. What you're looking for is the df.shift() function, which shifts your values. With this function, you'll be able to add columns of past values to your dataframe. You will then be able to use t 1 month/day ago, t 2 months/days ago, etc, as predictors of another day/month's temperature.
You can use it like this:
for i in range(1, 15):
df.loc[:, 'T–%s'%i] = df.loc[:, 'LandAverageTemperature'].shift(i)
Your columns will then be temperature, and temperature at T-1, T-2, for up to 14 time periods.
For your question about what is a proper model for time series forecasting, it would be off-topic for this site, but many resources exist on https://stats.stackexchange.com.
In general case you can use for X feature matrix all data columns excluding your target column. But in your case there is several complications:
You have missed (empty) data in most of the columns for many years. You can exclude such rows/years from train data. Or exclude columns with missed data (which will be almost all of your columns and it's not good).
Regression model can't use date fields directly, you should traislate date field to some numerical field(s), "months past first observation", for example. Something like (year-1750)*12 + month. Or/and you can have year and month in separate columns (it's better if you have some "seasonality" in your data).
You have sequental time data here, so may be you should not use simple regression. Use some ARIMA/SARIMA/SARIMAX and so on so-called Time-Series models which predicts target data sequentially one value by another, month after month in your case. It's a hard topic for learning, but you should definitely take a look at TS because you will need it some time in the future if not today.
In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution
I have a data set which contains site usage behavior of users over a period of six months. It contains data about:
Number of pages viewed
Number of unique cookies associated with each user
Different number of OS, Browsers used
Different number of cities visited
Everything over here is collected on a six month time frame. I have used this data to train a model to predict a target variable 'y'. Everything is numeric in format.
Now I know since its a six month data, and the model is built upon this 6 months of data, I can use this to predict on the next six month data to get target variable y.
My question is that if instead of using it to predict on six month time frame, I use the model to predict on monthly time frame, will it give me incorrect results?
My logic tells me yes, as for example, I used tree method such as Decision tree and Random forest, these algorithms kind of makes thresholds to give output "0/1". Now the variables I mentioned above such as number of cookies associated, OS, Browser etc would have different values if we look at it from one month stand point and if we look at it from 6 months standpoint. For example, number of unique cookies associated with a user would be less if seen over a month where as it will be more if seen from 6 months standpoint.
But I am confused as to if the model will automatically adjust these values while running on monthly data or not. Request you to help me understand the if I am thinking this right or wrong. Also please provide logical explanation if possible.
Thanks.
Is your minimum unit of mesurement 6 months ? I hope not, but if yes, then I would sugges that you dont try to predict the next 1 month.
Seasonality within a year aside, you would need daily volume measurements.. I would be very worried to build anything on monthly or even weekly numbers.
In terms of modelling techniques, please stick to simple regression methods like kungphu suggests.
I am trying to fit ARIMA model. I have 3 months data and it shows count(float) for every minute. Which order I should pass for arima.fit()?
I need to predict for every minute.
A basic ARIMA(p,d,q) model isn't going to work with your data.
Your data violates the assumptions of ARIMA, one of which is that the parameters have to be consistent over time.
I see clusters of 5 spikes, so I'm presuming that you have 5 busy workdays and quiet weekends. Basic ARIMA isn't going to know the difference between a weekday and weekend, so it probably won't give you useful results.
There is such a thing as a SARIMA (Seasonal Autoregressive Integrated Moving Average). That would be useful if you were dealing with daily data points, but not really suitable for minute data either.
I would suggest that you try to filter your data so that it excludes evenings and weekends. Then you might be dealing with a dataset that has consistent parameters. If you're using python, you could try using pyramid's auto_arima() function on your filtered time series data (s) to have it try to auto-find the best parameters p, d, q.
It also does a lot of the statistical tests you might want to look into for this type of analysis. I actually don't always agree with the auto_arima's parameter choices, but it's a start.
model = pyramid.arima.auto_arima(s)
print(model.summary())
http://pyramid-arima.readthedocs.io/en/latest/_submodules/arima.html
1) is your data even good for box -Jenkins model( ARIMA)?
2) I see higher mean in the end. There is clear seasonality in the data. ARIMA will fail. Please try seasonal based models- SARIMA as righly suggested by try, Prophet is another beautiful algo by Facebook for seasonality. - ( R implementation)
https://machinelearningstories.blogspot.com/2017/05/facebooks-phophet-model-for-forecasting.html
3) Not just rely on ARIMA. Try other time series algos that are STL, BTS, TBATS. ( Structural Time series) and Hybrid etc. Let me know if you want some packages information in R?