I am trying to use statsmodles for panel and have an issue with the shape of my data. My model is a TVP-VAR for a panel in a normal linear state space model composed of the State Equation and the Measurement Equation, where I have managed to write it as in eq. 33 in Canova and Cicarelli (2013)
The key model equation, where X t = Xt and ut = Xt′+ut with UtN = 0 (I + 2 Xt′ Xt), is attached.
Key Model Equation
I use exactly this class of models from your site : TVP-VAR, MCMC, and sparse simulation smoothing.
https://www.statsmodels.org/devel/examples/notebooks/generated/statespace_tvpvar_mcmc_cfa.html
When I run the local model, I get the attached local graph, for the Simulations based on KFS approach, MLE parameters' and Simulations based on CFA approach, MLE parameters' where some countries and years appear in an unexpected format.
KFS and CFA unexpected unexpected outcome format
I suspect it has to do with the data shape I am using. You can see my actual data shape in the attached local screenshot.
When I run the Simulations with alternative parameterization yielding a smoother trend among the errors I get is
"
value' must be an instance of str or bytes, not a tuple.
"
In addition to an earlier
"An unsupported index was provided and will be ignored when, e.g. forecasting. self._init_dates(dates, freq) "
I suspect that has to do with my data shape and index.My dataset is in a long format.
A screenshot here
Data shape
My question is a bit naive. How do I reshape my data in order to be compatible with statsmodels? How do I rewrite my code in order to bring my data into an acceptable shape to run the TVP-VAR, MCMC, and sparse simulation smoothing?
Hope it is clear what I am looking. The code I am now using to import data is:
%matplotlib inline
from importlib import reload
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import invwishart, invgamma
#1
import pyreadstat
dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()
labels=list(meta.column_labels)
column=list(meta.column_names)
# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year
dta.head()
I would apreace if you help me setting the right shape format acceptable from statsmodles
I am working on a timeseries analysis with SARIMAX and have been really struggling with it.
I think I have successfully fit a model and used it to make predictions; however, I don't know how to make out of sample forecast with exogenous data.
I may be doing the whole thing wrong so I have included my steps below with some sample data;
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas import datetime
import statsmodels.api as sm
# Defining Sample data
df = pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train = df.loc['2019-01-01':'2019-01-09']
test = df.loc['2019-01-10':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forecasting out of sample data with exogenous data
forecast = model_1.forecast(3, exog=exog_test)
so my problem is really with the last line, what do I do if I want more than 3 steps?
I would attempt to answer this question as it mainly relates to the type of data and documentation about statsmodels package.
As per the documentation the 'steps' are an integer, the number of steps to forecast from the end of the sample. That also means if you are interested in getting more than three steps you need to provide additional array data for training and TESTING data (note - both).
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)
(https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAXResults.forecast.html)
Here are two errors I get when I increase step size by one:
ValueError: cannot reshape array of size 3 into shape (4,1)
Provided exogenous values are not of the appropriate shape. Required (4, 1), got (3, 1).
ValueError: the number of rows in the exogenous variable does not match the number of time periods you're asking it to predict
With that said simply expanding the testing set works well and gets you additional forecasts here is the code that works and the working notebook link:
https://colab.research.google.com/drive/1o9KXAe61EKH6bDI-FJO3qXzlWjz9IHHw?usp=sharing
import pandas as pd
import numpy as np
# from sklearn.model_selection import train_test_split
# why import this if you want to do tran/test manually?
from pandas import datetime
# Defining Sample data
df=pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
'2019-01-04','2019-01-05','2019-01-06',
'2019-01-07','2019-01-08','2019-01-09',
'2019-01-10','2019-01-11','2019-01-12'],
'price':[78,60,62,64,66,68,70,72,74,76,78,80],
'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
})
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)
# Splitting Data into test and training sets manually
train=df.loc['2019-01-01':'2019-01-09']
# I made a change here #CHANGED 10 to 09 so one more month got added
# that means my input array is now 4,1 (if you add a column array is - )
# (4,2)
# I can give any step from -4,0,4 (integral)
test=df.loc['2019-01-09':'2019-01-12']
# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')
# Defining and fitting the model with training data for endogenous and exogenous data
import statsmodels.api as sm
model=sm.tsa.statespace.SARIMAX(train['price'],
order=(0, 0, 0),
seasonal_order=(0, 0, 0,12),
exog=train.iloc[:,1:],
time_varying_regression=True,
mle_regression=False)
model_1= model.fit(disp=False)
# Defining exogenous data for testing
exog_test=test.iloc[:,1:]
# Forcasting out of sample data with exogenous data
forecast = model_1.forecast(4, exog=exog_test)
I'm trying to use the QuantileTransformer from das-ml
For that, I have the following DF:
When I try:
from dask_ml.preprocessing import StandardScaler,QuantileTransformer,MinMaxScaler
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']])
I get this error:
ValueError: Tried to concatenate arrays with unknown shape (1, nan).
To force concatenation pass allow_unknown_chunksizes=True.
And I don't find where to set the parameter: allow_unknown_chunksizes=True
since in the transformer raises and error.
The first error disappears if I compute the df beforehand:
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']].compute())
But I don't why is this necessary or even if it is the right thing to do.
Also, in contrast to the StandardScaler this returns an array instead of a dataframe.
This was a limitation of the previous Dask-ML implementation. It's fixed in https://github.com/dask/dask-ml/pull/533.
When I run this code:
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)
I get this output:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error points to the line where it says melbourne_model.fit(X, y).
I want the code to fit the model with X and y so I can make future predictions of houses in Melbourne based on a few variables I input such as year built, land area, rooms/bedrooms, location, etc. Right now I can't do that because of this error.
I think it is because the X and y aren't NumPy Arrays and when I use np.asarray() and put what I want to turn into a NumPy Array, it doesn't work. I know this because when I write type(X) or type(y), I get pandas.core.series.Series.
The whole code to my file:
import pandas as pd
import numpy as np
melbourne_file_path = 'melb_data.csv\\melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
np.asarray(melbourne_data.Price)
y = melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
np.asarray(melbourne_data[melbourne_predictors])
X = melbourne_data[melbourne_predictors]
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)
I am using Jupyter Notebook as part of Anaconda.
The CSV file I am using can be downloaded here.
Once you do the download the folder you need to extract the files and the csv is inside the folder. You can make your own melbourne_file_path based on where the file is for you.
The error you're getting is fairly clear: Input contains NaN, infinity or a value too large. The problem is not that your inputs are pandas Series, but that your data is missing values! A quick glance at your CSV on Kaggle shows that rows 15 and 16 are missing quite a few fields, for example.
It's up to you to decide how to handle these missing values. One way is just to drop any row that's missing 1 or more values: df.dropna(inplace=True). This should get the RandomForestRegressor to fit without errors, but might bias your results if too many rows are dropped. A possibly better approach is to fill missing values with the column mean: df.fillna(df.mean()).
I am trying to compute PDF estimate from KDE computed using scikit-learn module. I have seen 2 variants of scoring and I am trying both: Statement A and B below.
Statement A results in following error:
AttributeError: 'KernelDensity' object has no attribute 'tree_'
Statement B results in following error:
ValueError: query data dimension must match training data dimension
Seems like a silly error, but I cannot figure out. Please help. Code is below...
from sklearn.neighbors import KernelDensity
import numpy
# d is my 1-D array data
xgrid = numpy.linspace(d.min(), d.max(), 1000)
density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d)
# statement A
density_score = KernelDensity(kernel='gaussian', bandwidth=0.08804).score_samples(xgrid)
# statement B
density_score = density.score_samples(xgrid)
density_score = numpy.exp(density_score)
If it helps, I am using 0.15.2 version of scikit-learn. I've tried this successfully with scipy.stats.gaussian_kde so there is no problem with data.
With statement B, I had the same issue with this error:
ValueError: query data dimension must match training data dimension
The issue here is that you have 1-D array data, but when you feed it to fit() function, it makes an assumption that you have only 1 data point with many dimensions! So for example, if your training data size is 100000 points, the your d is 100000x1, but fit takes them as 1x100000!!
So, you should reshape your d before fitting: d.reshape(-1,1) and same for xgrid.shape(-1,1)
density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d.reshape(-1,1))
density_score = density.score_samples(xgrid.reshape(-1,1))
Note: The issue with statement A, is that you are using score_samples on an object which is not fit yet!
You need to call the fit() function before you can sample from the distribution.