Data shape to match statsmodels - python

I am trying to use statsmodles for panel and have an issue with the shape of my data. My model is a TVP-VAR for a panel in a normal linear state space model composed of the State Equation and the Measurement Equation, where I have managed to write it as in eq. 33 in Canova and Cicarelli (2013)
The key model equation, where X t = Xt and ut = Xt′+ut with UtN = 0 (I + 2 Xt′ Xt), is attached.
Key Model Equation
I use exactly this class of models from your site : TVP-VAR, MCMC, and sparse simulation smoothing.
https://www.statsmodels.org/devel/examples/notebooks/generated/statespace_tvpvar_mcmc_cfa.html
When I run the local model, I get the attached local graph, for the Simulations based on KFS approach, MLE parameters' and Simulations based on CFA approach, MLE parameters' where some countries and years appear in an unexpected format.
KFS and CFA unexpected unexpected outcome format
I suspect it has to do with the data shape I am using. You can see my actual data shape in the attached local screenshot.
When I run the Simulations with alternative parameterization yielding a smoother trend among the errors I get is
"
value' must be an instance of str or bytes, not a tuple.
"
In addition to an earlier
"An unsupported index was provided and will be ignored when, e.g. forecasting. self._init_dates(dates, freq) "
I suspect that has to do with my data shape and index.My dataset is in a long format.
A screenshot here
Data shape
My question is a bit naive. How do I reshape my data in order to be compatible with statsmodels? How do I rewrite my code in order to bring my data into an acceptable shape to run the TVP-VAR, MCMC, and sparse simulation smoothing?
Hope it is clear what I am looking. The code I am now using to import data is:
%matplotlib inline
from importlib import reload
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import invwishart, invgamma
#1
import pyreadstat
dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()
labels=list(meta.column_labels)
column=list(meta.column_names)
# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year
dta.head()
I would apreace if you help me setting the right shape format acceptable from statsmodles

Related

Fixed effect regression model in Python

I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)

StandardScaler.inverse_transform() return the same array as input :/ Is sklearn broken or am I?

Good evening,
I'm currently pursuing a PhD in chemistry and in this framework I'm trying to apply my few knowledge in python and stats to discriminate sample based on their IR spectrum.
After a few of weeks of data acquisition I'm finally able to build my data set and was about to see what PCA can offer (this was the easy part).
I was able to build my script and get the loadings, scores and everything else that I could possibly need or want. However I used the StandardScaler from sklearn.preprocessing to scale down my data so (correct my if i'm wrong) I should get back loadings in this "standard scaled" space.
As my data are actual IR spectra those loadings have a chemical meanings (even thought there are not real spectrum) e.g. if my PC1 loadings have a peak at XX cm-1 i know that samples with high PC1 are likely to contain compounds that absorb at this wavenumber .
So i want to reverse the StandardScaler transformation. I've tried to used StandardScaler.inverse_transform() however it appears to return me the same array that I gave him... which is very frustrating...
I'm trying to do the same thing with my samples spectrum but it gave me the same result again : here is the portion of my script where I tried this :
Wavenumbers = DFF.columns
#in fact this is a little more complicated but that's the spirit
Spectre = DFF.values.tolist()
#btw DFF is my pandas.dataframe containing spectrum with features = wavenumber
SS = StandardScaler(copy=True)
DFF = SS.fit_transform(DFF) #at this point I use SS for preprocessing before PCA
#I'm then trying to inverse SS and get back the 1rst spectrum of the dataset
D = SS.inverse_transform(DFF[0])
#However at this point DFF[0] and D are almost-exactly the same I'm sure because :
plt.plot(Wavenumbers,D)
plt.plot(Wavenumbers,DFF[0]) #the curves are the sames, and :
for i,j in enumerate(D) :
if j==DFF[0][i] : pass
else : print("{}".format(j-DFF[0][i] )) #return nothing bigger than 10e-16
The problem is more than likely syntax or how i used StandardScaler, however i have no one around me to search for help with that . Can anyone tell me what i did wrong ? or give me an hint on how i could get back my loadings in the "actual real IR spectra" space ?
PS: sorry for the wacky English and i hope to be understandable
Good evening,
After putting the problem aside for a few days I finally re-coded the function I needed (as suggested by Robert Dodier).
For reminder, I wanted to have a function that could take my data from a pandas dataframe and mean-centered it in order to do PCA, but also that could reverse the preprocessing for latter uses.
Here is the code I ended up with :
import pandas as pd
import numpy as np
class Scaler:
std =[]
mean = []
def fit(self,DF):
self.std=[]
self.mean=[]
for c in DF.columns:
self.std.append(DF[c].std())
self.mean.append(DF[c].mean())
def transform(self,DF):
X = np.zeros(shape=DF.shape)
for i,c in enumerate(DF.columns):
for j in range(len(DF.index)):
X[j][i] = (DF[c][j] - self.mean[i]) / self.std[i]
return X
def reverse(self,X):
Y = np.zeros(shape=X.shape)
for i in range(len(X[0])):
for j in range(len(X)):
Y[j][i] = X[j][i] * self.std[i] + self.mean[i]
return Y
def fit_transform(self,DF):
self.fit(DF)
X = self.transform(DF)
return X
It's pretty slow and surely very low-tech but it seems to do the job just fine. Hope it will save some time to other python beginners.
I designed it to be as close as I think sklearn.preprocessing.StandardScaler does it.
example :
S = Scaler() #create scaler object
S.fit(DF) #fit the scaler to the dataframe (calculate mean and std for every columns in DF /!\ DF must be a pd.dataframe)
X=S.transform(DF) # return a np.array with mean centered data
Y = S.reverse(X) # reverse the transformation to get back original data
Again sorry for the fast tipped English. And thanks to Robert for taking the time to answer.

How to find my time series is additive or multiplicative only using statistical test in python?

I am trying to find whether my time series is additive or multiplicative I tried two-approach
If the variance is high and varying with time i.e. high variability then the series is multiplicative or else its additive, but confused should it be on detrended series or original
I have converted the blog code here
The converted python code is here :
import pandas as pd
import numpy as np
df=pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv")
df['Month'] = pd.to_datetime(df['Month'])
# Set the column 'Date' as index (skip if already done)
df = df.set_index('Month')
##Trend
df['Trend']=[0]*(7-1)+list(pd.Series(df['Passengers']).rolling(window=7).mean().iloc[7-1:].values)
# De-trend data
df['detrended_a']=df['Passengers']-df['Trend']
df['detrended_m']=df['Passengers']/df['Trend']
### Make seasonals
df['seasonal_a']=[0]*(7-1)+list(pd.Series(df['detrended_a']).rolling(window=7).mean().iloc[7-1:].values)
df['seasonal_m']=[0]*(7-1)+list(pd.Series(df['detrended_m']).rolling(window=7).mean().iloc[7-1:].values)
### residuals
df['residual_a']=df['detrended_a'] - df['seasonal_a']
df['residual_m']=df['detrended_m'] / df['seasonal_m']
import statsmodels.api as sm
acf_a = sum(pd.Series(sm.tsa.acf(df['residual_a'])).fillna(0))
acf_m= sum(pd.Series(sm.tsa.acf(df['residual_m'])).fillna(0))
if acf_a>acf_m:
print("Additive")
else:
print("Multiplicative")
But yet I am not sure about its usability across different kinds of series. If any one can help me to improve this method or suggest any better method

sktime ARIMA invalid frequency

I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date']).set_index('date').T.iloc[0]
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):

Issues with Pymc3 Summary

I am currently struggling to obtain a summary of the statistics of a model I ran through Bayesian regression on. I first used Lasso and model selection to filter the best variables, then used pm.Model to obtain the regression proper.
Of course, having 'filtered' the explanatory variables that weren't relevant, the shape of the X matrix had changed. The data I worked on is the load_boston dataset from sklearn.dataset. I coded the data as independent variable and the target as dependent variable.
Having performed model selection with SelectFromModel, I used the get.support method to obtain an index of the retained variables. I then used a loop over both the indexes of all variables and the numbers contained in the support, with the purpose of storing the names of the retained variables in an empty list I had created at hoc. The code looks something like this
import pandas as pd
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(9)
# Load the boston dataset.
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston['data'], boston['target']
# Here is the code for the estimator LassoCV
# Here is the code for Model Selection
support(indices=True) #to obtain the list of indices of retained variables
X_transform = sfm.transform(X) #to remove the unnecessary variables
#Here is the line for linear modeling
#I initialize some useful variables
m = y.shape[0]
n = X.shape[1]
c = supp.shape[0]
L = boston['feature_names']
varnames=[]
for i in range (0, n):
for j in range (0, c):
if i == supp[j]:
varnames.append(L[i])
pm.summary(trace, varnames=varnames)
The console then displays 'KeyError: RM', which is one of the names of the variables used. One issue I noticed that every object of varnames is classified as str_ object of numpy module, meaning that I can't read the name of the retained variables on the list unless I double click on them.
How could I fix this? I have no clue what I am doing wrong.

Categories

Resources