statsmodles AR model error when calling params - python

New to statsmodels, trying to use statsmodels.tsa.ar_model to fit a pandas timeseries.
#pull one series from dataframe
y=data.sentiment
armodel=sm.tsa.ar_model.AR(y, freq='D').fit()
armodel.params()
gets the following error:
C:\Python27\lib\site-packages\pandas\lib.pyd in pandas.lib.SeriesIndex.__set__ (pandas\lib.c:27817)()
AssertionError: Index length did not match values
Any ideas?

You should upgrade to current master, if you can. This was fixed here.

Related

pandas dataframe to_csv() with get_handle() error [duplicate]

I had big table which I sliced to many smaller tables based on their dates:
dfs={}
for fecha in fechas:
dfs[fecha]=df[df['date']==fecha].set_index('Hour')
#now I can acess the tables like this:
dfs['2019-06-23'].head()
I have done some modifictions to the dfs['2019-06-23'] specific table and now I would like to save it on my computer. I have tried to do this in two ways:
#first try:
dfs['2019-06-23'].to_csv('specific/path/file.csv')
#second try:
test=dfs['2019-06-23']
test.to_csv('test.csv')
both of them raised this error:
TypeError: get_handle() got an unexpected keyword argument 'errors'
I don't know why I get this error and haven't find any reason for that. I have saved many files this way but never had that before.
My goal: to be able to save this dataframe after my modification as csv
If you are getting this error, there are two things to check:
Whether the DataFrame is not actually a Series - see (Pandas : to_csv() got an unexpected keyword argument)
Your numpy version. For me, updating to numpy==1.20.1 with pandas==1.2.2 fixed the problem. If you are using Jupyter notebooks, remember to restart the kernel afterwards.
In the end what worked was to use pd.DataFrame and then to export it as following:
to_export=pd.DataFrame(dfs['2019-06-23'])
to_export.to_csv('my_table.csv')
that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.

Unable to implement MICE in Python

I'm trying to use statsmodels package of MICE to impute values for my columns. I'm unable to figure out how exactly to use it. Whatever I run, it throws the error: ValueError: variable to be imputed has no observed values
Code:
df=pd.read_csv('contacts.csv', engine='c',low_memory=False)
from statsmodels.imputation.mice import MICEData as md
md(df)
Why am I doing wrong?
at least one of the columns in the generated data frame (hence csv) is empty.
Check the dataframe, maybe you have to clean it up/normalize.
Also, don't afraid to look into the code base.
What you are looking for is the _split_indices method of MICEData.

feature_names must be unique - Xgboost

I am running the xgboost model for a very sparse matrix.
I am getting this error. ValueError: feature_names must be unique
How can I deal with this?
This is my code.
yprob = bst.predict(xgb.DMatrix(test_df))[:,1]
According the the xgboost source code documentation, this error only occurs in one place - in a DMatrix internal function. Here's the source code excerpt:
if len(feature_names) != len(set(feature_names)):
raise ValueError('feature_names must be unique')
So, the error text is pretty literal here; your test_df has at least one duplicate feature/column name.
You've tagged pandas on this post; that suggests test_df is a Pandas DataFrame. In this case, DMatrix literally runs df.columns to extract feature_names. Check your test_df for repeat column names, remove or rename them, and then try DMatrix() again.
Assuming the problem is indeed that columns are duplicated, the following line should solve your problem:
test_df = test_df.loc[:,~test_df.columns.duplicated()]
Source: python pandas remove duplicate columns
This line should identify which columns are duplicated:
duplicate_columns = test_df.columns[test_df.columns.duplicated()]
One way around this can be to use column names that are unique while preparing the data and then it should work out.
I converted to them to np.array(df). My problem was solved

Pandas apply function issue

I have a dataframe (data) of numeric variables and I want to analyse the distribution of each column by using the Shapiro test from scipy.
from scipy import stats
data.apply(stats.shapiro, axis=0)
But I keep getting the following error message:
ValueError: ('could not convert string to float: M', u'occurred at index 0')
I've checked the documentation and it says the first argument of the apply function should be a function, which stats.shapiro is (as far as I'm aware).
What am I doing wrong, and how can I fix it?
Found the problem. There was a column of type object which resulted in the error message above. Apply the function only to numeric columns solved this issue.

Using statsmodels.seasonal_decompose() without DatetimeIndex but with Known Frequency

I have a time-series signal I would like to decompose in Python, so I turned to statsmodels.seasonal_decompose(). My data has frequency of 48 (half-hourly). I was getting the same error as this questioner, where the solution was to change from an Int index to a DatetimeIndex. But I don't know the actual dates/times my data is from.
In this github thread, one of the statsmodels contributors says that
"In 0.8, you should be able to specify freq as keyword argument to
override the index."
But this seems not to be the case for me. Here is a minimal code example illustrating my issue:
import statsmodels.api as sm
dta = pd.Series([x%3 for x in range(100)])
decomposed = sm.tsa.seasonal_decompose(dta, freq=3)
AttributeError: 'RangeIndex' object has no attribute 'inferred_freq'
Version info:
import statsmodels
print(statsmodels.__version__)
0.8.0
Is there a way to decompose a time-series in statsmodels with a specified frequency but without a DatetimeIndex?
If not, is there a preferred alternative for doing this in Python? I checked out the Seasonal package, but its github lists 0 downloads/month, one contributor, and last commit 9 months ago, so I'm not sure I want to rely on that for my project.
Thanks to josef-pkt for answering this on github. There is a bug in statsmodels 0.8.0 where it always attempts to calculate an inferred frequency based on a DatetimeIndex, if passed a Pandas object.
The workaround when using Pandas series is to pass their values in a numpy array to seasonal_decompose(). For example:
import statsmodels.api as sm
my_pandas_series = pd.Series([x%3 for x in range(100)])
decomposed = sm.tsa.seasonal_decompose(my_pandas_series.values, freq=3)
(no errors)

Categories

Resources