I am building a timeseries PyBats model using a Poisson distribution to signify the distribution of observations. My model instantiation looks like this
model = define_dglm(
Y=data.actual.values,
X=None,
family="poisson",
k=1,
prior_length=8,
dates=data["month"],
ntrend=2,
seasPeriods=[],
seasHarmComponents=[],
nsamps=10000,
)
Where data.actual.values is a numpy array of integers. After instantiating the model, in order to forecast into the future with pybats I run
forecast_samples = model.forecast_path(k=steps_to_forecast, X=X_future, nsamps=10000)
and get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 289, in forecast_path
return forecast_path_copula(self, k, X, nsamps, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 211, in forecast_path_copula
return forecast_path_copula_sim(mod, k, lambda_mu, lambda_cov, nsamps, t_dist, nu)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in forecast_path_copula_sim
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in <lambda>
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 477, in simulate_from_sampling_model
return np.random.poisson(rate, [nsamps])
File "mtrand.pyx", line 3573, in numpy.random.mtrand.RandomState.poisson
File "_common.pyx", line 824, in numpy.random._common.disc
File "_common.pyx", line 621, in numpy.random._common.discrete_broadcast_d
File "_common.pyx", line 355, in numpy.random._common.check_array_constraint
ValueError: lam value too large
I have tried converting my Y array to floats, and have tried replacing all 0 values with 1 and get the same error. What is causing this error?
The issue is in exceeding the maximum value allowed in numpy.random.poisson. It looks like any value larger than np.random.poisson(1E19) will cause this error.
A couple things you can try:
Use a longer prior length than 8 when defining the model. This will help produce more stable estimates of the coefficients. After defining your model, check what the coefficient mean vector (model.a) and covariance matrix (model.R) are, to make sure they're reasonable. If they're not, you can change them manually.
If some of your 'Y' values are truly that large, a Poisson model is probably not appropriate. I would suggest modeling log(Y) using the normal dlm model in Pybats.
I hope that this help!
Thanks,
Isaac
Related
For a project that I am working on, I created a linear regression model. After fitting that line, I wanted to simulate the data over and over again using np.random.choice on my data to see the variability in the regression line say the data be recollected. However I keep getting a KeyError: in my function and I am not sure how to fix it.
Here is a head of what the data looks like:
I ran a linear regression model on the columns 'nsb' and 'r'.
Here are my functions that repeatedly creates linear regression models for 'bootstrapped' data:
When I call this:
slope, int = draw_bs_pairs_linreg(big_df['nsb'], big_df['r'], size = 1000)
I get this error, which each time I run it the length and values in the list of numbers changes each time I run it.
KeyError: '[2, 567, 459, 458, 355, 230, 353, 565, 231, 566, 117] not in index'
Any help would be appriciated.
You need DataFrame.reset_index before call your function
big_df = big_df.reset_index(drop=True)
Or indexing with .iloc
bs_x, bs_y = x.iloc[bs_inds], y.iloc[bs_inds]
Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.
I am an absolute newbie in Python programming and currently learning basic statistics on it.
I am facing a
"PatsyError: Error evaluating factor: NameError:"
on a code with pred = model.predict(pd.DataFrame(calo['wt'])
Below is my code:
# For reading data set
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading a csv file using pandas library
calo=pd.read_csv("/Users/Sanjeev/Desktop/Excel R Assignments/Simple Linear Regression/calories_consumed.csv")
calo.columns = ['wt','cal']
np.corrcoef(calo.wt,calo.cal)
plt.plot(calo.wt,calo.cal,"bo");plt.xlabel("WEIGHT");plt.ylabel("CALORIES")
# For preparing linear regression model we need to import the statsmodels.formula.api
import statsmodels.formula.api as smf
model = smf.ols("wt~cal",data=calo).fit()
# For getting coefficients of the varibles used in equation
model.params
# P-values for the variables and R-squared value for prepared model
model.summary()
model.conf_int(0.05) # 95% confidence interval
pred = model.predict(pd.DataFrame(calo['wt']))
This throws up an error:
Traceback (most recent call last):
File "<ipython-input-43-4fcbf1ee1921>", line 1, in <module>
pred = model.predict(pd.DataFrame(calo['wt']))
File "/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py", line 837, in predict
exog = dmatrix(design_info, exog, return_type="dataframe")
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 291, in dmatrix
NA_action, return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 169, in _do_highlevel_design
return_type=return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 888, in build_design_matrices
value, is_NA = _eval_factor(factor_info, data, NA_action)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 63, in _eval_factor
result = factor.eval(factor_info.state, data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'cal' is not defined
wt~cal
^^^
Need your help to resolve this.
Thanks in advance. :)
Looking at the statsmodels API here, it looks like they expect the parameters as input, rather than the covariates.
So what you probably want is
pred = model.predict(model.params)
you need to put a variable based on which you are going to decide dependent variable(y)
model = statsmodels.formula.api.ols('y ~x ',data=df)
model.predict(pd.DataFrame(df['x']))
I was having this problem. I was doing something like this:
for _, i in frame.iterrows()
model.predict(i)
This doesn't provide it with the necessary headers. You have to do this:
for _, i in frame.iterrows()
model.predict(pd.DataFrame([i]))
I have a similarity score between 0 and 1 from each entry to every other entry in an 100 by 100 matrix. So e.g. [0,0] would be 1, [0,1] might be .54 etc. The similarity score was generated using Shannon Jensen divergence.
I want to use ward clustering (but am open to other suggestions) to cluster these together and I tried the following code:
x = np.array(mylist)
print x.shape#(100,100)
clustering = scipy.cluster.hierarchy.ward(x)
scipy.cluster.hierarchy.dendrogram(clustering)
but I am getting the error:
Traceback (most recent call last):
File "C:/Python27/ward.py", line 38, in <module>
clustering = scipy.cluster.hierarchy.ward(y)
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 465, in ward
return linkage(y, method='ward', metric='euclidean')
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 620, in linkage
y = _convert_to_double(np.asarray(y, order='c'))
File "C:\Python27\lib\site-packages\scipy\cluster\hierarchy.py", line 928, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Do I need to do some transformation on my array first or use some other method?
I tried to calculate logical regression. I have the data as csv file.
it looks like
node_id,second_major,gender,major_index,year,dorm,high_school,student_fac
0,0,2,257,2007,111,2849,1
1,0,2,271,2005,0,51195,2
2,0,2,269,2007,0,21462,1
3,269,1,245,2008,111,2597,1
..........................
This is my coding.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("Reed98.csv")
print df.describe()
dummy_ranks = pd.get_dummies(df['second_major'], prefix='second_major')
cols_to_keep = ['second_major', 'dorm', 'high_school']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'year':])
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(data['second_major'], data[train_cols])
result = logit.fit()
print result.summary()
When I run the coding in python I got an error:
Traceback (most recent call last):
File "D:\project\logisticregression.py", line 24, in <module>
result = logit.fit()
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 282, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 233, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\base\model.py", line 291, in fit
hess=hess)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6-win32.egg\statsmodels\base\model.py", line 341, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 328, in solve
raise LinAlgError('Singular matrix')
LinAlgError: Singular matrix
How to rewrite the code?
There's nothing wrong with your code. My guess is that you have missing values in your data. Try a dropna or use missing='drop' to Logit. You might also check that the right hand side is full rank np.linalg.matrix_rank(data[train_cols].values)