PatsyError: Error evaluating factor: NameError: - python

I am an absolute newbie in Python programming and currently learning basic statistics on it.
I am facing a
"PatsyError: Error evaluating factor: NameError:"
on a code with pred = model.predict(pd.DataFrame(calo['wt'])
Below is my code:
# For reading data set
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading a csv file using pandas library
calo=pd.read_csv("/Users/Sanjeev/Desktop/Excel R Assignments/Simple Linear Regression/calories_consumed.csv")
calo.columns = ['wt','cal']
np.corrcoef(calo.wt,calo.cal)
plt.plot(calo.wt,calo.cal,"bo");plt.xlabel("WEIGHT");plt.ylabel("CALORIES")
# For preparing linear regression model we need to import the statsmodels.formula.api
import statsmodels.formula.api as smf
model = smf.ols("wt~cal",data=calo).fit()
# For getting coefficients of the varibles used in equation
model.params
# P-values for the variables and R-squared value for prepared model
model.summary()
model.conf_int(0.05) # 95% confidence interval
pred = model.predict(pd.DataFrame(calo['wt']))
This throws up an error:
Traceback (most recent call last):
File "<ipython-input-43-4fcbf1ee1921>", line 1, in <module>
pred = model.predict(pd.DataFrame(calo['wt']))
File "/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py", line 837, in predict
exog = dmatrix(design_info, exog, return_type="dataframe")
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 291, in dmatrix
NA_action, return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/highlevel.py", line 169, in _do_highlevel_design
return_type=return_type)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 888, in build_design_matrices
value, is_NA = _eval_factor(factor_info, data, NA_action)
File "/anaconda3/lib/python3.7/site-packages/patsy/build.py", line 63, in _eval_factor
result = factor.eval(factor_info.state, data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
data)
File "/anaconda3/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "/anaconda3/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'cal' is not defined
wt~cal
^^^
Need your help to resolve this.
Thanks in advance. :)

Looking at the statsmodels API here, it looks like they expect the parameters as input, rather than the covariates.
So what you probably want is
pred = model.predict(model.params)

you need to put a variable based on which you are going to decide dependent variable(y)
model = statsmodels.formula.api.ols('y ~x ',data=df)
model.predict(pd.DataFrame(df['x']))

I was having this problem. I was doing something like this:
for _, i in frame.iterrows()
model.predict(i)
This doesn't provide it with the necessary headers. You have to do this:
for _, i in frame.iterrows()
model.predict(pd.DataFrame([i]))

Related

Pgmpy: expectation maximization for bayesian networks parameter learning with missing data

I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?
#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]
# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)
# create some missing values
new_data["Smoker"][:500] = np.NaN
# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
The error I get is an indexing error:
File "main.py", line 100, in <module>
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
weighted_data = self._compute_weights(latent_card)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
return self.apply_standard()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
likelihood *= cpd.get_value(
File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Thanks!
Since there is still no answer to your specific question, let me propose a solution with another module:
#import
import pandas as pd
import numpy as np
import pyAgrum as gum
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])
test_data = data[:2000]
new_data = data[2000:].copy()
# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()
# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN
# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)
In a notebook :

Pybats forecast error when using Poisson family distribution

I am building a timeseries PyBats model using a Poisson distribution to signify the distribution of observations. My model instantiation looks like this
model = define_dglm(
Y=data.actual.values,
X=None,
family="poisson",
k=1,
prior_length=8,
dates=data["month"],
ntrend=2,
seasPeriods=[],
seasHarmComponents=[],
nsamps=10000,
)
Where data.actual.values is a numpy array of integers. After instantiating the model, in order to forecast into the future with pybats I run
forecast_samples = model.forecast_path(k=steps_to_forecast, X=X_future, nsamps=10000)
and get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 289, in forecast_path
return forecast_path_copula(self, k, X, nsamps, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 211, in forecast_path_copula
return forecast_path_copula_sim(mod, k, lambda_mu, lambda_cov, nsamps, t_dist, nu)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in forecast_path_copula_sim
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in <lambda>
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 477, in simulate_from_sampling_model
return np.random.poisson(rate, [nsamps])
File "mtrand.pyx", line 3573, in numpy.random.mtrand.RandomState.poisson
File "_common.pyx", line 824, in numpy.random._common.disc
File "_common.pyx", line 621, in numpy.random._common.discrete_broadcast_d
File "_common.pyx", line 355, in numpy.random._common.check_array_constraint
ValueError: lam value too large
I have tried converting my Y array to floats, and have tried replacing all 0 values with 1 and get the same error. What is causing this error?
The issue is in exceeding the maximum value allowed in numpy.random.poisson. It looks like any value larger than np.random.poisson(1E19) will cause this error.
A couple things you can try:
Use a longer prior length than 8 when defining the model. This will help produce more stable estimates of the coefficients. After defining your model, check what the coefficient mean vector (model.a) and covariance matrix (model.R) are, to make sure they're reasonable. If they're not, you can change them manually.
If some of your 'Y' values are truly that large, a Poisson model is probably not appropriate. I would suggest modeling log(Y) using the normal dlm model in Pybats.
I hope that this help!
Thanks,
Isaac

Reasons for memory error. Numpy division by standard deviation

I'm trying to normalize my training and test set for MNIST dataset. Here's my code
import numpy as np
import pandas as pd
prediction = pd.read_csv("sample_submission.csv")
test_csv = pd.read_csv("test.csv")
train_csv = pd.read_csv("train.csv")
train = train_csv.values.T # turn train set data frame to numpy array
test = test_csv.values.T
y_values = train[[0], :] # bring y values [3,1,4,6,2,0,...]
train = train[1:, :]
y = np.zeros((10, y_values.shape[1]))
for i in range(y_values.shape[1]):
y[y_values[0][i]][i] = 1 # one-hot encoding
# scaling data set values to range (0,1)
train = np.divide(train, np.std(train))
test = np.divide(test, np.std(test))
everything seems to be working except that it gives me memory error on the last part where I try to divide the test set with its standard deviation.
Traceback (most recent call last):
File "C:/Users/falco/PycharmProjects/Digit-Recognizer/main.py", line 26, in <module>
test = np.divide(test, np.std(test))
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 3242, in std
**kwargs)
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 140, in _std
keepdims=keepdims)
File "C:\Users\falco\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 117, in _var
x = asanyarray(arr - arrmean)
MemoryError
Any help/ideas on why this is happening would be greatly appreciated!

Numpy Error "Could not convert string to float: 'Illinois'"

I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")

Calculate logistic regression in python

I tried to calculate logical regression. I have the data as csv file.
it looks like
node_id,second_major,gender,major_index,year,dorm,high_school,student_fac
0,0,2,257,2007,111,2849,1
1,0,2,271,2005,0,51195,2
2,0,2,269,2007,0,21462,1
3,269,1,245,2008,111,2597,1
..........................
This is my coding.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("Reed98.csv")
print df.describe()
dummy_ranks = pd.get_dummies(df['second_major'], prefix='second_major')
cols_to_keep = ['second_major', 'dorm', 'high_school']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'year':])
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(data['second_major'], data[train_cols])
result = logit.fit()
print result.summary()
When I run the coding in python I got an error:
Traceback (most recent call last):
File "D:\project\logisticregression.py", line 24, in <module>
result = logit.fit()
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 282, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\discrete\discrete_model.py", line 233, in fit
disp=disp, callback=callback, **kwargs)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6- win32.egg\statsmodels\base\model.py", line 291, in fit
hess=hess)
File "c:\python26\lib\site-packages\statsmodels-0.5.0-py2.6-win32.egg\statsmodels\base\model.py", line 341, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "C:\Python26\Lib\site-packages\numpy\linalg\linalg.py", line 328, in solve
raise LinAlgError('Singular matrix')
LinAlgError: Singular matrix
How to rewrite the code?
There's nothing wrong with your code. My guess is that you have missing values in your data. Try a dropna or use missing='drop' to Logit. You might also check that the right hand side is full rank np.linalg.matrix_rank(data[train_cols].values)

Categories

Resources