KeyError: [....] not found in index - python

For a project that I am working on, I created a linear regression model. After fitting that line, I wanted to simulate the data over and over again using np.random.choice on my data to see the variability in the regression line say the data be recollected. However I keep getting a KeyError: in my function and I am not sure how to fix it.
Here is a head of what the data looks like:
I ran a linear regression model on the columns 'nsb' and 'r'.
Here are my functions that repeatedly creates linear regression models for 'bootstrapped' data:
When I call this:
slope, int = draw_bs_pairs_linreg(big_df['nsb'], big_df['r'], size = 1000)
I get this error, which each time I run it the length and values in the list of numbers changes each time I run it.
KeyError: '[2, 567, 459, 458, 355, 230, 353, 565, 231, 566, 117] not in index'
Any help would be appriciated.

You need DataFrame.reset_index before call your function
big_df = big_df.reset_index(drop=True)
Or indexing with .iloc
bs_x, bs_y = x.iloc[bs_inds], y.iloc[bs_inds]

Related

Pgmpy: expectation maximization for bayesian networks parameter learning with missing data

I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?
#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]
# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)
# create some missing values
new_data["Smoker"][:500] = np.NaN
# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
The error I get is an indexing error:
File "main.py", line 100, in <module>
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
weighted_data = self._compute_weights(latent_card)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
return self.apply_standard()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
likelihood *= cpd.get_value(
File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Thanks!
Since there is still no answer to your specific question, let me propose a solution with another module:
#import
import pandas as pd
import numpy as np
import pyAgrum as gum
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])
test_data = data[:2000]
new_data = data[2000:].copy()
# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()
# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN
# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)
In a notebook :

Pybats forecast error when using Poisson family distribution

I am building a timeseries PyBats model using a Poisson distribution to signify the distribution of observations. My model instantiation looks like this
model = define_dglm(
Y=data.actual.values,
X=None,
family="poisson",
k=1,
prior_length=8,
dates=data["month"],
ntrend=2,
seasPeriods=[],
seasHarmComponents=[],
nsamps=10000,
)
Where data.actual.values is a numpy array of integers. After instantiating the model, in order to forecast into the future with pybats I run
forecast_samples = model.forecast_path(k=steps_to_forecast, X=X_future, nsamps=10000)
and get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 289, in forecast_path
return forecast_path_copula(self, k, X, nsamps, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 211, in forecast_path_copula
return forecast_path_copula_sim(mod, k, lambda_mu, lambda_cov, nsamps, t_dist, nu)
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in forecast_path_copula_sim
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/forecast.py", line 326, in <lambda>
return np.array(list(map(lambda prior: mod.simulate_from_sampling_model(prior, nsamps),
File "/opt/conda/lib/python3.8/site-packages/pybats/dglm.py", line 477, in simulate_from_sampling_model
return np.random.poisson(rate, [nsamps])
File "mtrand.pyx", line 3573, in numpy.random.mtrand.RandomState.poisson
File "_common.pyx", line 824, in numpy.random._common.disc
File "_common.pyx", line 621, in numpy.random._common.discrete_broadcast_d
File "_common.pyx", line 355, in numpy.random._common.check_array_constraint
ValueError: lam value too large
I have tried converting my Y array to floats, and have tried replacing all 0 values with 1 and get the same error. What is causing this error?
The issue is in exceeding the maximum value allowed in numpy.random.poisson. It looks like any value larger than np.random.poisson(1E19) will cause this error.
A couple things you can try:
Use a longer prior length than 8 when defining the model. This will help produce more stable estimates of the coefficients. After defining your model, check what the coefficient mean vector (model.a) and covariance matrix (model.R) are, to make sure they're reasonable. If they're not, you can change them manually.
If some of your 'Y' values are truly that large, a Poisson model is probably not appropriate. I would suggest modeling log(Y) using the normal dlm model in Pybats.
I hope that this help!
Thanks,
Isaac

ML Code throws value error when transforming data

Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.

ValueError: could not convert integer scalar in scipy indexing

During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix.
I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.
The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series) to resample the sparse matrix like this:
tfs[Ytr_resample.index]
It takes a lot of time, and outputs the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]
/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
348 csr_sample_values(self.shape[0], self.shape[1],
349 self.indptr, self.indices, self.data,
--> 350 num_samples, row.ravel(), col.ravel(), val)
351 if row.ndim == 1:
352 # row and col are 1d
ValueError: could not convert integer scalar
Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.
The problem seems to come from the fact that the index is coded in np.int64 and not in np.int32. Indeed the following works:
Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]
Therefore I have two questions:
Is the error actually coming from this conversion int32 to int64?
Is there a more pythonic way to convert the index type? (Ytr_resample.index.astype(np.int32) does not seem to change the type for some reason)
EDIT:
Ytr_resample.index is of type pandas.core.indexes.numeric.Int64Index:
Int64Index([1484, 753, 1587, 1494, 357, 1484, 84, 823, 424, 424,
...
1558, 1619, 1317, 1635, 537, 1206, 1152, 1635, 1206, 131],
dtype='int64', length=4840)
I created Ytr_resample by resampling Ytr (which is pandas.Series) such that every category present in Ytr has the same number of elements (by oversampling):
n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
lambda x: x.sample(n_samples,replace = True,random_state=42)).cat

scipy.optimize.curvefit() - array must not contain infs or NaNs

I am trying to fit some data to a curve in Python using scipy.optimize.curve_fit. I am running into the error ValueError: array must not contain infs or NaNs.
I don't believe either my x or y data contain infs or NaNs:
>>> x_array = np.asarray_chkfinite(x_array)
>>> y_array = np.asarray_chkfinite(y_array)
>>>
To give some idea of what my x_array and y_array look like at either end (x_array is counts and y_array is quantiles):
>>> type(x_array)
<type 'numpy.ndarray'>
>>> type(y_array)
<type 'numpy.ndarray'>
>>> x_array[:5]
array([0, 0, 0, 0, 0])
>>> x_array[-5:]
array([2919, 2965, 3154, 3218, 3461])
>>> y_array[:5]
array([ 0.9999582, 0.9999163, 0.9998745, 0.9998326, 0.9997908])
>>> y_array[-5:]
array([ 1.67399000e-04, 1.25549300e-04, 8.36995200e-05,
4.18497600e-05, -2.22044600e-16])
And my function:
>>> def func(x,alpha,beta,b):
... return ((x/1)**(-alpha) * ((x+1*b)/(1+1*b))**(alpha-beta))
...
Which I am executing with:
>>> popt, pcov = curve_fit(func, x_array, y_array)
resulting in the error stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 426, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py", line 338, in leastsq
cov_x = inv(dot(transpose(R),R))
File "/usr/lib/python2.7/dist-packages/scipy/linalg/basic.py", line 285, in inv
a1 = asarray_chkfinite(a)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 590, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
I'm guessing the error might not be with respect to my arrays, but rather an array created by scipy in an intermediate step? I've had a bit of a dig through the relevant scipy source
files, but things get hairy pretty quickly debugging the problem that way. Is there something obvious I'm doing wrong here? I've seen casually mentioned in other questions that sometimes certain initial parameter guesses (of which I currently don't have any explicit) might result in these kind of errors, but even if this is the case, it would be good to know a) why that is and b) how to avoid it.
Why it is failing
Not your input arrays are entailing nans or infs, but evaluation of your objective function at some X points and for some values of the parameters results in nans or infs: in other words, the array with values func(x,alpha,beta,b) for some x, alpha, beta and b is giving nans or infs over the optimization routine.
Scipy.optimize curve fitting function uses Levenberg-Marquardt algorithm. It is also called damped least square optimization. It is an iterative procedure, and a new estimate for the optimal function parameters is computed at each iteration. Also, at some point during optimization, algorithm is exploring some region of the parameters space where your function is not defined.
How to fix
1/Initial guess
Initial guess for parameters is decisive for the convergence. If initial guess is far from optimal solution, you are more likely to explore some regions where objective function is undefined. So, if you can have a better clue of what your optimal parameters are, and feed your algorithm with this initial guess, error while proceeding might be avoided.
2/Model
Also, you could modify your model, so that it is not returning nans. For those values of the parameters, params where original function func is not defined, you wish that objective function takes huge values, or in other words that func(params) is far from Y values to be fitted.
Also, at points where your objective function is not defined, you may return a big float, for instance AVG(Y)*10e5 with AVG the average (so that you make sure to be much bigger than average of Y values to be fitted).
Link
You could have a look at this post: Fitting data to an equation in python vs gnuplot
Your function has a negative power (x^-alpha) this is the same as (1/x)^(alpha). If x is ever 0 your function will return inf and your curve fit operation will break, I'm surprised a warning/error isn't thrown earlier informing you of a divide by 0.
BTW why are you multiplying and dividing by 1?
I was able to reproduce this error in python2.7 like so:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this produces the error:
Which always produces the Error:
/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py:58: RuntimeWarning: invalid value encountered in sqrt
return np.dot(np.dot(u * (1. / np.sqrt(s)), u.T), W)
Traceback (most recent call last):
File "main.py", line 43, in <module>
ica()
File "main.py", line 18, in ica
ica.fit(X)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 523, in fit
self._fit(X, compute_sources=False)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 479, in _fit
compute_sources=compute_sources, return_n_iter=True)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 335, in fastica
W, n_iter = _ica_par(X1, **kwargs)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 108, in _ica_par
- g_wtx[:, np.newaxis] * W)
File "/usr/lib64/python2.7/site-packages/sklearn/decomposition/fastica_.py", line 55, in _sym_decorrelation
s, u = linalg.eigh(np.dot(W, W.T))
File "/usr/lib64/python2.7/site-packages/scipy/linalg/decomp.py", line 297, in eigh
a1 = asarray_chkfinite(a)
File "/usr/lib64/python2.7/site-packages/numpy/lib/function_base.py", line 613, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs
Solution:
from sklearn.decomposition import FastICA
X = load_data.load("stuff") #this sets X to a 2d numpy array containing
#large positive and negative numbers.
ica = FastICA(whiten=False)
#this is a column wise normalization function which flattens the
#two dimensional array from very large and very small numbers to
#reasonably sized numbers between roughly -1 and 1
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print(np.isnan(X).any()) #this prints False
print(np.isinf(X).any()) #this prints False
ica.fit(X) #this works correctly.
Why does that normalization step fix the error?
I found the eureka moment here: sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
What I think is happening is that numpy is being fed gigantic numbers and very tiny numbers, and inside it's tiny brain it's creating NaN's and Inf's. So it's a bug in the sklearn. The work around is to flatten your input data to the algorithm so that there are no very large or very small numbers.
Bad sklearn! NO biscuit!

Categories

Resources