'KeyError:' when iterating over pandas data frame? - python

I have two lists Y_train and Y_test. At the moment they hold categorical data. Each element is either Blue or Green. They are going to be the targets for a Random Forest classifier. I need them encoded as 1.0s and 0.0s.
Here is a print(Y_train) to show you what the data frame looks like. The random numbers down the side are because the data has been shuffled. (Y_test is the same, just smaller):
183 Blue
126 Blue
1 Blue
409 Blue
575 Green
...
396 Blue
192 Blue
578 Green
838 Green
222 Blue
Name: Colour, Length: 896, dtype: object
To encode this I was going to simply loop over them and change each element to their encoded values:
for i in range(len(Y_train)):
if Y_train[i] == 'Blue':
Y_train[i] = 0.0
else:
Y_train[i] = 1.0
However, when I do this, I get the following:
Traceback (most recent call last):
File "G:\Work\Colours.py", line 90, in <module>
Main()
File "G:\Work\Colours.py", line 34, in Main
RandForest(X_train, Y_train, X_test, Y_test)
File "G:\Work\Colours.py.py", line 77, in RandForest
if Y_train[i] == 'Blue':
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 6
The weird thing is that it produces this error at different times. I've used flags and prints to see how far it gets. Sometimes it will get quite a few iterations into the loop, and then other times it will only do one or two iterations before breaking.
I'm assuming I just don't quite understand how you're supposed to iterate over data frames properly. If someone with more experience with this stuff could help me out that would be great.

Try:
Y_train[Y_train == 'Blue']=0.0
Y_train[Y_train == 'Green']=1.0
That should solve your issues.

If you are using a your own method to label encoding,it is better to create a separate encoded column rather than modifying original column.After that you can assign encoded column to your dataframe. As a example for your scenario.
encoded = np.ones((Y_train.shape[0],1))
for i in range(Y_train.shape[0]):
if Y_train[i] == 'Blue':
encoded[i] = 0
Note that this will only work for if you have two categories.
for multiple categories,you can use sklearn or pandas methods.
For multiple categories
Another approach is using pandas cat.codes.You can convert pandas series to a category and get the category codes.
Y_train = pd.Series(Y_train)
encoded = Y_train.astype("category").cat.codes
You can use sklearn Labelencoder to encode categorical data as well.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(Y_train)

In cases where you even have more number of labels than your current example(Blue and Green in your case), sklearn provides a label encoder that allows you to do this very easily using
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Transforms the 'column' in your dataframe df
df['column']= label_encoder.fit_transform(df['column'])

Related

Pgmpy: expectation maximization for bayesian networks parameter learning with missing data

I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?
#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]
# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)
# create some missing values
new_data["Smoker"][:500] = np.NaN
# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
The error I get is an indexing error:
File "main.py", line 100, in <module>
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
weighted_data = self._compute_weights(latent_card)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
return self.apply_standard()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
likelihood *= cpd.get_value(
File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Thanks!
Since there is still no answer to your specific question, let me propose a solution with another module:
#import
import pandas as pd
import numpy as np
import pyAgrum as gum
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])
test_data = data[:2000]
new_data = data[2000:].copy()
# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()
# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN
# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)
In a notebook :

KeyError: [....] not found in index

For a project that I am working on, I created a linear regression model. After fitting that line, I wanted to simulate the data over and over again using np.random.choice on my data to see the variability in the regression line say the data be recollected. However I keep getting a KeyError: in my function and I am not sure how to fix it.
Here is a head of what the data looks like:
I ran a linear regression model on the columns 'nsb' and 'r'.
Here are my functions that repeatedly creates linear regression models for 'bootstrapped' data:
When I call this:
slope, int = draw_bs_pairs_linreg(big_df['nsb'], big_df['r'], size = 1000)
I get this error, which each time I run it the length and values in the list of numbers changes each time I run it.
KeyError: '[2, 567, 459, 458, 355, 230, 353, 565, 231, 566, 117] not in index'
Any help would be appriciated.
You need DataFrame.reset_index before call your function
big_df = big_df.reset_index(drop=True)
Or indexing with .iloc
bs_x, bs_y = x.iloc[bs_inds], y.iloc[bs_inds]

ML Code throws value error when transforming data

Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.

ValueError: could not convert integer scalar in scipy indexing

During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix.
I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.
The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series) to resample the sparse matrix like this:
tfs[Ytr_resample.index]
It takes a lot of time, and outputs the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]
/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
348 csr_sample_values(self.shape[0], self.shape[1],
349 self.indptr, self.indices, self.data,
--> 350 num_samples, row.ravel(), col.ravel(), val)
351 if row.ndim == 1:
352 # row and col are 1d
ValueError: could not convert integer scalar
Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.
The problem seems to come from the fact that the index is coded in np.int64 and not in np.int32. Indeed the following works:
Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]
Therefore I have two questions:
Is the error actually coming from this conversion int32 to int64?
Is there a more pythonic way to convert the index type? (Ytr_resample.index.astype(np.int32) does not seem to change the type for some reason)
EDIT:
Ytr_resample.index is of type pandas.core.indexes.numeric.Int64Index:
Int64Index([1484, 753, 1587, 1494, 357, 1484, 84, 823, 424, 424,
...
1558, 1619, 1317, 1635, 537, 1206, 1152, 1635, 1206, 131],
dtype='int64', length=4840)
I created Ytr_resample by resampling Ytr (which is pandas.Series) such that every category present in Ytr has the same number of elements (by oversampling):
n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
lambda x: x.sample(n_samples,replace = True,random_state=42)).cat

sci-kit learn crashing on certain amounts of data

I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error

Categories

Resources