I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded.
Here is my code;
import numpy as np
import pandas as pd
import matplotlib.pyplot as py
Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
I can only see how I am trying to encode the first column of data however the error I am getting is the following;
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-132-360fc0133165>", line 2, in <module>
X = onehotencoder.fit_transform(X).toarray()
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site- packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'
What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!?
Thanks in advance for your advice!
I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :(
Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (e.g. 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again).
Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored").
Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. I don't know, maybe it should be checked with the sklearn community on github?
#Taylrl,
I encountered the same behavior and found it frustrating. As #Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.
Related
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?
It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.
I am getting an error on the inverse_transform after fit_transform. I am trying to inverse_transform float64 back to its original datatype which is string.
getting the data:
df = pd.read_csv("pris.csv", usecols=['judge', 'plea_orcs', 'prior_cases', 'race', 'pris_yrs'])
transforming string columns in csv:
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
X and y for sklearn:
X = df[['plea_orcs', 'judge', 'race', 'prior_cases', 'pris_yrs']]
y = df[['to_prison']]
this is raising the error:
print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
error:
IndexError Traceback (most recent call last)
<ipython-input-291-11e4763a5a03> in <module>
----> 1 print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\preprocessing\_encoders.py in inverse_transform(self, X)
733 for i in range(n_features):
734 labels = X[:, i].astype('int64', copy=False)
--> 735 X_tr[:, i] = self.categories_[i][labels]
736
737 return X_tr
IndexError: index 68 is out of bounds for axis 0 with size 5
Should I not be using OrdinalEncoding? I have several different ways but this one seems to be an error in the right direction.
The problem
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
In the second line, you fit your ordinal encoder on the column 'plea_orcs'. You can then transform that data (as you do with the convenience fit_transform and inverse_transform the result.
But then in the third line, you refit the ordinal encoder on the column 'judge'. This loses all information about plea_orcs, and you will no longer be able to transform test data, or inverse-transform.
Some solutions
In increasing order of (IMO) elegance:
Instantiate separate ordinal encoders for each feature.
Use just one ordinal encoder, and fit and transform all three columns at once.
Use just one ordinal encoder together with a ColumnTransformer for selecting the appropriate columns. Use passthrough for other columns, if you don't need to do any preprocessing to them.
Off-topic...
...but consider whether ordinal encoding is appropriate: if your data isn't naturally ordered, then you're adding false relationships to your data. See e.g. this DS.SE post.
python version 3.7, spyder 3.3.6. always showing an error I have tried with different versions python also:
import pandas as pa
import numpy as np
X=0
y=0
dataset = 0
#import the data set and separete the
dataset = pa.read_csv("50_Startups.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4].values
#categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[('one_hot_encoder',OneHotEncoder(),[0])],
remainder = 'passthrough'
)
X = np.array(ct.fit_transform(X), dtype=np.float64)
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
The error is:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
Traceback (most recent call last):
File "<ipython-input-5-139c661c06f7>", line 25, in <module>
X = np.array(ct.fit_transform(X), dtype=np.float64)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 490, in fit_transform
return self._hstack(list(Xs))
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 541, in _hstack
raise ValueError("For a sparse output, all columns should"
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Matrix of features as X and dep variable as Y (convert dataframe to numpy array)
`X = dataset.iloc[:,:-1].values`
`Y = dataset.iloc[:,-1].values`
Encoding Categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
en = LabelEncoder()
X[:,3] = en.fit_transform(X[:,3])
oh = OneHotEncoder(categorical_features=[3])
X = oh.fit_transform(X)
#converting from matrix to array
X = X.toarray()
#Dummy variable trap ---- Removing one dummy variable
X = X[:,1:]
Here you selecting all the columns which have numeric data.You only fit the encoder for categorical column and then transform it. And remove the dummy variable.
Currently I'm working on a Deep learning model containing LSTM to train on joints for human movement(s), but during the one-hot encoding process I keep getting an error.
I've checked several websites for instructions, but unable to solve the difference with my code/data:
import pandas as pd
import numpy as np
keypoints = pd.read_csv('keypoints.csv')
X = keypoints.iloc[:,1:76]
y = keypoints.iloc[:,76]
Which results in the followwing shapes:
Keypoints = (63564, 77)
x = (63564, 75)
y = (63564,)
All the keypoints of the joints are in x and y contains all the labels I want to train on, which are three different (textual) labels. The first column of the dataset can be ignored, cause it contained just frame numbers.
Therefor I was advised to use one-hot enconding to use categorical_entropy later on:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(categorical_features = [0])
y = ohe.fit_transform(y).toarray()
But when applying this, I get the error on the last line:
> Traceback (most recent call last):
File "LSTMPose.py", line 28, in <module>
y = ohe.fit_transform(y).toarray()
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in fit_transform
self._handle_deprecations(X)
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in _handle_deprecations
n_features = X.shape[1]
IndexError: tuple index out of range
I assumed it has something to with my y index, but it is just 1 column... so what am I missing?
You need to reshape your y-data to be 2D as well, similar to the x-data. The second dimension should have length 1, i.e. you can do:
y = ohe.fit_transform(y[:, None]).toarray()
I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)
#user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.