Why am i getting index error on this one hot encoding? - python

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?

It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.

Related

unable to transform the categorical variable, showing categories=auto error

python version 3.7, spyder 3.3.6. always showing an error I have tried with different versions python also:
import pandas as pa
import numpy as np
X=0
y=0
dataset = 0
#import the data set and separete the
dataset = pa.read_csv("50_Startups.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4].values
#categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[('one_hot_encoder',OneHotEncoder(),[0])],
remainder = 'passthrough'
)
X = np.array(ct.fit_transform(X), dtype=np.float64)
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
The error is:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
Traceback (most recent call last):
File "<ipython-input-5-139c661c06f7>", line 25, in <module>
X = np.array(ct.fit_transform(X), dtype=np.float64)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 490, in fit_transform
return self._hstack(list(Xs))
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 541, in _hstack
raise ValueError("For a sparse output, all columns should"
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Matrix of features as X and dep variable as Y (convert dataframe to numpy array)
`X = dataset.iloc[:,:-1].values`
`Y = dataset.iloc[:,-1].values`
Encoding Categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
en = LabelEncoder()
X[:,3] = en.fit_transform(X[:,3])
oh = OneHotEncoder(categorical_features=[3])
X = oh.fit_transform(X)
#converting from matrix to array
X = X.toarray()
#Dummy variable trap ---- Removing one dummy variable
X = X[:,1:]
Here you selecting all the columns which have numeric data.You only fit the encoder for categorical column and then transform it. And remove the dummy variable.

How to fix One-hot encoding error - IndexError?

Currently I'm working on a Deep learning model containing LSTM to train on joints for human movement(s), but during the one-hot encoding process I keep getting an error.
I've checked several websites for instructions, but unable to solve the difference with my code/data:
import pandas as pd
import numpy as np
keypoints = pd.read_csv('keypoints.csv')
X = keypoints.iloc[:,1:76]
y = keypoints.iloc[:,76]
Which results in the followwing shapes:
Keypoints = (63564, 77)
x = (63564, 75)
y = (63564,)
All the keypoints of the joints are in x and y contains all the labels I want to train on, which are three different (textual) labels. The first column of the dataset can be ignored, cause it contained just frame numbers.
Therefor I was advised to use one-hot enconding to use categorical_entropy later on:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(categorical_features = [0])
y = ohe.fit_transform(y).toarray()
But when applying this, I get the error on the last line:
> Traceback (most recent call last):
File "LSTMPose.py", line 28, in <module>
y = ohe.fit_transform(y).toarray()
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in fit_transform
self._handle_deprecations(X)
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in _handle_deprecations
n_features = X.shape[1]
IndexError: tuple index out of range
I assumed it has something to with my y index, but it is just 1 column... so what am I missing?
You need to reshape your y-data to be 2D as well, similar to the x-data. The second dimension should have length 1, i.e. you can do:
y = ohe.fit_transform(y[:, None]).toarray()

Imbalanced-Learn's FunctionSampler throws ValueError

I want to use the class FunctionSampler from imblearn to create my own custom class for resampling my dataset. I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame. I know that I have to reshape the feature array first since it is one-dimensional. When I use the class RandomUnderSampler everything works fine, however if I pass both the features and labels first to the fit_resample method of FunctionSampler which then creates an instance of RandomUnderSampler and then calls fit_resample on this class, I get the following error:
ValueError: could not convert string to float: 'path_1'
Here's a minimal example producing the error:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from imblearn import FunctionSampler
# create one dimensional feature and label arrays X and y
# X has to be converted to numpy array and then reshaped.
X = pd.Series(['path_1','path_2','path_3'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
FIRST METHOD (works)
rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X,y)
SECOND METHOD (doesn't work)
def resample(X, y):
return RandomUnderSampler().fit_resample(X, y)
sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)
Does anyone know what goes wrong here? It seems as the fit_resample method of FunctionSampler is not equal to the fit_resample method of RandomUnderSampler...
Your implementation of FunctionSampler is correct. The problem is with your dataset.
RandomUnderSampler seems to work for text data as well. There is no checking using check_X_y.
But FunctionSampler() has this check, see here
from sklearn.utils import check_X_y
X = pd.Series(['path_1','path_2','path_2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
check_X_y(X, y)
This will throw an error
ValueError: could not convert string to float: 'path_1'
The following example would work!
X = pd.Series(['1','2','2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
def resample(X, y):
return RandomUnderSampler().fit_resample(X, y)
sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)
X_res, y_res
# (array([[2.],
# [1.]]), array([0, 1], dtype=int64))

How do I fix this "TypeError: float() argument must be a string or a number, not 'method'" Error?

I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
Code:
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
imputer.transform
Should be
imputer.transform()
...With parentheses to actually call the method rather than assign it's name to something.

Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn

I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded.
Here is my code;
import numpy as np
import pandas as pd
import matplotlib.pyplot as py
Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
I can only see how I am trying to encode the first column of data however the error I am getting is the following;
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-132-360fc0133165>", line 2, in <module>
X = onehotencoder.fit_transform(X).toarray()
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site- packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'
What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!?
Thanks in advance for your advice!
I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :(
Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (e.g. 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again).
Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored").
Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. I don't know, maybe it should be checked with the sklearn community on github?
#Taylrl,
I encountered the same behavior and found it frustrating. As #Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Categories

Resources