ValueError on inverse transform using OrdinalEncoder with sklearn - python

I am getting an error on the inverse_transform after fit_transform. I am trying to inverse_transform float64 back to its original datatype which is string.
getting the data:
df = pd.read_csv("pris.csv", usecols=['judge', 'plea_orcs', 'prior_cases', 'race', 'pris_yrs'])
transforming string columns in csv:
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
X and y for sklearn:
X = df[['plea_orcs', 'judge', 'race', 'prior_cases', 'pris_yrs']]
y = df[['to_prison']]
this is raising the error:
print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
error:
IndexError Traceback (most recent call last)
<ipython-input-291-11e4763a5a03> in <module>
----> 1 print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\preprocessing\_encoders.py in inverse_transform(self, X)
733 for i in range(n_features):
734 labels = X[:, i].astype('int64', copy=False)
--> 735 X_tr[:, i] = self.categories_[i][labels]
736
737 return X_tr
IndexError: index 68 is out of bounds for axis 0 with size 5
Should I not be using OrdinalEncoding? I have several different ways but this one seems to be an error in the right direction.

The problem
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
In the second line, you fit your ordinal encoder on the column 'plea_orcs'. You can then transform that data (as you do with the convenience fit_transform and inverse_transform the result.
But then in the third line, you refit the ordinal encoder on the column 'judge'. This loses all information about plea_orcs, and you will no longer be able to transform test data, or inverse-transform.
Some solutions
In increasing order of (IMO) elegance:
Instantiate separate ordinal encoders for each feature.
Use just one ordinal encoder, and fit and transform all three columns at once.
Use just one ordinal encoder together with a ColumnTransformer for selecting the appropriate columns. Use passthrough for other columns, if you don't need to do any preprocessing to them.
Off-topic...
...but consider whether ordinal encoding is appropriate: if your data isn't naturally ordered, then you're adding false relationships to your data. See e.g. this DS.SE post.

Related

Why am i getting index error on this one hot encoding?

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?
It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.

How to fix One-hot encoding error - IndexError?

Currently I'm working on a Deep learning model containing LSTM to train on joints for human movement(s), but during the one-hot encoding process I keep getting an error.
I've checked several websites for instructions, but unable to solve the difference with my code/data:
import pandas as pd
import numpy as np
keypoints = pd.read_csv('keypoints.csv')
X = keypoints.iloc[:,1:76]
y = keypoints.iloc[:,76]
Which results in the followwing shapes:
Keypoints = (63564, 77)
x = (63564, 75)
y = (63564,)
All the keypoints of the joints are in x and y contains all the labels I want to train on, which are three different (textual) labels. The first column of the dataset can be ignored, cause it contained just frame numbers.
Therefor I was advised to use one-hot enconding to use categorical_entropy later on:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(categorical_features = [0])
y = ohe.fit_transform(y).toarray()
But when applying this, I get the error on the last line:
> Traceback (most recent call last):
File "LSTMPose.py", line 28, in <module>
y = ohe.fit_transform(y).toarray()
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in fit_transform
self._handle_deprecations(X)
File "C:\Users\jebo\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in _handle_deprecations
n_features = X.shape[1]
IndexError: tuple index out of range
I assumed it has something to with my y index, but it is just 1 column... so what am I missing?
You need to reshape your y-data to be 2D as well, similar to the x-data. The second dimension should have length 1, i.e. you can do:
y = ohe.fit_transform(y[:, None]).toarray()

What is the meaning of the error ValueError: cannot copy sequence with size 205 to array axis with dimension 26 and how to solve it

This is the code i wrote , i am trying to convert the non-numerical data to numeric. However it return an error ValueError: cannot copy sequence with size 205 to array axis with dimension 26
The data is get from http://archive.ics.uci.edu/ml/datasets/Automobile
automobile = pd.read_csv('imports-85.csv', names = ["symboling",
"normalized-losses", "make", "fuel", "aspiration", "num-of-doors", "body-
style", "drive-wheels", "engine-location", "wheel-base", "length", "width",
"height", " curb-weight", "engine-type", "num-of-cylinders","engine-
size","fuel-system","bore","stroke"," compression-ratio","horsepower","peak-
rpm","city-mpg","highway-mpg","price"])
X = automobile.drop('symboling',axis=1)
y = automobile['symboling']
le = preprocessing.LabelEncoder()
le.fit([automobile])
print (le)
The fit method takes an array of [n_samples] see the docs. You're passing the entire data frame within a list. I'm pretty sure if you print the shape of your dataframe (automobile.shape) it will show a shape of (205, 26)
If you want to encode your data you need to do it one column at a time e.g.
le.fit(automobile['make']).
Note, that this is not the correct way to encode categorical data, as the name suggests LabelEncoder is designed for labels and not input features. In scikit-learns current state you should use OneHotEncoder. There are plans in the next release for a categorical encoder

Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn

I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded.
Here is my code;
import numpy as np
import pandas as pd
import matplotlib.pyplot as py
Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
I can only see how I am trying to encode the first column of data however the error I am getting is the following;
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):
File "<ipython-input-132-360fc0133165>", line 2, in <module>
X = onehotencoder.fit_transform(X).toarray()
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site- packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'
What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!?
Thanks in advance for your advice!
I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :(
Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (e.g. 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again).
Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored").
Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. I don't know, maybe it should be checked with the sklearn community on github?
#Taylrl,
I encountered the same behavior and found it frustrating. As #Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.
Unfortunately, I am getting the error in the title when I try and run hstack to add this single column to my other numpy array.
Here is my code :
#reading in test/train data for TF-IDF
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])
#reading in labels for training
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]
#reading in single integer column to join
AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None) #Classifier
X_all = traindata + testdata #adding test and train data to put into tf-idf
lentrain = len(traindata) #find length of train data
tfv.fit(X_all) #fit tf-idf on all our text
X_all = tfv.transform(X_all) #transform it
X = X_all[:lentrain] #reduce to size of training set
AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
X_test = X_all[lentrain:] #reduce to size of training set
#printing debug info, output below :
print "X.shape => " + str(X.shape)
print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
print "X_all.shape => " + str(X_all.shape)
#line we get error on
X = np.hstack((X, AllAlexaAndGoogleInfo))
Below is the output and error message :
X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
31 print "X_all.shape => " + str(X_all.shape)
32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
34 sc = preprocessing.StandardScaler().fit(X)
35 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
271 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
272 if arrs[0].ndim == 1:
--> 273 return _nx.concatenate(arrs, 0)
274 else:
275 return _nx.concatenate(arrs, 1)
ValueError: all the input arrays must have same number of dimensions
What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?
Thank you.
Edit :
Using the method in the answer below gets the following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
37 sc = preprocessing.StandardScaler().fit(X)
38 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
294 arr = array(arr,copy=False,subok=True,ndmin=2).T
295 arrays.append(arr)
--> 296 return _nx.concatenate(arrays,1)
297
298 def dstack(tup):
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Interestingly, I tried to print the dtype of X and this worked fine :
X.dtype => float64
However, trying to print the dtype of AllAlexaAndGoogleInfo like so :
print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)
produces :
'DataFrame' object has no attribute 'dtype'
As X is a sparse array, instead of numpy.hstack, use scipy.sparse.hstack to join the arrays. In my opinion the error message is kind of misleading here.
This minimal example illustrates the situation:
import numpy as np
from scipy import sparse
X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works
Based on the following output
X shape: (10, 10000)
xt shape: (10, 1)
one may expect that the hstack in the following line will work, but the fact is that it throws this error:
ValueError: all the input arrays must have same number of dimensions
So, use scipy.sparse.hstack when you have a sparse array to stack.
In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
First of all, AllAlexaAndGoogleInfo does not have a dtype as it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtype of object, which means that it might contain non-numerical elements like strings.
This is a minimal example that reproduces this situation:
X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape
The error message:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
So, check if there is any non-numerical values in AllAlexaAndGoogleInfo and repair them, before doing the stacking.
Use .column_stack. Like so:
X = np.column_stack((X, AllAlexaAndGoogleInfo))
From the docs:
Take a sequence of 1-D arrays and stack them as columns to make a
single 2-D array. 2-D arrays are stacked as-is, just like with hstack.
Try:
X = np.hstack((X, AllAlexaAndGoogleInfo.values))
I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame. np.hstack is a numpy function, and as such knows nothing about the internal structure of the DataFrame.

Categories

Resources