I have started with machine learning and already stuck on some task.
I want to set up (preprocessing) the table columns because the data in the columns are categorical. I have successfully process (change values) all columns except the last one. The data file from which I load (dataset = pd.read_csv("car.data")) my values is car.data (I downloaded it from internet). To mention I am using pandas and all important libraries for ML (tensorflow, sklearn, numpy...).
When I run my code all columns are filled with numbers and only last one is filled with "None". I have noticed when I run: dataset.unacc.unique(), I get this output:
array([None], dtype=object) and it should not be of type object. In correct code is different: array([0, 1, 3, 2], dtype=int64).
Can anyone help me with this problem? I couldn't find it on Internet. Thank you in advance.
Here is my code:
dataset.unacc.unique()
This is function to change values in a column:
def label_fixTarget(something):
if something=='unacc':
return 0
elif something=='acc':
return 1
elif something=='vgood':
return 3
elif something=="good":
return 2
dataset['unacc'] = dataset['unacc'].apply(label_fixTarget)
dataset.unacc.unique()
dataset.head()
And this is my new table:
if you know the values to be changed, you can create a dictionary and map it on the column such as:
qualityToPoint ={
'TA' : 3,
'Fa' : 2,
'Gd' : 4,
'None': 0,
'Ex': 5,
'Po' : 1}
df['column'] = df['column'].map(qualityToPoint).astype('int')
This is not the correct way to encode categorical data.
To achieve what you want, you need to use sklearn.preprocessing.LabelEncoder.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'a': ['unacc', 'acc', 'good', 'vgood']})
label_encoder = LabelEncoder()
label_encoder.fit(df['a'])
# Putting in encoded categories into another column `encoded`
df['encoded'] = label_encoder.transform(df['a'])
print(df)
# This prints the following `df`
# a encoded
# 0 unacc 2
# 1 acc 0
# 2 good 1
# 3 vgood 3
label_encoder after the call to fit consists of all information about transforming categories to integers. Be careful it doesn't transform what it has not seen e.g. if I execute:
label_encoder.transform(['a', 'b'])
where neither 'a' nor 'b' was encountered during call to fit, it will result in exception.
How to decode from integers back to labels:
# Just like `transform`, we also have `inverse_transform`.
df['decoded'] = label_encoder.inverse_transform(df['encoded'])
print(df)
# This will print something like:
# a encoded decoded
# 0 unacc 2 unacc
# 1 acc 0 acc
# 2 good 1 good
# 3 vgood 3 vgood
So first I encoded column 'a' and put the encoded values into 'encoded' column. And then to test inverse_transform, I called inverse transform on the encoded values (values under 'encoded' column), and then put the result in 'decoded' column.
Column 'a' and 'decoded' should be same, and they're.
You can also print the classes that the label_encoder recognizes after the call to fit.
print(label_encoder.classes_)
# This will print
# array(['acc', 'good', 'unacc', 'vgood'], dtype=object)
Note: I put the results from transform() (which returns a numpy.array) into 'encoded' column in the same df and the results from inverse_transform() into 'decoded' column just to demonstrate that the decoded values must be same as that of initial values.
LabelEncoder scikit-learn documentation
I am trying to use sklearn to train a decision tree based on my dataset.
When I was trying to slicing the data to (outcome:Y, and predicting variables:X), it turns out that the outcome (my label) is in True/False:
#data slicing
X = df.values[:,3:27] #X are the sets of predicting variable, dropping unique_id and student name here
Y = df.values[:,'OffTask'] #Y is our predicted value (outcome), it is in the 3rd column
This is how I do, but I do not know whether this is the right approach:
#convert the label "OffTask" to dummy
df1 = pd.get_dummies(df,columns=["OffTask"])
df1
My trouble is the dataset df1 return my label Offtask to OffTask_N and OffTask_Y
Can someone know how to fix it?
get_dummies is used for converting nominal string values to integer. It returns as many as column as many unique string values are available in columns eg:
df={'color':['red','green','blue'],'price':[1200,3000,2500]}
my_df=pd.DataFrame(df)
pd.get_dummies(my_df)
In your case you can drop first value, wherever value is null can be considered it will be first value
You could make the pd.get_dummies to return only one column by setting drop_first=True
y = pd.get_dummies(df,columns=["OffTask"], drop_first=True)
But this is not the recommended way to convert the label to binaries. I would suggest using labelbinarizer for this purpose.
Example:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(pd.DataFrame({'OffTask':['yes', 'no', 'no', 'yes']}))
#
array([[1],
[0],
[0],
[1]])
Can anyone please suggest what is the best way to encode string features wherein I have > 500 unique features. Does this fall under categorical Data?
I need to basically normalize data with string features having huge number of unique features and adjacent features are co-realted. ( eg. col1 and col2 have a particular combination for one class in classification Problem. Similarly col3 and col4 again have some fixed pattern for each class)
How do I encode my data in this scenario before making it ready for ML algorithm?
There are several ways to encode categorical features. The best way really depends on your dataset and which ML algorithm you are going to use, so you could try different encoding schemes and pick the one that has the best results.
I've worked with categorical features with hundreds of unique values (e.g. Product Brands) and with tree-based algorithms and a label-encoder worked well with the algorithm.
For example you could use the scikit-learn label encoder:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
You can do that in pandas as well, for example, if you have a column with the string categories you want to encode you could try this:
df["categorical_feature"] = df["categorical_feature"].astype('category')
df["categorical_feature_enc"] = df["categorical_feature"].cat.codes
Another useful encoding you could try is the one-hot encoding. However, since you have a lot of categories to encode that would result in an addition of n columns to your dataset per categorical feature (n = number of categories). Check the pandas get_dummies to see an example.
I have a series like:
df['ID'] = ['ABC123', 'IDF345', ...]
I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.
During the training, I'm doing as follows:
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.
In the test file, I was doing as follows:
new_df['ID'] = le_dpid.transform(new_df.ID)
But, I'm getting the following error: ValueError: y contains new labels
How do I fix this?? Thanks!
UPDATE:
So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.
df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
And then predict it on something like:
BankNum | ID |
00982222 | AB999 |
00982222 | AB999 |
00981111 | AB890 |
I'm doing something like this:
df['BankNum'] = df.BankNum.astype(np.float128)
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
clf = RandomForestClassifier(random_state=42, n_estimators=140)
clf.fit(X_train, y_train)
I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.
One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.
An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.
you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549
The thing is to create dictionary with classes, than map column and fill new classes with some "known value"
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
If your data are pd.DataFrame I suggest you this simple solution...
I build a custom transformer that integer maps each categorical feature. When fitted you can transform all the data you want. You can compute also simple label encoding or onehot encoding.
If new unseen categories or NaNs are present in new data:
1] For label encoding, 0 is a special token reserved for mapping these cases.
2] For onehot encoding, all the onehot columns are zeros in these cases.
class FeatureTransformer:
def __init__(self, categorical_features):
self.categorical_features = categorical_features
def fit(self, X):
if not isinstance(X, pd.DataFrame):
raise ValueError("Pass a pandas.DataFrame")
if not isinstance(self.categorical_features, list):
raise ValueError(
"Pass categorical_features as a list of column names")
self.encoding = {}
for c in self.categorical_features:
_, int_id = X[c].factorize()
self.encoding[c] = dict(zip(list(int_id), range(1,len(int_id)+1)))
return self
def transform(self, X, onehot=True):
if not isinstance(X, pd.DataFrame):
raise ValueError("Pass a pandas.DataFrame")
if not hasattr(self, 'encoding'):
raise AttributeError("FeatureTransformer must be fitted")
df = X.drop(self.categorical_features, axis=1)
if onehot: # one-hot encoding
for c in sorted(self.categorical_features):
categories = X[c].map(self.encoding[c]).values
for val in self.encoding[c].values():
df["{}_{}".format(c,val)] = (categories == val).astype('int16')
else: # label encoding
for c in sorted(self.categorical_features):
df[c] = X[c].map(self.encoding[c]).fillna(0)
return df
Usage:
X_train = pd.DataFrame(np.random.randint(10,20, (100,10)))
X_test = pd.DataFrame(np.random.randint(20,30, (100,10)))
ft = FeatureTransformer(categorical_features=[0,1,3])
ft.fit(X_train)
ft.transform(X_test, onehot=False).shape
This is in fact a known bug on LabelEncoder : BUG for fit_transform ... basically you have to fit it and then transform. It will work fine ! A suggestion is to keep a dictionary of your encoders to each and every column so that in the inverse transform you are able to retrieve the original categorical values.
I'm able to mentally process operations better when dealing in DataFrames. The approach below fits and transforms the LabelEncoder() using the training data, then uses a series of pd.merge joins to map the trained fit/transform encoder values to the test data. When there is a test data value not seen in the training data, the code defaults to the max trained encoder value + 1.
# encode class values as integers
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y_train = encoder.transform(y_train)
# make a dataframe of the unique train values and their corresponding encoded integers
y_map = pd.DataFrame({'y_train': y_train, 'encoded_y_train': encoded_y_train})
y_map = y_map.drop_duplicates()
# map the unique test values to the trained encoded integers
y_test_df = pd.DataFrame({'y_test': y_test})
y_test_unique = y_test_df.drop_duplicates()
y_join = pd.merge(y_test_unique, y_map,
left_on = 'y_test', right_on = 'y_train',
how = 'left')
# if the test category is not found in the training category group, then make the
# value the maximum value of the training group + 1
y_join['encoded_y_test'] = np.where(y_join['encoded_y_train'].isnull(),
y_map.shape[0] + 1,
y_join['encoded_y_train']).astype('int')
encoded_y_test = pd.merge(y_test_df, y_join, on = 'y_test', how = 'left') \
.encoded_y_test.values
I found an easy hack around this issue.
Assuming X is the dataframe of features,
First, we need to create a list of dicts which would have the key as the iterable starting from 0 and the corresponding value pair would be the categorical column name. We easily accomplish this using enum.
cat_cols_enum = list(enumerate(X.select_dtypes(include = ['O']).columns))
Then the idea is to create a list of label encoders whose dimension is equal to the number of qualitative(categorical) columns present in the dataframe X.
le = [LabelEncoder() for i in range(len(cat_cols_enum))]
Next and the last part would be fitting each of the label encoders present in the list of encoders with the unique values of each of the categorical columns present in the list of dicts respectively.
for i in cat_cols_enum: le[i[0]].fit(X[i[1]].value_counts().index)
Now, we can transform the labels to their respective encodings using
for i in cat_cols_enum:
X[i[1]] = le[i[0]].transform(X[i[1]])
This error comes when transform function getting any new value for which LabelEncoder try to encode and because in training samples, when you are using fit_transform, that specific value did not present in the corpus. So there is a hack, whether use all the unique values with fit_transform function if you are sure that no new value will come further, or try some different encoding method which suits on the problem statement like HashingEncoder.
Here is the example if no further new values will come in testing
le_id.fit_transform(list(set(df['ID'].unique()).union(set(new_df['ID'].unique()))))
new_df['ID'] = le_id.transform(new_df.ID)
I also encountered the exact same error and was able to fix it.
my array: [[63, 1, 'True' , 1850000000.0, 61666.67]]
When I used the following code, the error occurred
pa_ = preprocessing.LabelEncoder()
pa_.fit([True,False])
data[:,2] = pa_.transform(data[:,2])
The reason for the error was that the true value into array was of string type
And I set boolean. i changed bool type to string Thus, the error was fixed
fit(['True','False'])
I hope my answer be helpful
I hope this helps someone as it's more recent.
sklearn uses the fit_transform to perform the fit function and transform function directing on label encoding.
To solve the problem for Y label throwing error for unseen values, use:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(Col)
This solves it!
I used
le.fit_transform(Col)
and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split
I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
You can create indicator variables for different categories. For example:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
Then we have:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
And you can concatenate these onto your original data matrix:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
[1 1 0 0
1 0 1 0
1 0 0 1]
Notice how constant term, an indicator for mouse, an indicator for cat and an indicator for dog leads to a less than full column rank matrix: the first column is the sum of the last three.
Standart approach to convert categorial features into numerical - OneHotEncoding
It's completely different classes:
[DictVectorizer][2].vocabulary_
A dictionary mapping feature names to feature indices.
i.e After fit() DictVectorizer has all possible feature names, and now it knows in which particular column it will place particular value of a feature. So DictVectorizer.vocabulary_ contains indicies of features, but not values.
LabelEncoder in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.
Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:
import pandas as pd
catColumns = df.select_dtypes(['object']).columns
Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
n = len(df[col].unique())
if (n > 2):
X = pd.get_dummies(df[col])
X = X.drop(X.columns[0], axis=1)
df[X.columns] = X
df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
else:
le.fit(df[col])
df[col] = le.transform(df[col])