Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.
Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).
Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.
Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?
If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.
You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.
Related
I am trying to use sklearn to train a decision tree based on my dataset.
When I was trying to slicing the data to (outcome:Y, and predicting variables:X), it turns out that the outcome (my label) is in True/False:
#data slicing
X = df.values[:,3:27] #X are the sets of predicting variable, dropping unique_id and student name here
Y = df.values[:,'OffTask'] #Y is our predicted value (outcome), it is in the 3rd column
This is how I do, but I do not know whether this is the right approach:
#convert the label "OffTask" to dummy
df1 = pd.get_dummies(df,columns=["OffTask"])
df1
My trouble is the dataset df1 return my label Offtask to OffTask_N and OffTask_Y
Can someone know how to fix it?
get_dummies is used for converting nominal string values to integer. It returns as many as column as many unique string values are available in columns eg:
df={'color':['red','green','blue'],'price':[1200,3000,2500]}
my_df=pd.DataFrame(df)
pd.get_dummies(my_df)
In your case you can drop first value, wherever value is null can be considered it will be first value
You could make the pd.get_dummies to return only one column by setting drop_first=True
y = pd.get_dummies(df,columns=["OffTask"], drop_first=True)
But this is not the recommended way to convert the label to binaries. I would suggest using labelbinarizer for this purpose.
Example:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(pd.DataFrame({'OffTask':['yes', 'no', 'no', 'yes']}))
#
array([[1],
[0],
[0],
[1]])
I'm just using Tensorflow and its tf.learn api to create and train a DNNRegressor model. I have an integer feature column that is multivalent (I can have more than one integer value in that column for each row) and I use tf.contrib.layers.sparse_column_with_integerized_feature for this feature column.
now my question is what is the right delimeter for the multivalent feature column in csv file.
for example supose I have a csv that col2 is multivalent feature and its not one hot feature:
1, 2, 1:2:3:4, 5
2, 1, 4:5, 6
as you see I use ':' for seperating integer feature valuse in col2 but it seems its not right and I got this error while running DNNRegressor with declaring this feature column as tf.contrib.layers.sparse_column_with_integerized_feature:
'Value passed to parameter 'x' has DataType string not in list of allowed
values: int32, int64, float32, float64'.
I really appreciate your help
tf.contrib.layers.sparse_column_with_integerized_feature is for int32 or int64 values only, so it won't work exactly as you want.
But tensorflow supports multi-dimensions in numerical columns, so you can work with tf.feature_column.numeric_column and specify the shape that you have. Note that tensorflow will expect that all of those shapes are the same, so you'll need to pad all of your values to a common shape.
The colon ':' delimeter is fine for multivalent columns, here's an example how to read multiple values into a DataFrame with pandas (the question is about XML, but the same works for CSV). This data frame you can pass into model.train() function as input_fn.
I am using sklearn.ensemble.RandomForestClassifier to analyze data and I was puzzled to see NaN values in the prediction without any NaN in the training set or in testing set.
print preds_y[preds_y.isnull().any(axis=1)].shape
print train_y[train_y.isnull().any(axis=1)].shape
print train_features[train_features.isnull().any(axis=1)].shape
print test_features[train_features.isnull().any(axis=1)].shape
> (4830, 1)
> (0, 1)
> (0, 22)
> (0, 22)
These NaN values are causing the call to sklearn.metrics.classification_report to fail with the following error:
> ValueError: Mix of label input types (string and number)
Right now I'm mostly interested in understanding why the random forest is spitting out NaNs. As soon as I figure that out, I can filter the results accordingly and see how well the method is performing.
Thanks in advance for your input.
(I'm sorry if this has been asked before. I searched for it but all the results I found concerned NaNs in the training data, which is not my issue at all.)
EDIT 1: Just to be clear, there are many valid predictions in the output:
print preds_y[~preds_y.isnull().any(axis=1)].shape
print train_y[~train_y.isnull().any(axis=1)].shape
> (11760, 1)
> (39749, 1)
EDIT 2:
As I wrote in a comment below, the original data has numeric and categorical columns. All the categorical columns are converted to numeric using pandas.get_dummies() before calling fit(). I convert the results back to a pandas.DataFrame and reconstruct the original categorical columns for readability. The two pandas.Series -- predicted and actual values -- I am feeding classification_report() have only one type (category).
It seems that the NaNs in the predictions arise if the random forest predicts 0 for every dummy binary column corresponding to the original categorical column. I was not expecting this to happen so often -- it seems that 30% of my entries go unclassified -- but I'm not sure there is anything further to add on this issue.
You can first remove all NaN by replacing them as zeros.
See this link.
Maybe use df.fillna(0), then you should be fine I suppose.
I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
You can create indicator variables for different categories. For example:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
Then we have:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
And you can concatenate these onto your original data matrix:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
[1 1 0 0
1 0 1 0
1 0 0 1]
Notice how constant term, an indicator for mouse, an indicator for cat and an indicator for dog leads to a less than full column rank matrix: the first column is the sum of the last three.
Standart approach to convert categorial features into numerical - OneHotEncoding
It's completely different classes:
[DictVectorizer][2].vocabulary_
A dictionary mapping feature names to feature indices.
i.e After fit() DictVectorizer has all possible feature names, and now it knows in which particular column it will place particular value of a feature. So DictVectorizer.vocabulary_ contains indicies of features, but not values.
LabelEncoder in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.
Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:
import pandas as pd
catColumns = df.select_dtypes(['object']).columns
Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
n = len(df[col].unique())
if (n > 2):
X = pd.get_dummies(df[col])
X = X.drop(X.columns[0], axis=1)
df[X.columns] = X
df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
else:
le.fit(df[col])
df[col] = le.transform(df[col])
I run an sk-learn classifier on a pandas dataframe (X). Since some data is missing, I use sk-learn's imputer like this:
imp=Imputer(strategy='mean',axis=0)
X=imp.fit_transform(X)
After doing that however, my number of features is decreased, presumably because the imputer just gets rids of the empty columns.
That's fine, except that the imputer transforms my dataframe into a numpy ndarray, and thus I lose the column/feature names. I need them later on to identify the important features (with clf.feature_importances_).
How can I know the names of the features in clf.feature_importances_, if some of the columns of my initial dataframe have been dropped by the imputer?
you can do this:
invalid_mask = np.isnan(imp.statistics_)
valid_mask = np.logical_not(invalid_mask)
valid_idx, = np.where(valid_mask)
Now you have old indexes (Indexes that these columns had in matrix X) for valid columns. You can get feature names by these indexes from list of feature names of old X.
It's more difficult than it should be. The answer is that SimpleImputer should get an argument, add_indicator=True. Then, after fitting, simple_imputer.indicator_ takes the value of another transformer of the type sklearn.impute.MissingIndicator. This in turn will have a variable features_, which contains the features.
So it's roughly like this:
simple_imputer = SimpleImputer(add_indicator=True)
simple_imputer.fit(X)
print(simple_imputer.indicator_.features_)
I've implemented a thin wrapper around SimpleImputer, called SimpleImputerWithFeatureNames, which gives you feature names. It's available on github.
>> import openml_speed_dating_pipeline_steps as pipeline_steps
>> imputer = pipeline_steps.SimpleImputerWithFeatureNames()
>> imputer.fit(X_train[numeric_features])
>> imputer.get_feature_names()
[...]