sklearn DecisionTreeClassifier using strings that should be considered categorical - python

I am training an sklearn.tree.DecisionTreeClassifier. I start out with a pandas.core.frame.DataFrame. Some of the columns of this data frame are strings that really should be categorical. For example, 'Color' is one such column and has values such as 'black', 'white', 'red', and so on. So I convert this column to be of type category like this:
data['Color'] = data['Color'].astype('category')
This works just fine. Now I split my data frame using sklearn.cross_validation.train_test_split, like this:
X = data.drop(['OutcomeType'], axis=1)
y = data['OutcomeType']
X_train, X_test, y_train, y_test = train_test_split(X, y)
Now X_train has type numpy.ndarray. However, the 'Color' values are no longer categorical, they are back to being strings.
So when I make the following calls:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
I get the following error:
ValueError: could not convert string to float: Black
What do I need to do to get this working correctly?

If you want to convert your categorical column to an integer, you can use data.Color.cat.codes; this uses data.Color.cat.categories to perform the mapping (the i'th array element gets mapped to the integer i)

As ayhan said, a workaround would be to create dummy features from your 'Color' variable (which are pretty common use with decision trees / RF).
You could use something like this :
def feature_to_dummy(df, column, drop=False):
''' take a serie from a dataframe,
convert it to dummy and name it like feature_value
- df is a dataframe
- column is the name of the column to be transformed
- if drop is true, the serie is removed from dataframe'''
tmp = pd.get_dummies(df[column], prefix=column, prefix_sep='_')
df = pd.concat([df, tmp], axis=1, join_axes=[df.index])
if drop:
del df[column]
return df
See documentation for pandas.get_dummies
Example
df
Out[1]:
color
0 red
1 black
2 green
df_dummy = feature_to_dummy(df, 'color', drop=True)
df_dummy
Out[2]:
color_black color_green color_red
0 0 0 1
1 1 0 0
2 0 1 0

Related

Concat created Nan values even after index_reset

I want to create a csv file that combines the train and test data and labels to use it for a project. The problem is that in concat function, even after using the index reset, the labels continue being Nan and i don't understand what is wrong. The datasets are in this link : https://wetransfer.com/downloads/9f0562b7ec341ebb663262af78971b8020211228154538/84d58d
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
data.to_csv('new1.csv', index=False)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
data2.to_csv('new2.csv', index=False)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
train = pd.concat([data_labels, data], axis=1, join='inner')
print(train.shape)
test = pd.concat([data2_labels, data2], axis=1, join='inner')
print(test.shape)
test.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
frame = pd.concat([train, test], axis=0)
print(frame)
I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.
To fix this: Set ignore_index=True in pd.concat(). From the docs:
ignore_index: bool, default False If True, do not use the index values
along the concatenation axis. The resulting axis will be labeled 0, …,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.
I just did concats with different order and it worked.
The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
print(data.shape)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
print(data2.shape)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)
#concat labels
l = data_labels.append(data2_labels)
#create the original dataset
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset

How to run model on new data that requires pd.get_dummies

I have a model that runs the following:
import pandas as pd
import numpy as np
# initialize list of lists
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
, ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
, ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category'])
print(df.head(2))
Name Attempts Score Category
0 tom 10 1 a
1 tom 15 5 a
Then I have created a dummy df to use in the model using the following code:
from sklearn.linear_model import LogisticRegression
df_dum = pd.get_dummies(df)
print(df_dum.head(2))
Attempts Score Name_matt Name_tom Category_a Category_b Category_c
0 10 1 0 1 1 0 0
1 15 5 0 1 1 0 0
Then I have created the following model:
#Model
X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values
#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]
#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)
df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])
print(dfpredictions)
Name Attempts Score Category predictions
14 matt 18 1 a 1.0
15 matt 15 6 a 1.0
16 matt 17 3 a 1.0
17 matt 14 7 c 1.0
18 matt 16 6 b 1.0
19 matt 10 2 b 1.0
Now I have new data which i would like to predict:
newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]
newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category'])
print(newdf)
Name Attempts Category
0 tom 10 a
1 tom 15 a
2 tom 14 a
Then create dummies and run prediction
newpredict = pd.get_dummies(newdf)
predict = model.predict(newpredict)
Output:
ValueError: X has 3 features per sample; expecting 6
Which makes sense because there are no categories b and c and no name called matt.
My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.
This is an example data - my dataset has 2000 columns when running pd.get_dummies. Thanks very much!
Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.
pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.
However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None
The full flow based on LabelEncoding and OneHotEncoding for your example is as below:
# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()
# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)
# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)
# predict on df_ohe_new
predict = model.predict(df_ohe_new)
Output (that you can assign back to newdf):
array([1, 1, 1])
However, if you really want to use pd.get_dummies only, then the following can work as well:
newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)
The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.
EDIT:
One thing I forgot to add is that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Linear Regression prediction by Date in python

I have converted date into numerical values but I am stuck on next step how to preparing data for prediction, How to use date for prediction in python code? How to count eventhappen attribute Please guide me and improve my code where it does not make any sense. Below is my code
#Here is Dataset
date Eventhappen
2016-01-14 A
2016-01-15 C
2016-01-16 B
2016-01-17 A
2016-01-18 C
2016-02-18 B
#Converting Date into Numerical Value
df['Dispatch_Date_Time'] = pd.to_datetime(df['Dispatch_Date_Time'])
df.set_index('Dispatch_Date_Time', inplace=True)
df.sort_index(inplace=True)
df['month'] = df.index.month
df['year'] = df.index.year
df['day'] = df.index.day
df['eventhappen'] = 1
#Preparing the data
X = df[['year']]
y = df['eventhappen']
#Trainng the Algorithm
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Making the Predictions
y_pred = regressor.predict(X_test)
#Plotting the Least Square Line
sns.pairplot(df, x_vars=['year'], y_vars='eventhappen', size=7, aspect=0.7, kind='reg')
There is a lot of confusion in your code, for me at least. The column names are not the same used in the processing.You have two scenarios to consider :
SN-A : If you want to predict which event is happening on some future date, the target column which is 'Eventhappen' will be categorical, you have a multi-classification task not a regression one, so you should encode your target column, then split your dataset using train/test split and finally implement a classifier to predict the event on some future date.
SN-B : If you want to predict the number of the events happening on some future date, then you are on the right way, you should have a numerical column to predict which is the count. It means that this line of code should not be a constant :
df['eventhappen'] = 1
When you have it, you should consider some timeseries techniques (power transformation, lags...), then split into train/test datasets and finally implement/evaluate your regressor model.
Use this function to extract the features needed from the date column and then use them directly in your machine learning model. You can also encode the cyclic features, that gives the model the ability to extract cyclic insights from the data.
def transform_col_date(data, date_col):
'''
data : Dataframe (Your dataset).
date_col : String (name of the date column)
'''
data_ = data.copy()
data_.reset_index(inplace=True)
data_[date_col] = pd.to_datetime(data_[date_col], infer_datetime_format = True)
data_['day'] = data_[date_col].dt.day
data_['month'] = data_[date_col].dt.month
data_['dayofweek'] = data_[date_col].dt.dayofweek
data_['dayofyear'] = data_[date_col].dt.dayofyear
data_['quarter'] = data_[date_col].dt.quarter
data_['weekofyear'] = data_[date_col].dt.weekofyear
data_['year'] = data_[date_col].dt.year
return data_
#in your case
data = transform_col_date(df, 'date')

Very Large Values Predicted for Linear Regression

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression(normalize = False)
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.
Also, if there is a better way to do this I'd be eager to hear about it.
You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.
If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)
If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:
df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1 C2 C3 y
0 1 0 0 12.4
1 1 0 0 11.9
2 0 1 0 8.3
3 0 1 0 3.1
4 0 0 1 5.4
5 0 0 1 6.2
In [73]:
y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:
reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:
_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:
reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:
reg1.coef_
Out[77]:
array([ 12.15, 5.7 , 5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False
A detailed explanation for a similar question at below.
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Categories

Resources