All independent variables are categorical and dependent(target) variable is continuous - python

This is the data I need to build model upon:
dataframe
dataframe contains 834 rows and 4 columns('Size','Sector','Road Connectivity','Price')
AIM is to train a model so as to predict the price
'Size','Sector' and 'Road connectivity' are 3 features which are assigned to X variable.
'Price' i.e our target feature is assigned to y variable
i have made a pipeline which consists of one hot encoder and linear regressor
below is the code:
ohc=OneHotEncoder(categories = "auto")
lr=LinearRegression(fit_intercept=True,normalize=True)
pipe=make_pipeline(ohc,lr)
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
kfolds=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
cross_val_score(pipe,X,y,cv=kfolds).mean()
output =0.8970496076598085
xinp=([['04M','Sec 10','C road']])
pipe.fit(X,y)
pipe.predict(xinp)
now when I pass the values to pipeline to predict it shows an error:
"""Found unknown categories ['Sec 10'] in column 1 during transform"""
ANY SUGGESTIONS which help build the model are appreciated...

It looks like you provided category (in xinp, the Sec 10 value), that was not present in training data, thus it can not be one hot encoded, because there is no dummy variable (no corresponding binary column) for it. One of possible solutions can be following:
ohc=OneHotEncoder(categories = "auto", handle_unknown = "ignore")
From scikit one hot encoder documentation:
handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this
parameter is set to ‘ignore’ and an unknown category is encountered
during transform, the resulting one-hot encoded columns for this
feature will be all zeros. In the inverse transform, an unknown
category will be denoted as None.

Related

Passing `sample_weight` parameter to classifier in imblearn pipeline when using over/under sampling transformer

Context: I am using imblearn Pipeline as follows
# Synthetic Minority Over-sampling Technique for Nominal and Continuous features
features_cat_mask = np.in1d(self.X_features, self.X_features_cat)
self.imbalance_transformer = SMOTENC(categorical_features=features_cat_mask)
# Add binary column indicators for categorical features
self.column_transformer = compose.make_column_transformer(
(preprocessing.OneHotEncoder(handle_unknown='ignore',
sparse=False), self.X_features_cat),
remainder='passthrough')
# Impute NaN values
simple_imputer = SimpleImputer(strategy='median')
model = RandomForestClassifier(n_jobs=-1,
criterion='entropy',
class_weight='balanced_subsample')
self.clf = Pipeline(steps=[("imbalance_transformer", self.imbalance_transformer),
("column_transformer", self.column_transformer),
("simple_imputer", simple_imputer),
("classifier", model)])
Previously before using imblearn SMOTENC I passed sample_weight using the following technique:
self.clf.fit(self.X_train,
self.y_train,
classifier__sample_weight=self.sample_weight)
Where self.sample_weight was defined based on a column in the original dataframe that produces X_train and y_train (column = 'sample_weight').
However, since using imblearn, the number of rows output from imblearn is NOT equal to the number of rows in original datafram where sample_weight comes from. I get the following error: ValueError: sample_weight.shape == (1208,), expected (1830,)!
Question: What are some recommended techniques for passing sample_weight to the model when using an imblearn transformer (that changes the number of rows in the dataframe passed to the RF model).

How can i map predicted values (after using RandomForestClassifier) back to their original values in Python?

For context, I am taking Ad listing data for Machines and using it to predict the type of Machine.
I have used the RandomForestClassifier for class prediction. In the model I have used LabelEncoder to convert all categorical variables, including the feature label (for example, 'Excavator' becomes '5'). After running the model successfully, I am left with my array of predicted values. These values are the encoded values - numerical. What I would like to do now is convert these predictions back into their original strings. E.g. I would like to map the number 5 back to it's original value of 'Excavator' - ideally mapping all of the predicted values in one DataFrame.
I have left out a lot of code below as I don't want to drown people in the full script so I have just left what I deem to be most relevant to my question but if you need to see more in order to help then please let me know!
### ENCODE TO CATEGORICAL ###
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Choose columns to encode
cols = ['make', 'model_of_Ad', 'year_manufactured', 'business', "tag_name_deep"]
# Encode columns
df[cols] = df[cols].apply(LabelEncoder().fit_transform)
# Reset df index
df.reset_index(drop=True, inplace=True)
....
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# define the model
rf = RandomForestClassifier()
# fit the model on the whole dataset
rf.fit(X_train, y_train)
#Predict on the test set in order to assess accuracy
y_pred = rf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# See predicted values
print(y_pred)
Any help is appreciated!

Handling category, float and int type features while using LIME for model interpretation

I am using Lime (Local Interpretable Model-agnostic Explanations) with mixed feature types in order to evaluate my model predictions for classification task. Does anyone know how to specify binary features in lime.lime_tabular.LimeTabularExplainer() method. How actually LIME handles these types of features (more features with only 1's and 0's)?
I think your should declare your binary features as categorical features in order to allow your Lime explainer to use its sampling mechanism efficiently when performing local perturbation around the studied sample.
You can do it using the categorical_features keyword parameter in the LimeTabularExplainer constructor.
my_binary_feature_column_index = 0 # put your column index here
explainer = LimeTabularExplainer(my_data, categorical_features=[my_binary_feature_column_index], categorical_name={my_binary_feature_column_index: ["foo", "bar", "baz"]})
categorical_features is a list of categorical column indexes, and
categorical_name is a dictionary containing a map of column index and list of category names.
As it is mentionned in the LIME code :
Explains predictions on tabular (i.e. matrix) data.
For numerical features, perturb them by sampling from a Normal(0,1) and
doing the inverse operation of mean-centering and scaling, according to the
means and stds in the training data. For categorical features, perturb by
sampling according to the training distribution, and making a binary
feature that is 1 when the value is the same as the instance being
explained.
So, categorical features are one hot encoded under the hood and the value 0 or 1 is used according to the feature distribution in your training dataset (unless you chose to use a LabelEncoder, which will result in LIME processing the feature as a continuous variable).
A good tutorial is available in the LIME project: https://github.com/marcotcr/lime/blob/master/doc/notebooks/Tutorial%20-%20continuous%20and%20categorical%20features.ipynb

Using encoded target value

I have a pandas dataframe and one column of it is my target value which is categorical.
I used get_dummies for encoding my target value. Now, I have my encoded target value in 5 encoded column because my target value has 5 categories.
My question is that how can I consider all of these 5 columns in linear regression method?
I have x_dummies as my dependent values dataframe and y_dummies as my target value data frame with 5 columns of encoded values.
I have never had a target value in more than one column!
Is this correct?
Link to Assingment:
https://www.cs.waikato.ac.nz/~eibe/pubs/ordinal_tech_report.pdf
regr = linear_model.LinearRegression()
regr.fit( x_dummies_training, y_dummies_training)
If your target is categorical you may want to use a classifier , not a regressor.
You may read this article to understand the difference if you want .
So in your case you would want to use a classifier and keep your y target as one variable instead of one hot encoding it.
If you want a mathematical model that's easy to interpret ( i guessed that from your use of Linear Regression) you may want a multinomial logistic regression:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='lbfgs',
multi_class='multinomial').fit(X, y)
You may want to check the sklearn documentation .
You could also try the wildly popular boosting trees methods that should give you better results : check catboost as an example .

How to one hot encode with pandas on a new dataset?

I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.
For example ,
You have tradf with column ['A_1','A_2']
With your new df you have column['A'] but only have one category 1 , you can do
pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)

Categories

Resources