sklearn: MultiClass Classifier with Negative Samples - python

I am new to Machine Learning, however, a veteran programmer....
I have a lot of data about Customer/Agent interactions, with ratings for these interactions as being positive/negative from the customer perspective... I also have lots of features about the customer (Age, Gender, previous spend, products purchased,....etc)
I want to train a model that can learn from Customer Features who is the best Agent to deal with them that would potentially produce the highest rating... Assuming that similar customers (similar features) would lead to the Agent being able to serve them in the same way.....
Assume the following pandas Dataframe: dataset
AgentID Score Cust_F1 Cust_F2 Cust_F3 ..... Cust_Fn
0 1 10 1 0 1 2
1 1 0 0 1 2 0
2 1 9 1 2 1 2
3 2 10 0 1 1 1
4 2 9 0 1 2 1
5 2 0 1 0 2 2
X = dataset.drop([['AgendID','Score']],1).values
y = dataset['AgentID'].values
clf = RandomForestClassifier(n_estimators=100, random_state=1)
clf.fit(X,y)
I want a way to train the model to reject (negative train) all samples with Score = 0. I cannot find a way to do this with sklearn... Of course, I can remove samples with Scores = 0 from the training data, however, I believe they are very valuable information that would help the algorithm to properly classify...
I also looked at sample_weight parameter and i thought if I put negative values there it would help, however, the documentation doesn't mention this...
Can someone please help me...

Related

Should the background dataset for shap be standardized?

So I am trying to explain a basic SVM model using SHAP. The inputs to the SVM model however are standardized (I used StandardScaler().fit() and then transformed the datapoints using StandardScaler so that they can be used on the SVM model).
My question is now when using SHAP I need to give it a background distribution. Usually the input to this background distribution looks like this:
background_distribution = KMeans(n_clusters=10,random_state=0).fit(xtrain).cluster_centers_
However I wanted to use my own custom background distribution, which contains select data points. Does this mean the data points need to be standardized as well? i.e instead of looking like
[ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
they look like this
[ 0.67028006 -0.18887347 0.90860212 -0.41342579 0.26204266 0.55080012
-0.85479154 0.13743146 -0.70749448 -0.42919754 1.21628074 -0.71418983
-0.26726124 -0.52247913 -0.34755864 0.31234752 -0.23208655 -0.63565412
-0.40904178 0. 4.89897949 -0.23473314 0.64082627 -0.46852129
-0.26726124 -0.44542354 1.15657353 0.53795751]
For clarity: I am asking whether after retrieving my points, I need to standardize the background data set, since my original data points are scaled for use in the model, however my background distribution contains non scaled data points.
The model training looks like this:
ss = StandardScaler().fit(X)
xtrain = ss.transform(xtrain) #Changes values to make them ML compatible -not needed for trees
xtest = ss.transform(xtest)
support_vector_classifier = SVC(kernel='rbf')
support_vector_classifier.fit(xtrain,ytrain)
y_pred_svc = support_vector_classifier.predict(xtest)
Option A:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B:
background_distribution= [ 1 0 1 31 24 4817 2 3 1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1]
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
shap.KernelExplainer(support_vector_classifier.predict,background_distribution)
Option B. Your background should be preprocessed in the same way as your training data
is close.
This is the case in any situation in ML when you preprocess data -- should you split your data for train, test, validate, should you feed your data for prediction to trained model -- you always apply the same transformations to all parts of your data, sometimes manually, sometimes through pipeline. SHAP is not an exception from this principle.
However, you may think about the following as well: your scaler should be trained on the trained data before applying to test or background data. You can't train it on test or validate or background data because this would sound as if for predicting future you first asking to show it to you ("data leakage" as they call it ML).
This means, you can't:
ss = StandardScaler().fit(background_distribution)
background_distribution = ss.transform(background_distribution)
Rather:
ss = StandardScaler().fit(X_train)
background_distribution = ss.transform(background_distribution)

AttributeError: 'Series' object has no attribute 'to_coo'

I am trying to use a Naive Bayes classifier from the sklearn module to classify whether movie reviews are positive. I am using a bag of words as the features for each review and a large dataset with sentiment scores attached to reviews.
df_bows = pd.DataFrame.from_records(bag_of_words)
df_bows = df_bows.fillna(0).astype(int)
This code creates a pandas dataframe which looks like this:
The Rock is destined to ... Staggeringly ’ ve muttering dissing
0 1 1 1 1 2 ... 0 0 0 0 0
1 2 0 1 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 0 0 0 0 0
3 0 0 1 0 4 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
I then try and fit this data frame with the sentiment of each review using this code
nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)
However I get an error which says
AttributeError: 'Series' object has no attribute 'to_coo'
This is what the df movies looks like.
sentiment text
id
1 2.266667 The Rock is destined to be the 21st Century's ...
2 3.533333 The gorgeously elaborate continuation of ''The...
3 -0.600000 Effective but too tepid biopic
4 1.466667 If you sometimes like to go to the movies to h...
5 1.733333 Emerges as something rare, an issue movie that...
Can you help with this?
When you're trying to fit your MultinomialNB model, sklearn's routine checks if the input df_bows is sparse or not. If it is, just like in our case, it is required to change the dataframe's type to 'Sparse'. Here is the way I fixed it :
df_bows = pd.DataFrame.from_records(bag_of_words)
# Keep NaN values and convert to Sparse type
sparse_bows = df_bows.astype('Sparse')
nb = nb.fit(sparse_bows, movies['sentiment'] > 0)
Link to Pandas doc : pandas.Series.sparse.to_coo

Keras model multiple inputs and multiple outputs: how to determine multiple or non multiple scenario for input and output

I am trying keras in python. I want to know how to decide if my model is a multi input multi output model.
My input and target looks like the following
Input Table
......................inputs............................... ..............targets............
Time ID FA_1 FA_2 FA_3 FA_4 FA_5 Tag F_1 F_2 F_3 F_4 F_5
1 2 4 0 3 7 0 0 1 0 1 0 0
3 2 4 0 3 7 0 1 0 0 1 1 0
2 7 0 5 6 0 2 1 0 1 1 0 0
here, FA values features with some count values in integer and F values are binary values of error occurrence on the features.
Now if i want to predict the target values with the given input values will it be a multiple inputs and multiple outputs model?
For clarification, for each row of inputs i want 5 values in the output. The values can be either predicted values (0 or 1) or predictions of being 1 (0-100%).
I checked keras documentation of Models with multiple inputs and outputs. But it has totally different outputs, where in my case its just multiple number of outputs not multiple outputs. Any relevant suggestions on building the model is appreciated.
may be something like this,
model = keras.Model(inputs=[input_table, tags_input],
outputs=[F_1_pred, F_2_pred, F_3_pred, F_4_pred, F_5_pred])

Getting "Perfect separation detected, results not available" while building the Logistic Regression model

As part of my assignment I am building logistic regression model but I am getting an error "Perfect separation detected, results not available" while building it.
**X_train :-**
year amt_spnt rank
1 -1.723034 -0.418500 0.272727
2 0.716660 2.088507 -0.636364
3 1.174102 -0.558333 -1.545455
4 -0.503187 -1.297451 1.181818
5 1.326583 -0.628250 -1.545455
**y_train :-**
1 0
2 1
3 1
4 0
5 1
Name: result, dtype: int64
**Logistic Model code:-**
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
**Dataset before and after scaling**
**Image for evidence:-**
[![Evidence][1]][1]
[1]: https://i.stack.imgur.com/cTncA.png
This is a model setting issue, because of the perfect separation, your model can not converge. Perfect separation means there is one (or more) variable in your independent variables that can perfectly distinct dependent variable = 0 from dependent variable = 1. See the following example:
Y 0 0 0 0 0 0 1 1 1 1
X 1 2 3 4 4 4 5 6 7 8
If X <= 4, Y = 0
If X > 4, Y = 1
A short answer to your question is to find such variable in your independent variable and remove it from your model.

Get most informative features from very simple scikit-learn SVM classifier

I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.
From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order

Categories

Resources