Over/Under Sampling with RandomForest - python

Newbie to ML.
I'm having trouble understanding a classification report for a RandomForest model I'm running.
I've cleaned my data and realised there's data imbalance between my target labels (0: 182588, 1: 1137) - essentially 99/1.
As I understand, I need to either oversample or undersample my data to improve the predictive modelling of the RandomForest model.
I've ran both, I've also used TargetEncoding to convert my categorical data into numerical data (there are over 40 columns after pre-processing, so I felt this was the best approach - at least better than one hot encoding which will result in too much noise).
I then ran my model using both over and under sampling. With over, it returned an accuracy of 99%, with under, accuracy of 84%.
A 99% accuracy seems unrealistic and overfitting, 84% makes more sense. Nonetheless, what's more important to me is accuracy/recall.
How do I interpret the Classification Report - what is 'macro avg' and should I be looking at that to see how accurate my model is?
Here's a snippet of my code, both with over/under sampling. Maybe I'm doing something wrong and just can't see it.
label = mod_df.iloc[:, -1]
predictors = mod_df.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(predictors, label, test_size = 0.4, random_state = 1, stratify = mod_df.iloc[:, -1])
oversample = RandomOverSampler(sampling_strategy = 'minority', random_state = 1) # 109553 variables each
X_train, y_train = oversample.fit_resample(X_train, y_train)
encoder = TargetEncoder()
X_train = encoder.fit_transform(X_train, y_train)
X_test = encoder.transform(X_test)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
accuracy_score(y_test, pred) # 0.9954551639678868
print(classification_report(y_test, pred))
precision recall f1-score support
0 1.00 1.00 1.00 73035
1 0.88 0.31 0.46 455
accuracy 1.00 73490
macro avg 0.94 0.65 0.73 73490
weighted avg 0.99 1.00 0.99 73490
With undersampling, same process except the training set contains only 682 variables each.
precision recall f1-score support
0 1.00 0.85 0.92 73035
1 0.04 0.85 0.07 455
accuracy 0.85 73490
macro avg 0.52 0.85 0.49 73490
weighted avg 0.99 0.85 0.92 73490
Snippet of the data I'm working with...
population county_location primary_road secondary_road distance direction weather_1 party_count pcf_violation_category hit_and_run road_surface road_condition_1 lighting control_device bicycle_collision motorcycle_collision truck_collision collision_time party_sex party_age party_sobriety party_safety_equipment_1 party_safety_equipment_2 cellphone_in_use other_associate_factor_1 movement_preceding_collision vehicle_year party_race month day collision_severity
3 100000 to 250000 ventura W KENTWOOD DR H ST 100.0 east clear 3 dui not hit and run wet normal dark with street lights functioning 0 0 0 22:36:00 male 27 had been drinking, under influence air bag not deployed lap/shoulder harness used 0 violation proceeding straight 1989 hispanic 10 05 0
6 >250000 los angeles IMPERIAL HWY MAIN ST 33.0 east clear 2 speeding not hit and run dry normal dark with street lights functioning 0 0 1 21:25:00 male 61 had not been drinking air bag not deployed lap/shoulder harness used 0 none apparent passing other vehicle 2006 black 08 06 0
7 >250000 los angeles IMPERIAL HWY MAIN ST 33.0 east clear 2 speeding not hit and run dry normal dark with street lights functioning 0 0 1 21:25:00 male 33 had not been drinking air bag not deployed lap/shoulder harness used 0 none apparent proceeding straight 2013 hispanic 08 06 0

Related

Scikit Learn Dataset

In Scikit learn, when doing X,Y = make_moons(500,noise = 0.2) and after printing X and Y, I see that they are like arrays with a bunch of entries but with no commas?
I have data that I want to use instead of the Scikit learn moons dataset, but I dont understand what data type these Scikit learn data sets are and how I can make my data follow this data type.
The first one X is a 2d array:
array([[-6.72300890e-01, 7.40277997e-01],
[ 9.60230259e-02, 9.95379113e-01],
[ 3.20515776e-02, 9.99486216e-01],
[ 8.71318704e-01, 4.90717552e-01],
....
[ 1.61911895e-01, -4.55349012e-02]])
Which contains the x-axis, and y-axis position of points.
The second part of the tuple: y, is an array that contains the labels (0 or 1 for binary classification).
array([0, 0, 0, 0, 1, ... ])
To use this data in a simple classification task, you could do the following:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Create dataset
X, y = make_moons(500,noise = 0.2)
# Split dataset in a train part and a test part
train_X, test_X, train_y, test_y = train_test_split(X, y)
# Create the Logistic Regression classifier
log_reg = LogisticRegression()
# Fit the logistic regression classifier
log_reg.fit(train_X, train_y)
# Use the trained model to predit con the train and predict samples
train_y_pred = log_reg.predict(train_X)
test_y_pred = log_reg.predict(test_X)
# Print classification report on the training data
print(classification_report(train_y, train_y_pred))
# Print classification report on the test data
print(classification_report(test_y, test_y_pred))
The results are:
On training data
precision recall f1-score support
0 0.88 0.87 0.88 193
1 0.86 0.88 0.87 182
accuracy 0.87 375
macro avg 0.87 0.87 0.87 375
weighted avg 0.87 0.87 0.87 375
On test data
precision recall f1-score support
0 0.81 0.89 0.85 57
1 0.90 0.82 0.86 68
accuracy 0.86 125
macro avg 0.86 0.86 0.86 125
weighted avg 0.86 0.86 0.86 125
As we can see, the f1_score is not very different between the train and the test set, the model is not overfitting.

how to adjust accuracy only for 1's?

Suppose I have such data :
x1 x2 x3 y
0.85 0.95 0.22 1
0.35 0.26 0.42 0
0.89 0.82 0.82 1
0.36 0.14 0.32 0
0.44 0.53 0.82 1
0.75 0.78 0.52 1
I predict binary classification but the only thing that matters ,is the correct prediction of the 1s, and if the prediction is 0, it will not affect my accuracy.
I simply used the following code :
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
But this code also includes zeros in its accuracy.
How can I apply to the network that only the prediction of 1 is important ?
In other words, During fitting model, if the prediction was zero , this zero predication does not apply to the model accuracy.
It looks like you care about precision of the model. Precision means for all instances that you predict 1, what portion of them is correct.
If yes, use tf.keras.metrics.Precision() as metrics.
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=[tf.keras.metrics.Precision()])

SVC sigmoid kernel is not working properly

I am testing an SVM with a sigmoid kernel on the iris data using sklearn and SVC. Its performance is extremely poor with an accuracy of 25 %. I'm using exactly the same code and normalizing the features as https://towardsdatascience.com/a-guide-to-svm-parameter-tuning-8bfe6b8a452c (sigmoid section) which should increase performance substantially. However, I am not able to reproduce his results and the accuracy only increases to 33 %.
Using other kernels (e.g linear kernel) produces good results (accuracy of 82 %).
Could there be an issue within the SVC(kernel = 'sigmoid') function?
Python code to reproduce problem:
##sigmoid iris example
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.svm import SVC
sepal_length = iris.data[:,0]
sepal_width = iris.data[:,1]
#assessing performance of sigmoid SVM
clf = SVC(kernel='sigmoid')
clf.fit(np.c_[sepal_length, sepal_width], iris.target)
pr=clf.predict(np.c_[sepal_length, sepal_width])
pd.DataFrame(classification_report(iris.target, pr, output_dict=True))
from sklearn.metrics.pairwise import sigmoid_kernel
sigmoid_kernel(np.c_[sepal_length, sepal_width])
#normalizing features
from sklearn.preprocessing import normalize
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
sigmoid_kernel(np.c_[sepal_length_norm, sepal_width_norm])
#assessing perfomance of sigmoid SVM with normalized features
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True))
I see what's happening. In sklearn releases pre 0.22 the default gamma parameter passed to the SVC was "auto", and in subsequent releases this was changed to "scale". The author of the article seems to have been using a previous version and therefore implicitly passing gamma="auto" (he mentions that the "current default setting for gamma is ‘auto’"). So if you're on the latest release of sklearn (0.23.2), you'll want to explicitly pass gamma='auto' when instantiating the SVC:
clf = SVC(kernel='sigmoid',gamma='auto')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
So now when you print the classification report:
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.907407 0.650000 0.750000 0.766667 0.769136 0.769136
# recall 0.980000 0.780000 0.540000 0.766667 0.766667 0.766667
# f1-score 0.942308 0.709091 0.627907 0.766667 0.759769 0.759769
# support 50.000000 50.000000 50.000000 0.766667 150.000000 150.000000
What would explain the 33% accuracy you were seeing is the fact that the default gamma is "scale", which then places all predictions in a single region of the decision plane, and as the targets are split into thirds you get a maximum accuracy of 33.3%:
clf = SVC(kernel='sigmoid')
#normalizing features
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0]
sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0]
clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target)
X = np.c_[sepal_length_norm, sepal_width_norm]
pr_norm=clf.predict(np.c_[sepal_length_norm, sepal_width_norm])
print(pd.DataFrame(classification_report(iris.target, pr_norm, output_dict=True)))
# 0 1 2 accuracy macro avg weighted avg
# precision 0.0 0.0 0.333333 0.333333 0.111111 0.111111
# recall 0.0 0.0 1.000000 0.333333 0.333333 0.333333
# f1-score 0.0 0.0 0.500000 0.333333 0.166667 0.166667
# support 50.0 50.0 50.000000 0.333333 150.000000 150.000000

Metrics F1 warning zero division

I want to calculate the F1 score of my models. But I receive a warning and get a 0.0 F1-score and I don't know what to do.
here is the source code:
def model_evaluation(dict):
for key,value in dict.items():
classifier = Pipeline([('tfidf', TfidfVectorizer()),
('clf', value),
])
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print("Accuracy Score of" , key , ": ", metrics.accuracy_score(y_test,predictions))
print(metrics.classification_report(y_test,predictions))
print(metrics.f1_score(y_test, predictions, average="weighted", labels=np.unique(predictions), zero_division=0))
print("---------------","\n")
dlist = { "KNeighborsClassifier": KNeighborsClassifier(3),"LinearSVC":
LinearSVC(), "MultinomialNB": MultinomialNB(), "RandomForest": RandomForestClassifier(max_depth=5, n_estimators=100)}
model_evaluation(dlist)
And here is the result:
Accuracy Score of KNeighborsClassifier : 0.75
precision recall f1-score support
not positive 0.71 0.77 0.74 13
positive 0.79 0.73 0.76 15
accuracy 0.75 28
macro avg 0.75 0.75 0.75 28
weighted avg 0.75 0.75 0.75 28
0.7503192848020434
---------------
Accuracy Score of LinearSVC : 0.8928571428571429
precision recall f1-score support
not positive 1.00 0.77 0.87 13
positive 0.83 1.00 0.91 15
accuracy 0.89 28
macro avg 0.92 0.88 0.89 28
weighted avg 0.91 0.89 0.89 28
0.8907396950875212
---------------
Accuracy Score of MultinomialNB : 0.5357142857142857
precision recall f1-score support
not positive 0.00 0.00 0.00 13
positive 0.54 1.00 0.70 15
accuracy 0.54 28
macro avg 0.27 0.50 0.35 28
weighted avg 0.29 0.54 0.37 28
0.6976744186046512
---------------
C:\Users\Cey\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Accuracy Score of RandomForest : 0.5714285714285714
precision recall f1-score support
not positive 1.00 0.08 0.14 13
positive 0.56 1.00 0.71 15
accuracy 0.57 28
macro avg 0.78 0.54 0.43 28
weighted avg 0.76 0.57 0.45 28
0.44897959183673475
---------------
Can someone tell me what to do? I only receive this message when using the "MultinomialNB()" classifier
Second:
When extending the dictionary by using the Gausian classifier (GaussianNB()) I receive this error message:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
What should I do here ?
Together with UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples (main credits go there) and #yatu's answer, I could at least find a workaround for the warning:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior. _warn_prf(average, modifier, msg_start, len(result))
Quote from sklearn.metrics.f1_score in the Notes at the bottom:
When true positive + false positive == 0, precision is undefined. When
true positive + false negative == 0, recall is undefined. In such
cases, by default the metric will be set to 0, as will f-score, and
UndefinedMetricWarning will be raised. This behavior can be modified
with zero_division.
Thus, you cannot avoid this error if your data does not output a difference between true positives and false positives.
That being said, you can only suppress the warning at least, adding zero_division=0 to the functions mentioned in the quote. In either case, set to 0 or 1, you will get a 0 value as the return anyway.
precision = precision_score(y_test, y_pred, zero_division=0)
print('Precision score: {0:0.2f}'.format(precision))
recall = recall_score(y_test, y_pred, zero_division=0)
print('Recall score: {0:0.2f}'.format(recall))
f1 = f1_score(y_test, y_pred, zero_division=0)
print('f1 score: {0:0.2f}'.format(recall))
Can someone tell me what to do? I only receive this message when using the "MultinomialNB()" classifier
The first error seems to be indicating that a specific label is not predicted when using the MultinomialNB, which results in an undefined f-score, or ill-defined, since the missing values are set to 0. This is explained here
When extending the dictionary by using the Gausian classifier (GaussianNB()) I receive this error message:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
As per this question, the error is quite explicit, the issue is that TfidfVectorizer is returning a sparse matrix, which cannot be used as input for the GaussianNB. So the way I see it, you either avoid using the GaussianNB, or you add an intermediate transformer to turn the sparse array to dense, which I wouldn't advise being the result of a tf-idf vectorization.

How can I improve massively classification report of one class using ensemble model?

I have a dataset including
{0: 6624, 1: 75} 0 for nonobservational sentences and 1 for observational sentences. (basically, I annotate my sentences using Named Entity Recognition, If there is a specific entity like DATA, TIME, LONG (coordinate) I put label 1)
Now I want to make a model to classify them, the best model (CV =3 FOR ALL) that I made is the ensembling model of
clf= SGDClassifier()
trial_05=Pipeline([("vect",vec),("clf",clf)])
which has:
precision recall f1-score support
0 1.00 1.00 1.00 6624
1 0.73 0.57 0.64 75
micro avg 0.99 0.99 0.99 6699
macro avg 0.86 0.79 0.82 6699
weighted avg 0.99 0.99 0.99 669
[[6611 37]
[ 13 38]]
and this model which used resampled sgd for classifcation
precision recall f1-score support
0 1.00 0.92 0.96 6624
1 0.13 1.00 0.22 75
micro avg 0.92 0.92 0.92 6699
macro avg 0.56 0.96 0.59 6699
weighted avg 0.99 0.92 0.95 6699
[[6104 0]
[ 520 75]]
As you see the problem in both cases is class 1, but in forst one we have fairly good precision and f1 score versus in the second one we have a very good recall
So I decided to use ensemble model using both in this way:
from sklearn.ensemble import VotingClassifier#create a dictionary of our models
estimators=[("trail_05",trial_05), ("resampled", SGD_RESAMPLED_Model)]#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')
now I have this result:
precision recall f1-score support
0 0.99 1.00 1.00 6624
1 0.75 0.48 0.59 75
micro avg 0.99 0.99 0.99 6699
macro avg 0.87 0.74 0.79 6699
weighted avg 0.99 0.99 0.99 6699
[[6612 39]
[ 12 36]]
As you the ensembe model has better precision regarding to class 1,but worse recall and f1 socre which caused to worse confusion matrix regarding classed 1 (36 TP vs 38 TP for class 1)
MY aim is to improve TP for class one (f1 score, recall for class 1)
what do you recommend to improve TP for class one (f1score, recall for class 1?
generaly do you have any idea regarding my workflow?
I have tried parameter tuning, it i does not improve sgd model.

Categories

Resources