High OOB error for Random forest with Python

High OOB error for Random forest with Python - python

I am tring to use Random Forest Classifier from scikit learn in Python to predict stock movements. My dataset has 8 features and 1201 records. But after fitting the model and using it to predict, it appears 100% of accuracy and 100% of OOB error. I modified the n_estimators from 100 to a small value, but the OOB error has just dropped few %. Here is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
#File reading
df = pd.read_csv('700.csv')
df.drop(df.columns[0],1,inplace=True)
target = df.iloc[:,8]
print(target)
#train test split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.3)
#model fit
clf = RandomForestClassifier(n_estimators=100, criterion='gini',oob_score= True)
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
accuaracy = accuracy_score(y_test,pred)
print(clf.oob_score_)
print(accuaracy)
How can I modifiy the code in order to make the oob error drop? Thanks.

If you want to check the error then use/modify your code like this one :
oob_error = 1 - clf.oob_score_

Related

how to print mse vs epoch in MLPClassifier

Hi everyone i have problem to print my result using MLPClassifier sklearn, i want my result is plot graphs mse vs epoch for training vs testing and this is my code :
#Spliting data into Feature and
x=Dt[['new_tests','people_vaccinated','people_fully_vaccinated','total_boosters','new_vaccinations']]
y=Dt[['new_cases','new_deaths']]
#Import train_test_split function**
from sklearn.model_selection import train_test_split
#Split dataset into training set and test set to be 70% and 30%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
# Import MLPClassifer and mse
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_squared_error
# Create model object
flc = MLPClassifier(hidden_layer_sizes=(15,10),
random_state=5,
verbose=True,
max_iter=50,
learning_rate_init=0.1)
mse = mean_squared_error(y_test, ypred, squared=False)
# Fit data onto the model
flc.fit(x_train, y_train)
# Make prediction on test dataset
ypred = flc.predict(x_test)
# Import accuracy score
from sklearn.metrics import accuracy_score
# Calcuate accuracy
accuracy_score(y_test,ypred)
I was expecting the graph display mse vs epoch for training and testing Please, is anyone able to run it? Any ideas or suggestion? What could be causing this issue? Thank you.

Mean square error is a regression metric. To visualize the loss per iteration, you can plot the clf.loss_curve_:
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
clf = MLPClassifier(max_iter=50).fit(X, y)
plt.plot(clf.loss_curve_)
plt.show()
If you want to plot something more complicated (e.g. error vs. epoch for training and testing), you'll to write a training loop like this: Training and validation loss history for MLPRegressor

Why the voting classifier has less accuracy than one of the individual predictors that made it

I have a simple question concerning the votting classifier. As I understood, the voting classifier should have the highest accuracy than those individual predictors which built it (the wisdom of the crowd). Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# import dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
log_clf = LogisticRegression(solver='liblinear', random_state=42)
svm_clf = SVC(gamma='auto', random_state=42)
voting_clf = VotingClassifier(
estimators= [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf = voting_clf.fit(X_train, y_train)
predictors_list= [log_clf, rnd_clf, svm_clf, voting_clf]
for clf in predictors_list:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(clf.__class__.__name__, accuracy)
what I get as a accuracy is as follows:
LogisticRegression 0.776
RandomForestClassifier 0.88
SVC 0.864
VotingClassifier 0.864
As you can see for this run the Random Forest predictor has a slightly better accuray than the VotingClassifier!
Any explanation for this?
Many thanks in Advance
Fethi

Let's take a look at the voting parameter you passed 'hard'
documentation says:
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
So maybe, the prediction of ‍‍‍‍LogisticRegression and your SVC(SVM) are the same and wrong for some cases this makes your majority vote wrong for those cases.
you can use voting='soft' or assign weight as prior for prediction of each model, this way you make your prediction a little bit immune to the wrong prediction of bad models, and relay more on your best models.

Multiple Linear Regression 100% Accuracy

I am getting 100% accuracy in multiple linear regression. I am following one tutorial of last year. He is not getting 100% accuracy on the same model but I am getting now. Seems weird to me. Here's my code. Am I am doing it right or there's something wrong in my code?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('M_Regression.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, :1].values
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
#regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train,Y_train)
#Prediction
y_pred = reg.predict(x_test)
print(str(y_test) + " - " + str(y_pred))

Linear Regression have simple numbers it is common to have 100% accuracy on large dataset. Try with other datasets once. I tried your code i got 1.0 Accuracy on it.

To check the accuracy of your model, you could try printing the r2 score of your test sample. Something among the lines of :
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))
if you still have issues with the score. You could try removing the "random_state=0" to check if you still have 100% accuracy with several train/test data sets.

Why SKlearn and WEKA results do not match?

I have this dataset, and I'm using SKlearn to generate a random forest model as follows:
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.cross_validation import cross_val_score, cross_val_predict
import pandas as pd
import numpy as np
df = pd.read_csv('trainingSetExample.csv')
X_train = df.iloc[:, df.columns != 'label']
y_train = df.ix[:]['label']
clf = RandomForest()
print np.mean(cross_val_score(clf, X_train, y_train, cv=10))
print 'precision', np.mean(cross_val_score(clf, X_train, y_train, cv=10, scoring='precision_macro'))
Accuracy and precision are both 0.99, but when I use WEKA random-forest, accuracy and precision are both 0.95. It looks like the default values of the parameters for both is the same, in addition, I tried WEKA with 10000 iterations instead of 100 and it didn't improve.
Why are the results that much different?

I found what was the error, the label was included in features by mistake, so SKlearn always reported a high accuracy (close to 1), but WEKA was smart enough to remove that feature and report the actual accuracy. After removing that column, they both match.

Large mean squared error in sklearn regressors

I'm a beginner in machine learning and I want to build a model to predict the price of houses. I prepared a dataset by crawling a local housing website and it consists 1000 samples and only 4 features (latitude, longitude, area and number of rooms).
I tried RandomForestRegressor and LinearSVR models in sklearn, but I can't train the model properly and the MSE is super high.
MSE almost equals 90,000,000 (the true values of prices' range are between 5,000,000 - 900,000,000)
Here is my code:
import numpy as np
from sklearn.svm import LinearSVR
import pandas as pd
import csv
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
df = pd.read_csv('dataset.csv', index_col=False)
X = df.drop('price', axis=1)
X_data = X.values
Y_data = df.price.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=5)
rgr = RandomForestRegressor(n_estimators=100)
svr = LinearSVR()
rgr.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
MSEs = cross_val_score(estimator=rgr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEsSVR = cross_val_score(estimator=svr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEs *= -1
RMSEs = np.sqrt(MSEs)
print("Root mean squared error with 95% confidence interval:")
print("{:.3f} (+/- {:.3f})".format(RMSEs.mean(), RMSEs.std()*2))
print("")
Is the problem with my dataset and count of features? How can I build a prediction model with this type of dataset?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

High OOB error for Random forest with Python - python

If you want to check the error then use/modify your code like this one : oob_error = 1 - clf.oob_score_

Related

how to print mse vs epoch in MLPClassifier

Why the voting classifier has less accuracy than one of the individual predictors that made it

Multiple Linear Regression 100% Accuracy

Why SKlearn and WEKA results do not match?

Large mean squared error in sklearn regressors

Categories

Resources