Large mean squared error in sklearn regressors - python

I'm a beginner in machine learning and I want to build a model to predict the price of houses. I prepared a dataset by crawling a local housing website and it consists 1000 samples and only 4 features (latitude, longitude, area and number of rooms).
I tried RandomForestRegressor and LinearSVR models in sklearn, but I can't train the model properly and the MSE is super high.
MSE almost equals 90,000,000 (the true values of prices' range are between 5,000,000 - 900,000,000)
Here is my code:
import numpy as np
from sklearn.svm import LinearSVR
import pandas as pd
import csv
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
df = pd.read_csv('dataset.csv', index_col=False)
X = df.drop('price', axis=1)
X_data = X.values
Y_data = df.price.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=5)
rgr = RandomForestRegressor(n_estimators=100)
svr = LinearSVR()
rgr.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
MSEs = cross_val_score(estimator=rgr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEsSVR = cross_val_score(estimator=svr,
X=X_train,
y=Y_train,
scoring='mean_squared_error',
cv=5)
MSEs *= -1
RMSEs = np.sqrt(MSEs)
print("Root mean squared error with 95% confidence interval:")
print("{:.3f} (+/- {:.3f})".format(RMSEs.mean(), RMSEs.std()*2))
print("")
Is the problem with my dataset and count of features? How can I build a prediction model with this type of dataset?

Related

Precision, recall, F1 score all have zero value for the minority class in the classification report

I got error while using SVM and MLP classifiers from SkLearn package. The error is C:\Users\cse_s\anaconda3\lib\site-packages\sklearn\metrics_classification.py:1327: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Code for splitting dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
Code for SVM classifier
from sklearn import svm
SVM_classifier = svm.SVC(kernel="rbf", probability = True, random_state=1)
SVM_classifier.fit(X_train, y_train)
SVM_y_pred = SVM_classifier.predict(X_test)
print(classification_report(y_test, SVM_y_pred))
Code for MLP classifier
from sklearn.neural_network import MLPClassifier
MLP = MLPClassifier(random_state=1, learning_rate = "constant", learning_rate_init=0.3, momentum = 0.2 )
MLP.fit(X_train, y_train)
R_y_pred = MLP.predict(X_test)
target_names = ['No class', 'Yes Class']
print(classification_report(y_test, R_y_pred, target_names=target_names))
The error is same for both classifiers
I hope, it could help.
Classification_report:
Sets the value to return when there is a zero division. You can provide 0 or 1 if zero division occur. by the precision or recall formula
classification_report(y_test, R_y_pred, target_names=target_names, zero_division=0)
I don't know what's your data look like. Here's an example
Features of cancer dataset
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
cancer = load_breast_cancer()
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df_feat.head()
Target of dataset:
df_target = pd.DataFrame(cancer['target'],columns=['Cancer'])
np.ravel(df_target) # convert it into a 1-d array
Generate classification report:
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.3, random_state=101)
SVM_classifier = svm.SVC(kernel="rbf", probability = True, random_state=1)
SVM_classifier.fit(X_train, y_train)
SVM_y_pred = SVM_classifier.predict(X_test)
print(classification_report(y_test, SVM_y_pred))
Generate classification report for MLP Classifier:
MLP = MLPClassifier(random_state=1, learning_rate = "constant", learning_rate_init=0.3, momentum = 0.2 )
MLP.fit(X_train, y_train)
R_y_pred = MLP.predict(X_test)
target_names = ['No class', 'Yes Class']
print(classification_report(y_test, R_y_pred, target_names=target_names, zero_division=0))

cross validation for split test and train datasets

Unlike standart data, I have dataset contain separetly as train, test1 and test2. I implemented ML algorithms and got performance metrics. But when i apply cross validation, it's getting complicated.. May be someone help me.. Thank you..
It's my code..
train = pd.read_csv('train-alldata.csv',sep=";")
test = pd.read_csv('test1-alldata.csv',sep=";")
test2 = pd.read_csv('test2-alldata.csv',sep=";")
X_train = train_pca_son.drop('churn_yn',axis=1)
y_train = train_pca_son['churn_yn']
X_test = test_pca_son.drop('churn_yn',axis=1)
y_test = test_pca_son['churn_yn']
X_test_2 = test2_pca_son.drop('churn_yn',axis=1)
y_test_2 = test2_pca_son['churn_yn']
For example, KNN Classifier.
knn_classifier = KNeighborsClassifier(n_neighbors =7,metric='euclidean')
knn_classifier.fit(X_train, y_train)
For K-Fold.
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
dtc = DecisionTreeClassifier(random_state=42)
k_folds = KFold(n_splits = 5)
scores = cross_val_score(dtc, X, y, cv = k_folds)
print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
This is a variation on the "holdout test data" pattern (see also: Wikipedia: Training, Validation, Test / Confusion in terminology). For churn prediction: this may arise if you have two types of customers, or are evaluating on two time frames.
X_train, y_train ← perform training and hyperparameter tuning with this
X_test1, y_test1 ← test on this
X_test2, y_test2 ← test on this as well
Cross validation estimates holdout error using the training data—it may come up if you estimate hyperparameters with GridSearchCV. Final evaluation involves estimating performance on two test sets, separately or averaged over the two:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
X_test1, X_test2, y_test1, y_test2 = train_test_split(X_test, y_test, test_size=.5)
print(y_train.shape, y_test1.shape, y_test2.shape)
# (600,) (200,) (200,)
clf = KNeighborsClassifier(n_neighbors=7).fit(X_train, y_train)
print(f1_score(y_test1, clf.predict(X_test1)))
print(f1_score(y_test2, clf.predict(X_test2)))
# 0.819
# 0.805

I am trying to make Simple Linear Regression Model with the salary data csv file , and there are 35 data points . how do i split it in 80-20?

# split train test data
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# import required modules and train the ML algorithm
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
i am getting the error-
Found input variables with inconsistent numbers of samples: [28, 7]
By salary csv data, i will take this one as example
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#Import the dataset
salary_data=pd.read_csv("/mnt/c/Users/XXXXXXX/Downloads/Salary_Data.csv")
# Here you can split your dataset between train and test using 80% for train
X_train, X_test, y_train, y_test = train_test_split(salary_data["YearsExperience"], salary_data["Salary"], test_size=0.2, random_state=1)
#Then you can fit your linear model on train dataset
#Here the goal is to modelize salary considering years of XP
regressor = LinearRegression()
model = regressor.fit(X_train.values.reshape(-1, 1),y_train.values.reshape(-1, 1))
#Let's plot our model prediction on whole data and compare to real data
plt.title("Salary/Years of XP")
plt.ylabel("Salary $")
plt.xlabel("Years")
plt.plot(salary_data["YearsExperience"],salary_data["Salary"],color="blue",label="real data")
plt.plot(salary_data["YearsExperience"],model.predict(salary_data["YearsExperience"].values.reshape(-1,1)),color="red",label="linear model")
plt.legend()
plt.show()

High OOB error for Random forest with Python

I am tring to use Random Forest Classifier from scikit learn in Python to predict stock movements. My dataset has 8 features and 1201 records. But after fitting the model and using it to predict, it appears 100% of accuracy and 100% of OOB error. I modified the n_estimators from 100 to a small value, but the OOB error has just dropped few %. Here is my code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
#File reading
df = pd.read_csv('700.csv')
df.drop(df.columns[0],1,inplace=True)
target = df.iloc[:,8]
print(target)
#train test split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.3)
#model fit
clf = RandomForestClassifier(n_estimators=100, criterion='gini',oob_score= True)
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
accuaracy = accuracy_score(y_test,pred)
print(clf.oob_score_)
print(accuaracy)
How can I modifiy the code in order to make the oob error drop? Thanks.
If you want to check the error then use/modify your code like this one :
oob_error = 1 - clf.oob_score_

How to compute "y_train_true, y_train_prob, y_test_true, y_test_prob"?

I have computed X_train, X_test, y_train, y_test. But I can not compute y_train_true, y_train_prob, y_test_true, y_test_prob.
How can I compute y_train_true, y_train_prob, y_test_true, y_test_prob from the following code ?
X_train:
X_test:
y_train:
y_test:
N.B,
y_train_true: True binary labels of 0 or 1 in the training dataset
y_train_prob: Probability in range {0,1} predicted by the model for the training dataset
y_test_true: True binary labels of 0 or 1 in the testing dataset
y_test_prob: Probability in range {0,1} predicted by the model for the testing dataset
Code :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier and ====
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
# knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_train)
Well in your case y_train and y_test is already y_train_true and y_test_true. To get y_train_prob and y_test_prob, you need to take a model. I don't know which dataset you're using but it seems to be a binary classification problem so that you could use logistic regression to do this so,
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
y_train_prob = knn.predict_proba(X_train)
y_test_prob = knn.predict_proba(X_test)

Categories

Resources