I get the Used the following error when trying to do the score or mean squared error of a single sample :"Found input variables with inconsistent number of samples" for a decision tree regressor model from sklearn to find the chance
of having a heart attack based on 13 other parameters. The model seems
to work but the metrics tests always give me this kind of error
regardless of how I transform the data. Its because the sample test is 1 row whilst the training data is 303 rows but I don't know how to fit them.
import pandas as pd
from sklearn import tree
from sklearn.metrics import mean_squared_error
heart = pd.read_csv('heart.csv')
test = pd.read_csv('Heart_test.csv')
X = heart.iloc[:,0:13]
Y = heart.iloc[:,13:14]
test = test.iloc[:,0:13]
#print(X.head(), '\n', test.head(),'\n')
model = tree.DecisionTreeRegressor()
model = model.fit(X,Y)
y_prediction = model.predict(test)
print(mean_squared_error(Y,y_prediction)
Related
# import sklearn and necessary libraries
from sklearn.linear_model import LogisticRegression
# Apply sklearn logistic regression on the given data X and labels Y
X_skl = np.vstack((df1,df2)) # 10000 x 2 array
Y_skl = Y # 10000 x 1 array
LogR = LogisticRegression()
LogR.fit(X_skl,Y_skl)
Y_skl_hat = LogR.predict(X_skl)
# Calculate the accuracy
# Check the number of points where Y_skl is not equal to Y_skl_hat
error_count_skl = 0 # Count the number of error points
for i in range(N):
if Y_skl[i] == Y_skl_hat[i]:
error_count_skl = error_count_skl
else:
error_count_skl = error_count_skl + 1
# Calculate the accuracy
Accuracy = 100*(N - error_count_skl)/N
print("Accuracy(%):")
print(Accuracy)
Output:
Accuracy(%):
99.48
Hello,
I'm trying to apply logistic regression model on array X (with size of 10000 x 2) and label Y (10000 x 1)
using sklearn library in Python. I'm completely lost cause I've never used this library before. Can anyone help me with the coding?
Edited:
Sorry for the vague question, the goal is to find the training accuracy using the entire dataset of X. Above is what I came up with, can anyone take a look and see if it makes sense?
To calculate accuracy you can simply use this sklearn method.
sklearn.metrics.accuracy_score(y_true, y_pred)
In your case
sklearn.metrics.accuracy_score(Y_skl, Y_skl_hat)
If you want to take a look at
sklearn documentation for accuracy_score
And also you should train your model on some data and test it on others to check if the model can be generalized and to avoid overfitting.
To split your data in train and test datasets you could use:
sklearn.model_selection.train_test_split
If you want to take a look at
sklearn documentation for train_test_split
I am using Random Forest for my Regression problem in python. I have rather large data (5 features, 1 target, with 9387 dataset).
At first, to obtain the accuracy, I used a simple RF code with train_test_split and metrics.r2_score and the result gave me a 0.9999 score on both train set and test set. Later, I tried to perform cross-validation using cross_val_score with 5 Folds. This gives me 5 numbers (see below) which I found some of them are weird to be the score of cross-validation.
[-1.44202691 0.25338018 0.70433516 0.98278159 -3.34943088]
Is it really possible to have a negative accuracy or there is something just wrong with how I code?
I am still new to coding and python so please bear with me. You can see my code below.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import csv
from sklearn.model_selection import cross_val_score, train_test_split
from statistics import mean
data = pd.read_csv("Size1.csv", sep=",")
data = data[["X", "Y", "Z", "Tilt_C", "Tilt_A", "Radiation_C"]]
predict = "Radiation_C"
A = np.array(data.drop([predict], 1))
B = np.array(data[predict])
# Split data for Train and Test
a_train, a_test, b_train, b_test = train_test_split(A, B, test_size=0.25)
# Fitting Random Forest Regression to the dataset
# create regressor object
rf = RandomForestRegressor(random_state=42)
# fit the regressor with A and B data
rf.fit(a_train, b_train)
# Calculate accuracy
b_pred = rf.predict(a_test)
print('R^2:', metrics.r2_score(b_test, b_pred))
# Perform Cross Validation & scores
scores = cross_val_score(rf,A, B, cv=5)
print(scores)
print("Mean: ", mean(scores))
I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.
I wrote a function that takes dataset (excel / pandas) and some values, and then predicts outcome with decision tree classifier. I have done that with sklearn.
Can you help me with this, I have looked over the web and this website but I couldnt find the answer that works.
I have tried to do this, but it does not work:
from sklearn.metrics import accuracy_score
score = accuracy_score(variable_list, result_list)
This is the error that I get:
ValueError: Classification metrics can't handle a mix of continuous-multioutput and multiclass targets
This is the code(I removed code for accuracy)
import pandas as pd
import math
import xlrd
from sklearn.model_selection import train_test_split
from sklearn import tree
def predict_concrete_class(input_data, cement, blast_fur_slug,fly_ash,
water, superpl, coarse_aggr, fine_aggr, days):
data_for_tree = concrete_strenght_class(input_data)
variable_list = []
result_list = []
for index, row in data_for_tree.iterrows():
variable = row.tolist()
variable = variable[0:8]
variable_list.append(variable)
result_list.append(row[-1])
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(variable_list,result_list)
input_values = [cement, blast_fur_slug, fly_ash, water, superpl, coarse_aggr, fine_aggr, days]
prediction = decision_tree.predict([input_values])
info = "Prediction of future concrete class after "+ str(days)+" days: "+ str(prediction[0])
return info
print(predict_concrete_class(data, 500, 0, 0, 200, 0, 1125, 613, 3))
Split your data into train and test:
var_train, var_test, res_train, res_test = train_test_split(variable_list, result_list, test_size = 0.3)
Train your decision tree on train set:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(var_train, res_train)
Test model performance by calculating accuracy on test set:
res_pred = decision_tree.predict(var_test)
score = accuracy_score(res_test, res_pred)
Or you could directly use decision_tree.score:
score = decision_tree.score(var_test, res_test)
The error you are getting is because you are trying to pass variable_list (which is your list of input features) as a parameter in accuracy_score. You are supposed to pass your list of true labels and predicted labels.
You should perform a cross validation if you want to check the accuracy of your system.
You have to split you data set into two parts. The first one is used to learn your system. Then you perform the prediction process on the second part of the data set and compared the predicted results with the good ones. With this method, you check your system on a unlearned data set.
In order to split your set, you should use train_test_split from sklearn.model_selection
You will split your set randomly.
Here is good lecture: https://machinelearningmastery.com/k-fold-cross-validation/
Im trying to make prediction with logistic regression and to test accuracy with Python and sklearn library. Im using data that I downloaded from here:
http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
its excel file. I wrote a code, but I always get the same error, and the error is:
ValueError: Unknown label type: 'continuous'
I have used the same logic when I made linear regression, and it works for linear regression.
This is the code:
import numpy as np
import pandas as pd
import xlrd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Reading data from excel
data = pd.read_excel("DataSet.xls").round(2)
data_size = data.shape[0]
#print("Number of data:",data_size,"\n",data.head())
my_data = data[(data["Superpl"] == 0) & (data["FlyAsh"] == 0) & (data["BlastFurSlag"] == 0)].drop(columns=["Superpl","FlyAsh","BlastFurSlag"])
my_data = my_data[my_data["Days"]<=28]
my_data_size = my_data.shape[0]
#print("Size of dataset for 28 days or less:", my_data_size, "\n", my_data.head())
def logistic_regression(data_input, cement, water,
coarse_aggr, fine_aggr, days):
variable_list = []
result_list = []
for column in data_input:
variable_list.append(column)
result_list.append(column)
variable_list = variable_list[:-1]
result_list = result_list[-1]
variables = data_input[variable_list]
results = data_input[result_list]
#accuracy of prediction (splittig dataframe in train and test)
var_train, var_test, res_train, res_test = train_test_split(variables, results, test_size = 0.3, random_state = 42)
#making logistic model and fitting the data into logistic model
log_regression = linear_model.LogisticRegression()
model = log_regression.fit(var_train, res_train)
input_values = [cement, water, coarse_aggr, fine_aggr, days]
#predicting the outcome based on the input_values
predicted_strength = log_regression.predict([input_values]) #adding values for prediction
predicted_strength = round(predicted_strength[0], 2)
#calculating accuracy score
score = log_regression.score(var_test, res_test)
score = round(score*100, 2)
prediction_info = "\nPrediction of future strenght: " + str(predicted_strength) + " MPa\n"
accuracy_info = "Accuracy of prediction: " + str(score) + "%\n"
full_info = prediction_info + accuracy_info
return full_info
print(logistic_regression(my_data, 376.0, 214.6, 1003.5, 762.4, 3)) #true value affter 3 days: 16.28 MPa
Although you don't provide details of your data, judging from the error and the comment in the last line of your code:
#true value affter 3 days: 16.28 MPa
I conclude that you are in a regression (i.e numeric prediction) setting. A linear regression is an appropriate model for this task, but a logistic regression is not: logistic regression is for classification problems, and thus it expects binary (or categorical) data as target variables, not continuous values, hence the error.
In short, you are trying to apply a model that is inappropriate for your problem.
UPDATE (after link to the data): Indeed, reading closely the dataset description, you'll see (emphasis added):
The concrete compressive strength is the regression problem
while from scikit-learn User's Guide for logistic regression (again, emphasis added):
Logistic regression, despite its name, is a linear model for classification rather than regression.