Im trying to make prediction with logistic regression and to test accuracy with Python and sklearn library. Im using data that I downloaded from here:
http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
its excel file. I wrote a code, but I always get the same error, and the error is:
ValueError: Unknown label type: 'continuous'
I have used the same logic when I made linear regression, and it works for linear regression.
This is the code:
import numpy as np
import pandas as pd
import xlrd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Reading data from excel
data = pd.read_excel("DataSet.xls").round(2)
data_size = data.shape[0]
#print("Number of data:",data_size,"\n",data.head())
my_data = data[(data["Superpl"] == 0) & (data["FlyAsh"] == 0) & (data["BlastFurSlag"] == 0)].drop(columns=["Superpl","FlyAsh","BlastFurSlag"])
my_data = my_data[my_data["Days"]<=28]
my_data_size = my_data.shape[0]
#print("Size of dataset for 28 days or less:", my_data_size, "\n", my_data.head())
def logistic_regression(data_input, cement, water,
coarse_aggr, fine_aggr, days):
variable_list = []
result_list = []
for column in data_input:
variable_list.append(column)
result_list.append(column)
variable_list = variable_list[:-1]
result_list = result_list[-1]
variables = data_input[variable_list]
results = data_input[result_list]
#accuracy of prediction (splittig dataframe in train and test)
var_train, var_test, res_train, res_test = train_test_split(variables, results, test_size = 0.3, random_state = 42)
#making logistic model and fitting the data into logistic model
log_regression = linear_model.LogisticRegression()
model = log_regression.fit(var_train, res_train)
input_values = [cement, water, coarse_aggr, fine_aggr, days]
#predicting the outcome based on the input_values
predicted_strength = log_regression.predict([input_values]) #adding values for prediction
predicted_strength = round(predicted_strength[0], 2)
#calculating accuracy score
score = log_regression.score(var_test, res_test)
score = round(score*100, 2)
prediction_info = "\nPrediction of future strenght: " + str(predicted_strength) + " MPa\n"
accuracy_info = "Accuracy of prediction: " + str(score) + "%\n"
full_info = prediction_info + accuracy_info
return full_info
print(logistic_regression(my_data, 376.0, 214.6, 1003.5, 762.4, 3)) #true value affter 3 days: 16.28 MPa
Although you don't provide details of your data, judging from the error and the comment in the last line of your code:
#true value affter 3 days: 16.28 MPa
I conclude that you are in a regression (i.e numeric prediction) setting. A linear regression is an appropriate model for this task, but a logistic regression is not: logistic regression is for classification problems, and thus it expects binary (or categorical) data as target variables, not continuous values, hence the error.
In short, you are trying to apply a model that is inappropriate for your problem.
UPDATE (after link to the data): Indeed, reading closely the dataset description, you'll see (emphasis added):
The concrete compressive strength is the regression problem
while from scikit-learn User's Guide for logistic regression (again, emphasis added):
Logistic regression, despite its name, is a linear model for classification rather than regression.
Related
I get the Used the following error when trying to do the score or mean squared error of a single sample :"Found input variables with inconsistent number of samples" for a decision tree regressor model from sklearn to find the chance
of having a heart attack based on 13 other parameters. The model seems
to work but the metrics tests always give me this kind of error
regardless of how I transform the data. Its because the sample test is 1 row whilst the training data is 303 rows but I don't know how to fit them.
import pandas as pd
from sklearn import tree
from sklearn.metrics import mean_squared_error
heart = pd.read_csv('heart.csv')
test = pd.read_csv('Heart_test.csv')
X = heart.iloc[:,0:13]
Y = heart.iloc[:,13:14]
test = test.iloc[:,0:13]
#print(X.head(), '\n', test.head(),'\n')
model = tree.DecisionTreeRegressor()
model = model.fit(X,Y)
y_prediction = model.predict(test)
print(mean_squared_error(Y,y_prediction)
I am working on a logistic regression case with an imbalance in the target variable. To fix this I am using SMOTE (Synthetic Minority Oversampling Technique), but each time I run my regression model, I get different numbers in my confusion matrix. I have set random_state parameters while invoking SMOTE as well as Logistic Regression still to no avail. Even my features are the same in each iteration. I was able to get the best value for recall as 0.81 and AUC value as 0.916 once but they are not coming anymore. On some occasions, the value of False Positives and False Negatives shoots up very much indicating that the classifier is very bad.
Please guide what I am doing wrong here, below is the code snippet.
# Feature Selection
features = [ 'FEMALE','MALE','SINGLE','UNDER_WEIGHT','OBESE','PROFESSION_ANYS',
'PROFESSION_PROF_UNKNOWN']
# Set X and Y Variables
X5 = dataframe[features]
# Target variable
Y5 = dataframe['PURCHASE']
# Splitting using SMOTE
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state = 4)
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5,Y5, test_size=0.20)
os_data_X5,os_data_Y5 = os.fit_sample(X5_train, Y5_train)
columns = X5_train.columns
os_data_X5 = pd.DataFrame(data = os_data_X5, columns = columns )
os_data_Y5 = pd.DataFrame(data = os_data_Y5, columns = ['PURCHASE'])
# Instantiate Logistic Regression model (using the default parameters)
logreg_5 = LogisticRegression(random_state = 4, penalty='l1', class_weight = 'balanced')
# Fit the model with train data
logreg_5.fit(os_data_X5,os_data_Y5)
# Make predictions on test data set
Y5_pred = logreg_5.predict(X5_test)
# Make Confusion Matrix to compare results against actual values
cnf_matrix = metrics.confusion_matrix(Y5_test, Y5_pred)
cnf_matrix
I wrote a function that takes dataset (excel / pandas) and some values, and then predicts outcome with decision tree classifier. I have done that with sklearn.
Can you help me with this, I have looked over the web and this website but I couldnt find the answer that works.
I have tried to do this, but it does not work:
from sklearn.metrics import accuracy_score
score = accuracy_score(variable_list, result_list)
This is the error that I get:
ValueError: Classification metrics can't handle a mix of continuous-multioutput and multiclass targets
This is the code(I removed code for accuracy)
import pandas as pd
import math
import xlrd
from sklearn.model_selection import train_test_split
from sklearn import tree
def predict_concrete_class(input_data, cement, blast_fur_slug,fly_ash,
water, superpl, coarse_aggr, fine_aggr, days):
data_for_tree = concrete_strenght_class(input_data)
variable_list = []
result_list = []
for index, row in data_for_tree.iterrows():
variable = row.tolist()
variable = variable[0:8]
variable_list.append(variable)
result_list.append(row[-1])
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(variable_list,result_list)
input_values = [cement, blast_fur_slug, fly_ash, water, superpl, coarse_aggr, fine_aggr, days]
prediction = decision_tree.predict([input_values])
info = "Prediction of future concrete class after "+ str(days)+" days: "+ str(prediction[0])
return info
print(predict_concrete_class(data, 500, 0, 0, 200, 0, 1125, 613, 3))
Split your data into train and test:
var_train, var_test, res_train, res_test = train_test_split(variable_list, result_list, test_size = 0.3)
Train your decision tree on train set:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(var_train, res_train)
Test model performance by calculating accuracy on test set:
res_pred = decision_tree.predict(var_test)
score = accuracy_score(res_test, res_pred)
Or you could directly use decision_tree.score:
score = decision_tree.score(var_test, res_test)
The error you are getting is because you are trying to pass variable_list (which is your list of input features) as a parameter in accuracy_score. You are supposed to pass your list of true labels and predicted labels.
You should perform a cross validation if you want to check the accuracy of your system.
You have to split you data set into two parts. The first one is used to learn your system. Then you perform the prediction process on the second part of the data set and compared the predicted results with the good ones. With this method, you check your system on a unlearned data set.
In order to split your set, you should use train_test_split from sklearn.model_selection
You will split your set randomly.
Here is good lecture: https://machinelearningmastery.com/k-fold-cross-validation/
I have a supervised learning classification problem. I have 4 numeric class labels (0, 1, 2, 3) and I have about 100 trials of 38 separate features as the input.
After inputting this data into an SVC classifier in Python and Matlab (specifically the Classification Learner App), and matching the hyperparameters (C = 1, type = quadratic SVM, multi-class_method = onevsone, standardised data, no PCA), the accuracies given vary drastically:
Matlab = 86.7 %
Python = 45.0 %
Has anyone come across this or have any other ideas what I could do to know which one is correct?
Matlab input:
Python input:
import numpy as np
from sklearn import datasets, linear_model, metrics, svm, preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
symptom = input("What symptom would you like to analyse? \n")
cross_validation = input("With cross validation? \n")
if cross_validation == "Yes":
no_cvfolds = np.int(input("Number of folds? \n"))
x = symptomDF[feature]
y = symptomDF.loc[:, 'updrs_class'].values
x_new = StandardScaler().fit_transform(x)
scores = cross_val_score(SVC(kernel='poly', degree = 2, C=1.0, decision_function_shape = 'ovo'), x_new, y, cv = no_cvfolds)
print(name + " Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
So after a few days, one factor that seemed to have helped provide similar accuracies to Matlab, was the cross validator used and the shuffling parameter in the model.
Instead of directly stating the number of folds and inputting that into the cross_val_score function, I defined a Stratified K-folds cross validator. This was better for me because it provided a balance during cross validation when I had an imbalance in class data sizes. I then explicitly defined shuffle = True to shuffle the data before the cross validator split them into batches.
I am using dataset to see the relationship between salary and college GPA. I am using sklearn linear regression model. I think the coefficients should be intercept and the coff. value of corresponding feature. But the model is giving a single value.
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
# Use only one feature : CollegeGPA
labour_data_gpa = labour_data[['collegeGPA']]
# salary as a dependent variable
labour_data_salary = labour_data[['Salary']]
# Split the data into training/testing sets
gpa_train, gpa_test, salary_train, salary_test = train_test_split(labour_data_gpa, labour_data_salary)
# Create linear regression object
regression = LinearRegression()
# Train the model using the training sets (first parameter is x )
regression.fit(gpa_train, salary_train)
#coefficients
regression.coef_
The output is : Out[12]: array([[ 3235.66359637]])
Try:
regression = LinearRegression(fit_intercept =True)
regression.fit(gpa_train, salary_train)
and the results will be in
regression.coef_
regression.intercept_
In order to get a better understanding of your linear regression, you maybe should consider another module, the following tutorial helps: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
salary_pred = regression.predict(gpa_test)
print salary_pred
print salary_test
I think salary_pred = regression.coef_*salary_test.
Have a try that printed salary_pred and salary_test via pyplot. Figure can explain every thing.
Here you are training your model on a single feature gpa and a target salary:
regression.fit(gpa_train, salary_train)
If you train your model on multiple features e.g. python_gpa and java_gpa (with the target as salary), then you would get two outputs signifying coefficients of the equation (for the linear regression model) and a single intercept.
It is equivalent to: ax + by + c = salary (where c is the intercept, a and b are the coefficients).