Logistic Regression test input format help in python - python

I do have the below dataset.
I've created Logistic Regression out of it and checked Accuracy and is working fine. So now requirement is I've a new data with Age 30 and EstimatedSalary 50000 and I would like to predict whether Purchased will be 0 or 1. How to pass the new values 30 and 50000 in my python code.
Below is the python code which I've used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline
dataset = pd.read_csv(r"suv_data.csv")
X=dataset.iloc[:,[0,1]].values
y=dataset.iloc[:,2].values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=1)
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
accuracy_score(y_test,y_pred)*100
Regards,
Bharath Vikas

In general, to evaluate (i.e. call .predict in sklearn) a trained model, you need to input samples that have the same shape as the samples the model was trained on.
In your case I suppose (see my comment on your question) you wanted to have samples with Age and EstimatedSalary in the training set using Purchased as label.
Then, to test on a single sample just try this:
single_test_sample = pd.DataFrame({'Age':[30], 'EstimatedSalary':[50000]}).iloc[:,[0,1]].values
single_test_sample = sc.transform(single_test_sample)
single_test_prediction = classifier.predict(single_test_sample)
Note that you can also add more values in the test dataframe Age and EstimatedSalary columns, now I only added the sample you were interested in. If you add more, the model will output a prediction for each row in the test dataframe.
Also note that your code and mine, will also work without this .values at the end of the train/test set as sklearn already provides functionality with pandas dataframes.

Your question is not clear but I understand that you need to use the fitted model to predict a new sample.
After having fitted your model just use this:
new_sample = np.array([[30,50000]]) # 2D numpy array
new_sample_sc = sc.transform(new_sample)
y_pred_new = classifier.predict(new_sample_sc)
print(y_pred_new)

Related

Create 3 classification models to predict the class based on the other available columns

I have three type of classes (stetosa, versicolor, virginica) and also 4 other columns as sepal_length, sepal_width, petal_length, petal_width with around 150 rows and each it's filled with it's own information (so nothing is empty there). I need to predict the type of the class based on other columns.
This is what I have tried:
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
X=df[["sepal_length","sepal_width","petal_length","petal_width"]]
y=df["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(y_train, X_train)
clf.predict(y_test)
The text marked reponse with this problem:
ValueError: could not convert string to float: 'virginica'
I need to do this with train and test.
You need to encode your data. in other words, transform each category in a number (int or float).
Map the following categories like this:
mapping={'setosa':0,'versicolor':1,'virginica':2}
y.map(mapping)
After you train your model, you will get 0,1 or 2 as a result. Convert it back and you'll have your predictions.
And by the way, if you are predicting a class, you must change your model. LinearRegression() is a numerical predictor it can only predict numerical values.
Try to use SVC, LogisticRegression or any other classification model instead.

Can encode categorical data in train set but not in the test set

I need to encode the categorical values on my test set, somehow it throws TypeError: argument must be a string or number. I do not know why this happens because i could do it to my train set. I mean they're train/test feature set so they're exactly the same, what differentiates them is just the number of the rows of course. I do not know how to fix this, i have tried to use different LabelEncoder for each, but it still does not fix the error. Please someone help me.
For your information the categorical data is on the column 8th in both train and test features set
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.fit_transform(X_test[:,8])
On the test set you should never use fit_transform, but only transform. And it seems that you're not applying the preprocessing you did on the training data to your test data, that is also a mistake.
EDIT
When you use fit_transform, for example SimpleImputer(strategy='most_frequent') on your training data, you're basically calculating the most frequent value, to input it in the rows containing nan. This is fine. If you do fit_transform on your test set what you're doing is cheating, because you're assuming to have lot of instances from which calculate the most frequent value (whereas instead you might be predicting only one instance). The right thing to do is to input the missing data using the most frequent value you found on the training set. This is done by using only transform. The same logic apply to every other fit_transform / transform you can find in sklearn, for example when applying PCA or a CountVectorizer.

How to train SVM model in sklearn python by input CSV file?

I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Low K-fold accuracy for First Fold

I created a text classifier, and I'm trying to utilize K-fold cross-validation. I can't figure out why my first fold has an accuracy of 55% while my other folds are overfitting at 99-100% accuracy. My data set is a 5109x2 dataframe with columns df["Features"] as the features and df["Labels"] as labels. df["Features"] has descriptors based off some product mapping keywords and are separated by commas as seen here: Features. I'm creating indicator variables based off the sub-features through countvectorizer(). This is the result of a 5-fold cv. Result
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
def train(classifier, X, y):
count_vect=CountVectorizer(min_df = 1,lowercase = False)
y=pd.Series(y)
X=count_vect.fit_transform(X)
y=count_vect.fit_transform(y)
kf=KFold(n_splits=5,shuffle=True)
k_fold=pd.Series(np.zeros(5))
for i,(train_index,test_index) in enumerate(kf.split(X)):
print("Train",train_index, "Test",test_index)
X_train,X_test=X[train_index],X[test_index]
y_train,y_test=y[train_index],y[test_index]
k_fold[i]=(print("For K=",i+1," Classifier accuracy= ",classifier.fit(X_train, y_train).score(X_test, y_test), "n = ",X_train.shape[0]))
train(MLPClassifier(hidden_layer_sizes= (100,),activation='relu',random_state=2, max_iter=100, warm_start=True),df["Features"], df["Labels"])
It is entirely possible that this is just a result of the data. There is no reason to implement this by hand, scikit-learn has the functionality built in. If you want to test your implementation, try running the experiment using the shuffle parameter off to see if you get the same results.
It is best practice to shuffle your data anyway prior to running cross validation.

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:
using sklearn.cross_validation.cross_val_score
using sklearn.cross_validation.train_test_split
I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.
import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]
depth = 5
maxFeat = 3
result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)
result
# result is now something like array([ 0.66773295, 0.58824739])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)
RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc #something like 0.83
RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc #also something like 0.83
My question is:
why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?
Note:
When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.
In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.
Thanks for your help!
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.
The answer is what #KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
source: Better way to shuffle two numpy arrays in unison

Categories

Resources