I have a mall dataset and i ran k-means with k = 5. Now after i ran linear regression i wanted to print my predicted value Y to compare with the actual value of Y. Printing the actual value was very easy but i keep getting an error when i try to print the predicted Y. To print the predicted value i used df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}). But i get an error ValueError: array length 35 does not match index length 18.
code:
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
df0 = df[df.index.isin(mydict[0].tolist())]
Y = df0['Spending Score (1-100)']
X = df0[[ 'Annual Income (k$)','Age']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, Y, test_size = 0.5, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
r_sq = regressor.score(X, Y)
print('coefficient of determination:', r_sq)
print('intercept:', regressor.intercept_)
print('slope:', regressor.coef_)
y_pred = regressor.predict(X)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
Related
I have two arrays with some differences when predicting handwritten numbers. How would I calculate the percentage accuracy for each number?
Loading the data:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
Loading data:
digits = load_digits()
X = digits.data
y = digits.target
Splitting data into test train:
#Perform test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Logistic regression:
#Create a logistic regression object
clf = LogisticRegression(random_state=0,penalty='none')
#Fit model to data
clf.fit(X_train,y_train)
#Print the coefficients
b0 = clf.intercept_[0]
b1 = clf.coef_[0,0]
print('beta_0 =', b0)
print('beta_1 =', b1)
#Calculate the test error rate
yp = clf.predict(X_test)
err = (yp!=y_test).mean()
print('Error rate = {}'.format(err))
I want to calculate the percentage error for each number as yp doesn't equal y_test
I have a mall customers dataset from kaggle, it's 200 customers with 5 features, CustomerID, gender, age, annual income and spending score. Before running regression i first ran k-means to cluster my data, with K = 6 for Spending score(dependent variable) annual income(Independent variable) and age(independent variable). After doing so i ran multiple linear regression for each cluster seperatly, after doing so i printed my predicted values and my actual values and my predicted values are way more than my actual printed values. The predicted y values were 34 (that's exactly the number of values i have in my cluster) and my actual y values are 9. Why don't all of my actual values print?
code:
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Age','Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=6, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
mydict = {i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
dictlist = []
for key, value in mydict.items():
temp = [key,value]
dictlist.append(temp)
df0 = df[df.index.isin(mydict[0].tolist())]
Y = df0['Spending Score (1-100)']
X = df0[[ 'Annual Income (k$)','Age']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, Y, test_size = None, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X)
print('predicted:', y_pred, sep='\n')
print('actual', y_test, sep='\n')
The code above doesn't show how you computed y_pred. Also, you've called train_test_split with a test_size of None which means you're test set defaults to 25% of the data. If there's 34 items in your cluster that'd be 8.5 so the 9 actual values you're seeing makes sense. To understand why y_pred is more than that we'd have to see how you computed it but I'm guessing you did something like regressor.predict(X) which would give you predictions on all of the data, not just the test set.
I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.
I'm not able to see my resultant accuracy score in my final graph and I get precision/recall being ill-defined where I don't see any 0's.
I'm using this yeast data: https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data
I've tried making the whole set my training set by making train_frac=1.
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv("<my_dir>",names = ['sample','mcg', 'gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc','site'])
df=df.drop(columns=['sample'])
model_type = GaussianNB()
target = 'site'
train_frac = 0.5
Y = df[target]
df2 = df.drop(columns=[target])
dtype='object'). Everything but site.
X = df[df2.columns[:]]
def naive_split(X, Y, n):
# Take first n lines of X and Y for training and the rest for testing
X_train = X[:n]
X_test = X[n:]
Y_train = Y[:n]
Y_test = Y[n:]
return (X_train, X_test, Y_train, Y_test)
def train_model(n=int(train_frac*df.shape[0])):
X_train, X_test, Y_train, Y_test = naive_split(X, Y, n)
clf = model_type
clf = clf.fit(X_train, Y_train)
return (X_test, Y_test, clf)
X_test, Y_test, clf = train_model()
import sklearn.metrics as metrics
from sklearn import model_selection
sizes = np.arange(0.98,0.01, -0.02)
result = {}
for size in sizes:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(
X, Y, test_size=size, random_state=200)
clf = model_type
clf = clf.fit(X_train, Y_train)
score = clf.score(X_test, Y_test)
precision = metrics.precision_score(Y_test, clf.predict(X_test), average='weighted')
recall = metrics.recall_score(Y_test, clf.predict(X_test), average='weighted')
result[len(Y_train)] = (score, precision, recall)
result = pd.DataFrame(result).transpose()
result.columns = ['Accuracy','Precision', 'Recall']
result.plot(marker='*', figsize=(15,5))
plt.title('Metrics measures using random train/test splitting')
plt.xlabel('Size of training set')
plt.ylabel('Value');
I get the following results when I expect it to run without error:
C:\Users\<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)
C:\Users\<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1137: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for)
I'm trying to make a heart disease prediction program using Naive Bayes. When I finished the classifier, the cross validation showed a mean accuracy of 80% However when I try to make a prediction on a given sample, the prediction is all wrong! The dataset is the heart disease dataset from UCI repository, it contains 303 samples. There are two classes 0: healthy and 1: ill, when I try making a prediction on a sample from the dataset, it doesn't predicts its true value, except for very few samples. Here is the code:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler
class Predict:
def Read_Clean(self,dataset):
header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',
'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',
'Exercise_Induced_Angina', 'OldPeak',
'Slope', 'CA', 'Thal', 'Num']
df = pd.read_csv(dataset, names=header_row)
df = df.replace('[?]', np.nan, regex=True)
df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)
.fit_transform(df), columns=header_row)
df = df.astype(float)
return df
def Train_Test_Split_data(self,dataset):
Y = dataset['Num'].apply(lambda x: 1 if x > 0 else 0)
X = dataset.drop('Num', axis=1)
validation_size = 0.20
seed = 42
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
return X_train, X_test, Y_train, Y_test
def Scaler(self, X_train, X_test):
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
return X_train, X_test
def Cross_Validate(self, clf, X_train, Y_train, cv=5):
scores = cross_val_score(clf, X_train, Y_train, cv=cv, scoring='f1')
score = scores.mean()
print("CV scores mean: %.4f " % (score))
return score, scores
def Fit_Score(self, clf, X_train, Y_train, X_test, Y_test, label='x'):
clf.fit(X_train, Y_train)
fit_score = clf.score(X_train, Y_train)
pred_score = clf.score(X_test, Y_test)
print("%s: fit score %.5f, predict score %.5f" % (label, fit_score, pred_score))
return pred_score
def ReturnPredictionValue(self, clf, sample):
y = clf.predict([sample])
return y[0]
def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):
data = self.Read_Clean(dataset_path)
X_train, X_test, Y_train, Y_test = self.Train_Test_Split_data(data)
X_train, X_test = self.Scaler(X_train, X_test)
self.NB = GaussianNB()
self.Fit_Score(self.NB, X_train, Y_train, X_test, Y_test, label='NB')
self.Cross_Validate(self.NB, X_train, Y_train, 10)
return self.ReturnPredictionValue(self.NB, sample)
When I run:
if __name__ == '__main__':
sample = [41.0, 0.0, 2.0, 130.0, 204.0, 0.0, 2.0, 172.0, 0.0, 1.4, 1.0, 0.0, 3.0]
p = Predict()
print "Prediction value: {}".format(p.PredictionMain(sample))
The result is:
NB: fit score 0.84711, predict score 0.83607 CV scores mean: 0.8000
Prediction value: 1
I get 1 instead of 0 (this sample is already one of the dataset samples).
I did this for more than one sample from the dataset and I get wrong result most of the time, it's as if the accuracy is not 80%!
Any help would be appreciated.
Thanks in advance.
Edit:
Problem solved using Pipeline. The final code is:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
class Predict:
def __init__(self):
self.X = []
self.Y = []
def Read_Clean(self,dataset):
header_row = ['Age', 'Gender', 'Chest_Pain', 'Resting_Blood_Pressure', 'Serum_Cholestrol',
'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',
'Exercise_Induced_Angina', 'OldPeak',
'Slope', 'CA', 'Thal', 'Num']
df = pd.read_csv(dataset, names=header_row)
df = df.replace('[?]', np.nan, regex=True)
df = pd.DataFrame(Imputer(missing_values='NaN', strategy='mean', axis=0)
.fit_transform(df), columns=header_row)
df = df.astype(float)
return df
def Split_Dataset(self, df):
self.Y = df['Num'].apply(lambda x: 1 if x > 0 else 0)
self.X = df.drop('Num', axis=1)
def Create_Pipeline(self):
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('bayes', GaussianNB()))
model = Pipeline(estimators)
return model
def Cross_Validate(self, clf, cv=5):
scores = cross_val_score(clf, self.X, self.Y, cv=cv, scoring='f1')
score = scores.mean()
print("CV scores mean: %.4f " % (score))
def Fit_Score(self, clf, label='x'):
clf.fit(self.X, self.Y)
fit_score = clf.score(self.X, self.Y)
print("%s: fit score %.5f" % (label, fit_score))
def ReturnPredictionValue(self, clf, sample):
y = clf.predict([sample])
return y[0]
def PredictionMain(self, sample, dataset_path='dataset/processed.cleveland.data'):
print "dataset: "+ dataset_path
data = self.Read_Clean(dataset_path)
self.Split_Dataset(data)
self.model = self.Create_Pipeline()
self.Fit_Score(self.model, label='NB')
self.Cross_Validate(self.model, 10)
return self.ReturnPredictionValue(self.model, sample)
Now making a prediction on the same sample in the question returns [0] which is the true value. Actually by running the following method:
def CheckTrue(self):
clf = self.Create_Pipeline()
out = cross_val_predict(clf, self.X, self.Y)
p = [out == self.Y]
c = 0
for i in range(303):
if p[0][i] == True:
c += 1
print "Samples with true values: {}".format(c)
I get 249 true samples using the pipeline code, whereas I got only 150 before.
You're not applying StandardScaler to the sample. Classifier expects scaled data as it was trained on StandardScaler.transform output, but sample is not scaled the same way as in training.
It is easy to make such mistakes when combining multiple steps (scaling, preprocessing, classification) manually. To avoid such issues it is a good idea to use scikit-learn Pipeline.