I am using the following dataset, original version, obtained from: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
I want to apply logistic regression to classify the samples on that dataset, my code is the following:
import numpy as np
from sklearn.model_selection import train_test_split
data = np.genfromtxt("breast-cancer-wisconsin.data",delimiter=",")
X = data[:,1:-1]
X[X == '?'] = '-999999'
X = X.astype(int)
y = data[:, -1].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lg=linear_model.LogisticRegression(n_jobs = 10)
lg.fit(X_train,y_train)
predictions = lg.predict(X_test)
cm=confusion_matrix(y_test,predictions)
print(cm)
score = lg.score(X_test, y_test)
print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
I have deleted the first column because it is only the ID, and replaced the ? characters with a big number, so that it could be classified as an outlier. The problem I got is when I compare my results to the ones obtained in this page:
https://anujdutt9.github.io/ML_LogRSklearn.html
Because I am obtaining an accuracy of:
Accuracy: 0.34
and on the link mentioned before the accuracy was approximately 95%.
The results of my confusion matrix are also poor, for example, I obtain:
[[ 1 92]
[ 0 47]]
What is wrong with my model?
Thanks
Try this
X[X == '?'] = np.nan #converting ? to NaN
Then imputing the mean value
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
Related
In Python, I have conducted a small multiple linear regression model to explain house prices in areas based on other variables (all of which are percentages multiplied by 100) such as percentage of people with bachelor degrees in an area, percentage of people who work from home. I have conducted this in R and it works fine, but I am new to Python ML. I have shown the output of y_pred = regressor.predict(X_test) and the MSE I get. I have included a sample of my data where avgincome PctSingleDetached and PctDrivetoWork are X, and AvgHousingPrice is the Y.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
sample data:
avgincome PctSingleDetached PctDrivetoWork AvgHousingPrice
0 44388.0 61.528497 81.151832 448954
1 40650.0 54.372197 77.882798 349758
2 43350.0 68.393782 79.553265 428740
X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer.
# Instructs to find missing and replace it with mean
# Fit method in SimpleImputer will connect imputer to our matrix of features
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))
print(X)
print(Y)
## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
### Feature Scaling ###
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both
X_test[:, 0:] = sc.transform(X_test[:, 0:])
print(X_train)
print(X_test)
## Training ##
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)
### Predicting Test Set results ###
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, y_pred)
print(mse)
OUTPUT:
[[489066.76 300334. ]
[227458.2 200352. ]
[928249.59 946729. ]
[339032.27 350116. ]
[689668.21 600322. ]
[489179.58 577936. ]]
...
...
MSE = 2375985640.8102403
You can calculate mse yourself to check if there is something wrong. In my opinion the obtained result is coherent. Anyway I built a simple my_mse function to check the result output by sklearn, with your example data
from sklearn.metrics import mean_squared_error
list_ = [[489066.76, 300334.],
[227458.2, 200352. ],
[928249.59, 946729. ],
[339032.27, 350116. ],
[689668.21, 600322. ],
[489179.58, 577936. ]]
y_true = [y[0] for y in list_]
y_pred = [y[1] for y in list_]
mse = mean_squared_error(y_true, y_pred)
print(mse)
# 8779930962.14985
def my_mse(y_true, y_pred):
diff = 0
for couple in zip(y_true, y_pred):
diff+=pow(couple[0]-couple[1], 2)
return diff/len(y_true)
print(my_mse(y_true, y_pred))
# 8779930962.14985
Remember the mse is the mean squared error. (Each error is squared in the sum)
If you are asking if your model is bad or good, it depends on the main objective. Anyway, I think that your model is performing poor because it's a linear model. A model with more complexity could handle the problem and output better results
I have a mall customers dataset from kaggle, it's 200 customers with 5 features, CustomerID, gender, age, annual income and spending score. Before running regression i first ran k-means to cluster my data, with K = 6 for Spending score(dependent variable) annual income(Independent variable) and age(independent variable). After doing so i ran multiple linear regression for each cluster seperatly, after doing so i printed my predicted values and my actual values and my predicted values are way more than my actual printed values. The predicted y values were 34 (that's exactly the number of values i have in my cluster) and my actual y values are 9. Why don't all of my actual values print?
code:
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Age','Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=6, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
mydict = {i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
dictlist = []
for key, value in mydict.items():
temp = [key,value]
dictlist.append(temp)
df0 = df[df.index.isin(mydict[0].tolist())]
Y = df0['Spending Score (1-100)']
X = df0[[ 'Annual Income (k$)','Age']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, Y, test_size = None, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X)
print('predicted:', y_pred, sep='\n')
print('actual', y_test, sep='\n')
The code above doesn't show how you computed y_pred. Also, you've called train_test_split with a test_size of None which means you're test set defaults to 25% of the data. If there's 34 items in your cluster that'd be 8.5 so the 9 actual values you're seeing makes sense. To understand why y_pred is more than that we'd have to see how you computed it but I'm guessing you did something like regressor.predict(X) which would give you predictions on all of the data, not just the test set.
I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.
# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
The above code runs as expected. However, in order to explain the model, something like, 'a 1 unit change in Beta1 will result in a .7% improvement in probability of sucess' , I need to be able to see the theta. How do I do this?
Thanks for the assist. BTW, this is for a homework assignment
Very nice question. You can indeed access the thetas however in the documentation it is not clear how to do this.
Use the following. Here I use the iris dataset.
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Scale/ Normalize Independent Variables
X = StandardScaler().fit_transform(X)
#Split data into train an test set at 50% each
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .5, random_state=42)
gpc= GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs=-1)
gpc.fit(X_train,y_train)
y_proba=gpc.predict_proba(X_test)
#classify as 1 if prediction probablity greater than 15.8%
y_pred = [1 if x >= .158 else 0 for x in y_proba[:, 1]]
# thetas
gpc.kernel_.theta
Results:
array([7.1292252 , 1.35355145, 5.54106817, 0.61431805, 7.00063873,
1.3175175 ])
An example from the documentation that access thetas can be found HERE
Hope this helps.
It looks as if the theta value you are looking for is a property of the kernel object that you pass into the classifier. You can read more in this section of the sklearn documentation. You can access the log-transformed values of theta for the classifier kernel by using classifier.kernel_.theta where classifier is the name of your classifier object.
Note that the kernel object also has a method clone_with_theta(theta) that might come in handy if you're making modifications to theta.
Problem: I need to train a classifier (in matlab) to classify multiple levels of signal noise.
So i trained a multi class SVM in matlab using the fitcecoc and obtained an accuracy of 92%.
Then i trained a multiclass SVM using sklearn.svm.svc in python, but it seems that however i fiddle with the parameters, i cannot achieve more than 69% accuracy.
30% of the data was held back and used to verify the training. the confusion matrixes can be seen below.
Matlab confusion matrix
Python confusion matrix
So if anyone has some experience or suggestions with svm.svc multiclass training and can see a problem in my code, or has a suggestion it would be greatly appreciated.
Python code:
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
#from sklearn import preprocessing
#### SET fitting parameters here
C = 100
gamma = 1e-8
#### SET WEIGHTS HERE
C0_Weight = 1*C
C1_weight = 1*C
C2_weight = 1*C
C3_weight = 1*C
C4_weight = 1*C
#####
X = np.genfromtxt('data/features.csv', delimiter=',')
Y = np.genfromtxt('data/targets.csv', delimiter=',')
print 'feature data is of size: ' + str(X.shape)
print 'target data is of size: ' + str(Y.shape)
# SPLIT X AND Y INTO TRAINING AND TEST SET
test_size = 0.3
X_train, x_test, Y_train, y_test = train_test_split(X, Y,
... test_size=test_size, random_state=0)
svc = svm.SVC(C=C,kernel='rbf', gamma=gamma, class_weight = {0:C0_Weight,
... 1:C1_weight, 2:C2_weight, 3:C3_weight, 4:C4_weight},cache_size = 1000)
svc.fit(X_train, Y_train)
scores = cross_val_score(svc, X_train, Y_train, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Out = svc.predict(x_test)
np.savetxt("data/testPredictions.csv", Out, delimiter=",")
np.savetxt("data/testTargets.csv", y_test, delimiter=",")
# calculate accuracy in test data
Hits = 0
HitsOverlap = 0
for idx, val in enumerate(Out):
Hits += int(y_test[idx]==Out[idx])
HitsOverlap += int(y_test[idx]==Out[idx]) + int(y_test[idx]==
... (Out[idx]-1)) + int(y_test[idx]==(Out[idx]+1))
print "Accuracy in testset: ", Hits*100/(11595*test_size)
print "Accuracy in testset w. overlap: ", HitsOverlap*100/(11595*test_size)
to those curious how i got the parameters, they were found with GridSearchCV (and increased the accuracy from 40% to 69)
Any help or suggestions is greatly appreciated.
After much pulling my hair, the answer was found here: http://neerajkumar.org/writings/svm/
when the inputs were scaled with StandardScaler(), svm.svc now produces superior results to matlab!!