Python sklearn: GMM model produces bad results - python

I am trying to use a GMM model to classify the Iris data set. My model seems to produce inconsistent results, some runs have an accuracy of 90% and some of 33% (some even going all the way to 0%). I am not sure if the mistake occurs at the classification stage or my preprocessing or print statements for accuracy are incorrect.
I have tried changing the number of iterations and the init_params when defining the classifier. I am using scikit-learn version 0.21.3 and Python 3.7.0
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.mixture import GaussianMixture as GMM
def data_processor(input_file):
data = []
with open(input_file, 'r') as f:
for line in f:
line = line.strip().split(',')
if line[-1] == 'Iris-setosa':
line[-1] = 0
elif line[-1] == 'Iris-versicolor':
line[-1] = 1
elif line[-1] == 'Iris-virginica':
line[-1] = 2
data.append(line)
return np.array(data, dtype=float)
x, y = data_processor('iris.txt')[:, :-1], data_processor('iris.txt')[:, -1]
# split data into 5 chucnks, 80% used for training and 20% for testing
skf = StratifiedKFold(n_splits=5)
train_index, test_index = next(iter(skf.split(x, y)))
x_train, y_train = x[train_index], y[train_index]
x_test, y_test = x[test_index], y[test_index]
# calculate number of components in data set using number of classes
num_classes = len(np.unique(y_train))
# build classifier and fit model
classifier = GMM(n_components=num_classes, covariance_type='full',
max_iter=200)
classifier.fit(x_train)
# Make predictions and print accuracy
y_train_pred = classifier.predict(x_train)
accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
print('Accuracy on training data =', accuracy_training)
y_test_pred = classifier.predict(x_test)
accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
print('Accuracy on testing data =', accuracy_testing)
The number of test data is 150 rows, 50 per class. Previously with different models on the same data set, I was getting an accuracy of around 90-93%.
The input file contains data in this format: 5.9,3.0,5.1,1.8,Iris-virginica
I have also tried to scale the data using:
x = preprocessing.scale(x)
The data now looks as follows:
x[0] = [-0.90068117 1.03205722 -1.3412724 -1.31297673]
y[0] = 0.0
However, this did not affect the accuracy.

Related

Decision tree Regressor model get max_depth value of the model with highest accuracy

Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg.
Evaluate the model accuracy on the training data set and print its score.
Evaluate the model accuracy on the testing data set and print its score.
Predict the housing price for the first two samples of the X_test set and print them.(Hint : Use predict() function)
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to 5.
Evaluate each model's accuracy on the testing data set.
Hint: Make use of for loop
Print the max_depth value of the model with the highest accuracy.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import numpy as np
np.random.seed(100)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
y_pred=dt_reg.predict(X_test[:2])
print(y_pred)
I want to get Print the max_depth value of the model with the highest accuracy. But fresco plays not submitted Let me know what is error.
max_reg = None
max_score = 0
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
t = (m,max_score)
print (t)
If you wish to continue to use the loop as you've done, you can create another variable called 'best_max_depth' and replace its value with dt_reg.max_depth if your if-statement condition is met (it being the best model so far).
I suggest however, you look into GridSearchCV to extract parameters from your best models and to loop through different parameter values.
max_reg = None
max_score = 0
best_max_depth = None
t=()
for m in range(2, 6) :
rf_reg = DecisionTreeRegressor(max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
rf_reg_score = rf_reg.score(X_test,Y_test)
print (m, rf_reg_score ,max_score)
if rf_reg_score > max_score :
max_score = rf_reg_score
max_reg = rf_reg
best_max_depth = rf_reg.max_depth
t = (m,max_score)
print (t)
Try this code -
myList = list(range(2,6))
scores =[]
for i in myList:
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg.fit(X_train,Y_train)
scores.append(dt_reg.score(X_test, Y_test))
print(myList[scores.index(max(scores))])

build ml model for text classification usin logistic regression, bow,countvectorizer.getting value error for new input point

trying to classify new text input point into positive negative neutral using pre build model and getting value error.i have a new review which needs to be vectorized using count vectorizer but even after vectorizing the shape doesnt change. it works fine on train and test data set
used logistic regression
bag of words
count vectorization
# Create an object of class CountVectorizer
bow = CountVectorizer()
## X_train.values.shape
# Call the fit_transform method on training data
X_train = bow.fit_transform(X_train_raw.values)
# Call the transform method on the test dataset
X_test = bow.transform(X_test_raw.values)
#perform column standardaiztion
std = StandardScaler(with_mean=False)
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
start = time.time()
# creating list of C
C_values = np.linspace(0.1,1,10)
cv_scores = [] # empty list that will hold cv scores
# Try each value of alpha in the below loop
for c in C_values:
# Create an object of the class Logistic Regression with balanced class weights
clf = LogisticRegression(C = c, class_weight = 'balanced')
# perform 5-fold cross validation
# It returns the cv accuracy for each fold in a list
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
# Store the mean of the accuracies from all the 5 folds
cv_scores.append(scores.mean())
# calculate misclassification error from accuracy (error = 1 - accuracy)
cv_error = [1 - x for x in cv_scores]
# optimal (best) C is the one for which error is minimum (or accuracy is maximum)
optimal_C = C_values[cv_error.index(min(cv_error))]
print('\nThe optimal alpha is', optimal_C)
end = time.time()
print("Total time in minutes = ", (end-start)/60)
clf = LogisticRegression(C = optimal_C)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("Accuracy =", acc)
confusion_matrix(y_test, y_pred)
def predict_new(X):
# X1 = [[X]]
X2 = [X]
X4= pd.Series(X2)
X3,los = data_cleaning(X4)
#print(X3.values)
X_t = bow.transform(X3)
print(X_t.shape)
#X_t= std.transform(X_t)
#pred = clf.predict(X_t)
pred =0
if pred == 1:
print("Positive")
elif pred == -1:
print("Negative")
else:
print("Neutral")
predict_new('the app is good')
I expected the output to be positive negative or neutral

sklearn Naive Bayes in python

I have trained a classifier on 'Rocks and Mines' dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
And when calculating the accuracy score it always seems to be perfectly accurate (output is 1.0) which I find hard to believe. Am I making any mistakes, or naive bayes is this powerful?
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
data = urllib.request.urlopen(url)
df = pd.read_csv(data)
# replace R and M with 1 and 0
m = len(df.iloc[:, -1])
Y = df.iloc[:, -1].values
y_val = []
for i in range(m):
if Y[i] == 'M':
y_val.append(1)
else:
y_val.append(0)
df = df.drop(df.columns[-1], axis = 1) # dropping column containing 'R', 'M'
X = df.values
from sklearn.model_selection import train_test_split
# initializing the classifier
clf = GaussianNB()
# splitting the data
train_x, test_x, train_y, test_y = train_test_split(X, y_val, test_size = 0.33, random_state = 42)
# training the classifier
clf.fit(train_x, train_y)
pred = clf.predict(test_x) # making a prediction
from sklearn.metrics import accuracy_score
score = accuracy_score(pred, test_y)
# printing the accuracy score
print(score)
The X is the input and y_val is the output (I have converted 'R' and 'M' into 0's and 1's)
This is because of random_state argument inside train_test_split() function.
When you set random_state to an integer sklearn ensures that your data sampling is constant.
That means that everytime you run it by specifying random_state, you will get a same result, this is expected behavior.
Refer docs for further details.

How to run through loop to use non-scaled and scaled data in python for loop

I have the following code running through and fitting a model on the iris data using different modeling techniques. How can I add a second step in this process so I can demonstrate the improvement between using scaled and non-scaled data?
I don't need to run the scale transform outside of the loop, i was just having a lot of issues with transforming the data type from pandas dataframe to np array and back again.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
sc = StandardScaler()
X_train_scale = sc.fit_transform(X_train)
X_test_scale = sc.transform(X_test)
numFolds = 10
kf = KFold(len(y_train), numFolds, shuffle=True)
# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, svm.SVC]
params = [{},{}]
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
So ideally my output would be:
CV standard accuracy score of LogisticRegression: 0.683333
CV scaled accuracy score of LogisticRegression: 0.766667
CV standard accuracy score of SVC: 0.766667
CV scaled accuracy score of SVC: 0.783333
It seems like this is unclear, I am trying to loop through scaled and unscaled data, similar to how I am looping through the different ML algorithms.
I wanted to follow up with this. I was able to do this by creating a pipeline and using gridsearchCV
pipe = Pipeline([('scale', StandardScaler()),
('clf', LogisticRegression())])
param_grid = [{
'scale':[None,StandardScaler()],
'clf':[SVC(),LogisticRegression()]}]
grid_search = GridSearchCV(pipe, param_grid=param_grid,n_jobs=-1, verbose=1 )
In the end this got me the results I wanted and was able to test easily how to work between scaling and not scaling.
try this:
from __future__ import division
for param, Model in zip(params, Models):
total = 0
for train_indices, test_indices in kf:
train_X = X_train[train_indices]; train_Y = y_train[train_indices]
test_X = X_train[test_indices]; test_Y = y_train[test_indices]
reg = Model(**param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)
accuracy = total / numFolds
print ("CV accuracy score of {0}: {1}".format(Model.__name__, round(accuracy, 6)))
# added to your code
if previous_accuracy:
improvement = 1 - (accuracy / previous_accuracy)
print "CV accuracy score improved by", improvement
else:
previous_accuracy = accuracy

Sklearn SVM vs Matlab SVM

Problem: I need to train a classifier (in matlab) to classify multiple levels of signal noise.
So i trained a multi class SVM in matlab using the fitcecoc and obtained an accuracy of 92%.
Then i trained a multiclass SVM using sklearn.svm.svc in python, but it seems that however i fiddle with the parameters, i cannot achieve more than 69% accuracy.
30% of the data was held back and used to verify the training. the confusion matrixes can be seen below.
Matlab confusion matrix
Python confusion matrix
So if anyone has some experience or suggestions with svm.svc multiclass training and can see a problem in my code, or has a suggestion it would be greatly appreciated.
Python code:
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
#from sklearn import preprocessing
#### SET fitting parameters here
C = 100
gamma = 1e-8
#### SET WEIGHTS HERE
C0_Weight = 1*C
C1_weight = 1*C
C2_weight = 1*C
C3_weight = 1*C
C4_weight = 1*C
#####
X = np.genfromtxt('data/features.csv', delimiter=',')
Y = np.genfromtxt('data/targets.csv', delimiter=',')
print 'feature data is of size: ' + str(X.shape)
print 'target data is of size: ' + str(Y.shape)
# SPLIT X AND Y INTO TRAINING AND TEST SET
test_size = 0.3
X_train, x_test, Y_train, y_test = train_test_split(X, Y,
... test_size=test_size, random_state=0)
svc = svm.SVC(C=C,kernel='rbf', gamma=gamma, class_weight = {0:C0_Weight,
... 1:C1_weight, 2:C2_weight, 3:C3_weight, 4:C4_weight},cache_size = 1000)
svc.fit(X_train, Y_train)
scores = cross_val_score(svc, X_train, Y_train, cv=10)
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Out = svc.predict(x_test)
np.savetxt("data/testPredictions.csv", Out, delimiter=",")
np.savetxt("data/testTargets.csv", y_test, delimiter=",")
# calculate accuracy in test data
Hits = 0
HitsOverlap = 0
for idx, val in enumerate(Out):
Hits += int(y_test[idx]==Out[idx])
HitsOverlap += int(y_test[idx]==Out[idx]) + int(y_test[idx]==
... (Out[idx]-1)) + int(y_test[idx]==(Out[idx]+1))
print "Accuracy in testset: ", Hits*100/(11595*test_size)
print "Accuracy in testset w. overlap: ", HitsOverlap*100/(11595*test_size)
to those curious how i got the parameters, they were found with GridSearchCV (and increased the accuracy from 40% to 69)
Any help or suggestions is greatly appreciated.
After much pulling my hair, the answer was found here: http://neerajkumar.org/writings/svm/
when the inputs were scaled with StandardScaler(), svm.svc now produces superior results to matlab!!

Categories

Resources