I encounter some problems using GridSerachCV from sklearn package and parallelization.
To see if it was coming from my data, I tried to perform the same kind or processing on the iris dataset embedded in sklearn.
I try to optimize a SVM classifier for different parameter values. When I set n_jobs=1 there is no problem, it is working well (the following code runs in about 0.5s). But if I just change n_jobs to other values, then it runs indefinitely (I did not let it run several hours, but after more than 10 minutes I still don't get any result).
Here is the code with n_jobs=3:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from sklearn import datasets
from sklearn import grid_search
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = StandardScaler().fit_transform(X)
parameters = {'C': [0.1,0.3,0.5,0.7,1.0],\
'coef0': [0.1,0.3,0.5,0.7,0.9],\
'degree': [2,3,4],\
'gamma': [0.0,0.1,0.2]}
clf = SVC(random_state=0, max_iter=-1, probability=False,shrinking=True, verbose=False,\
class_weight='auto',tol=0.0001, kernel='poly')
gridsearch_RF = grid_search.GridSearchCV(clf, parameters, n_jobs=3, cv=2, verbose=0)
gridsearch_RF = gridsearch_RF.fit(X,y)
print gridsearch_RF.best_score_
print gridsearch_RF.best_params_
I also tried with another classifier (RandomForest) and I get the same problem.
Does anyone have had the same problem?
(I use WinPython-64bits-2.7.9.5 and sklearn version 0.16.1)
Related
I am trying to practice using Sci-Kit Learn to do a K-Nearest Neighbor prediction model using the Iris data set. This is what I have written:
import sklearn
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
iris = datasets.load_iris()
X = iris.data
y = iris.target
knn =KNeighborsClassifier(n_neighbors=6)
knn.fit(X, y)
This is my output>>> KNeighborsClassifier(n_neighbors=6)
However, I think I should be getting: KNeighborsClassifer(algorithm = 'auto', leaf_size =30, metric ='minkowski, metric_params=None, n_jobs=1, n_neighbors=6, p=2, weights='uniform')
Also, I tried to predict the target value based on a new array of X values (X_new) as below:
X_new = np.array([[5.6,2.8,3.9,1.1],[5.7,2.6,3.8,1.3],[4.7,3.2,1.3,0.2]])
Pred = knn.predict(X_new)
print(Pred)
However, it didn't provide an output of anything at all. Any assistance/advice would be appreciated!
I think your code works fine considering I ran it on Google Colab (link to the notebook - https://colab.research.google.com/drive/1FROuNe4NMD6D2HCCEtz6TePlCccbGFZm?usp=sharing).
Do check this out maybe you try reproducing the error.
I am a beginner to machine learning and as part of learning I choose student performance dataset from UCI. I want to predict the final result of a student based on the features given.
I first tried using two main and highly correlated features G1 and G2 that are grades of two exams. I used LinearRegression algorithm and got an accuracy of 0.4 or less.
Then I tried feature engineering on all the features that are objects in dataframe and still the accuracy is same.
How can I improve accuracy score ?
My code as a Python notebook
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error,accuracy_score
df = pd.read_csv('student-mat.csv',sep=';')
df2 = pd.read_csv('student-por.csv',sep=';')
df = [df,df2]
df = pd.concat(df)
df = pd.get_dummies(df)
X = df.drop('G3',axis=1)
y = df['G3']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
y_pred = [int(round(i)) for i in y_pred]
accuracy_score(y_test,y_pred)
The accuracy calculated on continous variables is not very useful. You can use the mean squared error instead, which is relevant for continuous output.
As for improving your model, you can try to use the different tools at your disposal to identify the most relevant features. I recommend statsmodels API (https://www.statsmodels.org/stable/regression.html) to get a more in-depth analysis.
I am exploring the hyperparametrs of a KNN model to fit MNIST dataset by using RandomizedSearchCV in sklearn.
To do this exploring I kept only 10000 data, otherwise it will take forever.
My code is something similar to this:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
mnist = fetch_openml("mnist_784");
X, y = mnist['data'], mnist['target'];
X_train_pre, X_test, y_train_pre, y_test = X[:60000], X[60000:], y[:60000], y[60000:];
shuffle_index = np.random.permutation(60000);
X_train, y_train = X_train_pre[shuffle_index], y_train_pre[shuffle_index];
del y_train_pre, X_train_pre
X_short, y_short = X_train[:10000], y_train[:10000];
param_grid2 = {"weights": ["uniform", "distance"],"n_neighbors":[3,4,5,6,7,8], };#5,10,15, 20, 25, 30, 40,50]
knn_clf_rs2 = KNeighborsClassifier(n_jobs=-1);
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5, n_jobs=-1);
random_search2.fit(X_short, y_short);
To speed up, I set the parameter n_jobs=-1 both in the classifier constructor and in the random_search2 constructor, although for the moment I need only the one for the search while the one set in the classifier constructor is unused. However, looking around I saw that according to many people n_jobs=-1 is ineffective and actually just one processor is used. Even worse, it slows down performances due to the increase resource usage of multiprocessing. So it is actually better to do (in my case too):
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5);
First of all, is n_jobs=-1 not working because of the classifier type (KNN) or because of RandomizedSearchCV or because of the combination of the two?
Furthermore is there any solution to this such as to force multiprocessing?
When I run this cross validation leave one out, it does nothing, not even an error message. I can't figure out what I am missing. I'm using the csv from kaggle - https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression/downloads/heart-disease-prediction-using-logistic-regression.zip/1
import csv
from sklearn.model_selection import LeaveOneOut
from sklearn import svm
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import cross_val_score,
cross_val_predict
from sklearn import metrics
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
#replace missing values with mean
dataset=read_csv("//Users/crystalfortress/Desktop/CompGenetics
/Final_Project_Comp/framingham.csv")
dataset.fillna(dataset.mean(), inplace=True)
print(dataset.isnull().sum())
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 15].values
model = svm.SVC(kernel='linear', C=10, gamma = 0.1)
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print('Accuracy after cross validation:', scores.mean())
predictions = cross_val_predict(model, X, y, cv=loo)
accuracy = metrics.r2_score(y, predictions)
print('Prediction accuracy:', accuracy)
x = metrics.classification_report(y, predictions)
print(x)
cf = metrics.confusion_matrix(y, predictions)
print(cf)
I tried running this on my machine, it looks like your code works fine (though I did comment out a lot of unnecessary imports). Leave One Out will take a really long time to train a model (it makes n training data sets out of n data points link). So you will have to wait for it to train before you will get your results. Changing cv= to a number (default is 3, I believe) will train a model more quickly, and most likely have less variance in the model. Also, add n_jobs=-1 to your cross_val_score call will allow python access to all of your processors. E.g.: scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1)
You can also set the verbose parameter in your cross_val_score as well to watch the progress (though, heads up, it isn't fast). I believe the highest value is 3 (it doesn't mention it in their online docs), but using a higher value is fine. So the final cross_val_score call would look like:
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1, verbose=10)
I'm used to using sklearn, whose documentation I find very easy to follow. However, I now need to learn to use OpenCV - in particular, I need to be able to use an MLP classifier, and to update its weights as new training data comes in.
In sklearn, this can be done using the partial_fit method. According to the OpenCV documentation, there is an UPDATE_WEIGHTS flag that can be set, but I can't figure out how to include it in my code.
Here's a MCVE of what I have so far:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np
import cv2
from sklearn.neural_network import MLPClassifier
def softmax(x):
softmaxes = np.zeros(x.shape)
for i in range(x.shape[1]):
softmaxes[:, i] = np.exp(x)[:, i]/np.sum(np.exp(x), axis=1)
return softmaxes
data = load_breast_cancer()
X = data.data
y = data.target.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
y2 = np.zeros((y_train.shape[0], 2))
y2[:,0] = np.where(y_train==0, 1, 0)
y2[:,1] = np.where(y_train==1, 1, 0)
ann = cv2.ml.ANN_MLP_create()
ann.setLayerSizes(np.array([X.shape[1], y2.shape[1]]))
ann.setActivationFunction(cv2.ml.ANN_MLP_SIGMOID_SYM)
ann.train(np.float32(X_train), cv2.ml.ROW_SAMPLE, np.float32(y2))
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
preds_proba = softmax(ann.predict(np.float32(X_test))[1])
print(roc_auc_score(y_test, preds_proba[:,1]))
print(roc_auc_score(y_test, mlp.predict_proba(X_test)[:,1]))
As the score between the OpenCV classifier and the sklearn learn one are comparable, I'm pretty confident it's implemented correctly.
How can I modify this code, so that when a new training sample comes in, I can update the weights based on that sample alone, rather than retraining on the entire train set?
The equivalent in sklearn would be:
mlp.partial_fit(X_new_sample, y_new_sample).
Figured out the answer. The syntax is the following:
ann.train(cv2.ml.TrainData_create(np.float32(X), 0, np.float32(y)), flags=1)