I am trying to learn by myself how to grid-search number of neurons in a basic multi-layered neural networks. I am using GridSearchCV and KerasClasifier of Python as well as Keras. The code below works for other data sets very well but I could not make it work for Iris dataset for some reasons and I cannot find it why, I am missing out something here. The result I get is:
Best: 0.000000 using {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 3}
0.000000 (0.000000) with: {'n_neurons': 5}
from pandas import read_csv
import numpy
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn.model_selection import GridSearchCV
dataframe=read_csv("iris.csv", header=None)
dataset=dataframe.values
X=dataset[:,0:4].astype(float)
Y=dataset[:,4]
seed=7
numpy.random.seed(seed)
#encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
#one-hot encoding
dummy_y = np_utils.to_categorical(encoded_Y)
#scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
def create_model(n_neurons=1):
#create model
model = Sequential()
model.add(Dense(n_neurons, input_dim=X.shape[1], activation='relu')) # hidden layer
model.add(Dense(3, activation='softmax')) # output layer
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=10, initial_epoch=0, verbose=0)
# define the grid search parameters
neurons=[3, 5]
#this does 3-fold classification. One can change k.
param_grid = dict(n_neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, dummy_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
For the purpose of illustration and computational efficiency I search only for two values. I sincerely apologize for asking such a simple question. I am new to Python, switched from R, by the way because I realized that Deep Learning community is using python.
Haha, this is probably the funniest thing I ever experienced on Stack Overflow :) Check:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=5)
and you should see different behavior. The reason why your model get a perfect score (in terms of cross_entropy having 0 is equivalent to best model possible) is that you haven't shuffled your data and because Iris consist of three balanced classes each of your feed had a single class like a target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 (first fold ends here) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (second fold ends here)2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Such problems are really easy to be solved by every model - so that's why you've got a perfect match.
Try to shuffle your data before - this should result in an expected behavior.
Related
I applied Random Forest RFECV among other ML models to a churn dataset.
While Logistic, SVC, Gradient Boosting, Decision Trees worked well on the data (all using RFECV),
Random Forest RFECV decided that only one feature was important and eliminated all the other features.
Code:
#Create Feature variable X and Target variable y
y = churn_dataset['Churn']
X = churn_dataset.drop(['Churn'], axis = 1)
#RFECV
rfecv = RFECV(RandomForestClassifier(), cv=10, scoring='f1')
rfecv = rfecv.fit(X, y)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X.columns[rfecv.support_])
print(np.where(rfecv.support_ == False)[0])
#drop columns
X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1, inplace=True)
rfecv.estimator_.feature_importances_
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=8)
#fit model
random_forest = rfecv.fit(X_train, y_train)
The following error is returned:
ValueError: Found array with 1 feature(s) (shape=(1622, 1)) while a minimum of 2 is required.
Output of churn_dataset.head()
name gender churn last_purchase_in_days order_count purchase_quantity ...
2 ACKLE 0 1 0.317604 -0.453647 2 -0.368683 1.173058 0.291104 0 ... 0 0 0 0 0 0 1 0 0 1.00
4 ADNAN 1 1 0.250814 -0.453647 2 -0.368683 -0.431351 -0.418023 0 ... 0 0 0 0 0 0 1 0 0 1.00
5 ADY 0 1 -1.143415 -0.453647 2 -0.368683 0.190767 -0.117630 0 ... 0 0 0 0 0 0 1 0 0 1.00
6 ANDY 0 1 0.768432 -0.453647 2 -0.368683 -0.752232 -0.559952 0 ... 0 0 0 0 0 0 1 0 0 1.00
7 AGIE 0 0 -1.669381 3.048875 8 -0.368683 0.520653 4.251851 0 ... 0 0 0 0 0 0 1 0 0 0.16
churn_dataset.columns
Index(['name', 'gender', 'Churn', 'last_purchase_in_days',
'order_count', 'quantity', 'disc_code',
'AOV', 'sales',
'channel_Paid Advertising','channel_Recurring Payment',
'channel_Search Engine',
'channel_Social Media', 'country_Denmark', 'country_France',
'country_Germany', 'country_Italy',
'country_Luxembourg', 'country_Others', 'country_Switzerland',
'country_United Kingdom', 'city_Düsseldorf', 'city_Frankfurt',
'city_Hamburg', 'city_Hannover', 'city_Köln', 'city_Leipzig',
'city_Munich', 'city_Others', 'city_Stuttgart', 'city_Wien',
'Probability_of_Churn'],
dtype='object')
I have some list of words/lexicon and I want to use them for a BOW classification. In sklearnit is possible to use countvectorizer and tfidfvectorizer in sklearn, the two approaches builds the vocabulary the use from the training data. But in my case I have built a kind of list of words(dictionary) that can be used to discriminate between the classes for text classification.
Is there any library or package I can use in python?
Check out the vocabulary parameter of the CountVectorizer here in the docs.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
twenty_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'], shuffle=True, random_state=42)
my_vocabulary = ['aristotle',
'arithmetic',
'arizona',
'arkansas'
]
count_vect = CountVectorizer(vocabulary=my_vocabulary)
X_train_counts = count_vect.fit_transform(twenty_train.data)
df = pd.DataFrame.sparse.from_spmatrix(X_train_counts)
And you will see in the output only that only the words from your vocabulary are used:
Out[16]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
... .. .. .. ..
2252 0 0 0 0
2253 0 0 0 0
2254 0 0 0 0
2255 0 0 0 0
2256 0 0 0 0
[2257 rows x 4 columns]
Of late, there are so many questions around multi-class multi label text classification. Please check if this article helps.
I'm working on the basics of machine learning with the iris dataset. I think I understand the idea of splitting data and making predictions on new data; however, I'm having trouble understanding the results I get for the code below:
iris = load_iris()
X = iris.data
y = iris.target
len(X)--result: 150
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(y_pred)
print(metrics.accuracy_score(y_test, y_pred))
Result: [1 2 2 0 2 1 0 2 0 1 1 2 2 2 0 0 2 2 0 0 1 2 0 2 1 2 1 1 1 2 0 1 1 0 1 0 0
2]
0.95% accuracy
I only get back 38 results. From what I understand, the data is split into 50 50 chunks, meaning I should get back 50 results for the data not part of the train and test data. Why do I get only 38?
I feel like my biggest question regarding Machine Learning is actually using the model.
By default train_test_split set test_size to 0.25. In case of 50 it will be 12.5, so 38 values are correct.
sklearn.model_selection.train_test_split
I have more than 2500 samples on which static analysis has been performed, with more than 300 features extracted per sample.
Among these samples, I have discriminated more than 10 APT class and my aim is to build, for each class, a one-class classifier.
I'm using python scikit library for machine-learning, and in particular i'm facing with One-class SVM.
First question: There exist some other good one-class classifier for this approach?
Second question: I have to come up with some metrics that can define a sort of "accuracy" of the classifier. Now I know that for one-class SVM the accuracy concept is not so well-define. I report my code and my concept:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.read_csv('features_labeled_apt17.csv')
X = df.ix[:,1:341].values
X_train, X_test = train_test_split(X,test_size = 0.3,random_state = 42)
clf = svm.OneClassSVM(nu=0.1,kernel = "linear", gamma =0.1)
y_score = clf.fit(X_train)
pred = clf.predict(X_test)
print(pred)
These represents the output of the code:
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1
1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1]
The 1 represent of course the well-labeled sample, while the -1 represent the wrong one.
First: do you think this can be a good approach?
Second: For metrics, if I divide the total element in the testing set by the wrong labeled?
In my understanding in machine learning algorithms, your use case is not a good one to apply oneclass-SVM classifier.
Normally, oneclass-svm is used for Unsupervised Outlier Detection problems. Refer this page to see the implementation of oneclass-svm to detect outliers.
Just display your data-frame, I will find any new approach to solve your problem.
I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.
I have the following 'sample' containing features and class (status):
A B C D E F status
1 5 2 5 1 3 1
1 2 3 2 2 1 0
3 4 2 3 5 1 1
1 2 2 1 1 4 0
I saved the feature names as 'features':
A B C D E F
The features 'X':
1 5 2 5 1 3
1 2 3 2 2 1
3 4 2 3 5 1
1 2 2 1 1 4
And the status 'y':
1
0
1
0
Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.
import pandas as pd
import numpy as np
from sklearn import svm
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)
test_size = int(X.shape[0]/2)
clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00
My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.
I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.
From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order