I started on my journey on machine learning a few months back, Today I was practicing my skills and I tried I few different Algorithms, I used Linear Regression, Decision Tree Classifier and Support Vector Machine, My code is very simple and it's working just fine (" I guess " ), But since I'm new pardon me if this a silly question, But using the Linear Regression and Decision Tree Classifier give me an accuracy from 1.04 to 1.22, But if I use SVM it give me 0.72, So I'm confuse since I read that SVM is better than Linear Regression in speed and performance, So can you guys please help me clarify this. :)
Thanks in Advance :)
THIS IS MY CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
dataset = pd.read_csv("/home/jairo/Downloads/diabetes.csv")
dataset.shape
x = dataset.drop(['Outcome'], axis=1)
y = dataset['Outcome']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
classifier = DecisionTreeClassifier()
classifier.fit(x_train, y_train)
predic = classifier.predict(x_test)
score = accuracy_score(y_test, predic.round(), normalize=False)
print("Accuracy : {}".format(score/100))
THIS IS THE LAST OUTPUT THAT I GOT:
Accuracy : 1.15
Classification performance is highly dependent on your type of input and what you want to classify. One isn't objectively "better" than the other. To perhaps add some insight into your results, SVM works by trying to find a hyperplane that divides your data into classes. If you had positive and negative as your two potential outcomes, for example, it would try to find the hyperplane that divides your points in n-dimensional space such that all points on either side of the hyperplane belong to that class. n here refers to the number of features.
Related
I have a csv of size 12500 X 3. The first two columns (A and B) are inputs and the the final column (C) is the sum of the two columns.
I wanted to build a prediction model to get the value of C for a given A and B. This is just a basic model to imporve my understanding of machine learning.
The accuracy score is almost zero (0.00032) and the model is way to simple to get the predictions wrong. The code is below:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('Dataset.csv') #importing dataset
X = data.drop(columns=['C'])
y = data['C']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score
I did not even include outlier into the data and I create the csv using excel formulae. I used jupyter notebook to build this prediction model. Can someone please point out if/what I'm doing wrong?
Before you build your model, you should understand the behavior of the model and its main function. Decision Tree is used to classify data based on the criterias extracted from data. For this purpose, you should just choose the simple Linear Regression model, not the Decision Tree.
For my University project, I was asked to optimise the structure and parameters of ANN using one or more of the following methods:
Random Search
Meta Learning
Adaptive Boosting
Cascade Correlation
Here is the original code to improve:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
nof_prin_components = 200
pca = PCA(n_components=nof_prin_components, whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
nohn = 200 # nof hidden neurons
clf = MLPClassifier(hidden_layer_sizes=(nohn,),solver='sgd',activation='tanh',
batch_size=256, early_stopping=True).fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
print(classification_report(y_test, y_pred))
I haven't got any problems with implementing Random Search and Grid Search, it definitely makes sense to me, it's well documented and there are lots of examples of how to use it. Here's how I've implemented it:
If it comes to the rest of the methods, I have no idea how to use them. I can't find any useful examples that I can implement in my solution.
The question is: what's the easiest of the listed methods (except Random Search) to implement and describe in the report? How can I implement it along with my MLPClassifier?
I'm reading about decision trees and bagging classifiers, and I'm trying to show the first decision tree that is used in the bagging classifier. I'm confused about the output.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
bag_clf = BaggingClassifier(
DecisionTreeClassifier(),
n_estimators=500,
max_samples=100,
bootstrap=True,
n_jobs=-1)
bag_clf.fit(X_train, y_train)
Source(tree.export_graphviz(bag_clf.estimators_[0], out_file=None))
Here's a snippet out of the output
It's been my understanding that the value is supposed to show how many of the samples are classified as each category. In that case, shouldn't the numbers in the value field add up to the samples field? Why is that not the case here?
Nice catch.
It would seem that the extra bootstrap samples are included in the value, but not in the total samples; repeating your code verbatim but changing to bootstrap=False eliminates the discrepancy:
The behavior is similar in Random Forest, both classifier and regressor - see respectively:
Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?
sklearn RandomForestRegressor discrepancy in the displayed tree values
Interesting find.
I did some dig around and found that the bootstrapping switches on the proportion = True switch while exporting the graphviz object. Since there is possibility of same sample passing through the decision tree more than once, it is expressed in percentage terms. If bootstrapping = False, the sample goes through only once and hence it can be expressed as sample counts on each classes.
I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure
As a beginner in scikit-learn, and trying to classify the iris dataset, I'm having problems with adjusting the scoring metric from scoring='accuracy' to others like precision, recall, f1 etc., in the cross-validation step. Below is the full code sample (enough to start at # Test options and evaluation metric).
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection # for command model_selection.cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'
#Below, we build and evaluate 6 different models
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn, we calculate the cv-scores, ther mean and std for each model
#
results = []
names = []
for name, model in models:
#below, we do k-fold cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Now, apart from scoring ='accuracy', I'd like to evaluate other performance metrics for this multiclass classification problem. But when I use, scoring='precision', it raises:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
My questions are:
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct? If yes, then, which command(s) should replace scoring='accuracy' in the code above?
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
Why is this happening, when the model evaluation documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) clearly says balanced_accuracy is a scoring method? I'm quite confused here, so an actual code to show how to evaluate other performance etrics would be appreciated! Thanks inn advance!!
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct?
No. Precision & recall are certainly valid for multi-class problems, too - see the docs for precision & recall.
If yes, then, which command(s) should replace scoring='accuracy' in the code above?
The problem arises because, as you can see from the documentation links I have provided above, the default setting for these metrics is for binary classification (average='binary'). In your case of multi-class classification, you need to specify which exact "version" of the particular metric you are interested in (there are more than one); have a look at the relevant page of the scikit-learn documentation, but some valid options for your scoring parameter could be:
'precision_macro'
'precision_micro'
'precision_weighted'
'recall_macro'
'recall_micro'
'recall_weighted'
The documentation link above contains even an example of using 'recall_macro' with the iris data - be sure to check it.
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
This is not exactly trivial, but you can see a way in my answer for Cross-validation metrics in scikit-learn for each data split
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
This is because you are probably using an older version of scikit-learn. balanced_accuracy became available only in v0.20 - you can verify that it is not available in v0.18. Upgrade your scikit-learn to v0.20 and you should be fine.