I am trying to use NLTK's KMeans Clustering Algorithm.
It is generally going fine.
I want to use the Metrics package of NLTK to determine precision,recall and f measure.
I searched for some examples in web and in other references but may be without a clue.
If any one may kindly cite an example or reference.
Thanks in Advance.
It is hard to evaluate the performance of unsupervised learning i.e. clustering. It entirely depends on why are you trying to cluster in the first place.
Still, I think there are few performance metrics available, like,
Precision, Recall, and thus the F-measure are inappropriate for cluster analysis. Clustering is not classification, and clusters are not classes!
Common measures for clustering (if you are trying to compare with existing labels, which does not make a whole lot of sense - if you already know the classes, then use classification and not clustering) are the Adjusted Rand Index and its variants.
I want to use the DBSCAN clustering algorithm in order to detect outliers in my dataset. As this is an unsupervised learning approach, do I need to split my dataset in training and test data or is testing the DBSCAN algorithm just not possible? For outlier detection reasons, should I feed the DBSCAN model with my entire dataset?
In case testing DBSCAN is possible, can you suggest ways in doing that with Python?
You don't need to split your data into test and train. However, you should have a sample of labelled data from your original data if you wish to evaluate your model. There are other unsupervised ways as well, but they compare which clustering method is performing better relative to other methods that you try (algorithms or different hyperparameters).
I would suggest reading - https://scikit-learn.org/stable/modules/clustering.html
The section 2.3.10 shows the various methods for evaluation of your clustering models, and the sklearn API needed to implement them.
You can choose which one suits your requirement best based on your problem statement.
For outlier detection, use an actual outlier detection algorithm instead of DBSCAN.
Noise as detected by DBSCAN is not the same as outliers. For examples if you data is all uniform random data, this ought to be considered "noise", but none of them will be outliers. All the data is normal noise.
Let me add another important point here:
You cannot test unsupervised learning methods. The main idea from unsupervised learning methods is to define a non-predefiend target.
Supervised learning methods in machine learning --> train/test or train/dev/test split
unsupervised learning --> no split
Depending on your dataset for outliers there are also other statistical methods to identify outliers:
Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .
tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.
The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.
I am looking for a Python online learning/incremental learning algorithm of 'reasonable' complexity.
In Scikit-learn I have found a few algorithms with the partial_fit method, namely ['BernoulliNB', 'GaussianNB', 'MiniBatchKMeans', 'MultinomialNB', 'PassiveAggressiveClassifier', PassiveAggressiveRegressor', 'Perceptron', 'SGDClassifier', 'SGDRegressor']
All these algorithms form simple decision boundaries as far as I can see. Do we have out-of-the-box online algorithms somewhere in Python which can model more complex decision boundaries?
Correction: As noted below K-means does of course not have a simple decision boundary. What I was looking for was supervised algorithms capable of, e.g., XOR.
One general approach is to combine a Linear-Classifier with some Kernel-Approximation techniques, e.g.:
SGD-based SVM/Logistic Regression with:
RBFSampler / Random Kitchen Sinks
Just build up a pipeline and you are still able to use partial_fit.
One more remark (regarding your list of algorithms): KMeans or KNearestNeighbor does not form a linear decision-boundary!
Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.