Is Testing the DBSCAN clustering algorithm possible? And if yes, how? - python

I want to use the DBSCAN clustering algorithm in order to detect outliers in my dataset. As this is an unsupervised learning approach, do I need to split my dataset in training and test data or is testing the DBSCAN algorithm just not possible? For outlier detection reasons, should I feed the DBSCAN model with my entire dataset?
In case testing DBSCAN is possible, can you suggest ways in doing that with Python?

You don't need to split your data into test and train. However, you should have a sample of labelled data from your original data if you wish to evaluate your model. There are other unsupervised ways as well, but they compare which clustering method is performing better relative to other methods that you try (algorithms or different hyperparameters).
I would suggest reading - https://scikit-learn.org/stable/modules/clustering.html
The section 2.3.10 shows the various methods for evaluation of your clustering models, and the sklearn API needed to implement them.
You can choose which one suits your requirement best based on your problem statement.

For outlier detection, use an actual outlier detection algorithm instead of DBSCAN.
Noise as detected by DBSCAN is not the same as outliers. For examples if you data is all uniform random data, this ought to be considered "noise", but none of them will be outliers. All the data is normal noise.

Let me add another important point here:
You cannot test unsupervised learning methods. The main idea from unsupervised learning methods is to define a non-predefiend target.
Supervised learning methods in machine learning --> train/test or train/dev/test split
unsupervised learning --> no split
Depending on your dataset for outliers there are also other statistical methods to identify outliers:
quantilies
z-score

Related

How to use KMeans clustering to improve the accuracy of a logistic regression model?

I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.

Which is the best clustering algorithm for clustering multidimensional data with low density difference?

I am working on a project currently and I wish to cluster multi-dimensional data. I tried K-Means clustering and DBSCAN clustering, both being completely different algorithms.
The K-Means model returned a fairly good output, it returned 5 clusters but I have read that when the dimensionality is large, the Euclidean distance fails so I don't know if I can trust this model.
On trying the DBSCAN model, the model generated a lot of noise points and clustered a lot of points in one cluster. I tried the KNN dist plot method to find the optimal eps for the model but I can't seem to make the model work. This led to my conclusion that maybe the density of the points plotted is very high and maybe that is the reason I am getting a lot of points in one cluster.
For clustering, I am using 10 different columns of data. Should I change the algorithm I am using? What would be a better algorithm for multi-dimensional data and with less-varying density?
You can first make a dimension reduction on your dataset with PCA/LDA/t-sne or autoencoders. Then run standart some clustering algorithms.
Another way is you can use fancy deep clustering methods. This blog post is really nice explanation of how they apply deep clustering on the high dimensional dataset.
Maybe this provides you with some inspiration: Scikit-learn clustering algorithms
I suggest you try a few out. Hope that helps!

Proper way to handle highly imbalanced data - binary classification

I have a really large dataset with 60 million rows and 11 features.
It is highly imbalanced dataset, 20:1 (signal:background).
As I saw, there are two ways to tackle this problem:
First: Under-sampling/Oversampling.
I have two problems/questions in this way.
If I make under-sampling before train test split, I am losing a lot of data.
But more important, If I train a model on a balanced dataset, I am losing information about the frequency of my signal data(let's say the frequency of benign tumor over malignant), and because model is trained on and evaluated, model will perform well. But if sometime in the future I am going to try my model on new data, it will bad perform because real data is imbalanced.
If I made undersampling after train test split, my model will underfit because it will be trained on balanced data but validated/tested on imbalanced.
Second - class weight penalty
Can I use class weight penalty for XBG, Random Forest, Logistic Regression?
So, everybody, I am looking for an explanation and idea for a way of work on this kind of problem.
Thank you in advance, I will appreciate any of your help.
I suggest this quick paper by Breiman (author of Random Forest):
Using Random Forest to Learn Imbalanced Data
The suggested methods are weighted RF, where you compute the splits using weighted Gini (or Entropy, which in my opinion is better when weighted), and Balanced Random Forest, where you try to balance the classes during the bootstrap.
Both methods can be implemented also for boosted trees!
One of the suggested methodologies could be using Synthetic Minority oversampling technique (SMOTE) which attempts to balance the data set by creating synthetic instances. And train the balanced data set using any of the classification algorithm.
For comparing multiple models, Area Under the ROC Curve (AUC score) can be used to determine which model is superior.
This guide will be able to give you some ideas on different methodologies you can use and compare to resolve imbalance problem.
The above issue is pretty common when dealing with medical datasets and other types of fault detection where one of the classes (ill-effect) is always under-represented.
The best way to tackle this is to generate folds and apply cross validation. The folds should be generated in a way to balance the classes in each fold. In your case this creates 20 folds, each has the same under-represented class and a different fraction of the over-represented class.
Generating balanced folds and using cross validation also results in a better generalised and robust model. In your case, 20 folds might seem to harsh, so you can possibly create 10 folds each with a 2:1 class ratio.

How to run predict() on "precomputed" data for clustering in python

I have my own precomputed data for running AP or Kmeans in python. However when I go to run predict() as I would like to run a train() and test() on the data to see if the clusterings have a good accuracy on the class or clusters, Python tells me that predict() is not available for "precomputed" data.
Is there another way to run a train / test on clustered data in python?
Most clustering algorithms, including AP, have no well-defined way to "predict" on new data. K-means is one of the few cases simple enough to allow a "prediction" consistent with the initial clusters.
Now sklearn has this oddity of trying to squeeze everything into a supervised API. Clustering algorithms have a fit(X, y) method, but ignore y, and are supposed to have a predict method even though the algorithms don't have such a capability.
For affinity propagation, someone at some point decided to add a predict based on k-means: It always predicts the nearest center. Computing the mean only is possible with coordinate data, and hence the method fails with metric=precomputed.
If you want to replicate this behavior, computer the distances to all cluster centers, and choose the argmin, that's all. You can't fit this into the sklearn API easily with "precomputed" metrics. You could require the user to pass a distance vector to all "training" examples for the precomputed metric, but only few of them are needed...
In my opinion, I'd rather remove this method altogether:
It is not in published research on affinity propagation that I know
Affinity propagation is based on concepts of similarity ("affinity") not on distance or means
This predict will not return the same results as the points were labeled by AP, because AP is labeling points using a "propagated responsibility", rather than the nearest "center". (The current sklearn implementation may be losing this information...)
Clustering methods don't have a consistent predict anyway - it's not a requirement to have this.
If you want to do this kind of prediction, just pass the cluster centers to a nearest neighbor classifier. That is what is re-implemented here, a hidden NN classifier. So you get more flexibility if you make prediction a second (classification) step.
Note that it clustering it is not common to do any test-train split, because you don't use the labels anyway, and use only unsupervised evaluation methods (if any at all, because these have their own array of issues) if any at all - you cannot reliably do "hyperparameter optimization" here, but have to choose parameters based on experience and humans looking at the data.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Categories

Resources