I am working on school's project about Outlier detecttion. I think i will create my own small dataset and use DBSCAN to work with it. I think i will try to create a dataset that about a click on ads on a website is cheat or not. Below is detail information of the dataset that i am gona create.
Dataset Name: Cheat Ads Click detection.
Column:value
source: (categorical) url: 0, redirect: 1, search: 2
visited_before: (categorical) no:1, few_time: 1, fan: 2
time_on_site(seconds): (numerical) time user working on the site before leaving by seconds.
active_type: (categorical) fake_active: 0 (like they just open website but don't do anythings but click ads), normal_active: 1, real_acive: 2 (Maybe i will let it become score of active: float value from 0 to 10.)
Cheat (label): (categorical) no: 0, yes: 1
Maybe i will have some more other columns like number of times user click on ads,...
My question is do you think that DBSCAN can work well on this dataset? If yes, can you please give me some tips to make a great dataset or to create dataset faster? And if no, please suggest me some other datasets that DBSCAN can work well with theme.
Thank you so much.
DBSCAN has the inherent ability to detect outliers. Since points that are outliers will fail to belong to any cluster.
Wiki states:
it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away)
This can be easily demonstrated using synthetic datasets from sklearn such as make_moons and make_blobs. Sklearn has a pretty decent demo on this.
from sklearn.datasets import make_moons
x, label = make_moons(n_samples=200, noise=0.1, random_state=19)
plt.plot(x[:,0], x[:,1],'ro')
I implemented the dbscan algorithm a while ago to learn. (The repo has since been moved) However, as Anony-Mousse has stated
noise (low density) is not the same as outlier
And the intuition learned from synthetic datasets don't necessarily carry over to actual real-life data. So the above-suggested dataset and implementation are only meant for learning purposes.
Are describing a classification problem, not a clustering problem.
Also that data does not have a bottom of density, does it?
Last but not least, (A) click fraud is heavily clustered, not outliers, (B) noise (low density) is not the same as outlier (rare) and (C) first get the data, then speculate about possible algorithms, because what if you can't get the data?
Related
Couldn't find the answer elsewhere.
Background:
I have data from the Business Dynamics Survey, a dataset that aggregates information on firms by firms' characteristics. I am trying to approximate the firms size distribution.
Now, the data features 10 firm size categories, the corresponding amount of firms in that category, and the level of employment. Sample:
For the life of me I can't figure out how to transform that into an histogram to perform a kernel estimation. A quick look to the docs didn't yield any useful info, because honestly I don't really know what I am looking for. Maybe someone can point me in the right direction?
Hello future internet person who's landed here. Two similar ways to approach it. Either
sns.displot(x=list(df.fsize), weights=list(df.firms), discrete=True)
Or
sns.catplot(x='fsize', y = 'firms', kind = "point",data = df).
To better characterize the underlying density I'm following this paper:
https://journals.sagepub.com/doi/full/10.1177/0081175018782579#_i11
Good luck on your journey!
I am using TF-IDF and DBSCAN to cluster similar human names in a database. The goal of the project is to be able to cluster together names that belong to the same person but may not necessarily be formatted or spelt the same. For example, John Smith can be also be labeled in the database as J. Smith or Smith, John. Ideally the model would be able to cluster these instances together.
The dataset I'm working with has over 250K records. I understand that DBSCAN will label records that are noise as -1. However, the model is also producing an additional cluster that almost always has around 200K records in it and the vast majority of the records within seem like they should be in their own individual clusters. Is there a reason why this may be happening? I'm considering running another model on this large cluster to see what happens.
Any advice would be greatly appreciated. Thanks!
First off, DBSCAN is a reasonable method for supervised clustering when the amount of clusters you have is unknown.
You need to pass a matrix of distances for every string you are clustering upon. What string similarity metric you use is your choice. Here is an example with Levenstein distance where names is a list or array of your strings for clustering:
import Levenshtein as Lev
import numpy as np
from sklearn.cluster import DBSCAN
lev_similarity = 1 * np.array([[Lev.distance(v1, v2) for v1 in names] for v2 in names])
dbscan = DBSCAN(eps=5, min_samples=1)
dbscan.fit(lev_similarity)
Because we are using lev distance, eps will be the number of substitutions to turn one string into the other. Tune it for your use case. The biggest concern being longer names being shortened ('malala yousafzai' vs 'malala y.' is more substitutions than 'jane doe' to 'jane d.')
My assumption as to why your current code has most of your dataset clustered: your eps value is tuned too high.
You called it 'DBSCAN', and I know what you're talking about because I'm doing this at work right now, but your descriptions sounds much more like fuzzy matching. Check out the link below and see if that helps you get to your end game.
https://medium.com/analytics-vidhya/matching-messy-pandas-columns-with-fuzzywuzzy-4adda6c7994f
Also, below is a link to a canonical example of DBSCAN, but again, I don't think that's what you actually want to do.
https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea
I am working on the classification of a 3D point cloud using several python libraries (whitebox, PCL, PDAL). My goal is to classify the soil. The data set has been classified by a company so I am based on their classification as ground truth.
For the moment I am able to classify the soil, to do that I declassified the data set and redo a classification with PDAL. Now I'm at the stage of comparing the two datasets to see the quality of my classification.
I made a script which takes the XYZ coordinates of the 2 sets and puts it in a list and I compare them one by one, however the dataset contains around 5 millions points and it takes 1 minute by 5 points at the begining. After few minutes everything crash. Can anyone give me tips? Here a picture of my clouds The set at the lets is the ground truth and at the right is the one classified by me
Your problem is that you are not using any spatial data structure to ease your point proximity queries. There are several ways you can mitigate this issue, such as KD tree and Octree.
By using such spatial structures you will be able to discard a large portion of unnecessary distance computations, thus improving the performance.
I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
I have 1000 text files which have discharge summary for patients
SAMPLE_1
The patient was admitted on 21/02/99. he appeared to have pneumonia at the time
of admission so we empirically covered him for community-acquired pneumonia with
ceftriaxone and azithromycin until day 2 when his blood cultures grew
out strep pneumoniae that was pan sensitive so we stopped the
ceftriaxone and completed a 5 day course of azithromycin. But on day 4
he developed diarrhea so we added flagyl to cover for c.diff, which
did come back positive on day 6 so he needs 3 more days of that…” this
can be summarized more concisely as follows: “Completed 5 day course
of azithromycin for pan sensitive strep pneumoniae pneumonia
complicated by c.diff colitis. Currently on day 7/10 of flagyl and
c.diff negative on 9/21.
SAMPLE_2
The patient is an 56-year-old female with history of previous stroke; hypertension;
COPD, stable; renal carcinoma; presenting after
a fall and possible syncope. While walking, she accidentally fell to
her knees and did hit her head on the ground, near her left eye. Her
fall was not observed, but the patient does not profess any loss of
consciousness, recalling the entire event. The patient does have a
history of previous falls, one of which resulted in a hip fracture.
She has had physical therapy and recovered completely from that.
Initial examination showed bruising around the left eye, normal lung
examination, normal heart examination, normal neurologic function with
a baseline decreased mobility of her left arm. The patient was
admitted for evaluation of her fall and to rule out syncope and
possible stroke with her positive histories.
I also have a csv file which is 1000rows X 5columns. Each row has information entered manually for each of the text file.
So for example for the above two files, someone has manually entered these records in the csv file:
Sex, Primary Disease,Age, Date of admission,Other complications
M,Pneumonia, NA, 21/02/99, Diarhhea
F,(Hypertension,stroke), 56, NA, NA
My question is:
How do I represent use this information of text:labels to a machine learning algorithm
Do I need to do some manual labelling around the areas of interest in all the 1000 text files?
If yes then how and which method to use. (i.e. like <ADMISSION> was admitted on 21/02/99</ADMISSION>,
<AGE>56-year-old</AGE>)
So basically how do I use this text:labels data to automate the filling of labels.
As far as I can tell the point is not to mark up the texts, but to extract the information represented by the annotations. This is an information extraction problem, and you should read up on techniques for this. The CSV file contains the information you want to extract (your "gold standard", so you should start by splitting it into training (90%) and testing (10%) subsets.
There is a named entity recognition task in there: Recognize diseases, numbers, dates and gender. You could use an off-the shelf chunker, or find an annotated medical corpus and use it to train one. You can also use a mix of approaches; spotting words that reveal gender is something you could hand-code pretty easily, for example. Once you have all these words, you need some more work, for example, to distinguish the primary disease from the symptoms; the age from other numbers, and the date of admission from any other dates. This is probably best done as a separate classification task.
I recommend you now read through the nltk book, chapter by chapter, so that you have some idea of what the available techniques and tools are. It's the approach that matters, so don't get bogged down in comparisons of specific machine learning engines.
I'm afraid the algorithm that fills the gaps has not yet been invented. If the gaps were strongly correlated or had some sort of causality you might be able to model that with some sort of Bayesian model. Still with the amount of data you have this is pretty much impossible.
Now on the more practical side of things. You can take two approaches:
Treat the problem as a document-level task in which case you can just take all rows with a label and train on them and infer the labels/values of the rest. You should look at Naïve Bayes, Multi-class SVM, MaxEnt, etc. for the categorical columns and linear regression for predicting the numerical values.
Treat the problem as an information extraction task in which case you have to add the annotation you mentioned inside the text and train a sequence model. You should look at CRF, structured SVM, HMM, etc. Actually, you could look at some systems that adapt multiclass classifiers to sequence labeling tasks, e.g. SVMTool for POS tagging (can be adapted to most sequence labeling tasks).
Now about the problems, you will face. In 1. it is very unlikely that you will predict the date of the record with any algorithm. It might be possible to roughly predict the patient age as this is something that usually correlates with diseases, etc. And it's very very unlikely that you will be able to even set up the disease column as an entity extraction task.
If I have to solve your problem I would probably pick approach 2. which is imho the correct approach but could is also quite a bit of work. In that case, you will need to create markup annotations yourself. A good starting point is an annotation tool called brat. Once you have your annotations, you could develop a classifier in the style of CoNLL-2003.
What you are trying to achieve seems quite a bit, especially with 1000 records. I think (depending on your data) you may be better off using ready products instead of building them yourself. There are open source and commercial products that might be able to use -- lexigram.io has an API, MetaMap and Apache cTAKES are state-of-the-art open source tools for clinical entity extraction.