I am using TF-IDF and DBSCAN to cluster similar human names in a database. The goal of the project is to be able to cluster together names that belong to the same person but may not necessarily be formatted or spelt the same. For example, John Smith can be also be labeled in the database as J. Smith or Smith, John. Ideally the model would be able to cluster these instances together.
The dataset I'm working with has over 250K records. I understand that DBSCAN will label records that are noise as -1. However, the model is also producing an additional cluster that almost always has around 200K records in it and the vast majority of the records within seem like they should be in their own individual clusters. Is there a reason why this may be happening? I'm considering running another model on this large cluster to see what happens.
Any advice would be greatly appreciated. Thanks!
First off, DBSCAN is a reasonable method for supervised clustering when the amount of clusters you have is unknown.
You need to pass a matrix of distances for every string you are clustering upon. What string similarity metric you use is your choice. Here is an example with Levenstein distance where names is a list or array of your strings for clustering:
import Levenshtein as Lev
import numpy as np
from sklearn.cluster import DBSCAN
lev_similarity = 1 * np.array([[Lev.distance(v1, v2) for v1 in names] for v2 in names])
dbscan = DBSCAN(eps=5, min_samples=1)
dbscan.fit(lev_similarity)
Because we are using lev distance, eps will be the number of substitutions to turn one string into the other. Tune it for your use case. The biggest concern being longer names being shortened ('malala yousafzai' vs 'malala y.' is more substitutions than 'jane doe' to 'jane d.')
My assumption as to why your current code has most of your dataset clustered: your eps value is tuned too high.
You called it 'DBSCAN', and I know what you're talking about because I'm doing this at work right now, but your descriptions sounds much more like fuzzy matching. Check out the link below and see if that helps you get to your end game.
https://medium.com/analytics-vidhya/matching-messy-pandas-columns-with-fuzzywuzzy-4adda6c7994f
Also, below is a link to a canonical example of DBSCAN, but again, I don't think that's what you actually want to do.
https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea
Related
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
I've a dataset containing the position of a person walking in a indoor environment at a given time.
I don't have any information about the environment, just the dataset.
The table is structured like this:
(ID, X, Y, time)
where ID is the primary key containing the id, X and Y the coordinates and time is the timestamp.
The frequency for the data gathering is of 1 element every 0.2 seconds.
Before I start any analysis on the path, speeds etc I'd like to remove the noise from the dataset but I'm not sure what approach I should use.
I've read about using clustering functions like DBSCAN and for given parameters it seems to do something but since it's a clustering based on density I don't feel like it's the best solution. On the other hand ST-DBSCAN takes into account the time so it seems more appropriate but it's still based on density.
Is there a better way to filter noise in a context like this or is DBSCAN the right approach?
If you think of your data as 2-dimensional time-series, then it makes sense to apply one of the algorithms listed here: https://github.com/rob-med/awesome-TS-anomaly-detection
I am working on school's project about Outlier detecttion. I think i will create my own small dataset and use DBSCAN to work with it. I think i will try to create a dataset that about a click on ads on a website is cheat or not. Below is detail information of the dataset that i am gona create.
Dataset Name: Cheat Ads Click detection.
Column:value
source: (categorical) url: 0, redirect: 1, search: 2
visited_before: (categorical) no:1, few_time: 1, fan: 2
time_on_site(seconds): (numerical) time user working on the site before leaving by seconds.
active_type: (categorical) fake_active: 0 (like they just open website but don't do anythings but click ads), normal_active: 1, real_acive: 2 (Maybe i will let it become score of active: float value from 0 to 10.)
Cheat (label): (categorical) no: 0, yes: 1
Maybe i will have some more other columns like number of times user click on ads,...
My question is do you think that DBSCAN can work well on this dataset? If yes, can you please give me some tips to make a great dataset or to create dataset faster? And if no, please suggest me some other datasets that DBSCAN can work well with theme.
Thank you so much.
DBSCAN has the inherent ability to detect outliers. Since points that are outliers will fail to belong to any cluster.
Wiki states:
it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away)
This can be easily demonstrated using synthetic datasets from sklearn such as make_moons and make_blobs. Sklearn has a pretty decent demo on this.
from sklearn.datasets import make_moons
x, label = make_moons(n_samples=200, noise=0.1, random_state=19)
plt.plot(x[:,0], x[:,1],'ro')
I implemented the dbscan algorithm a while ago to learn. (The repo has since been moved) However, as Anony-Mousse has stated
noise (low density) is not the same as outlier
And the intuition learned from synthetic datasets don't necessarily carry over to actual real-life data. So the above-suggested dataset and implementation are only meant for learning purposes.
Are describing a classification problem, not a clustering problem.
Also that data does not have a bottom of density, does it?
Last but not least, (A) click fraud is heavily clustered, not outliers, (B) noise (low density) is not the same as outlier (rare) and (C) first get the data, then speculate about possible algorithms, because what if you can't get the data?
I have a set of product data with specifications etc.
I have applied kmode clustering to the dataset to form clusters of the most similar products.
When I enter a new data, I want to know which cluster this data belongs to and what are the other products( which are almost same as this new product). How do I go about this?
Use the nearest neighbors.
No need to rely on clustering, which tends to be unstable and produce unbalanced clusters. It's fairly common to have 90% of your data rightfully in the same cluster (e.g. a "normal users" cluster, or a "single visit" cluster). So you should ask yourself: what do you gain by doing this, what is the cost-benefit ratio?
I have a pandas dataframe with the following 2 columns:
Database Name Name
db1_user Login
db1_client Login
db_care Login
db_control LoginEdit
db_technology View
db_advanced LoginEdit
I have to cluster the Database Name based on the field “Name”. When I convert it to a numpy, using
dataset = df2.values
When I print the print(dataset.dtype), the type is object. I have just started with Clustering, from what I read, I understand that object is not a type suitable for Kmeans clustering.
Any help would be appreicated!!
What is the mean of
Login
LoginEdit
View
supposed to be?
There is a reason why k-means only works on continuous numerical data. Because the mean requires such data to be well defined.
I don't think clustering is applicable on your problem at all (rather look into data cleaning). But clearly you need a method that works with arbitrary distances - k-mean does not.
I don't understand whether you want to develop clusters for each GROUP of "Name" Attributes, or alternatively create n clusters regardless of the value of "Name"; and I don't understand what clustering could achieve here.
In any case, just a few days ago there was a similar question on the datascience SE site (from an R user, though), asking for similarity of local-names of email addresses (the part before the "#"), not of database-names. The problem is similar to yours.
Check this out:
https://datascience.stackexchange.com/questions/14146/text-similarities/14148#14148
The answer was comprehensive with respect to the different distance measures for strings.
Maybe this is what you should investigate. Then decide on a proper distance measure that is available in python (or one that you can program yourself), and that fits your needs.