KMeans clustering of textual data

KMeans clustering of textual data - python

I have a pandas dataframe with the following 2 columns:
Database Name Name
db1_user Login
db1_client Login
db_care Login
db_control LoginEdit
db_technology View
db_advanced LoginEdit
I have to cluster the Database Name based on the field “Name”. When I convert it to a numpy, using
dataset = df2.values
When I print the print(dataset.dtype), the type is object. I have just started with Clustering, from what I read, I understand that object is not a type suitable for Kmeans clustering.
Any help would be appreicated!!

What is the mean of
Login
LoginEdit
View
supposed to be?
There is a reason why k-means only works on continuous numerical data. Because the mean requires such data to be well defined.
I don't think clustering is applicable on your problem at all (rather look into data cleaning). But clearly you need a method that works with arbitrary distances - k-mean does not.

I don't understand whether you want to develop clusters for each GROUP of "Name" Attributes, or alternatively create n clusters regardless of the value of "Name"; and I don't understand what clustering could achieve here.
In any case, just a few days ago there was a similar question on the datascience SE site (from an R user, though), asking for similarity of local-names of email addresses (the part before the "#"), not of database-names. The problem is similar to yours.
Check this out:
https://datascience.stackexchange.com/questions/14146/text-similarities/14148#14148
The answer was comprehensive with respect to the different distance measures for strings.
Maybe this is what you should investigate. Then decide on a proper distance measure that is available in python (or one that you can program yourself), and that fits your needs.

Related

Looking for repeated patterns in time series data

I have spent the best part of the last few days searching forums and reading papers trying to solve the following question. I have thousands of time series arrays each of varying lengths containing a single column vector. this column vector contains the time between clicks for dolphins using echolocation.
I have managed to cluster these into similar groups using DTW and want to check which trains have a high degree of similarity i.e repeated patterns. I only want to know the similarity with themselves and don't care to compare them with other trains as I have already applied DTW for that. I'm hoping some of these clusters will contain trains with a high proportion of repeated patterns.
I have already applied the Ljung–Box test to each series to check for autocorrelation but think i should maybe be using something with FFT and the power spectrum. I don't have much experience in this but have tried to do so using a Python package waipy. Ultimately, I just want to know if there is some kind of repeated pattern in the data ideally tested with a p-value. The image I have attached shows an example train across the top. the maximum length of my trains is 550.
example output from Waipy
I know this is quite a complex question but any help would be greatly appreciated even if it is a link to a helpful Python library.
Thanks,
Dex

For anyone in a similar position I decided to go with Motifs as they are able to find a repeated pattern in a time series using euclidian distance. There is a really good package in Python called Stumpy which makes this very easy!
Thanks,
Dex

DBSCAN Clustering Unlike Names Together (Python)

I am using TF-IDF and DBSCAN to cluster similar human names in a database. The goal of the project is to be able to cluster together names that belong to the same person but may not necessarily be formatted or spelt the same. For example, John Smith can be also be labeled in the database as J. Smith or Smith, John. Ideally the model would be able to cluster these instances together.
The dataset I'm working with has over 250K records. I understand that DBSCAN will label records that are noise as -1. However, the model is also producing an additional cluster that almost always has around 200K records in it and the vast majority of the records within seem like they should be in their own individual clusters. Is there a reason why this may be happening? I'm considering running another model on this large cluster to see what happens.
Any advice would be greatly appreciated. Thanks!

First off, DBSCAN is a reasonable method for supervised clustering when the amount of clusters you have is unknown.
You need to pass a matrix of distances for every string you are clustering upon. What string similarity metric you use is your choice. Here is an example with Levenstein distance where names is a list or array of your strings for clustering:
import Levenshtein as Lev
import numpy as np
from sklearn.cluster import DBSCAN
lev_similarity = 1 * np.array([[Lev.distance(v1, v2) for v1 in names] for v2 in names])
dbscan = DBSCAN(eps=5, min_samples=1)
dbscan.fit(lev_similarity)
Because we are using lev distance, eps will be the number of substitutions to turn one string into the other. Tune it for your use case. The biggest concern being longer names being shortened ('malala yousafzai' vs 'malala y.' is more substitutions than 'jane doe' to 'jane d.')
My assumption as to why your current code has most of your dataset clustered: your eps value is tuned too high.

You called it 'DBSCAN', and I know what you're talking about because I'm doing this at work right now, but your descriptions sounds much more like fuzzy matching. Check out the link below and see if that helps you get to your end game.
https://medium.com/analytics-vidhya/matching-messy-pandas-columns-with-fuzzywuzzy-4adda6c7994f
Also, below is a link to a canonical example of DBSCAN, but again, I don't think that's what you actually want to do.
https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea

Handle missing values : When 99% of the data is missing from most columns (important ones)

I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing.
I am thinking of couple of options -
Impute missing value with mean/knn imputation
Impute missing value with 0.
I couldn't think of anything in this direction. If someone can help that would be great.
P.S. I am not feeling comfortable using mean imputation when 99% of the data is missing. Does someone have a reasoning for that? kindly let me know.
Data has 397576 Observations out of which below are the missing values
enter image description here

99% of the data is missing!!!???
Well, if your dataset has less than 100,000 examples, then you may want to remove those columns instead of imputing through any methods.
If you have a larger dataset then using mean imputing or knn imputing would be ...OK. These methods don't catch the statistics of your data and can eat up memory. Instead use Bayesian methods of Machine Learning like fitting a Gaussian Process through your data or a Variational Auto-Encoder to those sparse columns.
1.) Here are a few links to learn and use gaussian processes to samples missing values from the dataset:
What is a Random Process?
How to handle missing values with GP?
2.) You can also use a VAE to impute the missing values!!!
Try reading this paper
I hope this helps!

My first question to give a good answer would be:
What you are actually trying to archive with the completed data?
.
People impute data for different reasons and the use case makes a big difference for example you could use imputation as:
Preprocessing step for training a machine learning model
Solution to have a nice Graphic/Plot that does not have gaps
Statistical inference tool to evaluate scientific or medical studies
99% of missing data is a lot - in most cases you can expect, that nothing meaningful will come out of this.
For some variables it still might make sense and produce at least something meaningful - but you have to handle this with care and think a lot about your solution.
In general you can say, imputation does not create entries out of thin air. A pattern must be present in the existing data - which then is applied to the missing data.
You probably will have to decide on a variable basis what makes sense.
Take your variable email as an example:
Depending how your data - it might be that each row represents a different customer that has a specific email address. So that every row is supposed to be a unique mail address. In this case imputation won't have any benefits - how should the algorithm guess the email. But if the data is structured differently and customers appear in multiple rows - then an algorithm can still fill in some meaningful data. Seeing that Customer number 4 always has the same mail address and filling it for rows where only customer number 4 is given and the mail is missing.

KMeans: Extracting the parameters/rules that fill up the clusters

I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.

Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.

Finding out which cluster a new data entered belongs to and returning the other items in the same cluster

I have a set of product data with specifications etc.
I have applied kmode clustering to the dataset to form clusters of the most similar products.
When I enter a new data, I want to know which cluster this data belongs to and what are the other products( which are almost same as this new product). How do I go about this?

Use the nearest neighbors.
No need to rely on clustering, which tends to be unstable and produce unbalanced clusters. It's fairly common to have 90% of your data rightfully in the same cluster (e.g. a "normal users" cluster, or a "single visit" cluster). So you should ask yourself: what do you gain by doing this, what is the cost-benefit ratio?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.