Creating text-clusters that contain similar text

Creating text-clusters that contain similar text - python

Recently I had worked on image clustering which found similar images and grouped them together. I had used python's skimage module to calculate SSIM and then cluster all images based on some threshold that was decided.
I want to do similar for the text. I want to create automatic clusters containing similar text. For example, cluster-1 could have all text that represents working mothers, cluster-2 could have all text representing people talking about food and so on. I understand this has to be unsupervised learning. Do we have similar python module's that could help achieve this task? I also checked out google's tensorflow to see if I could get something from it but did not find anything relating to text clustering in its documentation.

There are numerous ways you can approach the task. In most cases the clustering algorithms are very similar to image clustering but what you need to define is the distance metric - in this case semantic similarity metric of some kind.
For this purpose you can use the approaches I list in another question around the topic of semantic similarity (even if a bit more detailed).
The one additional approach worth mentioning is 'automatic clustering' provided by topical modelling tools like LSA which you can run fairly easy using gensim package.

Related

Clustering student based on Semantic Similarity using Hierarchical Algorithm

I'm quite new to the whole clustering and stuff, so I'm a bit lost in the final bit of programming. I'm working on a project which clusters students based on Semantic Similarity of topics using a Hierarchical Algorithm.
What I understand is I have to collect all topics in a list, then apply clustering techniques like Hierarchical clustering.
How can I write a code in python to cluster the students based on the semantic similarity between the topic?

For clustering there are lot of algorithm. I propose you NLMF (Non Linear matrix factorisation), efficient and well spread. There are many others.
BASIC: If you just want to use python to achieve clustering, have a look at the python librairy nimfa: https://nimfa.biolab.si/
There are many others.
You'll have to manage you input data so they fit the expected inout format.
ADVANCED: If you want to understand, learn and maybe code an existing algorithm, look at these slides: https://perso.telecom-paristech.fr/essid/teach/NMF_tutorial_ICME-2014.pdf
RESEARCH TOPIC: If you want to do your own algorithm, I can't help you in this SO answer ;)

Is there any supervised clustering algorithm or a way to apply prior knowledge to your clustering?

In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.
So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.
My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.
To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?
Any idea/suggestion or even hint towards where I can research about it will be much appreciated.

There are papers on supervised clustering. A nice, clear one is Eick et al., which is available for free. Unfortunately, I do not think any off-the-shelf libraries in python support this. There is also this in the specific realm of text, but it is a much more domain-specific approach compared to Eick.
But there is a very simple solution that is effectively a type of supervised clustering. Decision Trees essentially chop feature space into regions of high-purity, or at least attempt to. So you can do this as a quick type of supervised clustering:
Create a Decision Tree using the label data.
Think of each leaf as a "cluster."
In sklearn, you can retrieve the leaves of a Decision Tree by using the apply() method.

A standard approach would be to use the dendrogram.
Then merge branches only if they agree with your positive examples and don't violate any of your negative examples.

Merging many statistical methods for Text classification, starting with SVM multiclass classifier

Premise: I am not an expert of Machine Learning/Maths/Statistics. I am a linguist and I am entering the world of ML. Please when answering, try to be the more explicit you can.
My problem: I have 3000 expressions containing some aspects (or characteristics, or features) that users usually review in online reviews. These expressions are recognized and approved by human beings and experts.
Example: “they play a difficult role”
The labels are: Acting (referring to the act of acting and also to actors), Direction, Script, Sound, Image.
The goal: I am trying to classify these expressions according to their aspects.
My system: I am using SkLearn and Python under a Jupyter environment.
Technique used until now:
I built a bag-of-words matrix (so I kept track of the
presence/absence of – stemmed - words for each expression) and
I applied a SVM multiclass classifier with kernel RBF and C = 1 (or I
tuned according to the final accuracy.). The code used is this one from
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/
First attempt showed 0.63 of accuracy. When I tried to create more labels from the class Script accuracy went down to 0.50. I was interested in doing that because I have some expressions that for sure describe the plot or the characters.
I think that the problem is due to the presence of some words that are shared among these aspects.
I searched for a solution to improve the model. I found something called “learning curve”. I use the official code provided by sklearn documentation http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .
The result is like the second picture (the right one). I can't understand if it is good or not.
In addition to this, I would like to:
import the expressions from a text file. For the moment I have
just created an array and put inside the expressions and I don't
feel so comfortable.
find a way, if it possible, to communicate to the system that there are some words that are very specific / important to an Aspect and help it to improve the classification.
How can I do this? I read that in some works researchers have used more systems... How should I handle this? From where can I retrieve the resulting numbers from the first system to use them in the second one?
I would like to underline that there are some expressions, verbs, nouns, etc. that are used a lot in some contexts and not in others. There are some names that for sure are names of actors and not directors, for example. In the future I would like to add more linguistic pieces of information to the system and trying to improve it.
I hope to have expressed myself in an enough clear way and to have used an appropriate and understandable language.

Word clustering in python

How to cluster only words in a given set of Data: i have been going through few algorithms online like k-Means algotihm,but it seems they are related to document clustering instead of word clustering.Can anyone suggest me some way to only cluster words in a given set of data???.
please am new to python.

Based on the fact that my last answer was indeed a false answer since it was used for document clustering and not word clustering, here is the real answer.
What you are looking for is word2vec.
Indeed, word2vec is a google tool based on deep-learning that works really well. It transforms words into vectorial representation, and therefore allows you to do multiple things with it.
For example, one of its many examples that work well are algebric relation of words:
vector('puppy') - vector('dog') + vector('cat') is close to vector('kitten')
vector('king') - vector('man') + vector('woman') is close to vector('queen')
What it means by that is it can sort of encompass the context of a word, and therefore it will work really well for numerous applications.
When you have vectors instead of words, you can pretty much do anything you want. You can for example do a k-means clustering with a cosine distance as the measure of dissimilarity...
Hope this answers well to your question. You can read more about word2vec in different papers or websites if you'd like. I won't link them here since it is not the subject of the question.

Word clustering will be really disappointing because the computer does not understand language.
You could use levenshtein distance and then do hierarchical clustering.
But:
dog and fog have a distance of 1, i.e. are highly similar.
dog and cat have 3 out of 3 letters different.
So unless you can define a good measure of similarity, don't cluster words.

Threshold in Hierarchial clustering

I am new to clustering and doing some minor project on clustering tweets, I used TF-IDF and then hierarchial clustering. I am confused about setting up threshold value for hierarchical clustering. What should be its value and how to decide it?
I used python scikit module for implementation.

While there are several methods that exist to help terminate hierarchical clustering (or clustering in general) there is no best general way to do this. This stems from the fact that there is no "correct" clustering of arbitrary data. Rather, "correctness" is very domain and application specific.
So while you can try out different methods (e.g., elbow or others) they will in turn have their own parameters that you will have to "tune" to obtain a clustering that you deem "correct". This video might help you out a bit (though it mainly deals with k-means, the concepts extend to other clustering approaches) - https://www.youtube.com/watch?v=3JPGv0XC6AE

I assume you are talking about choosing the amount of clusters to extract from your hierarchical clustering algorithm. There are several ways of doing this, and there is a nice Wikipedia article about it for some theory: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
For practical examples take a look at this question: Tutorial for scipy.cluster.hierarchy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.