Visualizing KMeans Clustering with Many Inputs - python

I'm totally new to machine learning (and full disclosure: this is for school) and am trying to wrap my head around KMeans Clustering and its implementation. I understand the gist of the algorithm and have implemented it in Java but I'm a little confused as to how to use it on a complex dataset.
For example, I have 3 folders, A, B and C, each one containing 8 text files (so 24 text files altogether). I want to verify that I've implemented KMeans correctly by having the algorithm cluster these 24 documents into 3 clusters based on their word usage.
To this effect, I've created a word frequency matrix and performed tfidf on it to create a sparse matrix which is 24 x 2367 (24 documents and 2367 words/ -grams in total). Then, I want to run my KMeans Clustering algorithm on my tfidf matrix and am not getting good results.
In order to try to debug I'm trying to visaulize my tfidf matrix and the centroids I get as output, but I don't quite understand how one would visualize this 24 x 2367 matrix? I've also saved this matrix to a .csv file and want to run a python library on it - but everything I've seen is an n x 2 matrix. How would one go about doing this?
Thanks in advance,

There are a few things that I would suggest (although I am not sure if SO is the right place for this question):
a. Since you mention that you are clustering unstructured text documents and you do not get good results, you might need to apply typical text mining pre-processing tasks like stop word, punctuation removal, case-lowering, stemming before generation of the TF-IDF matrix. There are other text pre-processing tasks like removing numbers, patterns etc. and need to be evaluated on case by case basis.
b. As far as the visualization in 2 D is concerned, you would need to reduce the dimension of the feature vector to 2. The dimension might reduce from 2367 after the pre-processing but not a lot. You can then use SVD on the TF-IDF matrix and check the amount of variance it can explain. However, reducing to 2 components might result in great data loss and the visualizations will not be that meaningful. But you can give this a try and see if the results make sense.
c. If the text content in the documents are small, you can try to craft handcrafted tags that describe the document. These tags should not number more than 20 per document. With this new tags you can create a TF-IDF matrix and perform the SVD which might give more interpretable results in 2D visualizations.
d. In order to evaluate the generated clusters, Silhouette measure can also be considered.

Because this is for school, there will be no code here, just ideas.
The CSV writing and reading will also be left to the reader (just a note: consider alternatives - saving/loading numpy arrays, h5py library, and json or msgpack for a start).
The problem for humans with looking at a 24 x 2367 matrix is, that it is too wide. The numbers in it are also looking like gibberish. But people, unlike computers, like images more (computers don't care).
You need to map the tf-idf values to 0-255, and make an image.
24 x 2367 is well below a megapixel. But making it 24 x 2367 is a little too elongated. Pad your rows to something that can make a nice rectangle or an approximate square (2400 or 2401 should be fine), and generate an image for each row. You can then look at individual rows, or tile them to get a full 6 x 4 image of all your documents (remember about some padding in-between. If your pixels are gray, choose a colorful padding).
Further ideas:
colormaps
PCA
t-SNE

Related

How to create word embedding using Word2Vec on Python?

I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar word or similarity between two words.
But, how if I have text data X and I want to produce the word embedding vector X_vector?
So that, this X_vector can be used for classification algorithms?
If X is a word (string token), you can look up its vector with word_model[X].
If X is a text - say, a list-of-words – well, a Word2Vec model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy is imported as np):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector() method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

Compare 2 set of 3D cloud points

I am working on the classification of a 3D point cloud using several python libraries (whitebox, PCL, PDAL). My goal is to classify the soil. The data set has been classified by a company so I am based on their classification as ground truth.
For the moment I am able to classify the soil, to do that I declassified the data set and redo a classification with PDAL. Now I'm at the stage of comparing the two datasets to see the quality of my classification.
I made a script which takes the XYZ coordinates of the 2 sets and puts it in a list and I compare them one by one, however the dataset contains around 5 millions points and it takes 1 minute by 5 points at the begining. After few minutes everything crash. Can anyone give me tips? Here a picture of my clouds The set at the lets is the ground truth and at the right is the one classified by me
Your problem is that you are not using any spatial data structure to ease your point proximity queries. There are several ways you can mitigate this issue, such as KD tree and Octree.
By using such spatial structures you will be able to discard a large portion of unnecessary distance computations, thus improving the performance.

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

Data Structure for Text Classification Task

I'm doing a text classification / tagging task and I would like to ask what kind of data structure would serve me best. The training data set I have is about 4 gigs (after some cleaning, but should be even smaller if I discard the rare words) with 6 million documents. Each document has 4 fields:
Document ID
Title
Body
Tags (as a string, e.g. "apple sql-server linux". This represents three tags, separated by a space. Documents can have 1-5 tags)
I've just finished the cleaning phase (stemming, stop words etc etc) and I'm about to convert them into a TF-IDF word vector with scikit so the output is a scipy sparse matrix. I would like to keep the Title and Body as two vectors and combine them at a later stage when I decide on what weighting to give the Title. The Title and Body are sparse vectors, but they are built with the same dictionary so have the same no. of columns.
What is the best way to represent this information? I come from R so I'm just used to storing things in data.tables / data frames but that doesn't seem very applicable for text classification and sparse matrices. One thing I thought about doing is creating my own "Document" class and just have a list of these objects to represent the corpus. I don't think this is very efficient, since I would probably want to do something like return all docs with the Tag apple.
ML algorithms I plan to run are k-means clustering, kNN, Naive Bayes and possibly SVM. There will probably others that I haven't thought about yet.
I'm new to Python and text classification - any help is greatly appreciated and I am especially interested in ppl who have done it before.
Thank you!
Your best bet is a list of dictionary objects. A list of keep all the documents, and a dictionary to keep all the information regarding the document.

Topic-based text and user similarity

I am looking to compute similarities between users and text documents using their topic representations. I.e. each document and user is represented by a vector of topics (e.g. Neuroscience, Technology, etc) and how relevant that topic is to the user/document.
My goal is then to compute the similarity between these vectors, so that I can find similar users, articles and recommended articles.
I have tried to use Pearson Correlation but it ends up taking too much memory and time once it reaches ~40k articles and the vectors' length is around 10k.
I am using numpy.
Can you imagine a better way to do this? or is it inevitable (on a single machine)?
Thank you
I would recommend just using gensim for this instead of rolling your own.
Don't quite understand why you end up taking too much memory for just computing the correlation for O(n^2) pair of items. To calculate Pearson Correlation, as wikipedia article pointed out,
That is, to get the corr(X,Y) you need only two vectors at a time. If you process your data one pair at a time, memory should not be a problem at all.
If you are going to load all vectors and do some matrix factorization, that is another story.
For computation time, I totally understand because you need to compare this for O(n^2) pair of items.
Gensim is known to be able to run with modest memory requirements (< 1 Gb) on a single CPU/desktop computer within a reasonable time frame. Check this about an experiment they have done on a dataset of 8.2GB using MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM. I think it is a larger dataset than you have.
If you have a even larger dataset, you might want to try distributed version of gensim or even map/reduce.
Another approach is to try locality sensitive hashing.
My tricks are using a search engine such as ElasticSearch, and it works very well, and in this way we unified the api of all our recommend systems. Detail is listed as below:
Training the topic model by your corpus, each topic is an array of words and each of the word is with a probability, and we take the first 6 most probable words as a representation of a topic.
For each document in your corpus, we can inference a topic distribution for it, the distribution is an array of probabilities for each topic.
For each document, we generate a fake document with the topic distribution and the representation of the topics, for example the size of the fake document is about 1024 words.
For each document, we generate a query with the topic distribution and the representation of the topics, for example the size of the query is about 128 words.
All preparation is finished as above. When you want to get a list of similar articles or others, you can just perform a search:
Get the query for your document, and then perform a search by the query on your fake documents.
We found this way is very convenient.

Categories

Resources