Topic-based text and user similarity - python

I am looking to compute similarities between users and text documents using their topic representations. I.e. each document and user is represented by a vector of topics (e.g. Neuroscience, Technology, etc) and how relevant that topic is to the user/document.
My goal is then to compute the similarity between these vectors, so that I can find similar users, articles and recommended articles.
I have tried to use Pearson Correlation but it ends up taking too much memory and time once it reaches ~40k articles and the vectors' length is around 10k.
I am using numpy.
Can you imagine a better way to do this? or is it inevitable (on a single machine)?
Thank you

I would recommend just using gensim for this instead of rolling your own.

Don't quite understand why you end up taking too much memory for just computing the correlation for O(n^2) pair of items. To calculate Pearson Correlation, as wikipedia article pointed out,
That is, to get the corr(X,Y) you need only two vectors at a time. If you process your data one pair at a time, memory should not be a problem at all.
If you are going to load all vectors and do some matrix factorization, that is another story.
For computation time, I totally understand because you need to compare this for O(n^2) pair of items.
Gensim is known to be able to run with modest memory requirements (< 1 Gb) on a single CPU/desktop computer within a reasonable time frame. Check this about an experiment they have done on a dataset of 8.2GB using MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM. I think it is a larger dataset than you have.
If you have a even larger dataset, you might want to try distributed version of gensim or even map/reduce.
Another approach is to try locality sensitive hashing.

My tricks are using a search engine such as ElasticSearch, and it works very well, and in this way we unified the api of all our recommend systems. Detail is listed as below:
Training the topic model by your corpus, each topic is an array of words and each of the word is with a probability, and we take the first 6 most probable words as a representation of a topic.
For each document in your corpus, we can inference a topic distribution for it, the distribution is an array of probabilities for each topic.
For each document, we generate a fake document with the topic distribution and the representation of the topics, for example the size of the fake document is about 1024 words.
For each document, we generate a query with the topic distribution and the representation of the topics, for example the size of the query is about 128 words.
All preparation is finished as above. When you want to get a list of similar articles or others, you can just perform a search:
Get the query for your document, and then perform a search by the query on your fake documents.
We found this way is very convenient.

Related

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

How to perform LSA on a huge dataset that does not fit into memory with Python?

I have similar questions before but I haven't found a solution that works for me specifically. So I have a million documents and lets say each document has around 20-30 words in it. I want to lemmatize, remove stopwords and use 100,000 words to build a tf-idf matrix and then do SVD on it.
How can I do this using Python within reasonable time and without running into memory errors ?
If someone has any idea that would be great.
There is an algorithm called SPIMI (single-pass-in-memroy-indexing). It basically involves going through your data and writing to the the disk every time your run out of memory, you then merge all your disk saves into one large matrix.
I've implemented this for a project here

Sample size for Named Entity Recognition gold standard corpus

I have a corpus of 170 Dutch literary novels on which I will apply Named Entity Recognition. For an evaluation of existing NER taggers for Dutch I want to manually annotate Named Entities in a random sample of this corpus – I use brat for this purpose. The manually annotated random sample will function as the 'gold standard' in my evaluation of the NER taggers. I wrote a Python script that outputs a random sample of my corpus on the sentence level.
My question is: what is the ideal size of the random sample in terms of the amount of sentences per novel? For now, I used a random 100 sentences per novel, but this leads to a pretty big random sample containing almost 21626 lines (which is a lot to manually annotate, and which leads to a slow working environment in brat).
NB, before the actual answer: The biggest issue I see is that you only can evaluate the tools wrt. those 170 books. So at best, it will tell you how good the NER tools you evaluate will work on those books or similar texts. But I guess that is obvious...
As to sample sizes, I would guesstimate that you need no more than a dozen random sentences per book. Here's a simple way to check if your sample size is already big enough: Randomly choose only half of the sentences (stratified per book!) you annotated and evaluate all the tools on that subset. Do that a few times and see if results for the same tool varies widely between runs (say, more than +/- 0.1 if you use F-score, for example - mostly depending on how "precise" you have to be to detect significant differences between the tools). If the variances are very large, continue to annotate more random sentences. If the numbers start to stabilize, you're good and can stop annotating.
Indeed, the "ideal" size would be... the whole corpus :)
Results will be correlated to the degree of detail of the typology: just PERS, LOC, ORG would require require a minimal size, but what about a fine-grained typology or even full disambiguation (linking)? I suspect good performance wouldn't need much data (just enough to validate), whilst low performance should require more data to have a more detailed view of errors.
As an indicator, cross-validation is considered as a standard methodology, it often uses 10% of the corpus to evaluate (but the evaluation is done 10 times).
Besides, if working with ancient novels, you will probably face lexical coverage problem: many old proper names would not be included in available softwares lexical resources and this is a severe drawback for NER accuracy. Thus it could be a nice idea to split corpus according to decades / centuries and conduct multiple evaluation so as to measure the impact of this trouble on performances.

Data Structure for Text Classification Task

I'm doing a text classification / tagging task and I would like to ask what kind of data structure would serve me best. The training data set I have is about 4 gigs (after some cleaning, but should be even smaller if I discard the rare words) with 6 million documents. Each document has 4 fields:
Document ID
Title
Body
Tags (as a string, e.g. "apple sql-server linux". This represents three tags, separated by a space. Documents can have 1-5 tags)
I've just finished the cleaning phase (stemming, stop words etc etc) and I'm about to convert them into a TF-IDF word vector with scikit so the output is a scipy sparse matrix. I would like to keep the Title and Body as two vectors and combine them at a later stage when I decide on what weighting to give the Title. The Title and Body are sparse vectors, but they are built with the same dictionary so have the same no. of columns.
What is the best way to represent this information? I come from R so I'm just used to storing things in data.tables / data frames but that doesn't seem very applicable for text classification and sparse matrices. One thing I thought about doing is creating my own "Document" class and just have a list of these objects to represent the corpus. I don't think this is very efficient, since I would probably want to do something like return all docs with the Tag apple.
ML algorithms I plan to run are k-means clustering, kNN, Naive Bayes and possibly SVM. There will probably others that I haven't thought about yet.
I'm new to Python and text classification - any help is greatly appreciated and I am especially interested in ppl who have done it before.
Thank you!
Your best bet is a list of dictionary objects. A list of keep all the documents, and a dictionary to keep all the information regarding the document.

How can I evaluate my technique?

I am dealing with a problem of text summarization i.e. given a large chunk(s) of text, I want to find the most representative "topics" or the subject of the text. For this, I used various information theoretic measures such as TF-IDF, Residual IDF and Pointwise Mutual Information to create a "dictionary" for my corpus. This dictionary contains important words mentioned in the text.
I manually sifted through the entire 50,000 list of phrases sorted on their TFIDF measure and hand-picked 2,000 phrases (I know! It took me 15 hours to do this...) that are the ground truth i.e. these are important for sure. Now when I use this as a dictionary and run a simple frequency analysis on my text and extract the top-k phrases, I am basically seeing what the subject is and I agree with what I am seeing.
Now how can I evaluate this approach? There is no machine learning or classification involved here. Basically, I used some NLP techniques to create a dictionary and using the dictionary alone to do simple frequency analysis is giving me the topics I am looking for. However, is there a formal analysis I can do for my system to measure its accuracy or something else?
I'm not an expert of machine learning, but I would use cross-validation. If you used e.g. 1000 pages of text to "train" the algorithm (there is a "human in the loop", but no problem), then you could take another few hundred test pages, and use your "top-k phrases algorithm" to find the "topic" or "subject" of these. The ratio of test pages where you agree with the outcome of the algorithm gives you a (somewhat subjective) measure of how well your method performs.

Categories

Resources