Data Structure for Text Classification Task

Data Structure for Text Classification Task - python

I'm doing a text classification / tagging task and I would like to ask what kind of data structure would serve me best. The training data set I have is about 4 gigs (after some cleaning, but should be even smaller if I discard the rare words) with 6 million documents. Each document has 4 fields:
Document ID
Title
Body
Tags (as a string, e.g. "apple sql-server linux". This represents three tags, separated by a space. Documents can have 1-5 tags)
I've just finished the cleaning phase (stemming, stop words etc etc) and I'm about to convert them into a TF-IDF word vector with scikit so the output is a scipy sparse matrix. I would like to keep the Title and Body as two vectors and combine them at a later stage when I decide on what weighting to give the Title. The Title and Body are sparse vectors, but they are built with the same dictionary so have the same no. of columns.
What is the best way to represent this information? I come from R so I'm just used to storing things in data.tables / data frames but that doesn't seem very applicable for text classification and sparse matrices. One thing I thought about doing is creating my own "Document" class and just have a list of these objects to represent the corpus. I don't think this is very efficient, since I would probably want to do something like return all docs with the Tag apple.
ML algorithms I plan to run are k-means clustering, kNN, Naive Bayes and possibly SVM. There will probably others that I haven't thought about yet.
I'm new to Python and text classification - any help is greatly appreciated and I am especially interested in ppl who have done it before.
Thank you!

Your best bet is a list of dictionary objects. A list of keep all the documents, and a dictionary to keep all the information regarding the document.

Related

Page similarity calculation with Gensim

there are some questions about this which I've studied, but I'm still not sure about a couple of things.I want to split a page stream (concatenated PDFs) in to singular documents. So the trick is to find where one document ends and where the next one starts. So a PDF can have 1000 pages, and can consist out of 20 documents, each with different lengths.
That being said, one feature I want to introduce is 'page similarity' where page (p) has a similarity score for the page before it (p-1).
So studying this problem leads me to a lot of examples using LDA and LSI models, but is this the way to go?
I have made a corpus with all the tokens, bigrams, trigrams from all 1000 pages. What is the best way to compare two pages with each other? I have looked had this example where an LSI model is used to compare a query with a whole corpus, but I can't figure out how to compare it with just one previous page/document.
Any other ideas will be greatly appreciated!
texts = data_lemmatized # --> all tokenized + filtered + bigrams + trigrams using gensim
dictionary = corpora.Dictionary(data_lemmatized)
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(data_lemmatized[1]). #--> this is page 2, which I want to compare with data_lemmatized[0]
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # -->this performs a similarity query against the corpus, but I want only 1 page

There are many different possible text-similarity calculations; which one is the "way to go" will depend on your data, project goals, & resources. Both LDA & LSI are reasonable things to try.
The Gensim models work on whatever 'documents' you give them - so if you've preprocessed your 1000-page PDF into 20 of your true documents, or 1000 separate pages, before training a topic model, those are also the units-of-text it will analyze.
Are you trying to compare the pages to infer document-boundaries? (Are you sure there's no other better hint of boundaries in the PDF? Are you sure all the desired document boundaries align with page-boundaries?)
Doing page-level comparisons might work for that, or might not it'd depend on how vividly the word/phrase usage changes from the end of one document to another.
You can generally think of many of the Gensim models as providing a summary vector for a text. The queries against all documents are useful for listing ranked matches, but if you want to do a simple pairwise calculation, you'd get the vectors for each text from the model individually, then use a direct calculation on the vectors (such as cosine-similarity). For example:
page1_bow = dictionary.doc2bow(data_lemmatized[0])
page1_lsi = lsi[page1_bow]
page2_bow = dictionary.doc2bow(data_lemmatized[1])
page2_lsi = lsi[page2_bow]
cossim = gensim.matutils.cossim(page1_lsi, page2_lsi)
(The gensim.mattutils.cossim() function is a convenience helper for calculating cosine-similarity between the sparse arrays many Gensim bag-of-words-fed models, including LSI. With other dense vectors, you might use the raw calculation for cosine-similarity, or the cosine-distance function offered by scipy.spatial.distance.cosine, or other methods.)

What is the best approach to measure a similarity between texts in multiple languages in python?

So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!

You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.

Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.

Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.

Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:

There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

classification of documents considering the order of words

I'm trying to classify a list of documents. I'm using CountVectorizer and TfidfVectorizer to vectorize the documents before the classification. The results are good but I think that they could be better if we will consider not only the existence of specific words in the document but also the order of these words. I know that it is possible to consider also pairs and triples of words but I'm looking for something more inclusive.

Believe it or not, but bag of words approaches work quite well on a wide range of text datasets. You've already thought of bi-grams or tri-grams. Let's say you had 10-grams. You have information about the order of your words, but it turns out there are rarely more than one instance of each 10-gram, so there would be few examples for your classification model to learn from. You could try some other custom feature engineering based on the text, but it would be a good amount of work that rarely help much. There are other successful approaches in Natural Language Processing, especially in the last few years, but they usually focus on more than word ordering.

How can I evaluate my technique?

I am dealing with a problem of text summarization i.e. given a large chunk(s) of text, I want to find the most representative "topics" or the subject of the text. For this, I used various information theoretic measures such as TF-IDF, Residual IDF and Pointwise Mutual Information to create a "dictionary" for my corpus. This dictionary contains important words mentioned in the text.
I manually sifted through the entire 50,000 list of phrases sorted on their TFIDF measure and hand-picked 2,000 phrases (I know! It took me 15 hours to do this...) that are the ground truth i.e. these are important for sure. Now when I use this as a dictionary and run a simple frequency analysis on my text and extract the top-k phrases, I am basically seeing what the subject is and I agree with what I am seeing.
Now how can I evaluate this approach? There is no machine learning or classification involved here. Basically, I used some NLP techniques to create a dictionary and using the dictionary alone to do simple frequency analysis is giving me the topics I am looking for. However, is there a formal analysis I can do for my system to measure its accuracy or something else?

I'm not an expert of machine learning, but I would use cross-validation. If you used e.g. 1000 pages of text to "train" the algorithm (there is a "human in the loop", but no problem), then you could take another few hundred test pages, and use your "top-k phrases algorithm" to find the "topic" or "subject" of these. The ratio of test pages where you agree with the outcome of the algorithm gives you a (somewhat subjective) measure of how well your method performs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.