Classifying sentences with overlapping words

Classifying sentences with overlapping words - python

I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.
Pre Sales
Post Sales
Purchased
Service query
Now the problems that I'm facing are these :
There is a huge number of overlapping words between each of the
categories, hence using NaiveBayes is failing.
The size of tweets being only 160 chars, what is the best way to
prevent words from one category falling into the another.
What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field
Thanks

I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.
A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.
An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.

This is going to be a complex problem.
How do you define the categories? Get as many tweets and FB posts as you can and tag them all with the correct categories to get some ground truth data
Then you can identify which words/phrases are the best for identifying particular category using e.g. PCA
Look into scikit-learn they have tutorials for text processing and classification.

Related

What is the best approach to measure a similarity between texts in multiple languages in python?

So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!

You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.

Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.

Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.

Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:

There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Decide rather a text is about "Topic A" or not - NLP with Python

I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.

You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.

classification of documents considering the order of words

I'm trying to classify a list of documents. I'm using CountVectorizer and TfidfVectorizer to vectorize the documents before the classification. The results are good but I think that they could be better if we will consider not only the existence of specific words in the document but also the order of these words. I know that it is possible to consider also pairs and triples of words but I'm looking for something more inclusive.

Believe it or not, but bag of words approaches work quite well on a wide range of text datasets. You've already thought of bi-grams or tri-grams. Let's say you had 10-grams. You have information about the order of your words, but it turns out there are rarely more than one instance of each 10-gram, so there would be few examples for your classification model to learn from. You could try some other custom feature engineering based on the text, but it would be a good amount of work that rarely help much. There are other successful approaches in Natural Language Processing, especially in the last few years, but they usually focus on more than word ordering.

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.

Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...

When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.

The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.

The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.

One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.