I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?
Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.
I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.
Related
So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.
Hence, my questions are:
Is my idea with word embeddings worth trying?
Is there maybe a better approach?
If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!
You could try Milvus that adopted Faiss to search similar vectors. It's easy to be installed with docker in windows OS.
Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.
Does the data contain lots of numerical information (e.g. nutritional facts)? If yes, this could be used to compare the products to some extent. My advice is to think of it not as a linguistic problem, but pattern matching as these texts have been assumably produced using semi-automatic methods using translation memories. Therefore similar texts across languages may have similar form and if so this should be used for comparison.
Multilingual text comparison is not a trivial task and I don't think there are any reasonably good out-of-box solutions for that. Yes, multilingual embeddings exist, but they have to be fine-tuned to work on specific downstream tasks.
Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc...
So, these features that defines a product each could be a vector, which means your products could be represented with a matrix.
You can potentially represent each feature with an embedding.
Or mixture of the embedding and one-hot vectors.
Here is how.
Define a list of product features:
product name, brand name, size, weight.
For each product feature, you need a text recognition model:
E.g. with brand recognition you find what part of the text is its brand name.
Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to
ru Кока-Кола, en Coca Cola.
Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.
There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.
I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.
You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.
I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.
Pre Sales
Post Sales
Purchased
Service query
Now the problems that I'm facing are these :
There is a huge number of overlapping words between each of the
categories, hence using NaiveBayes is failing.
The size of tweets being only 160 chars, what is the best way to
prevent words from one category falling into the another.
What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field
Thanks
I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.
A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.
An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.
This is going to be a complex problem.
How do you define the categories? Get as many tweets and FB posts as you can and tag them all with the correct categories to get some ground truth data
Then you can identify which words/phrases are the best for identifying particular category using e.g. PCA
Look into scikit-learn they have tutorials for text processing and classification.
First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?
Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.
I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.
Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner
Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.
What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.