I am performing text analysis for a document using mostly Pandas, NLTK, and TextBlob. I want to obtain the frequencies of only predefined terms. The rows in the document are reviews, and there is a predefined list of associations between words that constitutes a good or a bad review. Good reviews are likely to have (easy-->use), (service-->good) while bad reviews are likely to have (fee-->lower), (easier-->make).
What I would like to do is use these associations to classify reviews as good or bad based on if, for example, text in a particular row has "easy" and "use" closely associated, or "service" and "good". What I am stuck on is the structure--how do I come up with a "good" and "bad" dictionary of tuples and use this with maybe n-grams or something to derive an association?
Related
I have a database containing about 3 million texts (tweets). I put clean texts (removing stop words, tags...) in a list of lists of tokens called sentences (so it contains a list of tokens for each text).
After these steps, if I write
model = Word2Vec(sentences, min_count=1)
I obtain a vocabulary of about 400,000 words.
I have also a list of words (belonging to the same topic, in this case: economics) called terms. I found that 7% of the texts contain at least one of these words (so we can say that 7% of total tweets talk about economics).
My goal is to expand the list terms in order to retrieve more texts belonging to the economic topic.
Then I use
results = model.most_similar(terms, topn=5000)
to find, within the list of lists of tokens sentences, the words most similar to those contained in terms.
Finally if I create the data frame
df = pd.DataFrame(results, columns=['key', 'similarity'])
I get something like that:
key similarity
word1 0.795432
word2 0.787954
word3 0.778942
... ...
Now I think I have two possibilities to define the expanded glossary:
I take the first N words (what should be the value of N?);
I look at the suggested words one by one and decide which one to include in the expanded glossary based on my knowledge (does this word really belong to the economic glossary?)
How should I proceed in a case like this?
There's no general answer for what the cutoff should be, or how much you should use your own manual judgement versus cruder (but fast/automatic) processes. Those are inherently decisions which will be heavily influenced by your data, model quality, & goals – so you have to try different approaches & see what works there.
If you had a goal for what percentage of the original corpus you want to take – say, 14% instead of 7% – you could go as deeply into the ranked candidate list of 'similar words' as necessary to hit that 14% target.
Note that when you retrieve model.most_similar(terms), you are asking the model to 1st average all words in terms together, then return words close to that one average point. To the extent your seed set of terms is tightly around the idea of economics, that might find words close to that generic average idea – but might not find other interesting words, such as close sysnonyms of your seed words that you just hadn't thought of. For that, you might want to get not 5000 neighbors for one generic average point, but (say) 3 neighbors for every individual term. To the extent the 'shape' of the topic isn't a perfect sphere around someplace in the word-vector-space, but rather some lumpy complex volume, that might better reflect your intent.
Instead of using your judgement of the candidate words standing alone to decide whether a word is economics-related, you could instead look at the texts that a word uniquely brings in. That is, for new word X, look at the N texts that contain that word. How many, when applying your full judgement to their full text, deserve to be in your 'economics' subset? Only if it's above some threshold T would you want to move X into your glossary.
But such an exercise may just highlight: using a simple glossary – "for any of these hand-picked N words, every text mentioning at least 1 word is in" – is a fairly crude way of assessing a text's topic. There are other ways to approach the goal of "pick a relevant subset" in an automated way.
For example, you could view your task as that of training a text binary classifier to classify texts as 'economics' or 'not-economics'.
In such a case, you'd start with some training data - a set of example documents that are already labeled 'economics' or 'not-economics', perhaps via individual manual review, or perhaps via some crude bootstrapping (like labeling all texts with some set of glossary words as 'economics', & all others 'not-economics'). Then you'd draw from the full range of potential text-preprocessing, text-feature-extracton, & classification options to train & evaluate classifiers that make that judgement for you. Then you'd evaluate/tune those – a process wich might also improve your training data, as you add new definitively 'economics' or 'not-economics' texts – & eventually settle on one that works well.
Alternatively, you could use some other richer topic-modeling methods (LDA, word2vec-derived Doc2Vec, deeper neural models etc) for modeling the whole dataset, then from some seed-set of definite-'economics' texts, expand outward from them – finding nearest-examples to known-good documents, either auto-including them or hand-reviewing them.
Separately: min_count=1 is almost always a mistake in word2vec & related algorihtms, which do better if you discard words so rare they lack the variety of multiple usage examples the algorithm needs to generate good word-vectors.
My question isn't about a specific code issue, but rather about the best direction to take on a Natural Language Processing challenge.
I have a collection of several hundreds of Word and PDF files (from which I can then export the raw text) on the one hand, and a list of terms on the other. Terms can consist in one or more words. What I need to do is identify in which file(s) each term is used, applying stemming and lemmatization.
How would I best approach this? I know how to extract text, apply tokenization, lemmatization, etc., but I'm not sure how I could search for occurrences of terms in lemmatized form inside a corpus of documents.
Any hint would be most welcome.
Thanks!
I would suggest you create an inverted index of the documents, where you record the locations of each word in a list, with the word form as the index.
Then you create a secondary index, where you have as key the lemmatised forms, and as values a list of forms that belong to the lemma.
When you do a lookup of a lemmatised word, eg go, you go to the secondary index to retrieve the inflected forms, go, goes, going, went; next you go to the inverted index and get all the locations for each of the inflected forms.
For ways of implementing this in an efficient manner, look at Witten/Moffat/Bell, Managing Gigabytes, which is a fantastic book on this topic.
UPDATE: for multi-word units, you can work those out in the index. Looking for "software developer", look up "software" and "developer", and then merge the locations: everytime their location differs by 1, they are adjacent and you have found it. Discard all the ones where they are further apart.
I'm trying to classify a list of documents. I'm using CountVectorizer and TfidfVectorizer to vectorize the documents before the classification. The results are good but I think that they could be better if we will consider not only the existence of specific words in the document but also the order of these words. I know that it is possible to consider also pairs and triples of words but I'm looking for something more inclusive.
Believe it or not, but bag of words approaches work quite well on a wide range of text datasets. You've already thought of bi-grams or tri-grams. Let's say you had 10-grams. You have information about the order of your words, but it turns out there are rarely more than one instance of each 10-gram, so there would be few examples for your classification model to learn from. You could try some other custom feature engineering based on the text, but it would be a good amount of work that rarely help much. There are other successful approaches in Natural Language Processing, especially in the last few years, but they usually focus on more than word ordering.
I just started learning how NLP works. What I can do right now is to get the number of frequency of a specific word per document. But what I'm trying to do is to compare the four documents that I have to compare their similarities and different as well as displaying the words that are similar and the words that is unique to each document.
My documents are in .csv format imported using pandas. As each row has their own sentiment.
To be honest, the question you're asking is very high level and difficult (maybe impossible) to answer on a forum like this. So here are some ideas that might be helpful:
You could try to use [term frequency–inverse document frequency (TFIDF)] (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to compare the vocabularies for similarities and differences. This is not a large step from your current word-frequency analysis.
For a more detailed analysis, it might be a good idea to substitute the words of your documents with something like wordnet's synsets. This makes it possible to compare the sentence meanings at a higher level of abstraction than the actual words themselves. For example, if each of your documents mentions "planes", "trains", and "automobiles", there is an underlying similarity (vehicle references) that a simple word comparison will ignore not be able to detect.
Is there any metric that measures wealth of information on a text?
I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.
Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.
UPDATE
As an example:
Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.
If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.
Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.
A frequency distribution may work effectively but I wonder if there are other metrics for this.
What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).
See, for example, this demo. From the sentence in your example, it extracts:
small volcanic islet
Navtilos
Santorini
If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.