NLP process for combining common collocations

NLP process for combining common collocations - python

I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.
What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map look like for this?
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
And/or in python?

I recently had a similar question and played around with collocations
This was the solution I chose to identify pairs of collocated words:
from nltk import word_tokenize
from nltk.collocations import *
text = <a long text read in as string string>
tokenized_text = word_tokenize(text)
bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

Related

Meaningless Spacy Nouns

I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.

It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]

While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg

it-idf with TfidfVectorizer on Japanese text

I am working with a huge collection of documents written in several languages. I want to compute cosine distance between documents from their tf-idf scores. So far I have:
from sklearn.feature_extraction.text import TfidfVectorizer
# The documents are located in the same folder as the script
text_files = [r'doc1', r'doc2', r'doc3']
files = [open(f) for f in text_files]
documents = [f.read() for f in files]
vectorizer = TfidfVectorizer(ngram_range=(1,1))
tfidf = vectorizer.fit_transform(documents)
vocabulary = vectorizer.vocabulary_
When the three documents doc1, doc2 and doc3 contain English text, the algorithm works like a charm and vocabulary does indeed contains unigrams from the different bodies of text. I tried with Russian too, and it also worked great. However, when I try with some Japanese text, the algorithm does not work as intended anymore.
The problem arises from the fact that Japanese language does not have spaces, so that TfidfVectorizer does not understand what's a word and what isn't. For example I would have something like this in my unigram vocabulary:
診多索いほ権込真べふり告車クノ般宮えぼぜゆ注携ゆクく供9時ク転組けが意見だっあ税新ト復生ひり教台話辞ゃに
Whic is clearly a sentence and not a word. How can I solve this problem?

You should provide a tokenizer for the Japanese
vectorizer = TfidfVectorizer(ngram_range=(1,1), tokenizer=jap_tokenizer)
where jap_tokenizer is either a function you create or one like this.

This appears to be the English version of documents, basically:
documents = ['one word after another', 'two million more words', 'finally almost there']
For your Japanese documents, call them j_doc1, j_doc2, and j_doc3, documents probably looks like this (just an example; bear with me as I didn't bother creating random Japanese sentences):
documents = ['診多索いほ', '診多索いほ', '台話辞ゃに']
The current tokenizer looks for spaces, which your string doesn't have. You could try this:
documents = [" ".join(char for char in d) for d in documents]
Now documents looks like this, which may be more feasible (although that's up to you, as I don't know whether it's appropriate to always add a space between each Japanese character):
documents
Out[40]: ['診 多 索 い ほ', '診 多 索 い ほ', '台 話 辞 ゃ に']
Or define your own tokenizer, as referred to in another answer.

NLTK Most common synonym (Wordnet) for each word

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.
If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.
Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.
Input: "Hello my valued friend"
Return: "Hi my dear friend"

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.
The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:
from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())
You can then look up the frequency of a word like this:
>>> print(freqs["valued"])
14
Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.
>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in
brown.tagged_words(tagset="universal"))
>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45

Synonyms are a huge and open area of work in natural language processing.
In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.
Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.
Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).
I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!
Edit: I should have included this link to an older question as well:
How to get synonyms from nltk WordNet Python
And the NLTK documentation on its interface with WordNet:
http://www.nltk.org/howto/wordnet.html
I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.

The other answer shows you how to use synonyms:
wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
You now know how to get all the synonyms for a word. That's not the hard part. The hard part is determining what's the most common synonym. This question is highly domain dependent. Most common synonym where? In literature? In common vernacular? In technical speak?
Like you, I wanted to get an idea of how the English language was used. I downloaded 15,000 entire books from (Project Gutenberg) and processed the word and letter pair frequencies on all of them. After ingesting such a large corpus, I could see which words were used most commonly. Like I said above, though, it will depend on what you're trying to process. If it's something like Twitter posts, try ingesting a ton of tweets. Learn from what you have to eventually process.

How can I take a few paragraphs of text, see if any sentence has a pronoun and select all those sentences to make a new paragraph?

Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?

I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.

NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).

Python script to find word frequencies of a given document

I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer).
Is there any library or simple script that does this process?

use nltk
import nltk
YOUR_STRING = "Your words"
words = [w for w in YOUR_STRING.split()]
freq_dist = nltk.FreqDist(words)
tokens = freq_dist.keys()
#50 most frequent
most_frequent = tokens[:50]
#50 least frequent
least_frequent = tokens[-50:]

You should be able to count words. Use a collections.Counter or a dict, depending on what you need. That part is easy, but if it isn't you can find the answer by searching on SO itself.
I think you also want the Porter Stemmer, which has a Python version at http://tartarus.org/~martin/PorterStemmer/python.txt

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.