How to generate bigram/trigram corpus only

How to generate bigram/trigram corpus only - python

Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words?
I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.
For example, in the list below:
words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]
I use
bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]
This creates a list of unigrams and bigrams as follows:
[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
['i', 'love', 'new_york'],
['new_york', 'is', 'great']]
My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?

It's not a built-in option of the gensim Phrases functionality.
If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_'shouldn't be too expensive (and doesn't need full regular expressions). For example, your last line could be:
words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]
(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases-combined bigrams.)
If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300
...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code.

Related

how can i get the number of words that have been influenced by the lemmatization approach in a text?

For example, in the below sentence where the lemmatizer has affected 5 words, the number 5 should be displayed in the output.
lemmatizer = WordNetLemmatizer()
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']

Probably not the most elegant way, but as a workaround you could try to compare every single element of the tokenized sentence and the lemmatized sentence (as far as I know, lemmatization does not remove elements so this should work).
Something like that:
count = 0
for i, el in enumerate(tokenized):
if el!=lemmatized[i]:
count+=1
The value of count will be the number of elements that differ from the 2 lists, thus the number of elements affected by lemmatization.

Python: CountVectorizer ignores one letter word "I"

I have a list called dictionary1. I use the following code to get sparse count matrices of texts:
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)
cv1.fit_transform(dictionary1)
I notice however that
list(set(dictionary1)-set(cv1.get_feature_names()))
results in ['i']. So "i" is in my dictionary but CountVectorizer ignores it (presumably some default setting discards one-char words). In the documentation I could not find such an option. Can someone point me to the problem? Indeed I would like to keep "i" in my analysis, as it could refer to more personal language.

A working work-around is passing the dictionary as the vocabulary directly (actually I don't know why I did not do thath in the first place). I.e.
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=[], vocabulary=dictionary1)
cv1._validate_vocabulary()
list(set(dictionary1)-set(cv1.get_feature_names())) then returns [].
In my original post, I should have mentioned that dictionary1 already is a list of unique tokens.

The default configuration tokenizes the string by extracting words of at least 2 letters.
Check out this link to see more details about sklearn vectorizers.
In your case, you should use a different tokenizer, not analyzer. For example, you can use TweetTokenizer from nltk library:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
corpus = [...some_texts...]
tk = TweetTokenizer()
vectorizer = CountVectorizer(tokenizer=tk.tokenize)
x = vectorizer.fit_transform(corpus)
For example, if corpus is defined as below, you would get:
corpus = ['I love ragdolls',
'I received a cat',
'I take it as my best friend']
vectorizer.get_feature_names()
> ['a', 'as', 'best', 'cat', 'friend', 'i', 'it', 'love', 'my', 'ragdolls', 'received', 'take']

Getting words out of a numpy array of sentence strings

I have a numpy array of sentences (strings)
arr = np.array(['It's the most wonderful time of the year.',
'With the kids jingle belling.',
'And everyone telling you be of good cheer.',
'It's the hap-happiest season of all.'])
(that I read from a csv file). I need to make a numpy array with all the unique words in these sentences.
So what I need is
array(["It's", "the", "most", "wonderful", "time", "of" "year", "With", "the", "kids", "jingle", "belling" "and", "everyone", "telling", "you", "be", "good", "cheer", "It's", "hap-happiest", "season", "all"])
I could do this like
o = []
for x in arr:
o += x.split()
words = np.array(o)
unique_words = np.array(list(set(words.tolist())))
but as this involves first making lists and then converting that to numpy array, it's obviously gonna be slow and inefficient for large data.
I also tried nltk as in
words = np.array([])
for x in arr:
words = np.append(words, nltk.word_tokenize(x))
but with this too seems inefficient as a new array is created on each iteration instead of the old one being modified.
I suppose there's some elegant way of achieving what I want using more of numpy.
Can you point me in the right direction?

I think you can try something like this:
vocab = set()
for x in arr:
vocab.update(nltk.word_tokenize(x))
set.update() takes an iterable to add elements to existing set.
Update:
Also, you can look at the working of CountVectorizer in scikit-learn which:
converts a collection of text documents to a matrix of token counts.
And it uses a dictionary to keep track of the unique words:
# raw_documents is an iterable of sentences.
for doc in raw_documents:
feature_counter = {}
# analyze will split the sentences into tokens
# and apply some preprocessing on them (like stemming, lemma etc)
for feature in analyze(doc):
try:
# vocabulary is a dictionary containing the words and their counts
feature_idx = vocabulary[feature]
...
...
And I think it works pretty efficiently. So I think you can also use a dict() instead of set. I am not familiar with working of NLTK, but I think that must also contain something equivalent to CountVectorizer.

I'm not sure numpy is the best way to go here. You can achieve what you want with nested lists and sets or dictionaries.
One useful thing to know is that the tokenizer methods from nltk can process a list of sentences, and will return a list of tokenized sentences. For example:
from nltk.tokenize import WordPunktTokenizer
wpt = WordPunktTokenizer()
tokenized = wpt.tokenize_sents(arr)
This will return a list of lists of the tokenized sentences in arr, i.e.:
[['It', "'", 's', 'the', 'most', 'wonderful', 'time', 'of', 'the', 'year', '.'],
['With', 'the', 'kids', 'jingle', 'belling', '.'],
['And', 'everyone', 'telling', 'you', 'be', 'of', 'good', 'cheer', '.'],
['It', "'", 's', 'the', 'hap', '-', 'happiest', 'season', 'of', 'all', '.']]
nltk comes with lots of different tokenizers, and so will give you options for how best to split the sentences into word tokens. You can then use something like the following to get the unique set of words / tokens:
unique_words = set()
for toks in tokenized:
unique_words.update(toks)

Clustering synonym words using NLTK and Wordnet

Given a set of words V, I would like to group the synonym words in V together. I am wondering if there is any built-in function in NLTK and Wordnet that takes V as the input and automatically cluster them based on synonymity.
I already know how to extract the synonym of each word, but this is not what I am looking for. If I do so, the problem becomes complicated when the synonym sets are intersecting each other, or being subset/superset of each other, which needs writing a function removing the conflicts.
As an example, let's consider
V = ["good","constipate","bad","nice","defective","right","respectable","powerful"]
What I want to get as output is:
[('constipate'), ('nice'), ('bad', 'defective'), ('good', 'powerful', 'respectable', 'right')]
Now based on the size/number of the clusters, some sets might split into several sets, or combine together. Here, I am just caring for the words in V and their synonyms in V.

Yes, there is a way to do using nltk and wordnet. Following is an example. I am using built in sysnets and looking for synonyms for a 'book',
import nltk
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('book'):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
resulting synonyms for 'book' is
print(synonyms)
>>['book', 'book', 'volume', 'record', 'record_book', 'book', 'script', 'book', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'book', 'book', 'book', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', ..]
length of synonyms,
len(synonyms)
>>38
Note: Some synonyms are verb forms, and many synonyms are just different usages of 'book'. If, instead, we take the set of synonyms, there are fewer unique words, as shown in the following code:
len(set(synonyms))
>>25
After using set operation,
{'record', 'Quran', 'Holy_Scripture', 'Koran', 'Good_Book', 'playscript', 'book', 'Word_of_God', 'hold', 'Holy_Writ', 'script', 'leger', 'book_of_account', 'Scripture', 'ledger', 'reserve', 'volume', 'record_book', "al-Qur'an", 'Christian_Bible', 'Word', 'rule_book', 'Bible', 'Book', 'account_book'}

Splitting words using nltk module in Python

I am trying to find a way for splitting words in Python using the nltk module. I am unsure how to reach my goal given the raw data I have which is a list of tokenized words e.g.
['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
As you can see many words are stuck together (i.e. 'to' and 'produce' are stuck in one string 'toproduce'). This is an artifact of scraping data from a PDF file and I would like to find a way using the nltk module in python to split the stuck-together words (i.e. split 'toproduce' into two words: 'to' and 'produce'; split 'standardoperatingprocedures' into three words: 'standard', 'operating', 'procedures').
I appreciate any help!

I believe you will want to use word segmentation in this case, and I am not aware of any word segmentation features in the NLTK that will deal with English sentences without spaces. You could use pyenchant instead. I offer the following code only by way of example. (It would work for a modest number of relatively short strings--such as the strings in your example list--but would be highly inefficient for longer strings or more numerous strings.) It would need modification, and it will not successfully segment every string in any case.
import enchant # pip install pyenchant
eng_dict = enchant.Dict("en_US")
def segment_str(chars, exclude=None):
"""
Segment a string of chars using the pyenchant vocabulary.
Keeps longest possible words that account for all characters,
and returns list of segmented words.
:param chars: (str) The character string to segment.
:param exclude: (set) A set of string to exclude from consideration.
(These have been found previously to lead to dead ends.)
If an excluded word occurs later in the string, this
function will fail.
"""
words = []
if not chars.isalpha(): # don't check punctuation etc.; needs more work
return [chars]
if not exclude:
exclude = set()
working_chars = chars
while working_chars:
# iterate through segments of the chars starting with the longest segment possible
for i in range(len(working_chars), 1, -1):
segment = working_chars[:i]
if eng_dict.check(segment) and segment not in exclude:
words.append(segment)
working_chars = working_chars[i:]
break
else: # no matching segments were found
if words:
exclude.add(words[-1])
return segment_str(chars, exclude=exclude)
# let the user know a word was missing from the dictionary,
# but keep the word
print('"{chars}" not in dictionary (so just keeping as one segment)!'
.format(chars=chars))
return [chars]
# return a list of words based on the segmentation
return words
As you can see, this approach (presumably) mis-segments only one of your strings:
>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]
You can then use chain to flatten this list of lists:
>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']

You can easily install the following library and use it for your purpose:
pip install wordsegment
import wordsegment
help(wordsegment)
from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')
The output will be like this:
Out[4]: ['using', 'various', 'molecular', 'biology']
Please refer to http://www.grantjenks.com/docs/wordsegment/ for more details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate bigram/trigram corpus only - python

Related

how can i get the number of words that have been influenced by the lemmatization approach in a text?

Python: CountVectorizer ignores one letter word "I"

Getting words out of a numpy array of sentence strings

Clustering synonym words using NLTK and Wordnet

Splitting words using nltk module in Python

Categories

Resources