Nltk Tokenizing and add Bigrams by keeping the sentence

Nltk Tokenizing and add Bigrams by keeping the sentence - python

I am working on my bachelorthesis and have to prepare a corpus to train word embeddings.
What I'm thinking about is if it is possible to check a tokenized sentence or text for ngrams and then exchange these single tokens with the ngram.
To make it a bit clearer what i mean:
Input
var = ['Hello', 'Sherlock', 'Holmes', 'my', 'name', 'is', 'Mr', '.', 'Watson','.']
Desired Output
var = ['Hello', 'Sherlock_Holmes', 'my', 'name', 'is', 'Mr_Watson','.']
I know Mr. Watson is not the perfect example right now. But I am thinking about if this is possible.
Because training my word2vec algorithm without looking for ngrams does not do the job well enough.
class MySentence():
def __init__(self, dirname):
self.dirname = dirname
print('Hello init')
def __iter__(self):
for fname in os.listdir(self.dirname):
txt = []
for line in open(os.path.join(self.dirname, fname)):
txt = nltk.regexp_tokenize(line, pattern='\w+|\$[\d\.]+|\S+')
tokens = [token for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
yield tri_tokens = trigrams(tokens)
sentences = MySentence(path)

N-grams are just sequences of adjacent words but they don't have to make sense language-wise. For example, "Hello Sherlock" and "Holmes my" could be 2-grams. Rather, it sounds like you are looking a more sophisticated tokenization with language-specific context, or entity recognition ("Sherlock Holmes"), which itself requires a trained model. Check out NLTK's documentation regarding nltk.ne_chunk()or rule-based chunking. Or for out-of-the-box solutions, spaCy's named entity recognition and tokenization capabilities, to get started.

Related

word tokenization takes too much time to run

I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English

I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']

How to generate bigram/trigram corpus only

Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words?
I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.
For example, in the list below:
words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]
I use
bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]
This creates a list of unigrams and bigrams as follows:
[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
['i', 'love', 'new_york'],
['new_york', 'is', 'great']]
My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?

It's not a built-in option of the gensim Phrases functionality.
If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_'shouldn't be too expensive (and doesn't need full regular expressions). For example, your last line could be:
words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]
(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases-combined bigrams.)
If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300
...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code.

get ngrams with positional information

I'm trying to group similar short descriptions together and currently using ngrams to extract text features. Here's the ngrams function that I'm using:
def generate_ngrams(text, n):
text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]
ngrams = zip(*[token[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
However, I'm experiencing some undesired results after clustering. Suppose I have the following two texts:
00011122abc
00111224abc
By using ngrams(n=3), my clustering model grouped these together, which is not what I want. So I think I need to pass a new function into tfidf vectorizer instead of ngrams. I think I need to anchor the first char and create substrings as my features for tfidf, so for the first text it will be something like this:
[000, 0001, 00011, 0001111, 0001112 ...]
Has anyone else experienced similar problems or is there a better way to approach this? Thanks!

spacy tokenization merges the wrong tokens

I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
The problem is that the 'hypotheses.[2][3]' is glued together into one token.
How can I prevent spacy from connecting this '[2][3]' to the previous token?
As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.
So for example, any of the following would be a desirable output:
'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'

I think you could try playing around with infix:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
More on this https://spacy.io/usage/linguistic-features#native-tokenizers

How to put key-words in NLTK tokenize?

Input:"My favorite game is call of duty."
And I set "call of duty" as a key-words, this phrase will be one word in tokenize process.
Finally want to get the result:['my','favorite','game','is','call of duty']
So, how to set the key-words in python NLP ?

I think what you want is keyphrase extraction, and you can do it for instance by first tagging each word with it's PoS-tag and then apply some sort of regular expression over the PoS-tags to join interesting words into keyphrases.
import nltk
from nltk import pos_tag
from nltk import tokenize
def extract_phrases(my_tree, phrase):
my_phrases = []
if my_tree.label() == phrase:
my_phrases.append(my_tree.copy(True))
for child in my_tree:
if type(child) is nltk.Tree:
list_of_phrases = extract_phrases(child, phrase)
if len(list_of_phrases) > 0:
my_phrases.extend(list_of_phrases)
return my_phrases
def main():
sentences = ["My favorite game is call of duty"]
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
for x in sentences:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
print "\nNoun phrases:"
list_of_noun_phrases = extract_phrases(tree, 'NP')
for phrase in list_of_noun_phrases:
print phrase, "_".join([x[0] for x in phrase.leaves()])
if __name__ == "__main__":
main()
This will output the following:
Noun phrases:
(NP favorite/JJ game/NN) favorite_game
(NP call/NN) call
(NP duty/NN) duty
But,you can play around with
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
trying other types of expressions, so that you can get exactly what you want, depending on the words/tags you want to join together.
Also if you are interested, check this very good introduction to keyphrase/word extraction:
https://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/

This is, of course, way too late to be useful to the OP, but I thought I'd put this answer here for others:
It sounds like what you might be really asking is: How do I make sure that compound phrases like 'call of duty' get grouped together as one token?
You can use nltk's multiword expression tokenizer, like so:
string = 'My favorite game is call of duty'
tokenized_string = nltk.word_tokenize(string)
mwe = [('call', 'of', 'duty')]
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
tokenized_string = mwe_tokenizer.tokenize(tokenized_string)
Where mwestands for multi-word expression. The value of tokenized_string will be ['My', 'favorite', 'game', 'is', 'call of duty']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Nltk Tokenizing and add Bigrams by keeping the sentence - python

Related

word tokenization takes too much time to run

How to generate bigram/trigram corpus only

get ngrams with positional information

spacy tokenization merges the wrong tokens

How to put key-words in NLTK tokenize?

Categories

Resources