nltk wordpunct_tokenize vs word_tokenize - python

Does anyone know the difference between nltk's wordpunct_tokenize and word_tokenize? I'm using nltk=3.2.4 and there's nothing on the doc string of wordpunct_tokenize that explains the difference. I couldn't find this info either in the documentation of nltk (perhaps I didn't search in the right place!). I would have expected that first one would get rid of punctuation tokens or the like, but it doesn't.

wordpunct_tokenize is based on a simple regexp tokenization. It is defined as
wordpunct_tokenize = WordPunctTokenizer().tokenize
which you can find here. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.
word_tokenize on the other hand is based on a TreebankWordTokenizer, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.
sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'",
'Hey', "',", 'she', "'", 'll', 'say', '!']
As we can see, wordpunct_tokenize will split pretty much at all special symbols and treat them as separate units. word_tokenize on the other hand keeps things like 're together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'.
Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):
sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'
we get
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't",
'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''",
',', 'she', "'ll", 'say', '!']
so word_tokenize does split off double quotes, however it also converts them to `` and ''. wordpunct_tokenize doesn't do this:
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"',
'Hey', '",', 'she', "'", 'll', 'say', '!']

Related

Why does the nltk stopwords output does not match nltk word_tokenize output

I am currently using nltks stopwords and word_tokenize to process some text and encountered some weird behavior.
sentence = "this is a test sentence which makes perfectly.sense. doesn't it? it won't. i'm annoyed"
tok_sentence = word_tokenize(sentence)
print(tok_sentence)
print(stopwords.words('english'))
printing the following:
['this', 'is', 'a', 'test', 'sentence', 'which', 'makes', 'perfectly.sense', '.', 'does', "n't", 'it', '?', 'it', 'wo', "n't", '.', 'i', "'m", 'annoyed']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Focusing on the words containing the " ' "-char. We can see, that the stopword list clearly contains words separated by it. At the same time all the words in my sample sentence are included in the stopword list, so are its parts. ("doesn't" -> included, "doesn" + "t" -> included).
The word_tokenize function however splits the word "doesn't" into "does" and "n't".
Filtering the stopwords after using word_tokenize will therefore lead to the removal of "does" but leaves "n't" behind...
I was wondering if this behavior was intentional. If so, could someone please explain why?

how to make these words into sentence

I lemmatised several sentences, and it turns out the results like this,this is for the first two sentences.
['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
which all the words are generated together like a list. but i need them to be like sentence by sentence, the output format would be better like this:
['She be start on Levofloxacin but the patient become hypotensive at that point with blood pressure of 70/45 and receive a normal saline bolus to boost her blood pressure to 99/60 ; however the patient be admit to the Medical Intensive Care Unit for overnight observation because of her somnolence and hypotension .','11 . History of hemoptysis , on Coumadin .','There be ST scoop in the lateral lead consistent with Dig vs. a question of chronic ischemia change .']
can anyone help me please? thanks a lot
Try this code:
final = []
sentence = []
for word in words:
if word in ['.']: # and whatever other punctuation marks you want to use.
sentence.append(word)
final.append(' '.join(sentence))
sentence = []
else:
sentence.append(word)
print (final)
Hope this helps! :)
A good starting point might be str.join():
>>> wordsList = ['She', 'be', 'start', 'on', 'Levofloxacin']
>>> ' '.join(wordsList)
'She be start on Levofloxacin'
words=['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
def Wordify(words,sen_lim):
Array=[]
word=""
sen_len=0
for w in words:
word+=w+" "
if(w.isalnum()):
sen_len+=1
if(w=="." and sen_len>sen_lim):
Array.append(word)
word=""
sen_len=0
return(Array)
print(Wordify(words,5))
Basically you append the characters to a new string and separate out sentence if there is a period , but also ensure that the current sentence has a minimum number of words. This ensures sentences that like "11." are avoided.
sen_lim
is a parameter you could tune according to your convenience.
You can try string concatenation by looping through the list
list1 = ['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the',
'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood',
'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline',
'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';',
'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical',
'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because',
'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History',
'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST',
'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.',
'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
list2 = []
string = ""
for element in list1:
if(string == "" or element == "."):
string = string + element
else:
string = string + " " + element
list2.append(string)
print(list2)
You could try this:
# list of words.
words = ['This', 'is', 'a', 'sentence', '.']
def sentence_from_list(words):
sentence = ""
# iterate the list and append to the string.
for word in words:
sentence += word + " "
result = [sentence]
# print the result.
print result
sentence_from_list(words)
You may need to delete the last space, just before the '.'

TypeError: expected string or bytes-like object while filtering the nested list of strings with RegEx

I have this nested list of strings which is in it's final stage of cleaning. I want to replace the non letters in the nested list with spaces or create a new list without the non-letters. Here is my list:
list = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
And this is the pattern that I want to use in order to clean it all
pattern = re.compile(r'\W+')
newlist = list(filter(pattern.search, list))
print(newlist)
the code doesn't work and this is the error that I get:
Traceback (most recent call last):
File "/Users/art/Desktop/TxtProcessing/regexp", line 28, in <module>
newlist = [list(filter(pattern.search, list))]
TypeError: expected string or bytes-like object
I understand that list is not a string but a list of lists of strings, how do I fix it?
Any help will be very much Appreciated!
You need to step deeper into your list
import re
list_ = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
pattern = re.compile(r'\W+')
newlist_ = [item
for sublist_ in list_
for item in sublist_
if pattern.search(item)]
print(newlist_)
# ['mr.', ',', '?', ',', '.', 'pinkish-blue', '.', "n't", '.']
Additionally, you must not name your variables list.
You are attempting to pass a list to re.search, however, only strings are allowed since pattern matching is supposed to occur. Try looping over the list instead:
import re
l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
new_l = [[b for b in i if re.findall('^\w+$', b)] for i in l]
Also, note that your original variable name, list, shadows the builtin list function and in this case will assign the list contents to the attribute list.
First of all, shadowing a built-in name like list may lead to all sorts of troubles - choose your variable names carefully.
You don't actually need a regular expression here - there is a built-in isalpha() string method:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.
In [1]: l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'wea
...: ther', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pink
...: ish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
In [2]: [[item for item in sublist if item.isalpha()] for sublist in l]
Out[2]:
[['hello', 'smith', 'how', 'are', 'you', 'doing', 'today'],
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome'],
['the', 'sky', 'is'],
['you', 'should', 'eat', 'cardboard']]
Here is how you can apply the same filtering logic but using map and filter (you would need the help of functools.partial() as well):
In [4]: from functools import partial
In [5]: for item in map(partial(filter, str.isalpha), l):
...: print(list(item))
['hello', 'smith', 'how', 'are', 'you', 'doing', 'today']
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome']
['the', 'sky', 'is']
['you', 'should', 'eat', 'cardboard']

How do I select the first elements of each list in a list of lists?

I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!
There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)
>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How
Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]
Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])

Simple tokenization issue in NTLK

I want to tokenize the following text:
In Düsseldorf I took my hat off. But I can't put it back on.
'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I',
'can't', 'put', 'it', 'back', 'on', '.'
But to my surprise none of the NLTK tokenizers work. How can I accomplish did? Is it possible to use a combination of these tokenizers somehow to achieve the above?
You can take one of the tokenizers as a starting point and then fix the contractions (assuming that is the problem):
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
print(tokens)
>>>['In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']
You should tokenize the sentence before tokenizing the words:
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
If you want to get them back into a single list:
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']
If you must put the ["ca", "n't"] -> ["can't"]:
>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]
# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Categories

Resources