NLTK tokenize questions - python

Just a quick beginners question here with NLTK.
I am trying to tokenize and generate bigrams, trigrams and quadgrams from a corpus.
I need to add <s> to the beginning of my lists and </s> to the end in place of a period if there is one.
The list is taken from the brown corpus in nltk. (and a specific section at that)
so.. I have
from nltk.corpus import brown
news = brown.sents(categories = 'editorial')
Am I making this too difficult?
Thanks a lot.

import nltk.corpus as corpus
def mark_sentence(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
for row in news[:5]:
print(mark_sentence(row))
yields
['<s>', 'Assembly', 'session', 'brought', 'much', 'good', '</s>']
['<s>', 'The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '</s>']
['<s>', 'It', 'was', 'faced', 'immediately', 'with', 'a', 'showdown', 'on', 'the', 'schools', ',', 'an', 'issue', 'which', 'was', 'met', 'squarely', 'in', 'conjunction', 'with', 'the', 'governor', 'with', 'a', 'decision', 'not', 'to', 'risk', 'abandoning', 'public', 'education', '</s>']
['<s>', 'There', 'followed', 'the', 'historic', 'appropriations', 'and', 'budget', 'fight', ',', 'in', 'which', 'the', 'General', 'Assembly', 'decided', 'to', 'tackle', 'executive', 'powers', '</s>']
['<s>', 'The', 'final', 'decision', 'went', 'to', 'the', 'executive', 'but', 'a', 'way', 'has', 'been', 'opened', 'for', 'strengthening', 'budgeting', 'procedures', 'and', 'to', 'provide', 'legislators', 'information', 'they', 'need', '</s>']

Related

How to remove empty spaces in list?

I have a text:
text = '''
Wales greatest moment. Lille is so close to the Belgian
border,
this was essentially a home game for one of the tournament favourites. Their
confident supporters mingled with their new Welsh fans on the streets,
buying into the carnival spirit - perhaps more relaxed than some might have
been before a quarter-final because they thought this was their time.
In the driving rain, Wales produced the best performance in their history to
carry the nation into uncharted territory. Nobody could quite believe it.'''
I have a code:
words = text.replace('.',' ').replace(',',' ').replace('\n',' ').split(' ')
print(words)
And Output:
['Wales', 'greatest', 'moment', '', 'Lille', 'is', 'so', 'close', 'to', 'the', 'Belgian', 'border', '', '', 'this', 'was', 'essentially', 'a', 'home', 'game', 'for', 'one', 'of', 'the', 'tournament', 'favourites', '', 'Their', '', 'confident', 'supporters', 'mingled', 'with', 'their', 'new', 'Welsh', 'fans', 'on', 'the', 'streets', '', '', 'buying', 'into', 'the', 'carnival', 'spirit', '-', 'perhaps', 'more', 'relaxed', 'than', 'some', 'might', 'have', '', 'been', 'before', 'a', 'quarter-final', 'because', 'they', 'thought', 'this', 'was', 'their', 'time', '', 'In', 'the', 'driving', 'rain', '', 'Wales', 'produced', 'the', 'best', 'performance', 'in', 'their', 'history', 'to', '', 'carry', 'the', 'nation', 'into', 'uncharted', 'territory', '', 'Nobody', 'could', 'quite', 'believe', 'it', '']
You can see, list have empty spaces, I remove '\n', ',' and '.'.
But now I have not idea how to remove this spaces.
You can filter them, if you don't like them
no_empties = list(filter(None, words))
If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
This works because empty elements are considered falsey.
EDIT:
The original answer does not product the same output as mentioned in the comments, because of the dash symbol, to avoid that:
import re
words = re.findall(r'[\w-]+', text)
Original Answer
You can directly get what you want with the re module
import re
words = re.findall(r'\w+', text)
['Wales',
'greatest',
'moment',
'Lille',
'is',
'so',
'close',
'to',
'the',
'Belgian',
'border',
'this',
'was',
'essentially',
'a',
'home',
'game',
'for',
'one',
'of',
'the',
'tournament',
'favourites',
'Their',
'confident',
'supporters',
'mingled',
'with',
'their',
'new',
'Welsh',
'fans',
'on',
'the',
'streets',
'buying',
'into',
'the',
'carnival',
'spirit',
'perhaps',
'more',
'relaxed',
'than',
'some',
'might',
'have',
'been',
'before',
'a',
'quarter',
'final',
'because',
'they',
'thought',
'this',
'was',
'their',
'time',
'In',
'the',
'driving',
'rain',
'Wales',
'produced',
'the',
'best',
'performance',
'in',
'their',
'history',
'to',
'carry',
'the',
'nation',
'into',
'uncharted',
'territory',
'Nobody',
'could',
'quite',
'believe',
'it']
The reason you are getting this issue is that your text value is indented in every line with 4 single spaces, not because your code is flawed. You could add .replace(' ','') to your 'words' logic to fix this if you mean to have 4 single spaces every line, or you could refer to Thomas Weller's solution, which will solve the problem no matter how many consecutive single spaces you leave

Why does the nltk stopwords output does not match nltk word_tokenize output

I am currently using nltks stopwords and word_tokenize to process some text and encountered some weird behavior.
sentence = "this is a test sentence which makes perfectly.sense. doesn't it? it won't. i'm annoyed"
tok_sentence = word_tokenize(sentence)
print(tok_sentence)
print(stopwords.words('english'))
printing the following:
['this', 'is', 'a', 'test', 'sentence', 'which', 'makes', 'perfectly.sense', '.', 'does', "n't", 'it', '?', 'it', 'wo', "n't", '.', 'i', "'m", 'annoyed']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Focusing on the words containing the " ' "-char. We can see, that the stopword list clearly contains words separated by it. At the same time all the words in my sample sentence are included in the stopword list, so are its parts. ("doesn't" -> included, "doesn" + "t" -> included).
The word_tokenize function however splits the word "doesn't" into "does" and "n't".
Filtering the stopwords after using word_tokenize will therefore lead to the removal of "does" but leaves "n't" behind...
I was wondering if this behavior was intentional. If so, could someone please explain why?

Is there a way in which I can get the equivalent list of vectors per paragraph in doc2vec?

Is there a way to see the vectors I got per paragraphs and not per each word in the vocabulary with doc2vec. By using model.wv.vectors I get all the vectors per words. Now, I would need this in order to apply a clusterization algorithm on the embedded paragraphs which I can hopefully obtain. I am not sure though if this approach is good. This is how the paragraphs look:
[TaggedDocument(words=['this', 'is', 'the', 'effect', 'of', 'those', 'states', 'that', 'went', 'into', 'lockdown', 'much', 'later', 'they', 'are', 'just', 'starting', 'to', 'see', 'the', 'large', 'increase', 'now', 'they', 'have', 'to', 'ride', 'it', 'out', 'and', 'hope', 'for', 'the', 'best'], tags=[0])
TaggedDocument(words=['so', 'see', 'the', 'headline', 'is', 'died', 'not', 'revised', 'predictions', 'show', 'more', 'hopeful', 'situation', 'or', 'new', 'york', 'reaching', 'apex', 'long', 'before', 'experts', 'predicted', 'or', 'any', 'such', 'thing', 'got', 'to', 'keep', 'the', 'panic', 'train', 'rolling', 'see'], tags=[1])]
model.docvecs.vectors will contain all the trained-up document vectors.

how to make these words into sentence

I lemmatised several sentences, and it turns out the results like this,this is for the first two sentences.
['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
which all the words are generated together like a list. but i need them to be like sentence by sentence, the output format would be better like this:
['She be start on Levofloxacin but the patient become hypotensive at that point with blood pressure of 70/45 and receive a normal saline bolus to boost her blood pressure to 99/60 ; however the patient be admit to the Medical Intensive Care Unit for overnight observation because of her somnolence and hypotension .','11 . History of hemoptysis , on Coumadin .','There be ST scoop in the lateral lead consistent with Dig vs. a question of chronic ischemia change .']
can anyone help me please? thanks a lot
Try this code:
final = []
sentence = []
for word in words:
if word in ['.']: # and whatever other punctuation marks you want to use.
sentence.append(word)
final.append(' '.join(sentence))
sentence = []
else:
sentence.append(word)
print (final)
Hope this helps! :)
A good starting point might be str.join():
>>> wordsList = ['She', 'be', 'start', 'on', 'Levofloxacin']
>>> ' '.join(wordsList)
'She be start on Levofloxacin'
words=['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
def Wordify(words,sen_lim):
Array=[]
word=""
sen_len=0
for w in words:
word+=w+" "
if(w.isalnum()):
sen_len+=1
if(w=="." and sen_len>sen_lim):
Array.append(word)
word=""
sen_len=0
return(Array)
print(Wordify(words,5))
Basically you append the characters to a new string and separate out sentence if there is a period , but also ensure that the current sentence has a minimum number of words. This ensures sentences that like "11." are avoided.
sen_lim
is a parameter you could tune according to your convenience.
You can try string concatenation by looping through the list
list1 = ['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the',
'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood',
'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline',
'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';',
'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical',
'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because',
'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History',
'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST',
'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.',
'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
list2 = []
string = ""
for element in list1:
if(string == "" or element == "."):
string = string + element
else:
string = string + " " + element
list2.append(string)
print(list2)
You could try this:
# list of words.
words = ['This', 'is', 'a', 'sentence', '.']
def sentence_from_list(words):
sentence = ""
# iterate the list and append to the string.
for word in words:
sentence += word + " "
result = [sentence]
# print the result.
print result
sentence_from_list(words)
You may need to delete the last space, just before the '.'

gensim.similarities.SparseMatrixSimilarity get segmentation-fault

I want to get the similarity of one document to other documents. I use gensim. The program can run correctly, but after some steps it exits with Segmentation fault.
Below is my code:
from gensim import corpora, models, similarities
docs = [['Looking', 'for', 'the', 'meanings', 'of', 'words'],
['phrases'],
['and', 'expressions'],
['We', 'provide', 'hundreds', 'of', 'thousands', 'of', 'definitions'],
['synonyms'],
['antonyms'],
['and', 'pronunciations', 'for', 'English', 'and', 'other', 'languages'],
['derived', 'from', 'our', 'language', 'research', 'and', 'expert', 'analysis'],
['We', 'also', 'offer', 'a', 'unique', 'set', 'of', 'examples', 'of', 'real', 'usage'],
['as', 'well', 'as', 'guides', 'to:']]
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]
nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)
phrases = [['This',
'section',
'gives',
'guidelines',
'on',
'writing',
'in',
'everyday',
'situations'],
['from',
'applying',
'for',
'a',
'job',
'to',
'composing',
'letters',
'of',
'complaint',
'or',
'making',
'an',
'insurance',
'claim'],
['There',
'are',
'plenty',
'of',
'sample',
'documents',
'to',
'help',
'you',
'get',
'it',
'right',
'every',
'time'],
['create',
'a',
'good',
'impression'],
['and',
'increase',
'the',
'likelihood',
'of',
'achieving',
'your',
'desired',
'outcome']]
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
sims=index[phrase2word]
It can run normally until get sims, but it cannot get sims, and using gdb gets the following info:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffd881d809 in csr_tocsc (n_row=5, n_col=39,
Ap=0x4a4eb10, Aj=0x9fc6ec0, Ax=0x1be4a00, Bp=0xa15f6a0, Bi=0x9f3ee80,
Bx=0x9f85f60) at scipy/sparse/sparsetools/csr.h:411 411
scipy/sparse/sparsetools/csr.h: 没有那个文件或目录.
I have get the answer from github
The main reason is that num_features should be same with the dictionary.dfs

Categories

Resources