how to make these words into sentence - python

I lemmatised several sentences, and it turns out the results like this,this is for the first two sentences.
['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
which all the words are generated together like a list. but i need them to be like sentence by sentence, the output format would be better like this:
['She be start on Levofloxacin but the patient become hypotensive at that point with blood pressure of 70/45 and receive a normal saline bolus to boost her blood pressure to 99/60 ; however the patient be admit to the Medical Intensive Care Unit for overnight observation because of her somnolence and hypotension .','11 . History of hemoptysis , on Coumadin .','There be ST scoop in the lateral lead consistent with Dig vs. a question of chronic ischemia change .']
can anyone help me please? thanks a lot

Try this code:
final = []
sentence = []
for word in words:
if word in ['.']: # and whatever other punctuation marks you want to use.
sentence.append(word)
final.append(' '.join(sentence))
sentence = []
else:
sentence.append(word)
print (final)
Hope this helps! :)

A good starting point might be str.join():
>>> wordsList = ['She', 'be', 'start', 'on', 'Levofloxacin']
>>> ' '.join(wordsList)
'She be start on Levofloxacin'

words=['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
def Wordify(words,sen_lim):
Array=[]
word=""
sen_len=0
for w in words:
word+=w+" "
if(w.isalnum()):
sen_len+=1
if(w=="." and sen_len>sen_lim):
Array.append(word)
word=""
sen_len=0
return(Array)
print(Wordify(words,5))
Basically you append the characters to a new string and separate out sentence if there is a period , but also ensure that the current sentence has a minimum number of words. This ensures sentences that like "11." are avoided.
sen_lim
is a parameter you could tune according to your convenience.

You can try string concatenation by looping through the list
list1 = ['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the',
'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood',
'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline',
'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';',
'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical',
'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because',
'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History',
'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST',
'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.',
'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
list2 = []
string = ""
for element in list1:
if(string == "" or element == "."):
string = string + element
else:
string = string + " " + element
list2.append(string)
print(list2)

You could try this:
# list of words.
words = ['This', 'is', 'a', 'sentence', '.']
def sentence_from_list(words):
sentence = ""
# iterate the list and append to the string.
for word in words:
sentence += word + " "
result = [sentence]
# print the result.
print result
sentence_from_list(words)
You may need to delete the last space, just before the '.'

Related

How to remove empty spaces in list?

I have a text:
text = '''
Wales greatest moment. Lille is so close to the Belgian
border,
this was essentially a home game for one of the tournament favourites. Their
confident supporters mingled with their new Welsh fans on the streets,
buying into the carnival spirit - perhaps more relaxed than some might have
been before a quarter-final because they thought this was their time.
In the driving rain, Wales produced the best performance in their history to
carry the nation into uncharted territory. Nobody could quite believe it.'''
I have a code:
words = text.replace('.',' ').replace(',',' ').replace('\n',' ').split(' ')
print(words)
And Output:
['Wales', 'greatest', 'moment', '', 'Lille', 'is', 'so', 'close', 'to', 'the', 'Belgian', 'border', '', '', 'this', 'was', 'essentially', 'a', 'home', 'game', 'for', 'one', 'of', 'the', 'tournament', 'favourites', '', 'Their', '', 'confident', 'supporters', 'mingled', 'with', 'their', 'new', 'Welsh', 'fans', 'on', 'the', 'streets', '', '', 'buying', 'into', 'the', 'carnival', 'spirit', '-', 'perhaps', 'more', 'relaxed', 'than', 'some', 'might', 'have', '', 'been', 'before', 'a', 'quarter-final', 'because', 'they', 'thought', 'this', 'was', 'their', 'time', '', 'In', 'the', 'driving', 'rain', '', 'Wales', 'produced', 'the', 'best', 'performance', 'in', 'their', 'history', 'to', '', 'carry', 'the', 'nation', 'into', 'uncharted', 'territory', '', 'Nobody', 'could', 'quite', 'believe', 'it', '']
You can see, list have empty spaces, I remove '\n', ',' and '.'.
But now I have not idea how to remove this spaces.
You can filter them, if you don't like them
no_empties = list(filter(None, words))
If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
This works because empty elements are considered falsey.
EDIT:
The original answer does not product the same output as mentioned in the comments, because of the dash symbol, to avoid that:
import re
words = re.findall(r'[\w-]+', text)
Original Answer
You can directly get what you want with the re module
import re
words = re.findall(r'\w+', text)
['Wales',
'greatest',
'moment',
'Lille',
'is',
'so',
'close',
'to',
'the',
'Belgian',
'border',
'this',
'was',
'essentially',
'a',
'home',
'game',
'for',
'one',
'of',
'the',
'tournament',
'favourites',
'Their',
'confident',
'supporters',
'mingled',
'with',
'their',
'new',
'Welsh',
'fans',
'on',
'the',
'streets',
'buying',
'into',
'the',
'carnival',
'spirit',
'perhaps',
'more',
'relaxed',
'than',
'some',
'might',
'have',
'been',
'before',
'a',
'quarter',
'final',
'because',
'they',
'thought',
'this',
'was',
'their',
'time',
'In',
'the',
'driving',
'rain',
'Wales',
'produced',
'the',
'best',
'performance',
'in',
'their',
'history',
'to',
'carry',
'the',
'nation',
'into',
'uncharted',
'territory',
'Nobody',
'could',
'quite',
'believe',
'it']
The reason you are getting this issue is that your text value is indented in every line with 4 single spaces, not because your code is flawed. You could add .replace(' ','') to your 'words' logic to fix this if you mean to have 4 single spaces every line, or you could refer to Thomas Weller's solution, which will solve the problem no matter how many consecutive single spaces you leave

Why does the nltk stopwords output does not match nltk word_tokenize output

I am currently using nltks stopwords and word_tokenize to process some text and encountered some weird behavior.
sentence = "this is a test sentence which makes perfectly.sense. doesn't it? it won't. i'm annoyed"
tok_sentence = word_tokenize(sentence)
print(tok_sentence)
print(stopwords.words('english'))
printing the following:
['this', 'is', 'a', 'test', 'sentence', 'which', 'makes', 'perfectly.sense', '.', 'does', "n't", 'it', '?', 'it', 'wo', "n't", '.', 'i', "'m", 'annoyed']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Focusing on the words containing the " ' "-char. We can see, that the stopword list clearly contains words separated by it. At the same time all the words in my sample sentence are included in the stopword list, so are its parts. ("doesn't" -> included, "doesn" + "t" -> included).
The word_tokenize function however splits the word "doesn't" into "does" and "n't".
Filtering the stopwords after using word_tokenize will therefore lead to the removal of "does" but leaves "n't" behind...
I was wondering if this behavior was intentional. If so, could someone please explain why?

Split list of strings on multiple conditions

I have a number of sentences which I would like to split on specific words (e.g. and). However, when splitting the sentences sometimes there are two or more combinations of a word I'd like to split on in a sentence.
Example sentences:
['i', 'am', 'just', 'hoping', 'for', 'strength', 'and', 'guidance', 'because', 'i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home', 'and', 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was', 'because', 'he', 'is', 'been', 'told', 'not', 'to', 'talk']
so I have written some code to split a sentence:
split_on_word = []
no_splitting = []
indexPosList = [ i for i in range(len(kth)) if kth[i] == 'and'] # check if word is in sentence
for e in example:
kth = e.split() # split strings into list so it looks like example sentence
for n in indexPosList:
if n > 4: # only split when the word's position is 4 or more
h = e.split("and")
for i in h:
split_on_word.append(i)# append split sentences
else:
no_splitting.append(kth) #append sentences that don't need to be split
However, you can see that when using this code more than once (e.g.: replace the word to split on with another) I will create duplicates or part duplicates of the sentences that I append to a new list.
Is there any way to check for multiple conditions, so that if a sentence contains both or other combinations of it that I split the sentence in one go?
The output from the examples should then look like this:
['i', 'am', 'just', 'hoping', 'for', 'strength']
['guidance', 'because']
['i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home']
[ 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was']
['he', 'is', 'been', 'told', 'not', 'to', 'talk']
You can use itertools.groupby with a function that checks whether a word is a split-word:
In [11]: split_words = {'and', 'because'}
In [12]: [list(g) for k, g in it.groupby(example, key=lambda x: x not in split_words) if k]
Out[12]:
[['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home'],
['tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was'],
['he', 'is', 'been', 'told', 'not', 'to', 'talk']]

Simple tokenization issue in NTLK

I want to tokenize the following text:
In Düsseldorf I took my hat off. But I can't put it back on.
'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I',
'can't', 'put', 'it', 'back', 'on', '.'
But to my surprise none of the NLTK tokenizers work. How can I accomplish did? Is it possible to use a combination of these tokenizers somehow to achieve the above?
You can take one of the tokenizers as a starting point and then fix the contractions (assuming that is the problem):
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)
contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)
fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1
print(tokens)
>>>['In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']
You should tokenize the sentence before tokenizing the words:
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
If you want to get them back into a single list:
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']
If you must put the ["ca", "n't"] -> ["can't"]:
>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]
# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

NLTK tokenize questions

Just a quick beginners question here with NLTK.
I am trying to tokenize and generate bigrams, trigrams and quadgrams from a corpus.
I need to add <s> to the beginning of my lists and </s> to the end in place of a period if there is one.
The list is taken from the brown corpus in nltk. (and a specific section at that)
so.. I have
from nltk.corpus import brown
news = brown.sents(categories = 'editorial')
Am I making this too difficult?
Thanks a lot.
import nltk.corpus as corpus
def mark_sentence(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
for row in news[:5]:
print(mark_sentence(row))
yields
['<s>', 'Assembly', 'session', 'brought', 'much', 'good', '</s>']
['<s>', 'The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '</s>']
['<s>', 'It', 'was', 'faced', 'immediately', 'with', 'a', 'showdown', 'on', 'the', 'schools', ',', 'an', 'issue', 'which', 'was', 'met', 'squarely', 'in', 'conjunction', 'with', 'the', 'governor', 'with', 'a', 'decision', 'not', 'to', 'risk', 'abandoning', 'public', 'education', '</s>']
['<s>', 'There', 'followed', 'the', 'historic', 'appropriations', 'and', 'budget', 'fight', ',', 'in', 'which', 'the', 'General', 'Assembly', 'decided', 'to', 'tackle', 'executive', 'powers', '</s>']
['<s>', 'The', 'final', 'decision', 'went', 'to', 'the', 'executive', 'but', 'a', 'way', 'has', 'been', 'opened', 'for', 'strengthening', 'budgeting', 'procedures', 'and', 'to', 'provide', 'legislators', 'information', 'they', 'need', '</s>']

Categories

Resources