Related
Is there a way to see the vectors I got per paragraphs and not per each word in the vocabulary with doc2vec. By using model.wv.vectors I get all the vectors per words. Now, I would need this in order to apply a clusterization algorithm on the embedded paragraphs which I can hopefully obtain. I am not sure though if this approach is good. This is how the paragraphs look:
[TaggedDocument(words=['this', 'is', 'the', 'effect', 'of', 'those', 'states', 'that', 'went', 'into', 'lockdown', 'much', 'later', 'they', 'are', 'just', 'starting', 'to', 'see', 'the', 'large', 'increase', 'now', 'they', 'have', 'to', 'ride', 'it', 'out', 'and', 'hope', 'for', 'the', 'best'], tags=[0])
TaggedDocument(words=['so', 'see', 'the', 'headline', 'is', 'died', 'not', 'revised', 'predictions', 'show', 'more', 'hopeful', 'situation', 'or', 'new', 'york', 'reaching', 'apex', 'long', 'before', 'experts', 'predicted', 'or', 'any', 'such', 'thing', 'got', 'to', 'keep', 'the', 'panic', 'train', 'rolling', 'see'], tags=[1])]
model.docvecs.vectors will contain all the trained-up document vectors.
I lemmatised several sentences, and it turns out the results like this,this is for the first two sentences.
['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
which all the words are generated together like a list. but i need them to be like sentence by sentence, the output format would be better like this:
['She be start on Levofloxacin but the patient become hypotensive at that point with blood pressure of 70/45 and receive a normal saline bolus to boost her blood pressure to 99/60 ; however the patient be admit to the Medical Intensive Care Unit for overnight observation because of her somnolence and hypotension .','11 . History of hemoptysis , on Coumadin .','There be ST scoop in the lateral lead consistent with Dig vs. a question of chronic ischemia change .']
can anyone help me please? thanks a lot
Try this code:
final = []
sentence = []
for word in words:
if word in ['.']: # and whatever other punctuation marks you want to use.
sentence.append(word)
final.append(' '.join(sentence))
sentence = []
else:
sentence.append(word)
print (final)
Hope this helps! :)
A good starting point might be str.join():
>>> wordsList = ['She', 'be', 'start', 'on', 'Levofloxacin']
>>> ' '.join(wordsList)
'She be start on Levofloxacin'
words=['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
def Wordify(words,sen_lim):
Array=[]
word=""
sen_len=0
for w in words:
word+=w+" "
if(w.isalnum()):
sen_len+=1
if(w=="." and sen_len>sen_lim):
Array.append(word)
word=""
sen_len=0
return(Array)
print(Wordify(words,5))
Basically you append the characters to a new string and separate out sentence if there is a period , but also ensure that the current sentence has a minimum number of words. This ensures sentences that like "11." are avoided.
sen_lim
is a parameter you could tune according to your convenience.
You can try string concatenation by looping through the list
list1 = ['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the',
'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood',
'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline',
'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';',
'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical',
'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because',
'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History',
'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST',
'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.',
'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
list2 = []
string = ""
for element in list1:
if(string == "" or element == "."):
string = string + element
else:
string = string + " " + element
list2.append(string)
print(list2)
You could try this:
# list of words.
words = ['This', 'is', 'a', 'sentence', '.']
def sentence_from_list(words):
sentence = ""
# iterate the list and append to the string.
for word in words:
sentence += word + " "
result = [sentence]
# print the result.
print result
sentence_from_list(words)
You may need to delete the last space, just before the '.'
I have this nested list of strings which is in it's final stage of cleaning. I want to replace the non letters in the nested list with spaces or create a new list without the non-letters. Here is my list:
list = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
And this is the pattern that I want to use in order to clean it all
pattern = re.compile(r'\W+')
newlist = list(filter(pattern.search, list))
print(newlist)
the code doesn't work and this is the error that I get:
Traceback (most recent call last):
File "/Users/art/Desktop/TxtProcessing/regexp", line 28, in <module>
newlist = [list(filter(pattern.search, list))]
TypeError: expected string or bytes-like object
I understand that list is not a string but a list of lists of strings, how do I fix it?
Any help will be very much Appreciated!
You need to step deeper into your list
import re
list_ = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
pattern = re.compile(r'\W+')
newlist_ = [item
for sublist_ in list_
for item in sublist_
if pattern.search(item)]
print(newlist_)
# ['mr.', ',', '?', ',', '.', 'pinkish-blue', '.', "n't", '.']
Additionally, you must not name your variables list.
You are attempting to pass a list to re.search, however, only strings are allowed since pattern matching is supposed to occur. Try looping over the list instead:
import re
l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'weather', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pinkish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
new_l = [[b for b in i if re.findall('^\w+$', b)] for i in l]
Also, note that your original variable name, list, shadows the builtin list function and in this case will assign the list contents to the attribute list.
First of all, shadowing a built-in name like list may lead to all sorts of troubles - choose your variable names carefully.
You don't actually need a regular expression here - there is a built-in isalpha() string method:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.
In [1]: l = [['hello', 'mr.', 'smith', ',', 'how', 'are', 'you', 'doing', 'today', '?'], ['the', 'wea
...: ther', 'is', 'great', ',', 'and', 'python', 'is', 'awesome', '.'], ['the', 'sky', 'is', 'pink
...: ish-blue', '.'], ['you', 'should', "n't", 'eat', 'cardboard', '.']]
In [2]: [[item for item in sublist if item.isalpha()] for sublist in l]
Out[2]:
[['hello', 'smith', 'how', 'are', 'you', 'doing', 'today'],
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome'],
['the', 'sky', 'is'],
['you', 'should', 'eat', 'cardboard']]
Here is how you can apply the same filtering logic but using map and filter (you would need the help of functools.partial() as well):
In [4]: from functools import partial
In [5]: for item in map(partial(filter, str.isalpha), l):
...: print(list(item))
['hello', 'smith', 'how', 'are', 'you', 'doing', 'today']
['the', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome']
['the', 'sky', 'is']
['you', 'should', 'eat', 'cardboard']
I am trying to isolate the first words in a series of sentences using Python/ NLTK.
created an unimportant series of sentences (the_text) and while I am able to divide that into tokenized sentences, I cannot successfully separate just the first words of each sentence into a list (first_words).
[['Here', 'is', 'some', 'text', '.'], ['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], ['I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
the_text="Here is some text. There is a a person on the lawn. I am confused. "
the_text= (the_text + "There is more. Here is some more. I don't know anything. ")
the_text= (the_text + "I should add more. Look, here is more text. How great is that?")
sents_tok=nltk.sent_tokenize(the_text)
sents_words=[nltk.word_tokenize(sent) for sent in sents_tok]
number_sents=len(sents_words)
print (number_sents)
print(sents_words)
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
print(first_words)
Thanks for the help!
There are three problems with your code, and you have to fix all three to make it work:
for i in sents_words:
first_words=[]
first_words.append(sents_words (i,0))
First, you're erasing first_words each time through the loop: move the first_words=[] outside the loop.
Second, you're mixing up function calling syntax (parentheses) with indexing syntax (brackets): you want sents_words[i][0].
Third, for i in sents_words: iterates over the elements of sents_words, not the indices. So you just want i[0]. (Or, alternatively, for i in range(len(sents_words)), but there's no reason to do that.)
So, putting it together:
first_words=[]
for i in sents_words:
first_words.append(i[0])
If you know anything about comprehensions, you may recognize that this pattern (start with an empty list, iterate over something, appending some expression to the list) is exactly what a list comprehension does:
first_words = [i[0] for i in sents_words]
If you don't, then either now is a good time to learn about comprehensions, or don't worry about this part. :)
>>> sents_words = [['Here', 'is', 'some', 'text', '.'],['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.'], ['I', 'am', 'confused', '.'], ['There', 'is', 'more', '.'], ['Here', 'is', 'some', 'more', '.'], ['I', 'do', "n't", 'know', 'anything', '.'], 'I', 'should', 'add', 'more', '.'], ['Look', ',', 'here', 'is', 'more', 'text', '.'], ['How', 'great', 'is', 'that', '?']]
You can use a loop to append to a list you've initialized previously:
>>> first_words = []
>>> for i in sents_words:
... first_words.append(i[0])
...
>>> print(*first_words)
Here There I There Here I I Look How
or a comprehension (replace those square brackets with parentheses to create a generator instead):
>>> first_words = [i[0] for i in sents_words]
>>> print(*first_words)
Here There I There Here I I Look How
or if you don't need to save it for later use, you can directly print the items:
>>> print(*(i[0] for i in sents_words))
Here There I There Here I I Look How
Here's an example of how to access items in lists and list of lists:
>>> fruits = ['apple','orange', 'banana']
>>> fruits[0]
'apple'
>>> fruits[1]
'orange'
>>> cars = ['audi', 'ford', 'toyota']
>>> cars[0]
'audi'
>>> cars[1]
'ford'
>>> things = [fruits, cars]
>>> things[0]
['apple', 'orange', 'banana']
>>> things[1]
['audi', 'ford', 'toyota']
>>> things[0][0]
'apple'
>>> things[0][1]
'orange'
For you problem:
>>> from nltk import sent_tokenize, word_tokenize
>>>
>>> the_text="Here is some text. There is a a person on the lawn. I am confused. There is more. Here is some more. I don't know anything. I should add more. Look, here is more text. How great is that?"
>>>
>>> tokenized_text = [word_tokenize(s) for s in sent_tokenize(the_text)]
>>>
>>> first_words = []
>>> # Iterates through the sentneces.
... for sent in tokenized_text:
... print sent
...
['Here', 'is', 'some', 'text', '.']
['There', 'is', 'a', 'a', 'person', 'on', 'the', 'lawn', '.']
['I', 'am', 'confused', '.']
['There', 'is', 'more', '.']
['Here', 'is', 'some', 'more', '.']
['I', 'do', "n't", 'know', 'anything', '.']
['I', 'should', 'add', 'more', '.']
['Look', ',', 'here', 'is', 'more', 'text', '.']
['How', 'great', 'is', 'that', '?']
>>> # First words in each sentence.
... for sent in tokenized_text:
... word0 = sent[0]
... first_words.append(word0)
... print word0
...
...
Here
There
I
There
Here
I
I
Look
How
>>> print first_words ['Here', 'There', 'I', 'There', 'Here', 'I', 'I', 'Look', 'How']
In one-liner with list comprehensions:
# From the_text, you extract the first word directly
first_words = [word_tokenize(s)[0] for s in sent_tokenize(the_text)]
# From tokenized_text
tokenized_text= [word_tokenize(s) for s in sent_tokenize(the_text)]
first_words = [w[0] for s in tokenized_text]
Another alternative, although it's pretty much similar to abarnert's suggestion:
first_words = []
for i in range(number_sents):
first_words.append(sents_words[i][0])
Just a quick beginners question here with NLTK.
I am trying to tokenize and generate bigrams, trigrams and quadgrams from a corpus.
I need to add <s> to the beginning of my lists and </s> to the end in place of a period if there is one.
The list is taken from the brown corpus in nltk. (and a specific section at that)
so.. I have
from nltk.corpus import brown
news = brown.sents(categories = 'editorial')
Am I making this too difficult?
Thanks a lot.
import nltk.corpus as corpus
def mark_sentence(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
for row in news[:5]:
print(mark_sentence(row))
yields
['<s>', 'Assembly', 'session', 'brought', 'much', 'good', '</s>']
['<s>', 'The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '</s>']
['<s>', 'It', 'was', 'faced', 'immediately', 'with', 'a', 'showdown', 'on', 'the', 'schools', ',', 'an', 'issue', 'which', 'was', 'met', 'squarely', 'in', 'conjunction', 'with', 'the', 'governor', 'with', 'a', 'decision', 'not', 'to', 'risk', 'abandoning', 'public', 'education', '</s>']
['<s>', 'There', 'followed', 'the', 'historic', 'appropriations', 'and', 'budget', 'fight', ',', 'in', 'which', 'the', 'General', 'Assembly', 'decided', 'to', 'tackle', 'executive', 'powers', '</s>']
['<s>', 'The', 'final', 'decision', 'went', 'to', 'the', 'executive', 'but', 'a', 'way', 'has', 'been', 'opened', 'for', 'strengthening', 'budgeting', 'procedures', 'and', 'to', 'provide', 'legislators', 'information', 'they', 'need', '</s>']