How to treat certain words as delimiters in nltk Python? - python

I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters
The expected output is this:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
This is the code which I trying to make the above output
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
My code output is this:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
How can I make the expected output?

So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol ., you can introduce that to multiple splits by using re.split().
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
Because we are using both single . and . with a whitespace after, the split results will return an additional ''. Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']

Related

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

I'm looking to get all sentences in a text file that contain at least one of the conjunctions in the list "conjunctions". However, when applying this function for the text in the variable "text_to_look" like this:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
for sentence in sentences:
coord_sents = []
if any(conjunction in sentence for conjunction in conjunctions):
coord_sents.append(sentence)
return coord_sents
wanted_sents = get_coordinate_sents(text_to_look)
I get this error message :
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)
There seems to be something about spaCy that I'm not aware of and prevents me from doing this...
While the problem lies in the fact that conjunction is a string and sentence is a Span object, and to check if the sentence text contains a conjunction you need to access the Span text property, you also re-initialize the coord_sents in the loop, effectively saving only the last sentence in the variable. Note a list comprehension looks preferable in such cases.
So, a quick fix for your case is
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if any(conjunction in sentence.text for conjunction in conjunctions)]
Here is my test:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
file_to_examine = text_to_look
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
coord_sents = [sentence for sentence in sentences if any(conjunction in sentence.text for conjunction in conjunctions)]
Output:
>>> coord_sents
[She's looking to buy one, but she hasn't got any money., She really wanted to book, so she asks another customer to lend her money., They get along really well, so they both exchange phone numbers and go their separate ways.]
However, the in operation will find nor in north, so in crimson, etc.
You need a regex here:
import re
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
rx = re.compile(fr'\b(?:{"|".join(conjunctions)})\b')
def get_coordinate_sents(file_to_examine):
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if rx.search(sentence.text)]

Regex - question about finding every word in a string that begins with a letter

import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?
import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']

Regex: Only want space character before and after match

I am using Regex tokenizer for a text passage, and I would like to extract all words that only have white space before and after that. Here is my code:
tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')
For instance, the sentence "we don't have 500 dollars" will end up becoming "we don have dollars". I would like to have "don" eliminated since it does not end with a whitespace. How do I do so?
You can use positive lookahead and lookbehind to achieve this
Code:
import re
pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
print(re.findall(pattern, "we don't have 500 dollars"))
print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))
Output:
['we', 'have', '500', 'dollars']
['Your', 'no', 'good', 'Torrance']
You can play around with this here
https://regex101.com/r/IeLC88/3

Match shortest substring in list using for loop

I am trying to match items (single words) from one list with items (full sentences) from a second list. This is my code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
if word in line:
print(word,line)
The problem now is that my code outputs substrings, so when looking for a sentence in which 'Python' occurs, I am also getting 'Pythons'; similarly, I am getting 'Funny' when I only want the sentence containing the word 'Fun'.
I have tried adding spaces surrounding the words in the list, but this is not an ideal solution, because the sentences may contain punctuation, and the code does not return a match.
Desired output:
- Time, Time is High
- Fun, That's Fun!
- Python, Python is nice
Since you want exact matches, it'd be better to use == instead of in.
import string
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
for wrd in line.split():
if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
print(word,line)
It is not as easy (requires more lines of code) to achieve retrieving "Fun!" for Fun and at the same time not "Pythons" for Python.. It can be done of course but your rules are not very clear to me at this point. Have a look at this though:
tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
Below you get exactly the same thing only this time instead of a list comprehension you use good old for loops. I though it might help you understand the code above easier.
a = []
for phrase in sentences:
words_in_phrase = phrase.split()
for words in tokens:
if words in words_in_phrase:
a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
What is happening here is that the code returns the string it found and the phrase in which it found it. The way this is done that it takes the phrases in the sentence list and split them on whitespace. So "Pythons" and "Python" are not the same as you wanted but so is "Fun!" and "Fun". This is also case sensitive.
You might want to use dynamically generated regular expressions, ie for "Python" the regexp will look like '\bPython\b'. '\b' is a word boundary.
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import re
for word in tokens:
regexp = re.compile('\b' + word + '\b')
for line in sentences:
if regexp.match(line):
print(line)
print(word,line)
tokenized sentence is better then split it by space, since tokenize will separate punctuation.
for example:
sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']
Code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
for token in tokens:
if token in nltk.word_tokenize(sentence):
print token,sentence

find best approach to recognize list of sequence word in sentence

I have two list of words that I would like to find in a sentence based on a sequence. I would like to check is it possible to use "regular expression" or I should use check the sentence by if condition?
n_ali = set(['ali','aliasghar'])
n_leyla = set(['leyla','lili',leila])
positive_adj = set(['good','nice','handsome'])
negative_adj = set(['bad','hate','lousy'])
Sentence = "aliasghar is nice man. ali is handsome man of my life. lili has so many bad attitude who is next to my friend. "
I would like to find any pattern as below:
n_ali + positive_adj
n_ali + negative_adj
n_leyla + positive_adj
n_leyla + negative_adj
I am using python 3.5 in VS2015 and I am new in NLTK. I know how to create a "regular expression" for check a single word but I am not sure what is the best approach for list of similar names. kindly help me and suggest me what is the best way to implement this approach.
You should consider removing stopwords.
import nltk
from nltk.corpus import stopwords
>>> words = [word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')]
>>> words
['aliasghar', 'nice', 'man', '.', 'ali', 'handsome', 'man', 'life', '.', 'lili', 'many', 'bad', 'attitude', 'next', 'friend', '.']
Alright, now you have the data like you want it (mostly). Let's use simple looping to store the results in pairs for ali and leila separately.
>>> ali_adj = []
>>> leila_adj = []
>>> for i, word in enumerate(words[:-1]):
... if word in n_ali and (words[i+1] in positive_adj.union(negative_adj)):
... ali_adj.append((word, words[i+1]))
... if word in n_leyla and (words[i+1] in positive_adj.union(negative_adj)):
... leila_adj.append((word, words[i+1]))
...
>>>
>>> ali_adj
[('aliasghar', 'nice'), ('ali', 'handsome')]
>>> leila_adj
[]
Note that we could not find any adjectives to describe leila because "many" isn't a stopword. You may have to do this type of cleaning of the sentence manually.

Categories

Resources