How do I parse a sentence using regex - python

I need to parse a sentence like:
"Alice is a boy." into ['Alice', 'boy'] and
and "An elephant is a mammal." into ['elephant', 'mammal']. Meaning I need to split the string by 'is' while also remove 'a/an'.
Is there an elegant way to do it?

If you insists on using a regex, you can do it like this by using re.search:
print(re.search('(\w+) is [a|an]? (\w+)',"Alice is a boy.").groups())
# output: ('Alice', 'boy')
print(re.search('(\w+) is [a|an]? (\w+)',"An elephant is a mammal.").groups())
# output: ('elephant', 'mammal')
# apply list() if you want it as a list

This answer does not make us of regex, but is one way of doing things:
s = 'Alice is a boy'
s = s.split() # each word becomes an entry in a list
s = [word for word in s if word != 'a' and word !='an' and word !='is']
The main downside to this is that you would need to list out every word you want to exclude in the list comprehension.

Related

Python How to skip the part in a string marked by certain symbols?

I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i)
text=final
print(final)
the expected output will be like:
a cat is an animal
If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too.
So I want to sort the word list by the length, and search for the long words first.
words.sort(key=len)
words=words[::-1]
Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:
acatisan%animal&
And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!
**For another case,what if the text include the words that are not included in the word list?How could I handle this?
text=‘wowwwwacatisananimal’
A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:
def compose(letters, words):
q = [(letters, [])]
while q:
letters, result = q.pop()
if not letters:
return ' '.join(result)
for word in words:
if letters.startswith(word):
q.append((letters[len(word):], result+[word]))
>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'
If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return with yield to yield all matching sentence compositions.
Contrived example (just replace return with yield):
>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']
Maybe you can replace the word with the index, so the final string should be like this 3 0 1 2 4 and then convert it back to sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in sorted(words,key=len,reverse=True):
if i in text:
final=text.replace(i,' %s'%words.index(i))
text=final
print(" ".join(words[int(i)] for i in final.split()))
Output:
a cat is an animal
You need a small modification in your code, update the code line
final=text.replace(i,' '+i)
to
final=text.replace(i,' '+i, 1) . This will replace only the first occurrence.
So the updated code would be
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i, 1)
text=final
print(final)
Output is:
a cat is an animal
if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.
import re
code here
print re.sub(r'\W+', ' ', final)
I wouldn't recommend using different delimeters either side of your matched words(% and & in your example.)
It's easier to use the same delimiter either side of your marked word and use Python's list slicing.
The solution below uses the [::n] syntax for getting every nth element of a list.
a[::2] gets even-numbered elements, a[1::2] gets the odd ones.
>>> fox = "the|quick|brown|fox|jumpsoverthelazydog"
Because they have | characters on either side, 'quick' and 'fox' are odd-numbered elements when you split the string on |:
>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']
and the rest are even:
>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']
So, by enclosing known words in | characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.
from itertools import chain
def flatten(list_of_lists):
return chain.from_iterable(list_of_lists)
def parse(source_text, words):
words.sort(key=len, reverse=True)
texts = [source_text, ''] # even number of elements helps zip function
for word in words:
new_matches_and_text = []
for text in texts[::2]:
new_matches_and_text.append(text.replace(word, f"|{word}|"))
previously_matched = texts[1::2]
# merge new matches back in
merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
texts = merged.split('|')
# remove blank words (matches at start or end of a string)
texts = [text for text in texts if text]
return ' '.join(texts)
>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an', 'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'
The merge code uses the zip and flatten functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.
This approach leaves the unrecognised words in the text.
'beautiful' and 'a' are handled well because they're on their own (i.e. next to recognised words.)
'enormous' and 'scary' are not known and, as they're next to each other, they're left stuck together.
Here's how to list the unknown words:
>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']
I'm curious: is this a bioinformatics project?
List and dict comprehension is another way to do it:
result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])
So, I used zip to combine words and their position in text, sorted the words by their position in original text and finally joined the result with ' '.

Trying to make sure certain symbols aren't in a word

I currently have the following to filter words with square and normal brackets and can't help but think there must be a tidier way to do this..
words = [word for word in random.choice(headlines).split(" ")[1:-1] if "[" not in word and "]" not in word and "(" not in word and ")" not in word]
I tried creating a list or tuple of symbols and doing
if symbol not in word
But it dies because I'm comparing a list with a string. I appreciate I could explode this out and do a compare like:
for word in random.choice(headlines).split(" ")[1:-1]:
popIn = 1
for symbol in symbols:
if symbol in word:
popIn = 0
if popIn = 1:
words.append(word)
But it seems like overkill in my head. I appreciate I'm a novice programmer so anything I can do to tidy either method up would be very helpful.
Use set intersection.
brackets = set("[]()")
words = [word for word in random.choice(headlines).split(" ")[1:-1] if not brackets.intersection(word)]
The intersection is empty if word does not contain any of the characters in brackets.
You might also consider using itertools instead of a list comprehension.
words = list(itertools.ifilterfalse(brackets.intersection,
random.choice(headlines).split(" "))[1:-1]))
I'm not sure of what you want to filter but I advise you to use the Regular expression module of python.
import re
r = re.compile("\w*[\[\]\(\)]+\w*")
test = ['foo', '[bar]', 'f(o)o']
result = [word for word in test if not r.match(word)]
print result
output is ['foo']

Removing list of words from a string

I have a list of stopwords. And I have a search string. I want to remove the words from the string.
As an example:
stopwords=['what','who','is','a','at','is','he']
query='What is hello'
Now the code should strip 'What' and 'is'. However in my case it strips 'a', as well as 'at'. I have given my code below. What could I be doing wrong?
for word in stopwords:
if word in query:
print word
query=query.replace(word,"")
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?
This is one way to do it:
query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print(result)
I noticed that you want to also remove a word if its lower-case variant is in the list, so I've added a call to lower() in the condition check.
the accepted answer works when provided a list of words separated by spaces, but that's not the case in real life when there can be punctuation to separate the words. In that case re.split is required.
Also, testing against stopwords as a set makes lookup faster (even if there's a tradeoff between string hashing & lookup when there's a small number of words)
My proposal:
import re
query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}
resultwords = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)
output (as list of words):
['hello','Says','']
There's a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out. 2 solutions here:
resultwords = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords] # filter out empty words
or add empty string to the list of stopwords :)
stopwords = {'what','who','is','a','at','is','he',''}
now the code prints:
['hello','Says']
building on what karthikr said, try
' '.join(filter(lambda x: x.lower() not in stopwords, query.split()))
explanation:
query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]
filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
# filters it based on the function which will take in one item at
# a time and return true.false
lambda x: x.lower() not in stopwords # anonymous function that takes in variable,
# converts it to lower case, and returns true if
# the word is not in the iterable stopwords
' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
#using the string/char in front of the dot, i.e. ' ' as a joiner.
# i.e. ["What", "is","hello"] -> "What is hello"
Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.
If the input query is "What is Hello", I get the output as:
wht s llo
Why does this happen?
This happens because .replace() replaces the substring you give it exactly.
for example:
"My, my! Hello my friendly mystery".replace("my", "")
gives:
>>> "My, ! Hello friendly stery"
.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter.
"hello".replace("he", "je")
is logically similar to:
"je".join("hello".split("he"))
If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.
"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"
"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"
"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"
Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:
"hello my my friend".replace(" my ", " ")
>>> "hello my friend"
For these reasons your accepted answer by Robby Cornelissen is the recommended way to do what you are wanting.
" ".join([x for x in query.split() if x not in stopwords])
stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
n=p.replace(i,'')
p=n
print(p)

Determine if a list of words is in a sentence?

Is there a way (Pattern or Python or NLTK, etc) to detect of a sentence has a list of words in it.
i.e.
The cat ran into the hat, box, and house. | The list would be hat, box, and house
This could be string processed but we may have more generic lists:
i.e.
The cat likes to run outside, run inside, or jump up the stairs. |
List=run outside, run inside, or jump up the stairs.
This could be in the middle of a paragraph or the end of the sentence which further complicates things.
I've been working with Pattern for python for awhile and I'm not seeing a way to go about this and was curious if there is a way with pattern or nltk (natural language tool kit).
From what I got from your question, I think you want to search whether all the words in your list is present in a sentence or not.
In general to search for a list elements, in a sentence, you can use all function. It returns true, if all the arguments in it are true.
listOfWords = ['word1', 'word2', 'word3', 'two words']
sentence = "word1 as word2 a fword3 af two words"
if all(word in sentence for word in listOfWords):
print "All words in sentence"
else:
print "Missing"
OUTPUT: -
"All words in sentence"
I think this might serve your purpose. If not, then you can clarify.
Using a Trie, you will be able to achieve this is O(n) where n is the number of words in the list of words after building a trie with the list of words which takes O(n) where n is the number of words in the list.
Algorithm
split the sentence into list of words separated by space.
For each word check if it has a key in the trie. i.e. that word exist in the list
if it exits add that word to the result to keep track of how many words from the list appear in the sentence
keep track of the words that has a has subtrie that is the current word is a prefix of the longer word in the list of words
for each word in this words see by extending it with the current word it can be a key or a subtrie on the list of words
if it's a subtrie then we add it to the extend_words list and see if concatenating with the next words we are able to get an exact match.
Code
import pygtrie
listOfWords = ['word1', 'word2', 'word3', 'two words']
trie = pygtrie.StringTrie()
trie._separator = ' '
for word in listOfWords:
trie[word] = True
print('s', trie._separator)
sentence = "word1 as word2 a fword3 af two words"
sentence_words = sentence.split()
words_found = {}
extended_words = set()
for possible_word in sentence_words:
has_possible_word = trie.has_node(possible_word)
if has_possible_word & trie.HAS_VALUE:
words_found[possible_word] = True
deep_clone = set(extended_words)
for extended_word in deep_clone:
extended_words.remove(extended_word)
possible_extended_word = extended_word + trie._separator + possible_word
print(possible_extended_word)
has_possible_extended_word = trie.has_node(possible_extended_word)
if has_possible_extended_word & trie.HAS_VALUE:
words_found[possible_extended_word] = True
if has_possible_extended_word & trie.HAS_SUBTRIE:
extended_words.update(possible_extended_word)
if has_possible_word & trie.HAS_SUBTRIE:
extended_words.update([possible_word])
print(words_found)
print(len(words_found) == len(listOfWords))
This is useful if your list of words is huge and you do not wish to iterate over it every time or you have a large number of queries that over the same list of words.
The code is here
What about using from nltk.tokenize import sent_tokenize ?
sent_tokenize("Hello SF Python. This is NLTK.")
["Hello SF Python.", "This is NLTK."]
Then you can use that list of sentences in this way:
for sentence in my_list:
# test if this sentence contains the words you want
# using all() method
More info here
all(word in sentence for word in listOfWords)

Remove all articles, connector words, etc., from a string in Python

I have a list that contains many sentences. I want to iterate through the list, removing from all sentences words like "and", "the", "a", "are", etc.
I tried this:
def removearticles(text):
articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
text = text.replace(i, j)
return text
As you can probably tell, however, this will remove "a" and "an" when it appears in the middle of the word. I need to remove only the instances of the words when they are delimited by blank space, and not when they are within a word. What is the most efficient way of going about this?
I would go for regex, something like:
def removearticles(text):
re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)
or if you want to remove the leading whitespace as well:
def removearticles(text):
re.sub('\s+(a|an|and|the)(\s+)', '\2', text)
This looks more like an NLP job than something you would do with straight regex. I would check out NLTK (http://www.nltk.org/) IIRC it comes with a corpus full of filler words like the ones you're trying to get rid of.
def removearticles(text):
articles = {'a': '', 'an':'', 'and':'', 'the':''}
rest = []
for word in text.split():
if word not in articles:
rest.append(word)
return ' '.join(rest)
in operator of dict run faster than list.
Try something along the lines of
articles = ['and', 'a']
newText = ''
for word in text.split(' '):
if word not in articles:
newText += word+' '
return newText[:-1]
It can be done using regex. Iterator through your strings or (''.join the list and send it as a string) to the following regex.
>>> import re
>>> rx = re.compile(r'\ban\b|\bthe\b|\band\b|\ba\b')
>>> rx.sub(' ','a line with lots of an the and a baad')
' line with lots of baad'

Categories

Resources