Python regular expression search - python

I was trying to search for all those phrases with the key word 'car':
e.g. text = 'alice: speed car, my red car, new car', I would like to find 'speed car', 'my red car', 'new car'.
import re
text = 'alice: speed car, my red car, new car'
regex = r'([a-zA-Z]+\s)+car'
match = re.findall(regex, text)
if match:
print(match)
but the above code yields:
["speed ", "red ", "new "]
instead of
["speed car", "my red car", "new car"]
which is expected?

Problem is you're not capturing 'car' in your regex, put the whole regex inside a () and and use ?: for the inner regex to make it a non-capturing group.
>>> regex = r'((?:[a-zA-Z]+\s)+car)'
>>> text = 'alice: speed car, my red car, new car'
>>> re.findall(regex, text)
['speed car', 'my red car', 'new car']

Related

Python remove sentence if it is at start of string and starts with specific words?

I have strings that looks like:
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
I want to remove the first sentence of a string if it starts with 'Hi' or 'Hello'.
Desired Output:
docs = ['Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall."
'I am ready to go. Mom says hello.']
The regex I have is:
re.match('.*?[a-z0-9][.?!](?= )', x))
But this only give be the first sentence in weird format like:
<re.Match object; span=(0, 41), match='Hi, my name is Eric.'>
What can I do to get my desired output?
You can use
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s+', '', doc) for doc in docs]
See the regex demo. Details:
^ - start of string
H(?:ello|i)\b - Hello or Hi word (\b is a word boundary)
.*? - any zero or more chars other than line break chars as few as possible
[.?!] - a ., ? or !
\s+ - one or more whitespaces.
See the Python demo:
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s+', '', doc) for doc in docs]
print(docs)
Output:
[
'Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall.",
'I am ready to go. Mom says hello.'
]
You would have to first split the string in sentences
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
Then, you want to check each sentence for Hi or Hello with your regex and add it to the final array
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not re.match('.*?[a-z0-9][.?!](?= )', sentence):
final_sentence.append(sentence)
final_docs.append(final_sentence.join('.'))
Actually, your regex is not working, just changed the code to make it work, i goes just like follows:
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
Finally, filter your array to remove all the empty strings that may have been created in the process of joining:
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)
Output:
[' Are you blue?', 'This is a great idea. I would love to go.', ' What is your name?', 'I am ready to go. Mom says hello.']
I'll leave the full code here, any suggestion is welcome, I am sure this can be solved in a more functional approach that may be easier to understand, but I am not familiar with it to such a level.
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)

Comparing strings to a text to set punctuation marks in the right places

So there's a text and phrases from this text we need to match punctuation marks to:
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
The output I need is:
output = [['apples, and donuts'], ['a donut, i would']]
I'm a beginner, so I was thinking about using .replace() but I don't know how to slice a string and access the exact part I need from the text. Could you help me with that? (I'm not allowed to use any libraries)
You can try regex for that
import re
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = [['apples and donuts'], ['a donut i would']]
print([re.findall(i[0].replace(" ", r"\W*"), text) for i in phrases])
Output
[['apples, and donuts'], ['a donut, i would']]
By iterating over the phrases list and replacing the space with \W* the regex findall method will be able to detect the search word and ignoring the punctuation.
you can remove all the punctuation in the text and then just use the plain substring search. Your only problem then is how to restore, or to map, the text you've found to the original.
You can do it by remembering the original position in the text of each letter that you keep in the search text. Here's an example. I just removed the nested list around each phrase as it looks useless, you can easily account for it if you need.
from pprint import pprint
text = 'i like plums, apples, and donuts. if i had a donut, i would eat it'
phrases = ['apples and donuts', 'a donut i would']
def find_phrase(text, phrases):
clean_text, indices = prepare_text(text)
res = []
for phr in phrases:
i = clean_text.find(phr)
if i != -1:
res.append(text[indices[i] : indices[i+len(phr)-1]+1])
return res
def prepare_text(text, punctuation='.,;!?'):
s = ''
ind = []
for i in range(len(text)):
if text[i] not in punctuation:
s += text[i]
ind.append(i)
return s, ind
if __name__ == "__main__":
pprint(find_phrase(text, phrases))
['apples, and donuts.', 'a donut, i would']

Merging consecutive elements recursively in a list of strings in Python

I have list of strings which need to be transformed into a smaller list of strings, depending whether two consecutive elements belong to the same phrase.
This happens, at the moment, if the last character of the i-th string is lower and the first character of the i+1-th string is also lower, but more complex conditions should be checked in the future.
For example this very profound text:
['I am a boy',
'and like to play'
'My friends also'
'like to play'
'Cats and dogs are '
'nice pets, and'
'we like to play with them'
]
should become:
['I am a boy and like to play',
'My friends also like to play',
'Cats and dogs are nice pets, and we like to play with them'
]
My python solution
I think the data you have posted is comma seperated. If it is pfb a simple loop solution.
data=['I am a boy',
'and like to play',
'My friends also',
'like to play',
'Cats and dogs are ',
'nice pets, and',
'we like to play with them'
]
required_list=[]
for j,i in enumerate(data):
print(i,j)
if j==0:
req=i
else:
if i[0].isupper():
required_list.append(req)
req=i
else:
req=req+" "+i
required_list.append(req)
print(required_list)
Here is your code check it
data = ['I am a boy',
'and like to play'
'My friends also'
'like to play'
'Cats and dogs are '
'nice pets, and'
'we like to play with them'
]
joined_string = ",".join(data).replace(',',' ')
import re
values = re.findall('[A-Z][^A-Z]*', joined_string)
print(values)
Since you want to do it recursively, you can try something like this:
def join_text(text, new_text):
if not text:
return
if not new_text:
new_text.append(text.pop(0))
return join_text(text, new_text)
phrase = text.pop(0)
if phrase[0].islower(): # you can add more complicated logic here
new_text[-1] += ' ' + phrase
else:
new_text.append(phrase)
return join_text(text, new_text)
phrases = [
'I am a boy',
'and like to play',
'My friends also',
'like to play',
'Cats and dogs are ',
'nice pets, and',
'we like to play with them'
]
joined_phrases = []
join_text(phrases, joined_phrases)
print(joined_phrases)
My solution has some problems with witespaces, but I hope you got the idea.
Hope it helps!

Replace words in a string, ignoring quotes and matching the whole word

I know this question is very similar to other questions on here, however I have not been able to adapt a working answer out of any of the solutions. Apologies!
I am looking to replace words in a string, ignoring anything in quotes, and matching the whole word.
i.e. good evening my name is Tommy, I like football and "my" favourite sports is myfootball.
I would like to replace 'my', for instance, but I do not want to replace the "my" or 'my' in "myfootball".
The words I want to replace will be read from a list.
Thanks,
You can use the re module:
>>> import re
>>> s = 'good evening my name is Tommy, I like football and "my" favourite sports is myfootball'
>>> re.sub(r"\ my "," your ", s)
'good evening your name is Tommy, I like football and "my" favourite sports is myfootball'
or the str.replace function:
>>> s.replace(' my ',' your ')
'good evening your name is Tommy, I like football and "my" favourite sports is myfootball'
In [84]: re.sub('\s+"|"\s+'," ",s) # sub a " preceded by a space or a " followed by a space
Ot[84]: 'good evening my name is Tommy"s, I like football and my favourite sports is myfootball.'
In [88]: re.sub(r'"(my)"', r'\1', s)
Out[88]: 'good evening my name is Tommy, I like football and my favourite sports is my football.'
In [89]: re.sub(r'"(\w+)"', r'\1', s)
Out[89]: 'good evening my name is Tommy, I like football and my favourite sports is my football.'

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7
You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.

Categories

Resources