re.findall() isn't as greedy as expected - Python 2.7 - python

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
Per this regex tester, I should, in theory, be getting a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I am actually getting is like this:
>>> [' World', ' speaking']
The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.

The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
It is easy to see what is going on using re.finditer() and exploring the match objects:
>>> import re
>>> text = "Hello World! This is your captain speaking."
>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)
>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)
The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:
>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

You can change your regex somewhat:
>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

Related

Regex only returning first part of match

I'd like to extract the cat and another mat from this sentence:
>>> text = "the cat sat on another mat"
>>>
>>> re.findall('(the|another)\s+\w+', text)
['the', 'another']
But it won't return the cat and mat following. If I change it to re.findall('another\s+\w+', text) then it finds that part, but why doesn't the (first thing | second thing) work?
(Using Python's re module)
I would do
import re
text = "the cat sat on another mat"
re.findall('the\s+\w+|another\s+\w+', text)
The result should be
>>> ['the cat', 'another mat']
re.findall returns only the substrings in the capture group if a capture group exists in the given regex pattern, so in this case you should use a non-capturing group instead, so that re.findall would return the entire matches:
re.findall('(?:the|another)\s+\w+', text)
This returns:
['the cat', 'another mat']

Insert space between characters regex

I'm very new to regex, and i'm trying to find instances in a string where there exists a word consisting of either the letter w or e followed by 2 digits, such as e77 w10 etc.
Here's the regex that I currently have, which I think finds that (correct me if i'm wrong)
([e|w])\d{0,2}(\.\d{1,2})?
How can I add a space right after the letter e or w? If there are no instances where the criteria is met, I would like to keep the string as is. Do I need to use re.sub? I've read a bit about that.
Input: hello e77 world
Desired output: hello e 77 world
Thank You.
Your regex needs to just look like this:
([ew])(\d{2})
if you want to only match specifically 2 digits, or
([ew])(\d{1,2})
if you also want to match single digits like e4
The brackets are called capturing groups and could be back referenced in a search and replace, or with python, using re.sub
your replace string should look like
\1 \2
So it should be as simple as a line like:
re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
EDIT: working code
>>> import re
>>> your_string = 'hello e77 world'
>>>
>>> re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
'hello e 77 world'
This is what you're after:
import re
print(re.sub(r'([ew])(\d{1,2})', r'\g<1> \g<2>', 'hello e77 world'))

Python regex capturing groups not working for simple expression

I would like to get 2 captured groups for a pair of consecutive words. I use this regular expression:
r'\b(hello)\b(world)\b'
However, searching "hello world" with this regular expression yields no results:
regex = re.compile(r'\b(hello)\b(world)\b')
m = regex.match('hello world') # m evaluates to None.
You need to allow for space between the words:
>>> import re
>>> regex = re.compile(r'\b(hello)\s*\b(world)\b')
>>> regex.match('hello world')
<_sre.SRE_Match object at 0x7f6fcc249140>
>>>
Discussion
The regex \b(hello)\b(world)\b requires that the word hello end exactly where the word world begins but with a word break \b between them. That cannot happen. Adding space, \s, between them fixes this.
If you meant to allow punctuation or other separators between hello and world, then that possibility should be added to the regex.

Parse a string from a string with regex

I need a regex that will parse a string from a string.
To show you what I mean, imagine that the following is the content of the string to parse:
"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\"
where "..." denotes some arbitrary data.
The regex would need to return the list:
["a string", "another \"string\"\\", "yet another \"string"]
Edit: Note that the literal backslashes don't stop the second match
I've tried finditer but it won't find overlapping matches, and I tried the lookahead (?=) but I couldn't get that to work either.
Help?
You could try the below regex to match the strings that starts with " (which was not preceded by \ symbol) upto the next " symbol which also not preceded by \
(?<!\\)".*?(?<!\\)"
DEMO
>>> s = r'"a string" ... "another \"string\"" ... "yet another \"string" ... "failed string\"'
>>> m = re.findall(r'".*?[^\\]"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
UPDATE:
>>> s = r'"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\" '
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"|(?<=\\\\)".*?\\\\"', s)
>>> m
['"a string"', '"another \\"string\\"\\\\"', '"yet another \\"string"']
>>> for i in m:
... print i
...
"a string"
"another \"string\"\\"
"yet another \"string"
DEMO
A way that emulate an atomic group (that is interesting to reduce the backtracking when the pattern must fail):
re.findall(r'"(?=((?:[^"\\]+|\\.)*))\1"', s)
demo
You can use this regex:
"[\w\s\\"]+(?<!\\)"
Working demo
Edit: I noticed you updated your input sample. For the updated input, you can use this regex:
(?:\\\\"|")[\w\s\\"]+(?:\\\\"|(?<!\\)")
Working demo
("[^...]*?")(?=\s*\.\.\.|$)
You can try this.
See demo.Works correctly to give the required answer.
http://regex101.com/r/bJ6rZ5/4

Get the part of the string matched by regex

In the case of re.search(), is there a way I can get hold of just the part of input string that matches the regex? i.e. I just want the "heeehe" part and not the stuff that comes before it:
>>> s = "i got away with it, heeehe"
>>> import re
>>> match = re.search("he*he", s)
>>> match.string
'i got away with it, heeehe'
>>> match.?
'heeehe'
match.group(0) is the matched string.
Demo:
>>> import re
>>> s = "i got away with it, heeehe"
>>> match = re.search("he*he", s)
>>> match.group(0)
'heeehe'
You can also omit the argument, 0 is the default.

Categories

Resources