regular expression findall errors [duplicate] - python

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I run the following script
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*]')
paaa.findall(a)
I obtained
['[abc] [abc] [y78]']
Why the '[abc]' is missing? The '[abc]' clearly matches the pattern as well. Is there any bug in the python3 re.findall function?
Clarification:
Sorry the paaa should be paaa = re.compile(r'\[ab.*\]')
What I am looking for is something which will return
['[abc]', '[abc]', '[abc] [abc]', '[abc] [abc] [y78]']
Basically, any substring matches the pattern.

The repeated . in [ab.*] is greedy - it'll match as many characters as it can such that those characters are followed by a ]. So, everything in between the first [ and the last ] are matched.
Use lazy repetition instead, with .*?:
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*?]')
print(paaa.findall(a))
['[abc]', '[abc]']

You should escape the right square bracket as well, and use non-greedy repeater *? in your regex:
import re
a = r'[abc] [abc] [y78]'
paaa = re.compile(r'\[ab.*?\]')
print(paaa.findall(a))
This outputs:
['[abc]', '[abc]']

Related

Regular expression search all substrings between two keywords with linebreaks [duplicate]

This question already has answers here:
Regular expression to extract text between square brackets
(15 answers)
Closed 2 years ago.
Lets say I have a python string '\\[this\\] is \\[some\n text\\].'
s = "\\[this\\] is \\[some\n text\\]."
I would like a regular expression that would return me substrings "this" and "some\n text".
I've tried
re.search(r'\\[(.*)\\]',s)
but it does not work (return None)
You miss one backslash in the regex, and use re.DOTALL for the dot . to match the newline char
import re
s = "\\[this\\] is \\[some\n text\\]."
r = re.findall(r'\\\[(.*?)\\\]', s, flags=re.DOTALL)
print(r) # ['this', 'some\n text']
I will take the string you posted literally, but you can easily edit the regex to match another pattern.
I think that this can do the work:
'\\\\\[(.*?)\\\\\]'
Explained:
\ escapes a character, so with \ you escape a backslash. Since you have to find 2 backslashes, you need 2 more of them as escape characters (4 in total)
For the same reason as above, you need one more \ to escape the [ character
( sets your capturing group
. matches any character
* as many times as possible, but followed by a ? it means as few times as possible
) closes your capturing group
the other 5 \ followed by ] work as explained before (escaping the backslash/bracket sequence)
Hope I helped ;)
You can use use negated character class ([^][]*) with a capture group, and match the \ right before the closing ] outside of the group.
import re
s = "\\[this\\] is \\[some\n text\\]."
print(re.findall(r"\[([^][]*)\\]", s))
Output
['this', 'some\n text']

Python Regex: alternation gives empty matches [duplicate]

This question already has answers here:
Why do some regex engines match .* twice in a single input string?
(1 answer)
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I was doing some regex which simplifies to this code:
>>> import re
>>> re.sub(r'^.*$|', "xyz", "abc")
xyzxyz
I was expecting it to replace abc with xyz as the RE ^.*$ matches the whole string, the engine should just return that and exit. So I ran the same regex with re.findall().
>>> re.findall(r'^.*$|', 'abcd')
['abcd', '']
in the docs it says:
A|B, where A and B can be arbitrary REs. As the target string is scanned, REs separated by '|'
are tried from left to right. When one pattern completely matches,
that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match.
but than why is the regex matching an empty string?

How to match a full string, instead of partial string? [duplicate]

This question already has answers here:
Order of regular expression operator (..|.. ... ..|..)
(1 answer)
Checking whole string with a regex
(5 answers)
Closed 2 years ago.
pattern = (1|2|3|4|5|6|7|8|9|10|11|12)
str = '11'
This only matches '1', not '11'. How to match the full '11'? I changed it to:
pattern = (?:1|2|3|4|5|6|7|8|9|10|11|12)
It is the same.
I am testing here first:
https://regex101.com/
It is matching 1 instead of 11 because you have 1 before 11 in your alternation. If you use re.findall then it will match 1 twice for input string 11.
However to match numbers from 1 to 12 you can avoid alternation and use:
\b[1-9]|1[0-2]?\b
It is safer to use word boundary to avoid matching within word digits.
RegEx Demo
Regex always matches left before right.
On an alternation you'd put the longest first.
However, factoring should take precedense.
(1|2|3|4|5|6|7|8|9|10|11|12)
then it turns into
1[012]?|[2-9]
https://regex101.com/r/qmlKr0/1
I purposely didn't add boundary parts as
everybody has their own preference.
do you mean this solution?
[\d]+

understanding (|) regex in python [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 4 years ago.
I want to match sth like 12.12a or 13.12b but the below regex match with 'a' and i have no clue why is like that
import re
pattern = re.compile('\d\d?\.\d\d?(a|b)')
txt = "12.12a"
pattern_list = re.findall(pattern,txt)
for item in pattern_list:
print(item) # result a
Put the expression into the brackets. When there are brackets, only staff in brackets (matching groups to be precize) are matched and returned
pattern = re.compile('(\d\d?\.\d\d?(a|b))')
The result is ('12.12b', 'a') because of internal brackets. To get rid of internal brackets matches, use item[0] or another appropriate operation. Or simply unfold the regex (might be a little bit slower)
pattern = re.compile('\d\d?\.\d\d?a|\d\d?\.\d\d?b')

My regular expression is not getting matched exactly in python [duplicate]

This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed 6 years ago.
Here's my code...
import re
l=["chap","chap11","chapa","chapb","chapc","chap3","chap2","chapf","chap4","chap55","chapf","chap33","chap54","chapgk"]
for i in l:
matchobj=re.match(r'chap[0-9]',i,re.M|re.I)
if matchobj:
print(i)
as I have mentioned chap[0-9].. so it should only those strings which follow only one integer after chap
so I should get the following output..
chap3
chap2
chap4
but I am getting the following output...
chap11
chap3
chap2
chap4
chap55
chap33
chap54
match matches your pattern at the beginning of the string. Append e.g. end of string '$' or word boundary '\b' to your pattern:
matchobj=re.match(r'chap\d$',i,re.M|re.I)
# \d (digit) is shortcut for [0-9]
From the docs on re.match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
You should add a dollar sign to the end of your regex expression. The dollar ($) means the end of the string, and for future reference, the carat (^) signifies the beginning.
import re
l=["chap","chap11","chapa","chapb","chapc","chap3","chap2","chapf","chap4","chap55","chapf","chap33","chap54","chapgk"]
for i in l:
matchobj=re.match(r'chap[0-9]$',i,re.M|re.I)
if matchobj:
print(i)
Output
chap3
chap2
chap4

Categories

Resources