This question already has an answer here:
Python regex alternation
(1 answer)
Closed 8 years ago.
I am trying to find the characters immediately before and after a regex match in a given string. This is the code.
>>>import re
>>>s='dafddadffdbdasbffsbbfdbabbfsdfadsfdfddf' #completely garbage test string
>>>re.findall('.{0,5}(abb).{0,5}',s)
['abb']
The test string has an occurence of 'abb' here ...fdbabbfsd... I am under the impression that the special character . matches any character other than \n and the {m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible as stated here
So I expect my re to return ['bbfdbabbfsdfa'] and not just ['abb']. What am I missing?
It's because of the capturing group. Just move the parentheses:
re.findall('(.{0,5}abb.{0,5})',s)
findall only matches groups, so everything you want to match needs to be in the parentheses.
According to re.findall documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.
So by surrounding whole pattern as a group or removing group will give you what you want.
>>> re.findall('(.{0,5}abb.{0,5})',s) # Entire pattern as a group
['bbfdbabbfsdfa']
>>> re.findall('.{0,5}abb.{0,5}',s) # No capturing group
['bbfdbabbfsdfa']
Related
I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']
When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.
This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'
This question already has answers here:
Regular expression to return text between parenthesis
(11 answers)
Closed 2 years ago.
I have long string S, and I want to find value (numeric) in the following format "Value(**)", where ** is values I want to extract.
For example, S is "abcdef Value(34) Value(56) Value(13)", then I want to extract values 34, 56, 13 from S.
I tried to use regex as follows.
import re
regex = re.compile('\Value(.*'))
re.findall(regex, S)
But the code yields the result I did not expect.
Edit. I edited some mistakes.
You should escape the parentheses, correct the typo of Value (as opposed to Values), use a lazy repeater *? instead of *, add the missing right parenthesis, and capture what's enclosed in the escaped parentheses with a pair of parentheses:
regex = re.compile(r'Value\((.*?)\)')
Only one of your numbers follows the word 'Value', so you can extract anything inside parentheses. You also need to escape the parentheses which are special characters.
regex = re.compile('\(.*?\)')
re.findall(regex, S)
Output:
['(34)', '(56)', '(13)']
I think what you're looking for is a capturing group that can return multiple matches. This string is: (\(\d{2}\))?. \d matches an digit and {2} matches exactly 2 digits. {1,2} will match 1 or 2 digits ect. ? matches 0 to unlimited number of times. Each group can be indexed and combined into a list. This is a simple implementation and will return the numbers within parentheses.
eg. 'asdasd Value(102222), fgdf(20), he(77)' will match 20 and 77 but not 102222.
This question already has answers here:
Regular expressions: Ensuring b doesn't come between a and c
(4 answers)
Closed 3 years ago.
I have this string;
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
I would like to match and capture all substrings that appear in between START and FINISH but only if the word "poison" does NOT appear in that substring. How do I exclude this word and capture only the desired substrings?
re.findall(r'START(.*?)FINISH', string)
Desired captured groups:
candy
sugar
Using a tempered dot, we can try:
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
matches = re.findall(r'START((?:(?!poison).)*?)FINISH', string)
print(matches)
This prints:
['candy', 'sugar']
For an explanation of how the regex pattern works, we can have a closer look at:
(?:(?!poison).)*?
This uses a tempered dot trick. It will match, one character at a time, so long as what follows is not poison.
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
>>> reg = re.compile(r'^\d{1,3}(,\d{3})*$')
>>> str = '42'
>>> reg.search(str).group()
'42'
>>> reg.findall(str)
['']
>>>
python regex
Why does reg.findall find nothing, but reg.search works in this piece of code above?
When you have capture groups (wrapped with parenthesis) in the regex, findall will return the match of the captured group; And in your case the captured group matches an empty string; You can make it non capture with ?: if you want to return the whole match; re.search ignores capture groups on the other hand. These are reflected in the documentation:
re.findall:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.
re.search:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
import re
reg = re.compile(r'^\d{1,3}(?:,\d{3})*$')
s = '42'
reg.search(s).group()
# '42'
reg.findall(s)
# ['42']
This question already has answers here:
How to use regex for words
(4 answers)
Closed 9 years ago.
I am trying to match text containing a word (let's say 'word'). I am using the following regex:
r = re.compile(r'\bword\b')
When I try this regex I get the following results:
r.match('a word a') > None
r.match(' word ') > None
r.match('word') > match
Shouldn't all three strings match?
From the docs:
re.search(pattern, string, flags=0) Scan through string looking for a
location where the regular expression pattern produces a match, and
return a corresponding MatchObject instance. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0) If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding MatchObject instance. Return None if the string does not
match the pattern; note that this is different from a zero-length
match.
Note that even in MULTILINE mode, re.match() will only match at the
beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead
So, just do r.search(...) and you should get what you want.