I have the following Python regex:
>>> p = re.compile(r"(\b\w+)\s+\1")
\b : word boundary
\w+ : one or more alphanumerical characters
\s+ : one or more whitespaces (can be , \t, \n, ..)
\1 : backreference to group 1 ( = the part between (..))
This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:
>>> p.search("I am in the the car.")
<_sre.SRE_Match object; span=(8, 15), match='the the'>
The found match is the the, just as I had expected. The weird behaviour is in the findall function:
>>> p.findall("I am in the the car.")
['the']
The found match is now only the. Why the difference?
When using groups in a regular expression, findall() returns only the groups; from the documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:
>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]
The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:
>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")
You can filter that back to just the outer group result:
>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']
Related
I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']
When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.
This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'
Let's say I have a string :
s = "id_john, num847, id_000, num___"
I know how to retrieve either of 2 patterns with |:
re.findall("id_[a-z]+|num[0-9]+", s)
#### ['id_john', 'num847'] # OK
I know how to capture a portion only of a match with parenthesis:
re.findall("id_([a-z]+)", s)
#### ['john']
But I fail when i try to combine those two features, this is my desired output:
#### ['john', '847']
Thanks for your help.. (I work with python)
No need for lookaheads or complex patterns.
Consider this:
>>> re.findall('id_([a-z]+)|num([0-9]+)', s)
[('john', ''), ('', '847')]
When the first pattern matches, the first group will contain the match, and the second group will be empty. When the second pattern matches, the first group is empty, and the second group contains the match.
Since one of the two groups will always be empty, joining them couldn't hurt.
>>> [a+b for a,b in re.findall('id_([a-z]+)|num([0-9]+)', s)]
['john', '847']
You may use this code in Python with lookaheads:
>>> s = "id_john, num847, id_000, num___"
>>> print re.findall(r'(?:id_(?=[a-z]+\b)|num(?=\d+\b))([a-z\d]+)', s)
['john', '847']
RegEx Details:
(?:: Start non-capture group
id_(?=[a-z]+\b): Match id_ with a lookahead assertion to make sure we have [a-z]+ characters ahead followed by word boundary
|: OR
num(?=\d+\b))([a-z\d]+: Matchnum` with a lookahead assertion to make sure we have digits ahead followed by word boundary
): End non-capture group
([a-z\d]+): Match 1+ characters with lowercase letters or digits
I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]
Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]
How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'
I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);
From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']
?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1
It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'