Explain the behavior of this re - python

I have the following:
>>> re.sub('(..)+?/story','\\g<1>','money/story')
'mey'
>>>
Why is capture group 1 the first letter and last two letters of money and not the first two letters?

The first capture group does not contain m at all. What is being matched by (..)+?/story is oney/story.
The (..)+? matches an even number of characters, so the following is matched (spaced out to make it clearer):
m o n e y / s t o r y
^-^ ^-^
Then the replacement is the first capture group. Something you might not know is that when you have a repeated capture group (in this case (..)+?), then only the last captured group is kept.
To summarise, oney/story is matched, and replaced with ey, so the result is mey.

Because the string money contains 5 letters (odd) not even, it won't even match the first letter m. (..)+? captures two characters and non-greedily repeats the pattern one or more times . Because the repetation quantifier + exists next to the capturing group, it would capture tha last two characters of the match . Now the captured group contains the last two characters of the match done by this (..)+? pattern. So you got ey as the captured string not the first on. So by replacing all the matched characters with the string inside the group index 1 ey will give you mey.
DEMO

Related

Match Two Sets of Different Consecutive Numbers Regex Python

I am classifying a list of vanity phone numbers based on their patterns using regex.
I would like to capture this pattern 5ABXXXYYY
Sample 534666999
I wrote the below regex that captures XXXYYY.
(\d)\1{2}(\d)\2{2}
I want to add a condition to assert the B is not the same number as X.
Desired output will match the given pattern exactly and replace it with the word silver.
S_2 = 534666999
S_2_pattern = re.sub(r"(\d)\2{2}(\d)\3{2}", "Silver", str(S_2))
print(S_2_pattern)
Silver
Thanks
If you want to match 9 digits, and the 3rd digit should not be the same as the 4th, you can add another capture group for the 3rd digit and all the group numbers after are incremented by 1.
\b\d\d(\d)(?!\1)(\d)\2\2(\d)\3\3\b
\b A word boundary to prevent a partial word match
\d\d Match 2 digits
(\d)(?!\1) Capture a single digit in group 1, and assert that it is not followed by the same
(\d)\2\2 Capture a single digit in group 2 and match 2 times the same digits after it
(\d)\3\3 Capture a single digit in group 3 and match 2 times the same digits after it
\b A word boundary
Regex demo
If the first 3 digits in group 2 should also be different from the last 3 digits in group 3:
\b\d\d(\d)(?!\1)(\d)(?!\d\d\2)\2\2(\d)\3\3\b
Regex demo

How to find repeated word in a string with regex

I have a string like codecodecodecodecode...... I need to find a repeated word in that string.
I found a way but the regular expression always returns half of the repeated part I want.
^(.*)\1+$
at the group(1) I want to see just "code"
If it is greedy, it will first match till the end of the line, and will then backtrack until it can repeat 1 or more times till the end of the string, and for an evenly divided part like this of 4 words, you can capture 2 words and match the same 2 words with the backreference \1
If you have 5 words like codecodecodecodecode as in your example there will be a single group, as the only repetition it can do until the end of the string is 5 repetitions.
The quantifier should be non greedy (and repeat 1+ times to not match an empty string) to match as least as possible characters that can be repeated to the right till the end of the string.
^(.+?)\1+$
regex demo

Python regex remove dots from dot separated letters

I would like to remove the dots within a word, such that a.b.c.d becomes abcd, But under some conditions:
There should be at least 2 dots within the word, For example, a.b remains a.b, But a.b.c is a match.
This should match on 1 or 2 letters only. For example, a.bb.c is a match (because a, bb and c are 1 or 2 letters each), but aaa.b.cc is not a match (because aaa consists of 3 letters)
Here is what I've tried so far:
import re
texts = [
'a.b.c', # Should be: 'abc'
'ab.c.dd.ee', # Should be: 'abcddee'
'a.b' # Should remain: 'a.b'
]
for text in texts:
text = re.sub(r'((\.)(?P<word>[a-zA-Z]{1,2})){2,}', r'\g<word>', text)
print(text)
This selects "any dot followed by 1 or 2 letters", which repeats 2 or more times. Selection works fine, but replacement with group, causes only on last match and repetition is ignored.
So, it prints:
ac
abee
a.b
Which is not what I want. I would appreciate any help, thanks.
Starting the match with a . dot not make sure that there is a char a-zA-Z before it.
If you use the named group word in the replacement, that will contain the value of the last iteration as it is by itself in a repeated group.
You can match 2 or more dots with 1 or 2 times a char a-zA-Z and replace the dots with an empty string when there is a match instead.
To prevent aaa.b.cc from matching, you could make use of word boundaries \b
\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b
The pattern matches:
\b A word boundary to prevent the word being part of a larger word
[a-zA-Z]{1,2} Match 1 or 2 times a char a-zA-Z
(?: Non capture group
\.[a-zA-Z]{1,2} Match a dot and 1 or 2 times a char a-zA-Z
){2,} Close non capture group and repeat 2 or more times to match at least 2 dots
\b A word boundary
Regex demo | Python demo
import re
pattern = r"\b[a-zA-Z]{1,2}(?:\.[a-zA-Z]{1,2}){2,}\b"
texts = [
'a.b.c',
'ab.c.dd.ee',
'a.b',
'aaa.b.cc'
]
for s in texts:
print(re.sub(pattern, lambda x: x.group().replace(".", ""), s))
Output
abc
abcddee
a.b
aaa.b.cc
^(?=(?:.*?\.){2,}.*$)[a-z]{1,2}(?:\.[a-z]{1,2})+$
You can use this to match the string.If its a match, you can just remove . using any naive method.
See demo.
https://regex101.com/r/BrNBtk/1

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

Extract age from a string-python

Consider this string:
s="""A25-54 plus affinities targeting,Demo (AA F21-54),
A25-49 Artist Affinity Targeting,M21-49 plus,plus plus A 21+ targeting"""
I am looking to fix my pattern which currently does not pull all the age groups in the string (A 21+ is missing from the current output).
Current try:
import re
re.findall(r'(?:A|A |AA F|M)(\d+-\d+)',s)
Output:
['25-54', '21-54', '25-49', '21-49'] #doesnot capture the last group A 21+
Expected Output:
['A25-54','AA F21-54','A25-49','M21-49','A 21+']
As you see, I would like to have the last group too which is A 21+ which is currently missing from my output.
Also if I can get the string associated with the capture group. presently my output apart from not capturing all the groups doesnt have the string before the age group. eg: I want 'A25-54 instead of '25-54' , i guess because of ?: .
Appreciate any help I can get.
The missing part of the match is due to the fact your pattern contains one capturing group and once there is a capturing group in the regex, the re.findall only returns that parts. The second issue is that you should match either - followed with 1 or more digits or a literal + symbol after the first one or more digits are matched.
You may use
(?:A|A |AA F|M)\d+(?:-\d+|\+)
NOTE: You might want to add a word boundary at the start to only match those A, AA F, etc. as whole words: r'\b(?:A|A |AA F|M)\d+(?:-\d+|\+)'.
See the regex demo and the regex graph:
Details
(?:A|A |AA F|M) - a non-capturing group matching A, A , AA , AA F or M
\d+ - 1+ digits
(?:-\d+|\+) - a non-capturing group matching - and 1+ digits after it or a single + symbol.
Python demo:
import re
s="""A25-54 plus affinities targeting,Demo (AA F21-54),
A25-49 Artist Affinity Targeting,M21-49 plus,plus plus A 21+ targeting"""
print(re.findall(r'(?:A|A |AA F|M)\d+(?:-\d+|\+)',s))
# => ['A25-54', 'AA F21-54', 'A25-49', 'M21-49', 'A 21+']

Categories

Resources