Regex Replace w Match - python

I have names like "Western Michigan" "Northern Illinois" and I need to change them to "W Michigan" and "N Illinois". The following is the closest I have but this fails cause let's say I match the word "Western Michigan" it throws an error and says \2 is an unmatched group (\3 seems to return the W I want). (this is python)
re.sub("^((S)outhern|(E)astern|(W)estern|(N)orthern)", r"\2", long_name)

You have 5 capturing groups - but that's already been explained. You can get what you want easily if you reduce it to 1 capturing group, but it's a little subtle. First you use a "positive lookahead assertion" to ensure that you're looking at one of the "long words" of interest. An assertion doesn't match anything, though. It just constrains the search. Then you can capture the letter following, and consume the rest. Like so:
pat = r"""(?=Southern|Eastern|Western|Northern) # looking at one of these words
(.) # just capture the first character
(outhern|astern|estern|orthern) # and consume the rest"""
pat = re.compile(pat, re.VERBOSE)
pat.sub(r"\1", long_name)

Instead of passing a replace pattern, you can pass a callback:
re.sub("^(?P<word>Southern|Eastern|Western|Northern)",
lambda match: match.group('word')[0],
'Northern Illinois')

The grouping for the regular expression is by the nth open paren:
# 12 3 4 5
re.sub("^((S)outhern|(E)astern|(W)estern|(N)orthern)", r"\2", long_name)
Thus, the 2nd group would be 'S' if it matched, the third group the 'E' if it matched, and so on.
To rectify this, instead match the word and use the first character of the matched word.

Related

python regex return non-capturing group

I want to generate a username from an email with :
firstname's first letter
lastname's first 7 letters
eg :
getUsername("my-firstname.my-lastname#email.com")
mmylastn
Here is getUsername's code :
def getUsername(email) :
re.match(r"(.){1}[a-z]+.([a-z]{7})",email.replace('-','')).group()
email.replace('-','') to get rid of the - symbol
regex that captures the 2 groups I discribed above
If I do .group(1,2) I can see the captured groups are m and mylastn, so it's all good.
But using .group() doesn't just return the capturing group but also everthing between them : myfirstnamemlastn
Can someone explain me this behavior ?
First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern
Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.
So, in your case,
Check if there is a match first, trying to access None.groups() will throw an exception
Then join the match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname#email.com"))
See the Python demo.

How to check that selected regex from char set in two sections are equal?

Test String:gcaaaattatacccacatttctttttaaaatttcagcaaaattttaaactatacg
What I want to detect: any combination of two characters including "a" in them and the "a" cannot be the first character.
Example: gcaaaattata cccaca tttc tttttaaaattt cagcaaaattttaaac tata cg
My Regex: [{g,t,c}]{2,}a[{a,g,t,c}]
Problem: When it matches a character from the first set {g,t,c} it will match any character from the second list.
My Question: How to match from the second list what is already selected from the first set for an output like the example above.
Update
Further explanation:
- The combination is of two characters only including "a"
- "a" must be in between and cannot be the start. So , "ttttaaa" starting with t but nothing intercept the "a"s, if it was the same character "t" so match the pattern, If any character not "a" or "t" stop matching.
So these are working: "tttaaat","tattttatatat"
These are not working: "taaaaaaa","attttta"
I'm writing in python if that could help.
You could try following:
import re
s = 'gcaaaattatacccacatttctttttaaaatttcagcaaaattttaaactatacg'
for match in re.finditer(r'(g|c|t)\1*a+(\1)(\1|a)*', s):
print(match.group())
Output:
ttata
cccaca
tttttaaaattt
tata
Example on regex101. (g|c|t) matches any of the characters gct and captures it. \1*a+\1 repeats the first character 0 or more times followed by at least one a followed by first character. (\1|a)* at the end then allows any combination of a and first character.
You can use ([gtc])\1*(a+)(\1+\2*)+ which will look for at least one g, t, or c followed by one or more a and then any combination of those two characters
import re
word='gcaaaattatacccacatttctttttaaaatttcagcaaaattttaaactatacg'
matches = re.finditer(r'([gtc])\1*(a+)(\1+\2*)+', word)
for matchNum, match in enumerate(matches):
print(match.group())
one way to accomplish your goal would be to capture the first character and backreference it in the third portion of your expression.
like so:
(?P<first>[gtc])(?P=first)?a(?:a|(?P=first))*
python regex reference

unexpected re.sub behavior

I defined
s='f(x) has an occ of x but no y'
def italicize_math(line):
p="(\W|^)(x|y|z|f|g|h)(\W|$)"
repl=r"\1<i>\2</i>\3"
return re.sub(p,repl,line)
and made the following call:
print(italicize_math(s)
The result is
'<i>f</i>(x) has an occ of <i>x</i> but no <i>y</i>'
which is not what I expected. I wanted this instead:
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Can anyone tell me why the first occurence of x was not enclosed in inside the "i" tags?
You seem to be trying to match non-alphanumeric characters (\W) when you really want a word boundary (\b):
>>> p=r"(\b)(x|y|z|f|g|h)(\b)"
>>> re.sub(p,repl,s)
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Of course, ( is non alpha-numeric -- The reason your inner content doesn't match is because \W consumes a character in the match. so with a string like 'f(x)', you match the ( when you match f. Since ( was already matched, it won't match again when you try to match x. By contrast, word boundaries don't consume any characters.
Because the group construct is matching the position at the beginning of the string first and x would overlap the previous match. Also, the first and third groups are redundant since they can be replaced by word boundaries; and you can make use of a character class to combine letters.
p = r'\b([fghxyz])\b'
repl = r'<i>\1</i>'
Like previous answer mention, its because the ( char being consume when matching f thus cause subsequent x to fail the match.
beside replace with word boundary \b, you could also use lookahead regex which just do a peek and won't consume anything match inside the lookahead. Since it didn't consume anything, you don't need the \3 either
p=r"(\W|^)(x|y|z|f|g|h)(?=\W|$)"
repl=r"\1<i>\2</i>"
re.sub(p,repl,line)

Regular Expression: How to match using previous matches?

I am searching for string patterns of the form:
XXXAXXX
# exactly 3 Xs, followed by a non-X, followed by 3Xs
All of the Xs must be the same character and the A must not be an X.
Note: I am not searching explicitly for Xs and As - I just need to find this pattern of characters in general.
Is it possible to build this using a regular expression? I will be implementing the search in Python if that matters.
Thanks in advance!
-CS
Update:
#rohit-jain's answer in Python
x = re.search(r"(\w)\1{2}(?:(?!\1)\w)\1{3}", data_str)
#jerry's answer in Python
x = re.search(r"(.)\1{2}(?!\1).\1{3}", data_str)
You can try this:
(\w)\1{2}(?!\1)\w\1{3}
Break Up:
(\w) # Match a word character and capture in group 1
\1{2} # Match group 1 twice, to make the same character thrice - `XXX`
(?!\1) # Make sure the character in group 1 is not ahead. (X is not ahead)
\w # Then match a word character. This is `A`
\1{3} # Match the group 1 thrice - XXX
You can perhaps use this regex:
(.)\1{2}(?!\1).\1{3}
The first dot matches any character, then we call it back twice, make use of a negative lookahead to make sure there's not the captured character ahead and use another dot to accept any character once again, then 3 callbacks.

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Categories

Resources