Repeated pattern in python regex - python

New to python regex and would like to write something that matches this
<name>.name.<age>.age#<place>
I can do this but would like the pattern to have and check name and age.
pat = re.compile("""
^(?P<name>.*)
\.
(?P<name>.*)
\.
(?P<age>.*)
\.
(?P<age>.*?)
\#
(?P<place>.*?)
$""", re.X)
I then match and extract the values.
res = pat.match('alan.name.65.age#jamaica')
Would like to know the best practice to do this?

Match .name and .age literally. You don't need new groups for that.
pat = re.compile("""
^(?P<name>[^.]*)\.name
\.
(?P<age>[^.]*)\.age
\#
(?P<place>.*)
$""", re.X)
Notes
I've replaced .* ("anything") by [^.]* ("anything except a dot"), because the dot cannot really be part of the name in the pattern you show.
Think whether you mean * (0-unlimited occurrences) or rather + (1-unlimited occurrences).

No reason not to allow . in names, e.g. John Q. Public.
import re
pat = re.compile(r"""(?P<name>.*?)\.name
\.(?P<age>\d+)\.age
#(?P<place>.*$)""",
flags=re.X)
m = pat.match('alan.name.65.age#jamaica')
print(m.group('name'))
print(m.group('age'))
print(m.group('place'))
Prints:
alan
65
jamaica

You dont need the groups if you use re.split :
re.split('\.name\.|\.age', "alan.name.65.age#jamaica")
This will return name and age as first two elements of the list.

Related

Replace a character enclosed with lowercase letters

All the examples I've found on stack overflow are too complicated for me to reverse engineer.
Consider this toy example
s = "asdfasd a_b dsfd"
I want s = "asdfasd a'b dsfd"
That is: find two characters separated by an underscore and replace that underscore with an apostrophe
Attempt:
re.sub("[a-z](_)[a-z]","'",s)
# "asdfasd ' dsfd"
I thought the () were supposed to solve this problem?
Even more confusing is the fact that it seems that we successfully found the character we want to replace:
re.findall("[a-z](_)[a-z]",s)
#['_']
why doesn't this get replaced?
Thanks
Use look-ahead and look-behind patterns:
re.sub("(?<=[a-z])_(?=[a-z])","'",s)
Look ahead/behind patterns have zero width and thus do not replace anything.
UPD:
The problem was that re.sub will replace the whole matched expression, including the preceding and the following letter.
re.findall was still matching the whole expression, but it also had a group (the parenthesis inside), which you observed. The whole match was still a_b
lookahead/lookbehind expressions check that the search is preceded/followed by a pattern, but do not include it into the match.
another option was to create several groups, and put those groups into the replacement: re.sub("([a-z])_([a-z])", r"\1'\2", s)
When using re.sub, the text to keep must be captured, the text to remove should not.
Use
re.sub(r"([a-z])_(?=[a-z])",r"\1'",s)
See proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
Python code:
import re
s = "asdfasd a_b dsfd"
print(re.sub(r"([a-z])_(?=[a-z])",r"\1'",s))
Output:
asdfasd a'b dsfd
The re.sub will replace everything it matched .
There's a more general way to solve your problem , and you do not need to re-modify your regular expression.
Code below:
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0)) if _reg else r
#
re.sub(reg,repl, s)
output: 'Data: year=am_replace_str, monthday=am_replace_str, month=am_replace_str, some other text'
Of course, if your case does not contain groups , your code may like this :
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0))
#
re.sub(reg,repl, s)

Remove title using regular expression

How to remove 2 or 3 characters at the begining of the string followed by a dot and may or may not be followed by a space?
i = 'mr.john'
i.replace("mr.","")
The above returns the name 'john' correctly but not in all cases. For e.g.
i = 'smr. john'
i.replace("mr.","")
's john'
Expected result was 'john'
If you needed a more generic approach (i possibly having more names), you may use this code. You can define your own prefixes to remove:
import re
prefixes = ['mr', 'smr']
regex = r'\b(?:' + '|'.join(prefixes) + r')\.\s*'
i = 'hi mr.john, smr. john, etc. Previous etc should not be removed'
i = re.sub(regex,'',i)
print(i)
You can test it live here
The created regex is this:
\b # Word boundary (to match 'mr' but not 'zmr' unless specified)
(?:group|of|prefixes|that|we|want|to|remove) # example
\. # Literal '.'
\s* # 0 or more spaces
You want two or three characters at the start of the string followed by a dot and then maybe a space. As a regular expression this looks like ^\w{2,3}\. ?.
Now you can use re.sub to replace this part with an empty string.
cleaned_name = re.sub(r'(^\w{2,3}\. ?)', r'', name)
Use str.find with slicing.
Ex:
i = 'smr. john'
print(i[i.find(".")+1:].strip())
i2 = 'mr.john'
print(i2[i2.find(".")+1:].strip())
Output:
john
john

Conditional Regex: if A and B, choose B

I need to extract IDs from a string of the following format: Name ID, where the two are separated by white space.
Example:
'Riverside 456'
Sometimes, the ID is followed by the letter A or B (separated by white space):
'Riverside 456 A'
In this case I want to extract '456 A' instead of just '456':
I tried to accomplish this with the following regex:
(\d{1,3}) | (\d{1,3}\s[AB])
The conditional operator | does not quite work in this setting as I only get numerical IDs. Any suggestions how to properly set up regex in this setting?
Any help would be appreciated.
Try just reversing the order of the statements to have the more specific one first. I.e.:
(\d{1,3}\s[AB]) | (\d{1,3})
If you have an optional part that you might want to include, but not necessarily need, you could just use an "at most one time" quantifier:
Riverside (\d{1,3}(?: [AB])?)
The ?: marks groups as "not-capturing", so they won't be returned. And the ? tells it to either match it once or ignore it.
Your (\d{1,3})|(\d{1,3}\s[AB]) will always match the first branch as in an NFA regex, if the alternation group is not anchored on either side, the first branch that matches "wins", and the rest of the branches to the right are not tested against.
You can use an optional group:
\d{1,3}(?:\s[AB])?
See the regex demo
Add a $ at the end if the value you need is always at the end of the string.
If there can be more than 1 whitespace, add + after \s. Or * if there can be zero o more whitespaces.
Note that the last ? quantifier is greedy, so if there is a whitespace and A or B, they will be part of the match.
See the Python demo:
import re
rx = r'\d{1,3}(?:\s[AB])?'
s = ['Riverside 456 A', 'Riverside 456']
print([re.search(rx, x).group() for x in s])
import re
pattern = re.compile(r'(\d{1,3}\s?[AB]?)$')
print(pattern.search('Riverside 456').group(0)) # => '456'
print(pattern.search('Riverside 456 A').group(0)) # => '456 A'
You could use alternation
p = re.compile('''(\d{1,3}\s[AB]|\d{1,3})$''')
NB $ or maybe \s at the end (outside the group) is important, otherwise it will capture both 123 C and 1234 as 123 rather than fail to match.

how to replace via regex with group enclosed in python

Here's the scenario:
import re
if __name__ == '__main__':
s = "s = \"456\";"
ss = re.sub(r'(.*s\s+=\s+").*?(".*)', r"\1123\2", s)
print ss
What I intend to do is to replace '456' with 123, but the result is 'J3";'. I try to print '\112', it turns out to be character 'J'. Thus, is there any method to specify that \1 is the group in regex, not something like a escape character in Python? Thanks in advance.
Just change \1 to \g<1>
>>> re.sub(r'(.*s\s+=\s+").*?(".*)', r"\g<1>123\2", s)
's = "123";'
If there was no numbers present next to the backreference (like \1,\2), you may use \1 or \2 but if you want to put a number next to \1 like \11, it would give you a garbage value . In-order to differntiate between the backreferences and the numbers, you should use \g<num> as backrefernce where num refers the capturing group index number.

Regex help to match groups

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie
Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.
I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]
A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character
This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

Categories

Resources