Regex to match a capturing group one or more times - python

I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.

You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']

(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it

Try this:
import re
re.findall(r'\d\d','123456')

Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]

Related

Regex "AND" in an expression extract this and that

I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])
You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all
My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

regex for IPv4 matching [duplicate]

This question already has answers here:
python IP validation REGex validation for full and partial IPs
(2 answers)
Closed 7 years ago.
I was trying to match IPv4 addresses using regex. I got following regex.
But I am not able to understand ?: in it.
## r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
>>> import re
>>> re.findall(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
['254.123.11.13', '254.123.11.14', '254.123.12.13', '254.123.12.14', '254.124.11.13', '254.124.11.14', '254.124.12.13']
I know ?: is for avoiding capturing of a group, but here I am not able to make a sense with it.
Update:
If I am removing ?:, I am getting following result. I thought I will get IP address along with captured groups in tuples.
>>> re.findall(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
[('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13'), ('12.', '12', '14'), ('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13')]
The non-capture group is needed in this case because the {3} repeat specifier for your IPv4 quartet returns only the third match. The outer group however will provide all 3 of the matching inner matches: ( q{3} ) where q=regex for a number in your quartet. However we want to hide the third match with non-capture specifier for the inner group.
See below for a regex without the non-capturing, problem and a solution.
q = r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
Reproducing the {3} repeat problem without non-capturing:
t = '(%s\.){3}%s' % (q,q)
>>> re.findall(t,txt)
[('11.', '11', '13'), ('11.', '11', '14')]
Solution if you wanted tuples captured separately:
s='{0}\.{0}\.{0}\.{0}'.format(q)
>>> re.findall(s, txt)
[('254', '123', '11', '13'), ('254', '123', '11', '14')]
or
s='({0}\.{0}\.{0}\.{0})'.format(q)
>>> re.findall(s,txt)
[('254.123.11.13', '254', '123', '11', '13'), ('254.123.11.14', '254', '123', '11', '14')]
As i said in comment if you don't use non-capture group instead of matching the whole of your regex and due to this note that you have 3 group in your regex you'll get 3 result for each IP.
For better demonstration see the following sate machine :
without non-capture group :
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
Using non-capture group :
(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
As you can see when you sue non-capturing group you have not any group and the whole of your regex will interpret as one group usually the group 0!

regex expression not recognising the other lines

I have a regex which I would like to match a couple of things:
Here is a link to the examples and the code which I have started but for errors which I cannot determine in my regex is not recognising some lines: http://regex101.com/r/oL4bB5/1
The string examples:
eg1: Tommy Berry
eg2: Ms Winona Costin (a3/47kg)
eg3: Ms Kathy O'Hara
End result using findall in python:
eg1: ['Tommy Berry']
eg2: ['Ms','Winona Costin', '3', '47']
eg3: ['Ms', 'Kathy O'Hara']
As you can see, I want to isolate the Ms at the beginning of the string, the digits within the parenthesis and maintain the full name.
I appreciate the help, thanks!
EDIT
The name may contain numbers and special characters such as '-. etc.:
eg: Samuel L. Jackson-Pitt
I think you want something like this,
^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$
DEMO
>>> import re
>>> s = """Brodie Loy (a3/53kg)
Hugh Bowman
Ms Winona Costin (a3/47kg)
James McDonald
Ms Kathy O'Hara"""
>>> m = re.findall(r"^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$", s, re.M)
>>> m
[('', 'Brodie Loy', '3', '53'), ('', 'Hugh Bowman', '', ''), ('Ms', 'Winona Costin', '3', '47'), ('', 'James McDonald', '', ''), ('Ms', "Kathy O'Hara", '', '')]
>>> [tuple(s for s in tup if s) for tup in m]
[('Brodie Loy', '3', '53'), ('Hugh Bowman',), ('Ms', 'Winona Costin', '3', '47'), ('James McDonald',), ('Ms', "Kathy O'Hara")]
What you are looking for is: (demo)
^(Ms)?([\w '-]+)(?:.*?(\d+)\/(\d+))?
Remember to use re.MULTILINE.

python re, multiple matching groups

I have a string:
s = ' <span>Mil<\/span><\/th><td align=\"right\" headers=\"Y0 i7\">112<\/td><td align=\"right\" headers=\"Y1 i7\">113<\/td><td align=\"right\" headers=\"Y2 i7\">110<\/td><td align=\"right\" headers=\"Y3 i7\">107<\/td><td align=\"right\" headers=\"Y4 i7\">105<\/td><td align=\"right\" headers=\"Y5 i7\">95<\/td><td align=\"right\" headers=\"Y6 i7\">95<\/td><td align=\"right\" headers=\"Y7 i7\">87<\/td><td align=\"right\" headers=\"Y8 i7\">77<\/td><td align=\"right\" headers=\"Y9 i7\">74<\/td><td align=\"right\" headers=\"Y10 i7\">74<\/td><\/tr>'
I want to extract these numbers from the string:
112 113 110 107 105 95 95 87 77 74 74
I am no expert on regular expressions, so can anyone tell me, why this isn't returning any matches:
p = re.compile(r' .*(>\d*<\\/td>.*)*<\\/tr>')
m = p.match(s)
I'm sure there is an html/xml parsing module that can solve my problem and I could also just split the string and work on that output, but I really want to do it with the re module. Thanks!
>>> r = re.compile(r'headers="Y\d+ i\d+">(\d+)<\\/td>')
>>> r.findall(s)
['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74']
>>>
All of the numbers you want are in between ">" and "<". So, you can just do this:
re.findall(">(\d+)<", s)
output:
['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74']
Basically, it's saying get every stream of digits that is between ">" and "<". Then, with set, you can get only the unique ones.
The other answers give regexes that will work, but it's worth understanding why your regex doesn't.
All of your matches are both greedy and optional (*). So your regex says:
0 or more characters of anything
0 or more occurrences of your capture group
</tr>
"0 or more characters of anything" eats the rest of the string, leaving nothing for the capture group, and since it's optional, that successfully matches.
If you wanted to redesign your regex to work, you would want to use .*? instead of .* to match the junk at the beginning of the string. The ? makes the match nongreedy, so that it will match as few characters as possible rather than as many as possible.
Your expression isn't returning any matches because i wrote it a bit wrong. Instead of print:
p = re.compile(r' .*(>\d*<\\/td>.*)*<\\/tr>')
m = p.match(s)
You probably should print this:
>>> p = re.compile(r'headers="Y\d+ i\d+">(\d+)<\\/td>')
>>> p.findall(s)
['112', '113', '110', '107', '105', '95', '95', '87', '77', '74', '74']

Categories

Resources