This question already has answers here:
python IP validation REGex validation for full and partial IPs
(2 answers)
Closed 7 years ago.
I was trying to match IPv4 addresses using regex. I got following regex.
But I am not able to understand ?: in it.
## r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
>>> import re
>>> re.findall(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
['254.123.11.13', '254.123.11.14', '254.123.12.13', '254.123.12.14', '254.124.11.13', '254.124.11.14', '254.124.12.13']
I know ?: is for avoiding capturing of a group, but here I am not able to make a sense with it.
Update:
If I am removing ?:, I am getting following result. I thought I will get IP address along with captured groups in tuples.
>>> re.findall(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
[('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13'), ('12.', '12', '14'), ('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13')]
The non-capture group is needed in this case because the {3} repeat specifier for your IPv4 quartet returns only the third match. The outer group however will provide all 3 of the matching inner matches: ( q{3} ) where q=regex for a number in your quartet. However we want to hide the third match with non-capture specifier for the inner group.
See below for a regex without the non-capturing, problem and a solution.
q = r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
Reproducing the {3} repeat problem without non-capturing:
t = '(%s\.){3}%s' % (q,q)
>>> re.findall(t,txt)
[('11.', '11', '13'), ('11.', '11', '14')]
Solution if you wanted tuples captured separately:
s='{0}\.{0}\.{0}\.{0}'.format(q)
>>> re.findall(s, txt)
[('254', '123', '11', '13'), ('254', '123', '11', '14')]
or
s='({0}\.{0}\.{0}\.{0})'.format(q)
>>> re.findall(s,txt)
[('254.123.11.13', '254', '123', '11', '13'), ('254.123.11.14', '254', '123', '11', '14')]
As i said in comment if you don't use non-capture group instead of matching the whole of your regex and due to this note that you have 3 group in your regex you'll get 3 result for each IP.
For better demonstration see the following sate machine :
without non-capture group :
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
Using non-capture group :
(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
As you can see when you sue non-capturing group you have not any group and the whole of your regex will interpret as one group usually the group 0!
Related
I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]
I have a line that looks like this:
Amount:Category:Date:Description:55544355
My requirement is to find a sequence of two characters, followed later by that same sequence of two characters, followed later by that same sequence of two characters again till all sequences are found. I achieved this as follows:
>>>my_str = 'Amount:Category:Date:Description:55544355'
>>>[item[0] for item in re.findall(r"((..)\2*)", my_str)]
>>>['Am', 'ou', 'nt', ':C', 'at', 'eg', 'or', 'y:', 'Da', 'te', ':D', 'es', 'cr', 'ip', 'ti', 'on', ':5', '55', '44', '35']
This is obviously not the right output since the desired output is:
[[':D',':D'],['55','55'],['at', 'at']]
What am I doing wrong?
Would you please try the following:
my_str = 'Amount:Category:Date:Description:55544355'
print(re.findall(r'(..)(?=.*?\1)', my_str))
Output:
['at', ':D', '55']
If you want to print all occurrences of the characters, another step is required.
You have to use a lookahead with a backreference. To get both values, you can wrap the backreference also in a capture group which will be returned as a tuple by re.findall.
import re
print(re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355"))
Output
[('at', 'at'), (':D', ':D'), ('55', '55')]
If you want a list of lists:
import re
print([list(elem) for elem in re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355")])
Output
[['at', 'at'], [':D', ':D'], ['55', '55']]
I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])
You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]
I have written a regex expression to parse the system date and time and I can capture all with this script ( I know there are modules to parse date, this is only for regex learning)
import re
s = "Sun Oct 14 13:47:03 CEST 2012"
x = r"([A-Za-z]+\b)\s([A-Za-z]+\b)\s(\d\d)\s(\d\d)([/:])(\d\d)([/:])(\d\d)\s([A-Za-z]+\b)\s(\d\d\d\d)"
toll = (re.search(x,s))
for i in range(11):
print (toll.group(i))
Objective:
To get all the individual elements in groups
Questions:
How can I make my regex expression simpler (if there is any way)?
How can I simply drop the colon from my regex expression (Like I dont want : to be captured at all)?
Here's my output:
Sun Oct 14 13:47:03 CEST 2012
Sun
Oct
14
13
:
47
:
03
CEST
2012
Solution: simply don't put parentheses around the groups matching colons, then they won't show up as capture groups:
>>> x = r"([A-Za-z]+\b)\s([A-Za-z]+\b)\s(\d\d)\s(\d\d)[/:](\d\d)[/:](\d\d)\s([A-Za-z]+\b)\s(\d\d\d\d)"
>>> re.search(x,s).groups()
('Sun', 'Oct', '14', '13', '47', '03', 'CEST', '2012')
But if you really want to simplify this big regex, it looks like you can get by with simply regex-splitting on space or colon, and avoid the big regex entirely:
>>> re.split(r'[ :/]', s)
['Sun', 'Oct', '14', '13', '47', '03', 'CEST', '2012']
If you put parenthesis around a statement, it becomes a "capturing group".
To prevent this, either don't place brackets, or create a non-capturing group:
(?:[a-z]*)
However, my solution would be:
([A-Za-z]+)\s([A-Za-z]+)\s(\d\d)\s(\d\d)[/:](\d\d)[/:](\d\d)\s([A-Za-z]+)\s(\d{4})
Note that I removed the word boundaries, as they are irrelevant, due to the condition before them being only the alphabet, followed by a space character.
I also unbracketed the colons, and specified the number of digits on the last statement, with {4}
I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.
You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']
(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it
Try this:
import re
re.findall(r'\d\d','123456')
Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]