regex for IPv4 matching [duplicate] - python

This question already has answers here:
python IP validation REGex validation for full and partial IPs
(2 answers)
Closed 7 years ago.
I was trying to match IPv4 addresses using regex. I got following regex.
But I am not able to understand ?: in it.
## r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
>>> import re
>>> re.findall(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
['254.123.11.13', '254.123.11.14', '254.123.12.13', '254.123.12.14', '254.124.11.13', '254.124.11.14', '254.124.12.13']
I know ?: is for avoiding capturing of a group, but here I am not able to make a sense with it.
Update:
If I am removing ?:, I am getting following result. I thought I will get IP address along with captured groups in tuples.
>>> re.findall(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)', txt)
[('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13'), ('12.', '12', '14'), ('11.', '11', '13'), ('11.', '11', '14'), ('12.', '12', '13')]

The non-capture group is needed in this case because the {3} repeat specifier for your IPv4 quartet returns only the third match. The outer group however will provide all 3 of the matching inner matches: ( q{3} ) where q=regex for a number in your quartet. However we want to hide the third match with non-capture specifier for the inner group.
See below for a regex without the non-capturing, problem and a solution.
q = r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
Reproducing the {3} repeat problem without non-capturing:
t = '(%s\.){3}%s' % (q,q)
>>> re.findall(t,txt)
[('11.', '11', '13'), ('11.', '11', '14')]
Solution if you wanted tuples captured separately:
s='{0}\.{0}\.{0}\.{0}'.format(q)
>>> re.findall(s, txt)
[('254', '123', '11', '13'), ('254', '123', '11', '14')]
or
s='({0}\.{0}\.{0}\.{0})'.format(q)
>>> re.findall(s,txt)
[('254.123.11.13', '254', '123', '11', '13'), ('254.123.11.14', '254', '123', '11', '14')]

As i said in comment if you don't use non-capture group instead of matching the whole of your regex and due to this note that you have 3 group in your regex you'll get 3 result for each IP.
For better demonstration see the following sate machine :
without non-capture group :
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
Using non-capture group :
(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
Debuggex Demo
As you can see when you sue non-capturing group you have not any group and the whole of your regex will interpret as one group usually the group 0!

Related

Regex pattern to match multiple characters and split

I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]

Find same sequence of 2 characters in a sentence

I have a line that looks like this:
Amount:Category:Date:Description:55544355
My requirement is to find a sequence of two characters, followed later by that same sequence of two characters, followed later by that same sequence of two characters again till all sequences are found. I achieved this as follows:
>>>my_str = 'Amount:Category:Date:Description:55544355'
>>>[item[0] for item in re.findall(r"((..)\2*)", my_str)]
>>>['Am', 'ou', 'nt', ':C', 'at', 'eg', 'or', 'y:', 'Da', 'te', ':D', 'es', 'cr', 'ip', 'ti', 'on', ':5', '55', '44', '35']
This is obviously not the right output since the desired output is:
[[':D',':D'],['55','55'],['at', 'at']]
What am I doing wrong?
Would you please try the following:
my_str = 'Amount:Category:Date:Description:55544355'
print(re.findall(r'(..)(?=.*?\1)', my_str))
Output:
['at', ':D', '55']
If you want to print all occurrences of the characters, another step is required.
You have to use a lookahead with a backreference. To get both values, you can wrap the backreference also in a capture group which will be returned as a tuple by re.findall.
import re
print(re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355"))
Output
[('at', 'at'), (':D', ':D'), ('55', '55')]
If you want a list of lists:
import re
print([list(elem) for elem in re.findall(r"(..)(?=.*?(\1))", "Amount:Category:Date:Description:55544355")])
Output
[['at', 'at'], [':D', ':D'], ['55', '55']]

Regex "AND" in an expression extract this and that

I'm struggling to write a regex that extracts the following numbers in bold below.
I set up 3 different regex for each value, but since the last value might have a space in between I don't know how to accommodate an "AND" here.
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
I have tried this and it is working for the first 2 but not for the last one. I'd like to have the last one in a single regex.
p1 = re.compile(r'(\d+)/')
p2 = re.compile(r'/(\d+)')
p3 = re.compile(r'(?=.*[R](\d+))(?=.*[R]\s(\d+))')
I've tried different stuff and this is the last code I tried with unsuccessful results
if I do this
p1.findall(tire), p2.findall(tire), p3.findall(tire)
I would like to see this:
(['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17'])
You were almost there! You don't need three separate regular expressions.
Instead, use multiple capturing groups in a single regex.
(\d{3})\/(\d{2})R\s?(\d{2})
Try it: https://regex101.com/r/Xn6bry/1
Explanation:
(\d{3}): Capture three digits
\/: Match a forward-slash
(\d{2}): Capture two digits
R\s?: Match an R followed by an optional whitespace
(\d{2}): Capture two digits.
In Python, do:
p1 = re.compile(r'(\d{3})\/(\d{2})R\s?(\d{2})')
tire = 'Tire: P275/65R18 A/S; 275/65R 18 A/T OWL;265/70R 17 A/T OWL;'
matches = re.findall(p1, tire)
Now if you look at matches, you get
[('275', '65', '18'), ('275', '65', '18'), ('265', '70', '17')]
Rearranging this to the format you want should be pretty straightforward:
# Make an empty list-of-list with three entries - one per group
groups = [[], [], []]
for match in matches:
for groupnum, item in enumerate(match):
groups[groupnum].append(item)
Now groups is [['275', '275', '265'], ['65', '65', '70'], ['18', '18', '17']]

Timestring regex: please simplify it and show how the colons can be dropped?

I have written a regex expression to parse the system date and time and I can capture all with this script ( I know there are modules to parse date, this is only for regex learning)
import re
s = "Sun Oct 14 13:47:03 CEST 2012"
x = r"([A-Za-z]+\b)\s([A-Za-z]+\b)\s(\d\d)\s(\d\d)([/:])(\d\d)([/:])(\d\d)\s([A-Za-z]+\b)\s(\d\d\d\d)"
toll = (re.search(x,s))
for i in range(11):
print (toll.group(i))
Objective:
To get all the individual elements in groups
Questions:
How can I make my regex expression simpler (if there is any way)?
How can I simply drop the colon from my regex expression (Like I dont want : to be captured at all)?
Here's my output:
Sun Oct 14 13:47:03 CEST 2012
Sun
Oct
14
13
:
47
:
03
CEST
2012
Solution: simply don't put parentheses around the groups matching colons, then they won't show up as capture groups:
>>> x = r"([A-Za-z]+\b)\s([A-Za-z]+\b)\s(\d\d)\s(\d\d)[/:](\d\d)[/:](\d\d)\s([A-Za-z]+\b)\s(\d\d\d\d)"
>>> re.search(x,s).groups()
('Sun', 'Oct', '14', '13', '47', '03', 'CEST', '2012')
But if you really want to simplify this big regex, it looks like you can get by with simply regex-splitting on space or colon, and avoid the big regex entirely:
>>> re.split(r'[ :/]', s)
['Sun', 'Oct', '14', '13', '47', '03', 'CEST', '2012']
If you put parenthesis around a statement, it becomes a "capturing group".
To prevent this, either don't place brackets, or create a non-capturing group:
(?:[a-z]*)
However, my solution would be:
([A-Za-z]+)\s([A-Za-z]+)\s(\d\d)\s(\d\d)[/:](\d\d)[/:](\d\d)\s([A-Za-z]+)\s(\d{4})
Note that I removed the word boundaries, as they are irrelevant, due to the condition before them being only the alphabet, followed by a space character.
I also unbracketed the colons, and specified the number of digits on the last statement, with {4}

Regex to match a capturing group one or more times

I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.
You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']
(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it
Try this:
import re
re.findall(r'\d\d','123456')
Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]

Categories

Resources