python - regex putting repeating patterns in a single group - python

I'm trying to parse a string in regex and am 99% there.
my test string is
1
1234 1111 5555 88945
172.255.255.255 from 172.255.255.255 (1.1.1.1)
Origin IGP, localpref 300, valid, external, best
rx pathid: 0, tx pathid: 0x0
my current regex pattern is:
(?P<as_path>(\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>\S+,\s{0,4})
im using regex101 to test and have a link to the test here https://regex101.com/r/iGM8ye/1
So currently i have a group2 I don't want this group, could someone tell me why im getting this group and how to remove it?
and the second is, in the attributes I want to match the words, "valid, external, best" currently my pattern only matches "valid," I thought adding the repeat of within the group would of matched all three of those but it hasn't.
How would I achieve matching the repeat of "string, string, string," (string comma space) into one group?
Thanks
EDIT
Desired output
as_path : 1234 1111 5555 88945
peer_addr : 172.255.255.255
peer_rid : 1.1.1.1
local_pref : 300
attribs : valid, external, best
attiribs may also just be valid, external, or just external, or another entry in the format (stringcommaspace)

Try Regex: (?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>[\S]+,(?: [\S]+,?)*){0,4}
Demo
Regex in the question had a capturing group (Group 2) for (\d{4,10}\s). it is changed to a non capturing group now (?:\d{4,10}\s)

See regex in use here.
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}(?:\.\d{0,3}){3}).*\((?P<peer_rid>\d{0,3}(?:\.\d{0,3}){3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You were getting group 2 because your as_path group contained a group. I changed that to a non-capturing group.
I changed attribs to \S+(?:,\s+\S+){2}
This will match any non-space character one or more times \S+, followed by the following exactly twice:
,\s+\S+ the comma character, followed by the space character one or more times, followed by any non-space character one or more times
I changed peer_addr and peer_rid to \d{0,3}(?:\.\d{0,3}){3} instead of \d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}. This is a preference, but shortens the expression.
Without that last modification, you can use the following regex (it performs slightly better anyway (as seen here):
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You can also improve the performance by using more specific tokens as the following suggests (notice I also added the x modifier to make it more legible) and as seen here:
(?P<as_path>\d{4,10}(?:\s\d{4,10}){0,19})\s+
(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})[^)]*
\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+
.*localpref\s(?P<local_pref>\d+),\s+
(?P<attribs>\w+(?:,\s+\w+){2})

You get that separate group because your are repeating a capturing group were the last iteration will be the capturing group, in this case 88945 You could make it non capturing instead (?:
For the second part you could use an alternation to exactly match one of the options (?:valid|external|best)
Your pattern might look like:
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>(?:valid|external|best)(?:,\s{0,4}(?:valid|external|best))+)
regex101 demo

Related

python regex, capturing a pattern with trimming repeated subpattern in string

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

capture the number iwth comma or dot with regex

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Regex match only if multiple patterns found (python)

I'm trying to extract data from sentences such as:
"monthly payment of 525 and 5000 drive off"
using a python regex search function: re.search()
My regex query string is as follows for down payment:
match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
"(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"
My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.
How can I improve my regex string such that it only matches an element if another element is successfully matched as well?
In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.
Clearer explanation here
The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*, (|\$|dollars*|money)*, \s*, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)* can match empty strings.
I suggest removing the final * quantifier to match exactly one occurrence of the pattern:
(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)
See the regex demo
Also note that I contracted a (\s|-) group into a character class [\s-] as you only alternate single char patterns, and also turned (|\$|dollars*|money)* into a non-capturing optional group (?:\$|dollars*|money)? that matches just 1 or 0 occurrences of $, dollar(s) or money.

Categories

Resources