Regexp to match random order words - python

I have the following pseudo-DSL:
< allow | deny >
< tcp | udp | any >
src < prefix | $ip | #hostgroup | any > [ port number | range | #portgroup | any ]
dst < prefix | $ip | #hostgroup | any > [ port number | range | #portgroup | any ]
[ stateful ]
[ expire YYYYMMDD ] [ log ]
[ # comment ]
The order is fixed, starting from allow up to dst and its port.
That I'm matching with the following, rather dumb, regexp:
m = re.search("^(allow|deny)?\s+(tcp|udp|tcpudp|any)\s+?(src\s\S+)\s*?(port\s+\S+)?\s*?(dst\s\S+)\s?(port\s+\S+)?\s*?(\S+)?\s*?(\S+)?", line)
Pardon me for the n00bness of the questions, but the parts I'm having problems with are:
How can I match stateful, expire <value>, log if all 3 are optional but in case they are present I want to match them in separate groups.
How can I match optional statement port <value> in such a way that the match group will contain only the value, without creating an extra matching group, i.e. without using (port\s+(\S+))?
Thanks!
[edit for more of a problem statement]
To elaborate a bit more, sure I can check whether one of the 3 groups contain either log or stateful, but if I use the same approach, a non-capturing group for expire, aka (?:expire\s(\S+)), I'd need to make an assumption. Unless I can somehow have order-less matching? i.e. match on (stateful|log|(?:expire\s(\S+)))?

How can I match stateful, expire <value>, log if all 3 are optional but in case they are present I want to match them in separate groups.
Use capture groups that have a ? after them so that they will be optional.
Ex. \s*(stateful)?\s*(?:expire (\d{8}))?\s*(log)?
To allow those optional groups to appear in any order in the match string, but still always have them in the same numbered capture group, use a look-ahead (?= ).
Ex. (?=(?:.*(stateful))?)(?=(?:.*expire (\d{8}))?)(?=(?:.*(log))?)
How can I match optional statement port <value> in such a way that the match group will contain only the value, without creating an extra matching group, i.e. without using
(port\s+(\S+))?
Use a non-capturing group (?: ) to put those characters together for the following ? without capturing them. (You probably want to do this for expire above also)
(?:port\s+(\s+))?
Complete Regex

Related

Regex group doesn't match with "?" even if it should

input strings:
"| VLAN56 | LAB06 | Labor 06 | 56 | 172.16.56.0/24 | VLAN56_LAB06 | ✔️ | |",
"| VLAN57 | LAB07 | Labor 07 | 57 | 172.16.57.0/24 | VLAN57_LAB07 | ✔️ | ##848484: |"
regex:
'\|\s+(\d+).+(VLAN\d+_[0-9A-Za-z]+)\s+\|.+(#[0-9A-Fa-f]{6})?'
The goal is to get the VLAN number, hostname, and if there is one, the color code, but with a "?" it ignores the color code every time, even when it should match.
With the "?" the last capture group is always None.
You may use this regex:
\|\s+(\d+).+(VLAN\d+_[0-9A-Za-z]+)\s+\|[^|]+\|[^#|]*(#[0-9A-Fa-f]{6})?
You have a demo here: https://regex101.com/r/SWe42v/1
The reason why it didn't work with your regex is that .+ is a greedy quantifier: It matches as much as it can.
So, when you added the ? to the last part of the regex, you give no option to backtrack. The .+ matches the rest of the string/line and the group captures nothing (which is correct because it is optional)
In order to fix it, you can simply try to match the column with the emoji. You don't care about its content, so you simply use |[^|]+to skip the column.
This sort of construct is widely used in regexes: SEPARATOR[^SEPARATOR]*
The reason why the last capture group is None is that the preceding .+ can capture the rest of the line.
I would however first use the fact that this is a pipe-separated format, and split by that pipe symbol and then retrieve the elements of interest needed by slicing them from that result by their index:
import re
s = "| VLAN57 | LAB07 | Labor 07 | 57 | 172.16.57.0/24 | VLAN57_LAB07 | ✔️ | ##848484: |"
vlan,name,color = re.split(r"\s*\|\s*", s)[4:9:2]
print(vlan, name, color)
This code is in my opinion easier to read and to maintain.
I think this is what you're after: Demo
^\|\s+(VLAN[0-9A-Za-z]+)\s+\|\s+([0-9A-Za-z]+)\s+\|.*((?<=\#)[0-9A-Fa-f]{6})?.*$
^\|\s+ - the start of the line must be a pipe followed by some whitespace.
(VLAN[0-9A-Za-z]+) - What comes next is the VLAN - so we capture it; with the VLAN and all (at least 1) following alpha-numeric chars.
\s+\|\s+ - there's then another pipe delimeter, with whitespace either side.
([0-9A-Za-z]+) - the column after the vlan name is the device name; so we capture the alphanumeric value from that.
\s+\| - after our device there's more whitespace and then the delimiter
.* - following that there's a load of stuff that we're not interested in; could be anything.
((?<=\#)[0-9A-Fa-f]{6})? - next there may be a 6 hex char value preceded by a hash; we want to capture only the hex value part.
(...) says this is another capture group
(?<=\#) is a positive look behind; i.e. checks that we're preceded by some value (in this case #) but doesn't include it within the surrounding capture
[0-9A-Fa-f]{6} is the 6 hex chars to capture
? after the parenthesis says there's 0 or 1 of these (i.e. it's optional); so if it's there we capture it, but if it's not that's not an issue.
.*$ says we can have whatever else through to the end of the string.
We could strip a few of those bits out; or add more in (e.g. if we know exactly what column everythign will be in we can massively simplify by just capturing content from those columns. E.g.
^\|\s*([^\|\s]+)\s*\|\s*([^\|\s]+)\s*\|\s*[^\|]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\s]*\s*\|\s*[^\|\d]*(\d{6})?[^\|]*\s*\|$
... But amend per your requirements / whatever feels most robust and suitable for your purposes.

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

In a Python regular expression, I encounter this singular problem.
Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?
import re
string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)
Actual Output is:
['ab', 'cd']
['cd']
I'm confused as to why does the second result doesn't contain 'ab' as well?
+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.
You can reason about this behaviour as follows:
Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.
If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd
https://www.regular-expressions.info/captureall.html
From the Docs,
Let’s say you want to match a tag like !abc! or !123!. Only these two
are possible, and you want to capture the abc or 123 to figure out
which tag you got. That’s easy enough: !(abc|123)! will do the trick.
Now let’s say that the tag can contain multiple sequences of abc and
123, like !abc123! or !123abcabc!. The quick and easy solution is
!(abc|123)+!. This regular expression will indeed match these tags.
However, it no longer meets our requirement to capture the tag’s label
into the capturing group. When this regex matches !abc123!, the
capturing group stores only 123. When it matches !123abcabc!, it only
stores abc.
I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way,
we going to sumilate what happen using match
# group(0) return the matched string the captured groups are returned in groups or you can access them
# using group(1), group(2)....... in your case there is only one group, one group will capture only
# one part so when you do this
string = 'abcdla'
print(re.match('(ab|cd)', string).group(0)) # only 'ab' is matched and the group will capture 'ab'
print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd' the group will capture only this part 'cd' the last iteration
findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':
'abcdabla' ---> 1: match: 'ab' | capture : ab | left to process: 'cdabla'
'cdabla' ---> 2: match: 'cd' | capture : cd | left to process: 'abla'
'abla' ---> 3: match: 'ab' | capture : ab | left to process: 'la'
'la' ---> 4: match: '' | capture : None | left to process: ''
--- final : result captured ['ab', 'cd', 'ab']
Now the same thing with '(ab|cd)+'
'abcdabla' ---> 1: match: 'abcdab' | capture : 'ab' | left to process: 'la'
'la' ---> 2: match: '' | capture : None | left to process: ''
---> final result : ['ab']
I hope this clears thing a little bit.
So, for me confusing part was the fact that
If one or more groups are present in the pattern, return a list of groups;
docs
so it's returning you not a full match but only match of a capture. If you make this group not capturing (re.findall('(?:ab|cd)+', string), it'll return ["abcd"] as I initially expected

Match every other character in a string

I have a string
k1|v1|k2|v2|k3|v3|k4|v4
and I want to match on every other | so I can change the string to
k1:v1|k2:v2|k3:v3|k4:v4
I know I can match on | by doing a grouping like (|) but I can't figure out how to match only every other pipe.
Thanks.
Match with:
([^|]*)\|([^|]*(\||$))
Replace with $1:$2.
See it in action
General idea:
[^|]* - multiple non-| characters
() defines a group
(\||$) - a | or the end of the string
The entire regex reads as multiple non | characters in the first group, followed by a |, followed by multiple non | characters and a | or end of string in the second group

Write regex for matching all 4 digit numbers between patterns

I am trying to write a regex to find pattern in string. Its gonna have a word 'LAT_LON' then some non word characters and then many 4 digit numbers and after then some alphabet or end of string.
Eg1.
SOME EXAMPLE STRING 12334...
LAT_LON .... 1234 5678 9012 1234
1234 1234
Eg2.
SOME EXAMPLE STRING 1234...
LAT_LON ... 1234 5678 9012 1234
1234 1234 SOMETHING_ELSE
In both the examples I need those 6 4-digit numbers after the pattern 'LAT_LON' and before any other alphabet.
EDIT: I am working in python, although I don't care much about the language. I am fairly new to regex world. So I am just trying some random stuff, nothing very conclusive at all till now.
One way is to capture the numbers then split on whitespace.
LAT_LON[^\da-zA-Z]*(\d{4}(?:\s+\d{4})*)
Then split capture group 1 on whitespace.
LAT_LON [^\da-zA-Z]*
( # (1 start)
\d{4}
(?:
\s+
\d{4}
)*
) # (1 end)
Here is a more verbose formatted version.
( Regex's constructed by RegexFormat 6 )
LAT_LON # Exact 'LAT_LON'
[^\da-zA-Z]* # Optinal chars, 0 to many times
# not digit nor letter (case insensitive)
( # (1 start), Capture all 4 digit numbers
\d{4} # Single 4 digit number
(?: # Cluster group
\s+ # Whitespace(s)
\d{4} # Single 4 digit number
)* # End Cluster, do 0 to many times
) # (1 end)
Let me try it another way, just to have some variation in the answers. I'm going to use awk for the job.
awk '/LAT_LON/,/\n[^0-9]/{printf gensub(/[^0-9 ]/, "", "g", $0) " "}' /path/to/intput/file
With a possible pipe to clean up the output | tr -s ' '.
This code just searches for lines containing LAT_LON, then it will parse each of those lines until a non number is found. On these lines we filter out non spaces or numbers using the gensub.
Note that the regex is fairly simple because we have filtered out all irrelevant parts. A simple non-numerical removal does the job here. See also grep if you want to mess around with regex, in my opinion it's the best way to learn. In particular egrep, which supports an enhanced regex language!

Categories

Resources