multiple split in string using regex

multiple split in string using regex - python

I have a string :
Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2
I want split this string. It look like this -
'Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8' 'StaMAC:00:9F:0B:00:38:B8' 'BSSID:00 9F' 'Radioid:2'
I tried this logic - msgRegex = re.compile('[\w\s]+:') and split function also.
How can I do Please help me Thank you

From what I see, you have a problem when you have a whitespace inside the matches with hex values.
Because of that, I believe you cannot use a splitting approach here. Match your tokens with a regex like
(?<!\S)\b([^:]+):((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+)\b
See the regex demo
Python code:
import re
rx = r"(?<!\S)\b([^:]+):((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+)\b"
ss = ["Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2",
"Station Deassoc:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.5 StaMac1:40:83:DE:34:04:75 StaMac2:40:83:DE:34:04:75 UserName:4083de340475 StaMac3:40:83:DE:34:04:75 VLANId:1 Radioid:2 SSIDName:Devices SessionDuration:12 APID:CN58G6749V AP Name:1023-noida-racking-zopnow BSSID:BC:EA:FA:DC:A6:F1"]
for s in ss:
matches = re.findall(rx, s)
print(matches)
Result:
[('Station Disconnect', '1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8'), ('StaMAC', '00:9F:0B:00:38:B8'), ('BSSID', '00 9F'), ('Radioid', '2')]
[('Station Deassoc', '1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.5'), ('StaMac1', '40:83:DE:34:04:75'), ('StaMac2', '40:83:DE:34:04:75'), ('UserName', '4083de340475'), ('StaMac3', '40:83:DE:34:04:75'), ('VLANId', '1'), ('Radioid', '2'), ('SSIDName', 'Devices'), ('SessionDuration', '12'), ('APID', 'CN58G6749V'), ('AP Name', '1023-noida-racking-zopnow'), ('BSSID', 'BC:EA:FA:DC:A6:F1')]
NOTE: If you need no tuples in the result, remove the capturing parentheses from the pattern.
Pattern details:
(?<!\S)\b - start of string or whitespace followed with a word boundary (next char must be a letter/digit or _)
([^:]+) - Capturing group #1: 1+ chars other than :
: - a colon
((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+) - Capturing group 2 matching one or more occurrences of:
[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})* - 2 hex chars followed with zero or more occurrences of a space or : and 2 hex chars
| - or
\S - a non-whitespace char
\b - trailing word boundary.

In this particular case you can implement it like so:
import re
a = 'Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2'
print re.split(r'(?<=[A-Z0-9]) (?=[A-Z])', a)
Output:
['Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8', 'StaMAC:00:9F:0B:00:38:B8', 'BSSID:00 9F', 'Radioid:2']
Regex:
(?<=[A-Z0-9]) - Positive lookbehind for A-Z or 0-9
- 1 space character
(?=[A-Z]) - Positive look ahead for A-Z

Related

regex to match literally the characters of an abbreviation

I am new with regex and I would like some help. So I have a string below and I want to make my regex match the first character of the acronym literally + any character[a-z] unlimited times but only for the first character. For the rest of the characters, i would like to just match them as they are. Any help on what to change on my regex line to achieve this, would be highly appreciated.
import re
s = 'nUSA stands for northern USA'
x = (f'({"nUSA"}).+?({" ".join( t[0] + "[a-z]" + t[1:] for t in "nUSA")})(?: )')
print(x)
out: (nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
What i want to achieve with my regex line is something like the pattern below so that it can match for the northern USA.
(nUSA).+?(n[a-z]+ U + S + A)(?: )
instead of the one i get
(nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
I would like it to work for any arbitrary text, not only for the specific one. I am not sure if i have expressed my problem properly.

You may use
import re
s = 'nUSA stands for northern USA'
key='nUSA'
x = rf'\b({key})\b.+?\b({key[0]}[a-z]*\s*{key[1:]})(?!\S)'
# => print(x) => \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S)
# Or, if the key can contain special chars at the end:
# x = rf'\b({re.escape(key)})(?!\w).+?(?<!\w)({re.escape(key[0])}[a-z]*\s*{re.escape(key[1:])})(?!\S)'
print(re.findall(x, s))
# => [('nUSA', 'northern USA')]
See the Python demo. The resulting regex will look like \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S), see its demo. Details:
\b - word boundary
(nUSA) - Group 1 capturing the key word
\b / (?!\w) - word boundary (right-hand word boundary)
.+? - any 1+ chars other than linebreak chars as few as possible
\b - word boundary
(n[a-z]*\s*USA) - Group 2: n (first char), then any 0+ lowercase ASCII letters, 0+ whitespaces and the rest of the key string.
(?!\S) - a right-hand whitespace boundary (you may consider using (?!\w) again here).

Regex return match and extended matches

Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?

The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits

Regex for a third-person verb

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?

You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))

If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Python multiline regex delimiter

Having this multiline variable:
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.
I need a regex to get:
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\n'), ('PARALLEL', '4')]
Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.
Some attempts from the interpreter:
>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]
>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]
>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\nPARALLEL = 4\n')]
Thanks!

You can use a positive lookahead to make sure you lazily match the value correctly:
(\w+)\s=\s(.+?)(?=$|\n[A-Z])
^^^^^^^^^^^^
To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.
See the regex demo.
And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:
(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)
See another regex demo
Explanation:
(\w+) - Group 1 capturing 1+ word chars
\s*=\s* - a = symbol wrapped with optional (0+) whitespaces
(.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
.* - any 0+ characters other than a newline
(?:\n(?![A-Z]).*)* - 0+ sequences of:
\n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
.* - any 0+ characters other than a newline
Python demo:
import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))

regex of symbolic expression grouped

In python, I am trying to regex of a expression like this:
function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)
I am using this regex
(?P<perf_name>\w*?)\((?P<perf_param>[\w]+)*(?:,*(?P<perf_param2>[\w]+)?)*\)
but I'm stuck because so far I can't get all the params_x which are not close to brackets (param_2, param_8 and param_9)
Plus, I am pretty sure there is some solution that would prevent me to use a single perf_param instead of the two perf_param and perf_param2
Any ideas?

You should do that in 2 steps:
(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)
This regex will get you the name and params as two groups. Then, just split the second group with ,.
import re
p = re.compile(r'(?P<perf_name>\w*)\((?P<perf_params>\w*(?:,\w+)*)\)')
s = "function_1(param_1,param_2,param_3)+function_2(param_4,param_5)*function_3(param_6)+function_4()-function_5(param_7,param_8,param_9,param_10)"
res = [(x.group("perf_name"), x.group("perf_params").split(",")) for x in p.finditer(s)]
print(res)
# => [('function_1', ['param_1', 'param_2', 'param_3']), ('function_2', ['param_4', 'param_5']), ('function_3', ['param_6']), ('function_4', ['']), ('function_5', ['param_7', 'param_8', 'param_9', 'param_10'])]
See the Python demo
The regex matches:
(?P<perf_name>\w*) - 0 or more alphanumeric/underscore characters
\( - a literal (
(?P<perf_params>\w*(?:,\w+)*) - 0+ sequences of 0+ word characters (\w*) followed with 0+ sequences of 1+ word characters
\) - closing ).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

multiple split in string using regex - python

Related

regex to match literally the characters of an abbreviation

Regex return match and extended matches

Regex for a third-person verb

Python multiline regex delimiter

regex of symbolic expression grouped

Categories

Resources