Regex return match and extended matches - python

Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?

The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits

Related

regex doesn't seem to work on the input given as expected

My regex doesn't seem to work as expected, can someone help me fixing it?
import re
a = """
xyz # (.C (0),
.H (1)
)
mv [F-1:0] (/*AUTOINST*/
except_check
#(
.a (m),
.b (w),
.c (x),
.d (1),
.e (1)
)
data_check
(// Outputs
abc
#(
.a (b::c)
)
mask
(/*AUTOINST*/
"""
op = re.findall(r'^\s*(\w+)\s*$\n(?:^\s*[^\w\s].*$\n)*^\s*(\w+)\s*\(', a, re.MULTILINE)
for i in op:
print(i)
This is the output I get:
('except_check', 'data_check')
('abc', 'mask')
This is the expected output:
('xyz', 'mv')
('except_check', 'data_check')
('abc', 'mask')
Somehow, the regex doesn't work for first block of input and works fine for other two blocks of input.
"(\w+)\s+#\s?(\D*\S*\D*\s*\d?\W+)\s*(\w+)"gm
use this works
you can further simplify
Here is a regex with the minimal changes:
^\s*(\w+)(?:\s*[^\w\s].*$\n)*^\s*(\w+)[^()]*\(
See the regex demo.
The \s*$\n(?:^\s*[^\w\s] part is replaced with (?:\s*[^\w\s], as your first block does not contain a line break.
At the end, \s*\( is replaced with [^()]*\( because there are chars other than whitespace between the word you want to extract and a ( char.
Details:
^ - start of a line (granted you use re.M)
\s* - zero or more whitespaces
(\w+) - Group 1: one or more word chars
(?:\s*[^\w\s].*\n)* - zero or more occurrences of zero or more whitespaces, a special char other than _, the rest of the line and an LF char
^ - start of a line
\s* - zero or more whitespaces
(\w+) - Group 2: one or more word chars
[^()]* - zero or more chars other than ( and )
\( - a ( char.
Or, I think you can leverage the recursion feature available in the PyPi regex. Run pip install regex in the terminal/console and then
import regex
a = 'your_string_here'
rx = r'^\s*(\w+)\s*#\s*(\((?:[^()]++|(?2))*\))\s*(\w+)'
matches = [(x.group(1), x.group(3)) for x in regex.finditer(rx, a, regex.M)]
Here is the regex demo. It matches:
^ - start of a line
\s* - zero or more whitespaces
(\w+) - Group 1: one or more word chars
\s*#\s* - a # enclosed with zero or more whitespaces
(\((?:[^()]++|(?2))*\)) - Group 2: a ( char, then any zero or more occurrences of any one or more chars other than ( and ) or Group 2 pattern, and then a )
\s* - zero or more whitespaces
(\w+) - Group 2: one or more word chars.

Finding exact values associated with given word using regex in python

am trying to find values associated with a particular word using regex but not getting expected results.
I wrote a pattern that is working fine for standard input only and I want to so the same for all sorts of inputs.
What I have now:
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6, Hgb=11.3gm/dL'''
Pattern which I wrote:
re.findall(r'{}=(.*)'.format(detected_word), search_query)[0].split(',')[0]
detected_word is variable where am detecting left side part of equals sign like (WBC, RBC,...) using another technique.
In this above case, it's working fine, but if I change the sentence pattern like below am unable to find a generic pattern.
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6 and Hgb=11.3gm/dL'''
string = r'''results for WBC, RBC and Hgb are 8.110*3, 3.3010*6 and 11.3gm/dL'''
no matter of string format I can able to detect WBC, RBC, and Hgb these words but detecting the value for an associated word is worrying me
Could anyone please help me with this?
Thanks in advance
Here is an idea: use two separate patterns for the strings you provided as sample input, the first one will extract values coming after expected word= and the other will extract them from clauses of expected word1 + optional expected word2 + optional expected word3 + "to be" verb + value1, optional value2 and optional value3.
Pattern 1:
\b(WBC|RBC|Hgb)=(\S*)\b
See the regex demo.
\b(WBC|RBC|Hgb) - a whole word WBC, RBC or Hgb
= - a = char
(\S*)\b - Group 2: 0 or more non-whitespaces, that stops at last word boundary position
Pattern 2:
\b(WBC|RBC|Hgb)(?:(?:\s+and)?(?:\s*,)?\s+(WBC|RBC|Hgb))?(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))?\s+(?:is|(?:a|we)re|was|will\s+be)(?:\s*,)?\s*(\d\S*)\b(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?
See regex demo.
\b(WBC|RBC|Hgb) - Group 1 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - an optional pattern:
(?:\s+and)? - an optional sequence of 1+ whitespaces and then and
(?:\s*,)? - an optional sequence of 0+ whitespaces and then a comma
\s*(WBC|RBC|Hgb) - 0+ whitespaces and Group 2 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - same as above, captures the 3rd optional searched word into Group 3
\s+ - 1+ whitespaces
(?:is|(?:a|we)re|was|will\s+be) - a VERB, you may add more if you expect them to be at this position, or plainly try a \S+ or \w+ pattern instead
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma sequence, then 0+ whitespaces
(\d\S*)\b - Group 4 (pair it with Group 1 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - an optional group matching
(?:\s+and)? - an optional sequence of 1+ whitespaces and and
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma, then 0+ whitespaces
(\d\S*)\b - Group 5 (pair it with Group 2 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - same as above, with a capture group 6 that must be paired with Group 3.

Issues with regular expression and Named Reference

I have the following string:
"1 Compensation for the month Jan,2020 10 160 1600"
I would like to split the string into multiple groups using "Named Regular expression". I would like to split into the following groups:
'Index' : 1
'Description': 'Compensation for the month Jan,2020'
'HourlyRate': '10'
'TotalHours': '160'
'Total': '1600'
I used the following Regular expression:
(?P<Index>\w+)\s+(?P<Description>\w+)\s+(?P<HourlyRate>.+)\s+(?P<TotalHours>.+)\s+(?P<Total>)
Any idea how to accomplish this?
You may leverage the fact that the first and last three fields are number fields, thus, in the second field, you may match any amount of any chars:
^(?P<Index>\d+)\s+(?P<Description>.*?)\s+(?P<HourlyRate>\d+)\s+(?P<TotalHours>\d+)\s+(?P<Total>\d+)$
See the regex demo. If the numbers can have fractional parts, replace that \d+ pattern with \d+(?:\.\d+)? (or \d+(?:,\d+)? if you have a comma as a decimal separator).
Details
^ - start of string
(?P<Index>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<Description>.*?) - any 0+ chars other than line break chars, as few as possible
\s+ - 1+ whitespaces
(?P<HourlyRate>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<TotalHours>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<Total>\d+) - 1+ digits
$ - end of string.

Match arithmetic operators only one time foreach

I have to match the following type of strings:
HELLO
HELLO+2.20
HELLO*1.10
HELLO+2.12*2.99
HELLO*2.30+5.40
The plus and star operator can be there only one time (with their respective amount) so
HELLO+2.20+3.50
HELLO*2.11+1.25*9.99
HELLO*3.33*4.44
aren't valid matches
I tried this regex:
([A-Z]{2,12}(\*(\d+(?:\.\d{1,2})?))?(\+(\d+(?:\.\d{1,2})?))?)
but matches only star operator first and plus operator for last (both optionally). This regex doesn't support this case:
HELLO+2.11*3.56
You might use an alternation | to match either of the 2 variations of + and *
^[A-Z]{2,12}(?:\+\d+\.\d{1,2}(?:\*\d+\.\d{1,2})?|\*\d+\.\d{1,2}(?:\+\d+\.\d{1,2})?)?$
In parts
^ Start of string
[A-Z]{2,12} Match 2-12 uppercae chars
(?: Non capturing group
\+\d+\.\d{1,2} Match + 1+ digits . and 1-2 digits
(?:\*\d+\.\d{1,2})? Optionally match the same as previous starting with *
| Or
\*\d+\.\d{1,2} Match * 1+ digits . and 1-2 digits
(?:\+\d+\.\d{1,2})? Optionally match the same as previous starting with +
)? Close group and make it optional to also match only 12 uppercase chars
$
Regex demo
A more straightforward alternative with builtin features (without regex search):
test_str = '''
HELLO+2.20
HELLO*1.10
HELLO+2.12*2.99
HELLO*2.30+5.40
HELLO+2.20+3.50
HELLO*2.11+1.25*9.99
HELLO*3.33*4.44'''
valid_strings = [s for s in test_str.splitlines()
if s and s.count('+') < 2 and s.count('*') < 2]
print(valid_strings)
The output:
['HELLO+2.20', 'HELLO*1.10', 'HELLO+2.12*2.99', 'HELLO*2.30+5.40']

multiple split in string using regex

I have a string :
Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2
I want split this string. It look like this -
'Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8' 'StaMAC:00:9F:0B:00:38:B8' 'BSSID:00 9F' 'Radioid:2'
I tried this logic - msgRegex = re.compile('[\w\s]+:') and split function also.
How can I do Please help me Thank you
From what I see, you have a problem when you have a whitespace inside the matches with hex values.
Because of that, I believe you cannot use a splitting approach here. Match your tokens with a regex like
(?<!\S)\b([^:]+):((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+)\b
See the regex demo
Python code:
import re
rx = r"(?<!\S)\b([^:]+):((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+)\b"
ss = ["Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2",
"Station Deassoc:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.5 StaMac1:40:83:DE:34:04:75 StaMac2:40:83:DE:34:04:75 UserName:4083de340475 StaMac3:40:83:DE:34:04:75 VLANId:1 Radioid:2 SSIDName:Devices SessionDuration:12 APID:CN58G6749V AP Name:1023-noida-racking-zopnow BSSID:BC:EA:FA:DC:A6:F1"]
for s in ss:
matches = re.findall(rx, s)
print(matches)
Result:
[('Station Disconnect', '1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8'), ('StaMAC', '00:9F:0B:00:38:B8'), ('BSSID', '00 9F'), ('Radioid', '2')]
[('Station Deassoc', '1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.5'), ('StaMac1', '40:83:DE:34:04:75'), ('StaMac2', '40:83:DE:34:04:75'), ('UserName', '4083de340475'), ('StaMac3', '40:83:DE:34:04:75'), ('VLANId', '1'), ('Radioid', '2'), ('SSIDName', 'Devices'), ('SessionDuration', '12'), ('APID', 'CN58G6749V'), ('AP Name', '1023-noida-racking-zopnow'), ('BSSID', 'BC:EA:FA:DC:A6:F1')]
NOTE: If you need no tuples in the result, remove the capturing parentheses from the pattern.
Pattern details:
(?<!\S)\b - start of string or whitespace followed with a word boundary (next char must be a letter/digit or _)
([^:]+) - Capturing group #1: 1+ chars other than :
: - a colon
((?:[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})*|\S)+) - Capturing group 2 matching one or more occurrences of:
[a-fA-F0-9]{2}(?:[ :][a-fA-F0-9]{2})* - 2 hex chars followed with zero or more occurrences of a space or : and 2 hex chars
| - or
\S - a non-whitespace char
\b - trailing word boundary.
In this particular case you can implement it like so:
import re
a = 'Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8 StaMAC:00:9F:0B:00:38:B8 BSSID:00 9F Radioid:2'
print re.split(r'(?<=[A-Z0-9]) (?=[A-Z])', a)
Output:
['Station Disconnect:1.3.6.1.4.1.11.2.14.11.15.2.75.3.2.0.8', 'StaMAC:00:9F:0B:00:38:B8', 'BSSID:00 9F', 'Radioid:2']
Regex:
(?<=[A-Z0-9]) - Positive lookbehind for A-Z or 0-9
- 1 space character
(?=[A-Z]) - Positive look ahead for A-Z

Categories

Resources