Finding exact values associated with given word using regex in python - python

am trying to find values associated with a particular word using regex but not getting expected results.
I wrote a pattern that is working fine for standard input only and I want to so the same for all sorts of inputs.
What I have now:
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6, Hgb=11.3gm/dL'''
Pattern which I wrote:
re.findall(r'{}=(.*)'.format(detected_word), search_query)[0].split(',')[0]
detected_word is variable where am detecting left side part of equals sign like (WBC, RBC,...) using another technique.
In this above case, it's working fine, but if I change the sentence pattern like below am unable to find a generic pattern.
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6 and Hgb=11.3gm/dL'''
string = r'''results for WBC, RBC and Hgb are 8.110*3, 3.3010*6 and 11.3gm/dL'''
no matter of string format I can able to detect WBC, RBC, and Hgb these words but detecting the value for an associated word is worrying me
Could anyone please help me with this?
Thanks in advance

Here is an idea: use two separate patterns for the strings you provided as sample input, the first one will extract values coming after expected word= and the other will extract them from clauses of expected word1 + optional expected word2 + optional expected word3 + "to be" verb + value1, optional value2 and optional value3.
Pattern 1:
\b(WBC|RBC|Hgb)=(\S*)\b
See the regex demo.
\b(WBC|RBC|Hgb) - a whole word WBC, RBC or Hgb
= - a = char
(\S*)\b - Group 2: 0 or more non-whitespaces, that stops at last word boundary position
Pattern 2:
\b(WBC|RBC|Hgb)(?:(?:\s+and)?(?:\s*,)?\s+(WBC|RBC|Hgb))?(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))?\s+(?:is|(?:a|we)re|was|will\s+be)(?:\s*,)?\s*(\d\S*)\b(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?
See regex demo.
\b(WBC|RBC|Hgb) - Group 1 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - an optional pattern:
(?:\s+and)? - an optional sequence of 1+ whitespaces and then and
(?:\s*,)? - an optional sequence of 0+ whitespaces and then a comma
\s*(WBC|RBC|Hgb) - 0+ whitespaces and Group 2 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - same as above, captures the 3rd optional searched word into Group 3
\s+ - 1+ whitespaces
(?:is|(?:a|we)re|was|will\s+be) - a VERB, you may add more if you expect them to be at this position, or plainly try a \S+ or \w+ pattern instead
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma sequence, then 0+ whitespaces
(\d\S*)\b - Group 4 (pair it with Group 1 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - an optional group matching
(?:\s+and)? - an optional sequence of 1+ whitespaces and and
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma, then 0+ whitespaces
(\d\S*)\b - Group 5 (pair it with Group 2 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - same as above, with a capture group 6 that must be paired with Group 3.

Related

How to match part of the string contains certain word

I want to be able to match the following texts
top of radio 54 / bottom 27
radio top 54 / bottom 27
The word top can be before or after radio only for the first half of the text (before /). top that appears after / should not be matched.
I tried to use the following pattern that encloses the lookahead for the first half with parentheses.
((?=top).*radio\s\d{2}\s)\/\D*bottom\s\d{2}
But it doesn't work like I expected. Only matches first string, but not second.
You may use this regex with a lookahead:
^(?=[^/]*\btop\s)[^/]*radio[^/]+\d{2}\s+/\D*bottom\s+\d{2}$
RegEx Demo
RegEx Breakup:
^: Start
(?=[^/]*\btop\s): Lookahead to assert presence of word top being present before matching a /. This is to ensure we match word top before / only
[^/]*: Match 0 more of any char that is not a /
radio: Match radio
[^/]+: Match 1+ of any char that is not a /
\d{2}: Match 2 digits
\s+: Match 1+ whitespaces
/: Match a /
\D*: Match 0 or more non-digits
bottom: Match bottom
\s+: Match 1+ whitespace
\d{2}: Match 2 digits
$: End

Not able get desired output after string parsing through regex

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1
You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

Python RegEx: Not capturing all the data (python3.6, scrapy)

I was trying to script a website of length information using the following simple code:
list = re.findall('(?<=Length:\s\s)[:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'
However, it only gets the information if the length information is less than one hour. For example, it gets the 51:00 but not 01:08:47. I checked the source code for both shorter and longer than one hour. Here are how they look. It seems that for length more than 1 hour, there is one less white space. So I tried, but this time, list only returns a white space. Does anybody know how to get both short and long information? Thank you very much!
list = re.findall('(?<=Length:)[\s:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'
You need '(?<=Length:)\s*(\d\d[\s*:\s*\d\d]+)'.
Try this Regex and extract whatever is present in group 1:
Length\s*:\s*(\d+\s*(?::\s*\d+\s*){1,2})
Click for Demo
Explanation:
Length\s*: - matches Length literally followed by 0+ occurrences of a white-space, as many as possible
:\s* - matches a : followed by 0+ white-spaces
\d+\s* - matches 1+ occurrences of a digit followed by 0+ white-spaces. We start capturing the text from here in Group 1. We capture until the end of the match.
(?::\s*\d+\s*){1,2} - matches either 1 or 2 occurrences of the pattern (?::\s*\d+\s*)
(?:) - indicates a non-capturing group
:\s* - matches a : followed by 0+ occurrences of a white-space
\d+ - matches 1+ occurrences of a digit
\s* - matches 0+ occurrences of a white-space
Alternative Regex:(without any group)
(?<=Length:\s\s)\d+\s*(?::\s*\d+\s*){1,2}

Regex return match and extended matches

Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?
The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits

Categories

Resources