Regex always matches last group clause - python

I have this string
AC7640 Montreal Trudeau (YUL) La Guardia/New York (LGA) E75 Business (P) Confirmed
I want it to match AC, 7640, YUL and LGA
But I also want to match if last airport doesn't exist for instance:
AC7640 Montreal Trudeau (YUL) E75 Business (P) Confirmed
AC, 7640 and YUL
I came up with this regex:
([A-Z]{2}|[A-Z][0-9]|[0-9][A-Z])\s*([0-9]{1,4})(?:.*?\(([A-Z]{3})\)){1,2}
The problem is that it only matches 1 airport on both strings
I'm using python flavor

You need to "unroll" the pattern since the repeated capturing groups only store the last occurrence:
^([A-Z]{2}|[A-Z][0-9]|[0-9][A-Z])\s*([0-9]{1,4}).*?\(([A-Z]{3})\)(?:.*?\(([A-Z]{3})\))?
See the regex demo. Also, note that the last part, (?:.*?\(([A-Z]{3})\))?, is enclosed with an optional non-capturing group, so that it could match 1 or 0 occurrences. ^ at the start makes the regex engine to search from the beginning of the string only.
Details:
^ - start of string
([A-Z]{2}|[A-Z][0-9]|[0-9][A-Z]) - Group 1: two uppercase letters or an upppercase letter and a digit or a digit and an uppercase letter
\s* - 0+ whitespaces
([0-9]{1,4}) - Group 2: one to four digits
.*? - any 0+ chars as few as possible up to the first...
\( - a (
([A-Z]{3}) - Group 3: three uppercase letters
\) - )
(?:.*?\(([A-Z]{3})\))? - a non-capturing group matching 1 or 0 (optional) occurrences of:
.*? - any 0+ chars as few as possible up to the first ....
\( - a (
([A-Z]{3}) - Group 4: three uppercase letters
\) - a ).

Related

Regex to match end of line or whitespace followed by wildcard characters

I have a string where I'm trying to match a city and state with a regular expression in Python. Some of the strings have a final country code that is preceded by a space. I'm having trouble writing a regular expression that matches all the cases, and captures the city in the first capture group, and the state in the second capture g
[^.*]?Born:.*in[^.](.*),[^.*](.*)
This is the regular expression that I have so far, and these are some example strings that I'm trying to match.
Born: November 8, 1961 in Chicago, Illinois
Born: February 19, 1995 in Sombor, Serbia rs
Born: May 19, 1976 in Greenville, South Carolina us
Based on my current regular expression this is my current output:
(Chicago) (Illinois)
(Sombor) (Serbia rs )
(Greenville) (South Carolina us)
Expected outputs would be
(Chicago) (Illinois)
(Sombor) (Serbia)
(Greenville) (South Carolina)
How can I account for this trailing string of a space and two characters? Any help would be greatly spp
Use
Born:.*in\s+([^,]*),\s+(.*?)(?=(?:\s[A-Za-z]{2})?$)
See regex proof.
EXPLANATION
Born: - matches the characters Born: literally (case sensitive)
.* - matches any character (except for line terminators), between zero and unlimited times, as many times as possible, giving back as needed (greedy)
in - matches the characters in literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([^,]*)
Match a single character not present in the list below [^,]* between zero and unlimited times, as many times as possible, giving back as needed (greedy)
, - matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
, - matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (.*?)
.*? - matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=(?:\s[A-Za-z]{2})?$)
Assert that the Regex below matches
Non-capturing group (?:\s[A-Za-z]{2})?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
Match a single character present in the list below [A-Za-z]
{2} matches the previous token exactly 2 times
A-Z matches a single character in the range between A (index 65) and Z (index 90)
(case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122)
(case sensitive)
$ asserts position at the end of a line

Finding exact values associated with given word using regex in python

am trying to find values associated with a particular word using regex but not getting expected results.
I wrote a pattern that is working fine for standard input only and I want to so the same for all sorts of inputs.
What I have now:
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6, Hgb=11.3gm/dL'''
Pattern which I wrote:
re.findall(r'{}=(.*)'.format(detected_word), search_query)[0].split(',')[0]
detected_word is variable where am detecting left side part of equals sign like (WBC, RBC,...) using another technique.
In this above case, it's working fine, but if I change the sentence pattern like below am unable to find a generic pattern.
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6 and Hgb=11.3gm/dL'''
string = r'''results for WBC, RBC and Hgb are 8.110*3, 3.3010*6 and 11.3gm/dL'''
no matter of string format I can able to detect WBC, RBC, and Hgb these words but detecting the value for an associated word is worrying me
Could anyone please help me with this?
Thanks in advance
Here is an idea: use two separate patterns for the strings you provided as sample input, the first one will extract values coming after expected word= and the other will extract them from clauses of expected word1 + optional expected word2 + optional expected word3 + "to be" verb + value1, optional value2 and optional value3.
Pattern 1:
\b(WBC|RBC|Hgb)=(\S*)\b
See the regex demo.
\b(WBC|RBC|Hgb) - a whole word WBC, RBC or Hgb
= - a = char
(\S*)\b - Group 2: 0 or more non-whitespaces, that stops at last word boundary position
Pattern 2:
\b(WBC|RBC|Hgb)(?:(?:\s+and)?(?:\s*,)?\s+(WBC|RBC|Hgb))?(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))?\s+(?:is|(?:a|we)re|was|will\s+be)(?:\s*,)?\s*(\d\S*)\b(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?
See regex demo.
\b(WBC|RBC|Hgb) - Group 1 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - an optional pattern:
(?:\s+and)? - an optional sequence of 1+ whitespaces and then and
(?:\s*,)? - an optional sequence of 0+ whitespaces and then a comma
\s*(WBC|RBC|Hgb) - 0+ whitespaces and Group 2 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - same as above, captures the 3rd optional searched word into Group 3
\s+ - 1+ whitespaces
(?:is|(?:a|we)re|was|will\s+be) - a VERB, you may add more if you expect them to be at this position, or plainly try a \S+ or \w+ pattern instead
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma sequence, then 0+ whitespaces
(\d\S*)\b - Group 4 (pair it with Group 1 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - an optional group matching
(?:\s+and)? - an optional sequence of 1+ whitespaces and and
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma, then 0+ whitespaces
(\d\S*)\b - Group 5 (pair it with Group 2 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - same as above, with a capture group 6 that must be paired with Group 3.

Issues with regular expression and Named Reference

I have the following string:
"1 Compensation for the month Jan,2020 10 160 1600"
I would like to split the string into multiple groups using "Named Regular expression". I would like to split into the following groups:
'Index' : 1
'Description': 'Compensation for the month Jan,2020'
'HourlyRate': '10'
'TotalHours': '160'
'Total': '1600'
I used the following Regular expression:
(?P<Index>\w+)\s+(?P<Description>\w+)\s+(?P<HourlyRate>.+)\s+(?P<TotalHours>.+)\s+(?P<Total>)
Any idea how to accomplish this?
You may leverage the fact that the first and last three fields are number fields, thus, in the second field, you may match any amount of any chars:
^(?P<Index>\d+)\s+(?P<Description>.*?)\s+(?P<HourlyRate>\d+)\s+(?P<TotalHours>\d+)\s+(?P<Total>\d+)$
See the regex demo. If the numbers can have fractional parts, replace that \d+ pattern with \d+(?:\.\d+)? (or \d+(?:,\d+)? if you have a comma as a decimal separator).
Details
^ - start of string
(?P<Index>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<Description>.*?) - any 0+ chars other than line break chars, as few as possible
\s+ - 1+ whitespaces
(?P<HourlyRate>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<TotalHours>\d+) - 1+ digits
\s+ - 1+ whitespaces
(?P<Total>\d+) - 1+ digits
$ - end of string.

regular expression of python

I am struggling when writing regular expression in python.
For instance I get the following right
"GET /images/launch-logo.gif HTTP/1.0" 220 1839
is matched by
"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)
however I still need to include the following cases all together
"GET /history/history.html hqpao/hqpao_home.html
HTTP/1.0" 200 1502
"GET /shuttle/missions/missions.html Shuttle Launches from
Kennedy Space Center HTTP/1.0"200 8677
"GET /finger #net.com HTTP/1.0"404 -
obviously I should change the bold part of the expression
"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)
But how should I change it. I have one approach in mind which is change the bold part to
[\s |(\s*)(\S+) |(\S+)(12) |(\S+)]
where the 2nd, 3rd , 4th expression is the (1), (2), (3) extra cases I need to deal with.
But my expression do not work. What do I misunderstand about regular expression as I simply deal with it case by case.
This Might be a bit messy but it works:
\"(\S+) (\S+[\s\w\.\#]*)\s*(\S*)\"\s?(\d{3})\s(\S+)*
You can play with it on Regexr. Regexr Shared Link
You may use
^"([^\s"]+)\s+([^\s"]+)(?:\s+([^"]+?))?\s+([A-Z]+/\d[\d.]*)"\s*(\d{3})\s*(\S+)$
See the regex demo
Details
^ - start of a line (use re.M if you are reading the whole file into a variable, f.read())
" - a double quotation mark
([^\s"]+) - Group 1: one or more chars other than whitespace and a double quotation mark
\s+ - 1+ whitespaces
([^\s"]+) - Group 2: one or more chars other than whitespace and a double quotation mark
(?:\s+([^"]+?))? - an optional non-capturing group matching
\s+ - 1+ whitespaces
([^"]+?) - Group 3: any 1 or more chars other than ", as few as possible
\s+ - 1+ whitespaces
([A-Z]+/\d[\d.]*) - Group 4: 1+ uppercase letters, / and then 1 digit followed with any 0+ digits or . chars
" - a double quotation mark
\s+ - 0+ whitespaces
(\d{3}) - Group 5: three digits
\s* - 0+ whitespaces
(\S+) - 1 or more non-whitespace chars
$ - end of string.

Regex return match and extended matches

Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?
The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits

Categories

Resources