Regex match only if multiple patterns found (python) - python

I'm trying to extract data from sentences such as:
"monthly payment of 525 and 5000 drive off"
using a python regex search function: re.search()
My regex query string is as follows for down payment:
match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
"(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"
My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.
How can I improve my regex string such that it only matches an element if another element is successfully matched as well?
In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.
Clearer explanation here

The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*, (|\$|dollars*|money)*, \s*, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)* can match empty strings.
I suggest removing the final * quantifier to match exactly one occurrence of the pattern:
(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)
See the regex demo
Also note that I contracted a (\s|-) group into a character class [\s-] as you only alternate single char patterns, and also turned (|\$|dollars*|money)* into a non-capturing optional group (?:\$|dollars*|money)? that matches just 1 or 0 occurrences of $, dollar(s) or money.

Related

Regex pattern for tweets

I am building a tweet classification model, and I am having trouble finding a regex pattern that fits what I am looking for.
what I want the regex pattern to pick up:
Any hashtags used in the tweet but without the hash mark (example - #omg to just omg)
Any mentions used in the tweet but without the # symbol (example - #username to just username)
I don't want any numbers or any words containing numbers returned ( this is the most difficult task for me)
Other than that, I just want all words returned
Thank you in advance if you can help
Currently I am using this pattern:** r"(?u)\b\w\w+\b"** but it is failing to remove numbers.
This regex should work.
(#|#)?(?![^ ]*\d[^ ]*)([^ ]+)
Explanation:
(#|#)?: A 'hash' or 'at' character. Match 0 or 1 times.
(?!): Negative lookahead. Check ahead to see if the pattern in the brackets matches. If so, negate the match.
[^ ]*\d[^ ]*: any number of not space characters, followed by a digit, followed by any number of space characters. This is nested in the negative lookahead, so if a number is found in the username or hashtag, the match is negated.
([^ ]+): One or more not space characters. The negative lookahead is a 0-length match, so if it passes, fetch the rest of the username/hashtag (Grouped with brackets so you can replace with $2).

Regex - Exclude pattern if a certain word appears after the desired word

I want my regex to match the appearance of a certain word, except if it is followed by another specific word.
More specifically, I would like it to match "union" (in the sense of union or loyalty to a group, so it would not include words like "reunion", i.e. with word boundaries at the beginning and end of the string) in all cases, except when the string says "union europea" (which is understood as an administration and does not appeal to a group in the same way).
Using the pattern union\b does not help, because it would also match the aforementioned sentence.
You can use a negative lookahead:
pattern = '\W(union)\W(?!europea)'
As pointed out by #Michael Ruth, you probably don't want to capture words other than union. So, with some test data:
unionize
union
union europea
reunion
This pattern only captures union in the second case, (ie., it does not capture reunion or unionize. The \W are non-word characters, so additional letters (like from reunion and unionize) are not captured.
Use
pattern = r'\bunion\b(?!\W*europea)'
(?!\W*europea) excludes matches where union is followed with nonword characters (if any) and then europea string.

Regex (python) to match same group several times only when preceded or followed by specific pattern

Suppose I have the following text:
Products to be destroyed: «Prabo», «Palox 2000», «Remadon strong» (Rule). The customers «Dilora» and «Apple» has to be notified.
I need to match every string within the «» quotes but ONLY in the period starting with the "Products to be destroyed:" pattern or ending with the (Rule) pattern.
In other words in this example I do NOT want to match Dilora nor Apple.
The regex to get the quoted contents in the capturing group is:
«(.+?)»
Is it possible to "anchor" it to either a following pattern (such as Rule) or even to a prior pattern (such as "Products to be destroyed:"?
This is my saved attempt on regex101
Thank you very much.
You can match at least a single part between the arrows, and when there is a match, extract all the parts using re.findall for example.
The example data seems to be within a dot. In that case you can match at least a single arrow part matching any char except a dot using a negated character class.
Regex demo for at least a single match, and another demo to match the separate parts afterwards
import re
regex = r"\bProducts to be destroyed:[^.]*«[^«»]*»[^.]*\."
s = 'Products to be destroyed: «Prabo», «Palox 2000», «Remadon strong» (Rule). The customers «Dilora» and «Apple» has to be notified.'
result = re.search(regex, s)
if result:
print(re.findall(r"«([^«»]*)»", result.group()))
Output
['Prabo', 'Palox 2000', 'Remadon strong']

python - regex putting repeating patterns in a single group

I'm trying to parse a string in regex and am 99% there.
my test string is
1
1234 1111 5555 88945
172.255.255.255 from 172.255.255.255 (1.1.1.1)
Origin IGP, localpref 300, valid, external, best
rx pathid: 0, tx pathid: 0x0
my current regex pattern is:
(?P<as_path>(\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>\S+,\s{0,4})
im using regex101 to test and have a link to the test here https://regex101.com/r/iGM8ye/1
So currently i have a group2 I don't want this group, could someone tell me why im getting this group and how to remove it?
and the second is, in the attributes I want to match the words, "valid, external, best" currently my pattern only matches "valid," I thought adding the repeat of within the group would of matched all three of those but it hasn't.
How would I achieve matching the repeat of "string, string, string," (string comma space) into one group?
Thanks
EDIT
Desired output
as_path : 1234 1111 5555 88945
peer_addr : 172.255.255.255
peer_rid : 1.1.1.1
local_pref : 300
attribs : valid, external, best
attiribs may also just be valid, external, or just external, or another entry in the format (stringcommaspace)
Try Regex: (?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>[\S]+,(?: [\S]+,?)*){0,4}
Demo
Regex in the question had a capturing group (Group 2) for (\d{4,10}\s). it is changed to a non capturing group now (?:\d{4,10}\s)
See regex in use here.
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}(?:\.\d{0,3}){3}).*\((?P<peer_rid>\d{0,3}(?:\.\d{0,3}){3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You were getting group 2 because your as_path group contained a group. I changed that to a non-capturing group.
I changed attribs to \S+(?:,\s+\S+){2}
This will match any non-space character one or more times \S+, followed by the following exactly twice:
,\s+\S+ the comma character, followed by the space character one or more times, followed by any non-space character one or more times
I changed peer_addr and peer_rid to \d{0,3}(?:\.\d{0,3}){3} instead of \d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}. This is a preference, but shortens the expression.
Without that last modification, you can use the following regex (it performs slightly better anyway (as seen here):
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You can also improve the performance by using more specific tokens as the following suggests (notice I also added the x modifier to make it more legible) and as seen here:
(?P<as_path>\d{4,10}(?:\s\d{4,10}){0,19})\s+
(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})[^)]*
\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+
.*localpref\s(?P<local_pref>\d+),\s+
(?P<attribs>\w+(?:,\s+\w+){2})
You get that separate group because your are repeating a capturing group were the last iteration will be the capturing group, in this case 88945 You could make it non capturing instead (?:
For the second part you could use an alternation to exactly match one of the options (?:valid|external|best)
Your pattern might look like:
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>(?:valid|external|best)(?:,\s{0,4}(?:valid|external|best))+)
regex101 demo

Regex, capture using word boundaries without stopping at "dot" and/or other characters

Given for example a string like this:
random word, random characters##?, some dots. username bob.1234 other stuff
I'm currently using this regex to capture the username (bob.1234):
\busername (.+?)(,| |$)
But my code needs a regex with only one capture group as python's re.findall returns something different when there are multiple capture groups. Something like this would almost work, except it will capture the username "bob" instead of "bob.1234":
\busername (.+?)\b
Anybody knows if there is a way to use the word boundary while ignoring the dot and without using more than one capture group?
NOTES:
Sometimes there is a comma after the username
Sometimes there is a space after the username
Sometimes the string ends with the username
The \busername (.+?)(,| |$) pattern contains 2 capturing groups, and re.findall will return a list of tuples once a match is found. See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, there are three approaches here:
Use a (?:...) non-capturing group rather than the capturing one: re.findall(r'\busername (.+?)(?:,| |$)', s). It will consume a , or space, but since only captured part will be returned and no overlapping matches are expected, it is OK.
Use a positive lookahead instead: re.findall(r'\busername (.+?)(?=,| |$)', s). The space and comma will not be consumed, that is the only difference from the first approach.
You may turn the (.+?)(,| |$) into a simple negated character class [^ ,]+ that matches one or more chars other than a space or comma. It will match till end of string if there are no , or space after username.

Categories

Resources