Guaranteed method to match longest string in regular expression alternation [duplicate] - python

This question already has answers here:
How to extract longest of overlapping groups?
(4 answers)
Closed 4 years ago.
For some reason I need to generate the regular expression from some arbitrary list by using alternations.
Let's say the user can input "cat", "dog" and "!#[]", it will generate "cat|dog|!#\\{\\}".
The problem is that, can I make the re to match the longest term when several of the inputs contain common prefix?
For example:
"god", "godspeed", "godzilla" will generate "god|godspeed|godzilla"
I want it to match the longest term if there are several matches. That is to match "godspeed" rather than "god" if I use re.finditer() to match the string "godspeeding"
I have tried in Python 3.7.1 and it seems it reports matches according to the order in the regular expression. If this is always true, I can just sort the input (wrt length) before converting them to regular expression.
However, I cannot find any documentation about this behavior and not sure if this will be unchanged in the future.

From the docs:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted.
This is specified behavior and will most likely not be changed in the future. You should be alright sorting wrt the lenghts and performing the regex match afterwards.
Does this answer your question?

Related

Regex: add to group only if no specific patter found

I am trying to match a rather trivial pattern but still cannot figure it out.
Imagine, we have a string with "words" split by underscores. Sometimes the sequence can end up with a specific keyword with a follow-up number ("kw#"). Example:
"foo_bar_omega"
"bar_omega_kw1"
"alpha_betta_foo_bar_kw15"
I need the regex that will place the sequence of the words in one group and the keyword with the number into another group.
I've tried:
regex = '(?P<grp1>[a-zA-Z]+(_[a-zA-Z]+)*)(_(?P<grp2>kw\d+))?'
but it places the bar_omega_pw in the grp1 and nothing into grp2.
I can also try to match only the keyword with the regex='(_(?P<grp2>kw\d+$))?' and after simply split the initial string to identify the grp1, but it looks like over-complication.
Is it possible to do the job in one regex go?
What if this pattern is surrounded by other characters, i.e:123_bar_omega_kw1_&&?
Edit
The question was closed with the "repeated question" flag and referencing this question. While one can get some idea/direction from the suggested old question, here, I believe, we have a little different situation: the pattern of the 1st group "catches" part of symbols that are supposed to be in the second group.
I think that the good answer here would contain an explanation of the importance of using anchors (^/$) for such types of problems.

How can I re-write my Regex Expression to begin the search at the occurrence of a separate pattern? [duplicate]

This question already has answers here:
Python extract pattern matches
(10 answers)
Closed 2 years ago.
Apologies if this is a duplicate - I wasn't exactly sure what to search for and everything I found came up short.
I'm using Python and if anybodies interested I drafted up a quick example on here:
Regex101 Example I created
I'm trying to use regex to grab the first part of a string that might be formatted like so:
**This is a Location** 8:20
or it could be formatted like...
Irrelevant information - **Relevant Information** 6:90
I wrote the following expression which does the job almost perfectly, pulling the relevant part of the string (words) out but it also pulls in the second part of the string (numbers). This is annoying as I then need to do a second regex/python expression to split that out.
r'(\w* ){1,5}\d+:\d+'
I'm using Python so I know I can quite easily separate the info manually with a slice etc but I feel like there must be a more elegant solution to my Regex that would negate the need for this step. Essentially I think the solution would be to match '\d+:\d+' and look back from there.
Ok - perhaps this isn't the most elegant solution but I've just realised I think I can use capturing groups like so:
# Pattern with groups
pattern = '((\w* ){1,5})(\d+:\d+)'
string = "useless something else - useful 2:2"
r = re.search(pattern, string)
if r:
useful info= r.group(1)
boundary = r.group(3)
Theoretically, I'm always going to have the same number of groups with group 1 containing the relevant string I'm trying to grab and group 3 the time/number value. I'll test this now and update/close this thread.

What difference does round brackets in regular expression make? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am currently going through pythonchallenge.com, and now trying to make a code that searches for a lowercase letter with exactly three uppercase letters on both side of it. Then I got stuck on trying to make a regular expression for it. This is what I have tried:
import re
#text is in https://pastebin.com/pAFrenWN since it is too long
p = re.compile("[^A-Z]+[A-Z]{3}[a-z][A-Z]{3}[^A-Z]+")
print("".join(p.findall(text)))
This is what I got with it:
dqIQNlQSLidbzeOEKiVEYjxwaZADnMCZqewaebZUTkLYNgouCNDeHSBjgsgnkOIXdKBFhdXJVlGZVme
gZAGiLQZxjvCJAsACFlgfe
qKWGtIDCjn
I later searched for the solution, which had this regular expression:
p = re.compile("[^A-Z]+[A-Z]{3}([a-z])[A-Z]{3}[^A-Z]+")
So there is a bracket around [a-z], and I couldn't figure out what difference it makes. I would like some explanation on this.
Use Parentheses for Grouping and Capturing By placing part of a
regular expression inside round brackets or parentheses, you can group
that part of the regular expression together. This allows you to apply
a quantifier to the entire group or to restrict alternation to part of
the regex.
https://www.regular-expressions.info/brackets.html
Basicly the regex engine can find a list of strings matching the whole search pattern, and return you the parts inside the ().

Remove everything after regex pattern match but keep pattern [duplicate]

This question already has answers here:
Using regex to remove all text after the last number in a string
(2 answers)
Closed 4 years ago.
I was searching for a way to remove all characters past a certain pattern match. I know that there are many similar questions here on SO but i was unable to find one that works for me. Basically i have a fixed pattern (\w\w\d\d\d\d), and i want to remove everything after that, but keep the pattern.
ive tried using:
test = 'PP1909dfgdfgd'
done = re.sub ('(\w\w\d\d\d\d/w*)', '\w\w\d\d\d\d/', test)
but still get the same string ..
example:
dirty = 'AA1001dirtydata'
dirty2 = 'AA1001222%^&*'
Desired output:
clean = 'AA1001'
You can use re.match() instead of re.sub():
re.match('\w\w\d\d\d\d', dirty).group(0) # returns 'AA1001'
Note: match will look for the regular expression at the beginning of the string you provide and only "match" the characters corresponding to the pattern. If you want to find the pattern partway through the string you can use re.search().

Python, how to compare substrings? [duplicate]

This question already has answers here:
Splitting on last delimiter in Python string?
(3 answers)
Checking whether a string starts with XXXX
(5 answers)
Does Python have a string 'contains' substring method?
(10 answers)
Closed 5 months ago.
I'm trying to compare substrings, and if I find a match, I break out of my loop. Here's an example of a few strings:
'something_tag_05172015.3', 'B_099.z_02112013.1', 'something_tag_05172015.1' ,'BHO98.c_TEXT_TEXT_05172014.88'.
The comparison should only compare the string I'm looking for, and everything in the same strings to what is to the left of the last underscore '_' in the strings. So, 'something_tag' should match only 'something_tag_05172015.3' and 'something_tag_05172015.1'.
What I did to do this was I split on the underscores and did a join on all elements but the last element in the split to compare against my test string (this drops everything to the right of the last underscore. Though it works, there's gotta be a better way. I was thinking maybe regex to remove the last underscore and digits, but it didn't work properly on a few tags.
Here's an example of the regex I was trying: re.sub('_\d+\.\d+', '', string_to_test)
If you are sure that something_tag is in the beggining you can try:
your_tag.startswith('something_tag')
If you are not sure about that:
res = 'something_tag' in your_tag
sobolevn bet me to it. For more complicated scenarios, use a regular expression with named-groups and/or non-capturing groups.
That way the overall string needs to match a specific format, but you can just pull out the sub parts that you're interested in.

Categories

Resources