Regex - Group everything until last occurence - python

When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.

You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023

The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1

The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

Re module and positive look behind of variable width

I am new to programming and Python, so I apologize if this is an obvious question. I tried looking at similar questions on this website, but the solutions seem to be outside of my reach.
Problem: Consider the following text:
12/19 Paul 1/20
1/20 Jacob 10/2
Using the module re, extract the names from the above. In other words, your output should be:
['Paul', 'Jacob']
First, I tried using positive look arounds. I tried:
import re
name_regex=re.compile(r'''(
(?<=\d{1,2}/\d{1,2}\s) #looks for one or two digits followed by a forward slash followed by one or two digits, followed by a space
.*? #looks for anything besides the newline in a non-greedy manner (is the non-greedy part necessary? I am not sure...)
(?=\s\d{1,2}/\d{1,2}) #looks for a space followed by one or two digits followed by a forward slash followed by one or two digits
)''', re.VERBOSE)
text=str("12/19 Paul 1/20\n1/20 Jacob 10/2")
print(name_regex.findall(text))
However, the above yields the error:
re.error: look-behind requires fixed-width pattern
From reading similar questions, I believe that this means that look arounds cannot have variable length (i.e., they cannot look for "1 or 2 digits").
However, how can I fix this?
Any help would be greatly appreciated. Especially the help suited for nearly a complete beginner like me!
PS. Ultimately, the list of names surrounded by dates can be very long. The dates can have one or two digits that are separated by a slash. I just wanted to give a minimal working example.
Thank you!
If you want to match at least a single non whitespace char between the digit patterns, you might use
(?<=\d{1,2}/\d{1,2}\s)\S.*?(?=\s\d{1,2}/\d{1,2})
This part \S.*? will match a non whitespace char followed by any char except a newline non greedy so it will match until asserting the first occurrence of (?=\s\d{1,2}/\d{1,2})
Python demo
Note that if you would use .*? then match would also return an empty entry ['Paul', '', 'Jacob'] , see this example.
You could also use a capturing group instead of lookarounds:
\d{1,2}/\d{1,2}\s(\S.*?)\s\d{1,2}/\d{1,2}
Regex demo

Python, regular expression matching digits, x,xxx,xxx but not xx,xx,x,

first time posting, I've lurked for a little while, really excited about the helpful community here.
So, working with "Automate the boring stuff" by Al Sweigart
Doing an exercise that requires I build a regex that finds numbers in standard number format. Three digit, comma, three digits, comma, etc...
So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.
I have the following.
import re
testStr = '1,234,343'
matches = []
numComma = re.compile(r'^(\d{1,3})*(,\d{3})*$')
for group in numComma.findall(str(testStr)):
Num = group
print(str(Num) + '-') #Printing here to test each loop
matches.append(str(Num[0]))
#if len(matches) > 0:
# print(''.join(matches))
Which outputs this....
('1', ',343')-
I'm not sure why the middle ",234" is being skipped over. Something wrong with the regex, I'm sure. Just can't seem to wrap my head around this one.
Any help or explanation would be appreciated.
FOLLOW UP EDIT. So after following all your advice that I could assimilate, I got it to work perfectly for several inputs.
import re
testStr = '1,234,343'
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Num = numComma.findall(testStr)
print(Num)
gives me....
['1,234,343']
Great! BUT! What about when I change the string input to something like
'1,234,343 and 12,345'
Same code returns....
[]
Grrr... lol, this is fun, I must admit.
So the purpose of the exercise is to be able to eventually scan a block of text and pick out all the numbers in this format. Any insight? I thought this would add an additional tuple, not return an empty one...
FOLLOW UP EDIT:
So, a day later(Been busy with 3 daughters and Honey-do lists), I've finally been able to sit down and examine all the help I've received. Here's what I've come up with, and it appears to work flawlessly. Included comments for my own personal understanding. Thanks again for everything, Blckknght, Saleem, mhawke, and BHustus.
My final code:
import re
testStr = '12,454 So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.'
numComma = re.compile(r'''
(?:(?<=^)|(?<=\s)) # Looks behind the Match for start of line and whitespace
((?:\d{1,3}) # Matches on groups of 1-3 numbers.
(?:,\d{3})*) # Matches on groups of 3 numbers preceded by a comma
(?=\s|$)''', re.VERBOSE) # Looks ahead of match for end of line and whitespace
Num = numComma.findall(testStr)
print(Num)
Which returns:
['12,454', '1,234', '23,322', '1,234,567', '12']
Thanks again! I have had such a positive first posting experience here, amazing. =)
The issue is due to the fact you're using a repeated capturing group, (,\d{3})* in your pattern. Python's regex engine will match that against both the thousands and ones groups of your number, but only the last repetition will be captured.
I suspect you want to use non-capturing groups instead. Add ?: to the start of each set of parentheses (I'd also recommend, on general principle, to use a raw string, though you don't have escaping issues in your current pattern):
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Since there are no groups being captured, re.findall will return the whole matched text, which I think is what you wanted. You can also use re.find or re.search and call the group() method on the returned match object to get the whole matched text.
The problem is:
A regex match will return a tuple item for each group. However, it is important to distinguish a group from a capture. Since you only have two parenthese-delimited groups, the matches will always be tuples of two: the first group, and the second. But the second group matches twice.
1: first group, captured
,234: second group, captured
,343: also second group, which means it overwrites ,234.
Unfortunately, it seems that vanilla Python does not have a way to access any captures of a group other than the last one in a manner similar to .NET's regex implementation. However, if you are only interested in getting the specific number, your best bet would be to use re.search(number). If it returns a non-None value, then the input string is a valid number. Otherwise, it is not.
Additionally: A test on your regex. Note that, as Paul Hankin stated, test cases 6 and 7 match even though they shouldn't, due to the first * following the first capturing group, which will make the initial group match any number of times. Otherwise, your regex is correct. Fixed version.
RESPONSE TO EDIT:
The reason now that your regex returns an empty set on ' and ' is because of the ^ and $ anchors in your regex. The ^ anchor, at the start of the regex, says 'this point needs to be at the start of a string'. The $ is its counterpart, saying 'This needs to be at the end of the string'. This is good if you want your entire string from start to end to match the pattern, but if you want to pick out multiple numbers, you should do away with them.
HOWEVER!
If you leave the regex in its current form sans anchors, it will now match the individual elements of 1,23,45 as separate numbers. So for this we need to add a zero-width positive lookahead assertion and say, 'make sure that after this number is either whitespace or the end of a line'. You can see the change here. The tail end, (?=\s|$), is our lookahead assertion: it doesn't capture anything, but just makes sure criteria or met, in this case whitespace (\s) or (|) the end of a line ($).
BUT: In a similar vein, the previous regex would have matched 2 onward in "1234,567", giving us the number "234,567", which would be bad. So we use a lookbehind assertion similar to our lookahead at the end: (?<!^|\s), only match if at the beginning of the string or there is whitespace before the number. This version can be found here, and should soundly satisfy any non-decimal number related needs.
Try:
import re
p = re.compile(ur'(?:(?<=^)|(?<=\s))((?:\d{1,3})(?:,\d{3})*)(?=\s|$)', re.DOTALL)
test_str = """1,234 and 23,322 and 1,234,567 1,234,567,891 200 and 12 but
not 1,23,1 or ,,1111, or anything else silly"""
for m in re.findall(p, test_str):
print m
and it's output will be
1,234
23,322
1,234,567
1,234,567,891
200
12
You can see demo here
This regex, would match any valid number, and would never match an invalid number:
(?<=^|\s)(?:(?:0|[1-9][0-9]{0,2}(?:,[0-9]{3})*))(?=\s|$)
https://regex101.com/r/dA4yB1/1

Regexp - find a value between a part of the string and a second part of the string OR end of line

I've looked through many regexp examples here, but still fail to find a solution.
I have to check a request string for a certain substring in it. The substring in question will have something before it might have something after:
?something=xxx&to_dep=YYY&from_dep=zzz&...
OR
?something=xxx&to_dep=YYY
I need to extract YYY without a & in first case and simply YYY in the second case.
For now I use this kind of regexp:
re.search('to_dep=(.+?)&', req.query_string)
but works only in one case and can't be used if I want to re.sub it. (replace YYY with something else - & gets replaced too)
Any help?
Just try with:
[?&]to_dep=([^&]*)
[^&]* will match any characters that are not & or it will stop on the next & (first case) or stop on the end of the string (second case).
For both, you might use a positive lookbehind and a negated class:
re.search(r'(?<=to_dep=)[^&]+', req.query_string)
And this will give you only YYY, which then means you can also use it in re.sub:
re.sub(r'(?<=to_dep=)[^&]+', 'new_value', req.query_string)
[^&] matches any character except &.
(?<=to_dep=) makes sure there's a to_dep= before the part to match.

Negating match if a string is just before another string

I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory
Just use a negative look-behind in your regex:
(?<!with\s)accessory
Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$

Categories

Resources