non greedy Python regex from end of string - python

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.
I try to explain with an example:
Input can be one of the following
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'
I need to find the last part of the string ending with
somestring_otherstring.xml
In all the above cases the regex should return XX1234567890_84481.xml
My best try is:
result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)
Here I used:
(_.+)? to match "_anystring" in a non greedy mode
\.xml$ to match ".xml" in the final part of the string
The output I get is not correct:
_x-y-z_XX1234567890_84481.xml
I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.
Could anyone explain me how to implement a non greedy regex from the right?

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.
To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:
[^_]+_[^_]+\.xml$
Regex demo | Python demo
That will match
[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string
For example:
import re
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:
'[^_]+_[^_]+\.xml$'
The [^_] is a character class matching any character which is not an underscore.

You need to use this regex to capture what you want,
[^_]*_[^_]*\.xml
Demo
Check out this Python code,
import re
arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']
for s in arr:
m = re.search(r'[^_]*_[^_]*\.xml', s)
if (m):
print(m.group(0))
Prints,
XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml
The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

Related

Regex string between square brackets only if '.' is within string

I'm trying to detect the text between two square brackets in Python however I only want the result where there is a "." within it.
I currently have [(.*?] as my regex, using the following example:
String To Search:
CASE[Data Source].[Week] = 'THIS WEEK'
Result:
Data Source, Week
However I need the whole string as [Data Source].[Week], (square brackets included, only if there is a '.' in the middle of the string). There could also be multiple instances where it matches.
You might write a pattern matching [...] and then repeat 1 or more times a . and again [...]
\[[^][]*](?:\.\[[^][]*])+
Explanation
\[[^][]*] Match from [...] using a negated character class
(?: Non capture group to repeat as a whole part
\.\[[^][]*] Match a dot and again [...]
)+ Close the non capture group and repeat 1+ times
See a regex demo.
To get multiple matches, you can use re.findall
import re
pattern = r"\[[^][]*](?:\.\[[^][]*])+"
s = ("CASE[Data Source].[Week] = 'THIS WEEK'\n"
"CASE[Data Source].[Week] = 'THIS WEEK'")
print(re.findall(pattern, s))
Output
['[Data Source].[Week]', '[Data Source].[Week]']
If you also want the values of between square brackets when there is not dot, you can use an alternation with lookaround assertions:
\[[^][]*](?:\.\[[^][]*])+|(?<=\[)[^][]*(?=])
Explanation
\[[^][]*](?:\.\[[^][]*])+ The same as the previous pattern
| Or
(?<=\[)[^][]*(?=]) Match [...] asserting [ to the left and ] to the right
See another regex demo
I think an alternative approach could be:
import re
pattern = re.compile("(\[[^\]]*\]\.\[[^\]]*\])")
print(pattern.findall(sss))
OUTPUT
['[Data Source].[Week]']

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

python regExp search with lookarounds

In my test program I get an input that goes like
str = "TestID277RStep01CtrAx-mn00112345"
Here, I want to use regExp to form groups that return me the following
str = "Test(ID277)(R)(Step01)(CtrAx-mn001)12345"
My goal is to end up with 4 vars
var1 = "ID277"
var2 = "R"
var3 = "Step01"
var4 = "CtrAx-mn001"
I have so far tried
regx = ".*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]-/d{3}))?.*"
re_testInp = re.compile ( regx, re.IGNORECASE )
srch = re_testInp.search( r'^' + str )
print srch.groups()
I seem to be getting the first 3 groups right but unable to get the last one.
Almost close to pulling all my hair out with this one. Any help will be much appreciated.
Works for me fine with Python3.6.0 and the following pattern:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(.*\-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})?.*
I only changed the last capturing group as I'll explain what was wrong, in my opinion, with the pattern you included:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]/d{3}))?.*
Do notice that the capture group in bold will not find a match because:
You attempt to match a literal 'Ctr', also you did not consider the literal '-'. I do not know what is the possible text you try to match there exactly but I generalized it to: .*-
You wrote /d{3} instead of \d{3}
In the test string you included: '...ReqAx-mn...' the m is lower cased. You should change the pattern to: (Ctr(?=[A-Za-z][a-z]/d{3})) if you want to support lowercase as well.
You do not use the lookahead assertion properly. As stated in: https://docs.python.org/3/library/re.html
(?=...)
Matches if ... matches next, but doesn’t consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match 'Isaac ' only if it’s followed by 'Asimov'.
Meaning you should change the capturing group to: (.*-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})
In: (Step(?=\d)\d+) I assume you thought the first digit would be captured in the lookahead assertion, but both digits are captured by the following \d+
Ben.

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Categories

Resources