Python RegEx Overlapping - python

The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:
First, we're lookig for any one of these strings:
CATGTG
CATTTG
CACGTG
Second, the pattern is:
string
6-7 letters
string
Example
match: CATGTGXXXXXXCACGTG
no match: CATGTGXXXCACGTG (because 3 letters between)
Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.
Example:
input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX
workflow (spaces for readability):
found match: CATGTG XXXXXX CATTTG
it starts at 3
resuming search at C in CATTTG
found match: CATTTG XXXXXXX CACGTG
it starts at 15
and so on...
After a few hours of tinkering, my sorry attempt did not yield what I expected:
regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group())
3 CATGTG
15 CATTTG (incorrect)
You're a genius if you can figure this out with a RegEx. Thanks :D

You can use this kind of pattern:
import re
s='XXXCATGTGXXXXXXCATTTGXXXXXXXCACGTGXXX'
regex = re.compile(r'(?=(((?:CATGTG|CATTTG|CACGTG).{6,7}?)(?:CATGTG|CATTTG|CACGTG)))\2')
for m in regex.finditer(s):
print(m.start(), m.group(1))
The idea is to put the whole string inside the lookahead and to use a backreference to consume characters you don't want to test after.
The first capture group contains the whole sequence, the second contains all characters until the next start position.
Note that you can change (?:CATGTG|CATTTG|CACGTG) to CA(?:TGTG|TTTG|CGTG) to improve the pattern.

The main issue is that in order to use the | character, you need to enclose the alternatives in parentheses.
Assuming from your example that you want only the first matching string, try the following:
regex = re.compile("(CATGTG|CATTTG|CACGTG).{6,7}(?:CATGTG|CATTTG|CACGTG)")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group(1))
Note the .group(1), which will match only what's in the first set of parentheses, as opposed to .group() which will return the whole match.

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

non greedy Python regex from end of string

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.
I try to explain with an example:
Input can be one of the following
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'
I need to find the last part of the string ending with
somestring_otherstring.xml
In all the above cases the regex should return XX1234567890_84481.xml
My best try is:
result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)
Here I used:
(_.+)? to match "_anystring" in a non greedy mode
\.xml$ to match ".xml" in the final part of the string
The output I get is not correct:
_x-y-z_XX1234567890_84481.xml
I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.
Could anyone explain me how to implement a non greedy regex from the right?
Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.
To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:
[^_]+_[^_]+\.xml$
Regex demo | Python demo
That will match
[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string
For example:
import re
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())
Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:
'[^_]+_[^_]+\.xml$'
The [^_] is a character class matching any character which is not an underscore.
You need to use this regex to capture what you want,
[^_]*_[^_]*\.xml
Demo
Check out this Python code,
import re
arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']
for s in arr:
m = re.search(r'[^_]*_[^_]*\.xml', s)
if (m):
print(m.group(0))
Prints,
XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml
The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

Regex for the repeated pattern

Can you please get me python regex that can match
9am, 5pm, 4:30am, 3am
Simply saying - it has the list of times in csv format
I know the pattern for time, here it is:
'^(\\d{1,2}|\\d{1,2}:\\d{1,2})(am|pm)$'
^(\d+(:\d+)?(am|pm)(, |$))+ will work for you.
Demo here
If you have a regex X and you want a list of them separated by comma and (optional) spaces, it's a simple matter to do:
^X(,\s*X)*$
The X is, of course, your current search pattern sans anchors though you could adapt that to be shorter as well. To my mind, a better pattern for the times would be:
\d{1,2}(:\d{2})?[ap]m
meaning that the full pattern for what you want would be:
^\d{1,2}(:\d{2})?[ap]m(,\s*\d{1,2}(:\d{2})?[ap]m)*$
You can use re.findall() to get all the matches for a given regex
>>> str = "hello world 9am, 5pm, 4:30am, 3am hai"
>>> re.findall(r'\d{1,2}(?::\d{1,2})?(?:am|pm)', str)
['9am', '5pm', '4:30am', '3am']
What it does?
\d{1,2} Matches one or two digit
(?::\d{1,2}) Matches : followed by one ore 2 digits. The ?: is to prevent regex from capturing the group.
The ? at the end makes this part optional.
(?:am|pm) Match am or pm.
Use the following regex pattern:
tstr = '9am, 5pm, 4:30am, 3amsdfkldnfknskflksd hello'
print(re.findall(r'\b\d+(?::\d+)?(?:am|pm)', tstr))
The output:
['9am', '5pm', '4:30am', '3am']
Try this,
((?:\d?\d(?:\:?\d\d?)?(?:am|pm)\,?\s?)+)
https://regex101.com/r/nkcWt5/1

Python regex matching only if digit

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Categories

Resources