Regex and a sequences of patterns? - python

Is there a way to match a pattern (e\d\d) several times, capturing each one into a group? For example, given the string..
blah.s01e24e25
..I wish to get four groups:
1 -> blah
2 -> 01
3 -> 24
4 -> 25
The obvious regex to use is (in Python regex:
import re
re.match("(\w+).s(\d+)e(\d+)e(\d+)", "blah.s01e24e25").groups()
..but I also want to match either of the following:
blah.s01e24
blah.s01e24e25e26
You can't seem to do (e\d\d)+, or rather you can, but it only captures the last occurrence:
>>> re.match("(\w+).s(\d+)(e\d\d){2}", "blah.s01e24e25e26").groups()
('blah', '01', 'e25')
>>> re.match("(\w+).s(\d+)(e\d\d){3}", "blah.s01e24e25e26").groups()
('blah', '01', 'e26')
I want to do this in a single regex because I have multiple patterns to match TV episode filenames, and do not want to duplicate each expression to handle multiple episodes:
\w+\.s(\d+)\.e(\d+) # matches blah.s01e01
\w+\.s(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02
\w+\.s(\d+)\.e(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02e03
\w - \d+x\d+ # matches blah - 01x01
\w - \d+x\d+\d+ # matches blah - 01x01x02
\w - \d+x\d+\d+\d+ # matches blah - 01x01x02x03
..and so on for numerous other patterns.
Another thing to complicate matters - I wish to store these regexs in a config file, so a solution using multiple regexs and function calls is not desired - but if this proves impossible I'll just allow the user to add simple regexs
Basically, is there a way to capture a repeating pattern using regex?

Do it in two steps, one to find all the numbers, then one to split them:
import re
def get_pieces(s):
# Error checking omitted!
whole_match = re.search(r'\w+\.(s\d+(?:e\d+)+)', s)
return re.findall(r'\d+', whole_match.group(1))
print get_pieces(r"blah.s01e01")
print get_pieces(r"blah.s01e01e02")
print get_pieces(r"blah.s01e01e02e03")
# prints:
# ['01', '01']
# ['01', '01', '02']
# ['01', '01', '02', '03']

Number of captured groups equal to number of parentheses groups. Look at findall or finditer for solving your problem.

non-grouping parentheses:
(?:asdfasdg)
which do not have to appear:
(?:adsfasdf)?
c = re.compile(r"""(\w+).s(\d+)
(?:
e(\d+)
(?:
e(\d+)
)?
)?
""", re.X)
or
c = re.compile(r"""(\w+).s(\d+)(?:e(\d+)(?:e(\d+))?)?""", re.X)

After thinking about the problem, I think I have a simpler solution, using named groups.
The simplest regex a user (or I) could use is:
(\w+\).s(\d+)\.e(\d+)
The filename parsing class will take the first group as the show name, second as season number, third as episode number. This covers a majority of files.
I'll allow a few different named groups for these:
(?P<showname>\w+\).s(?P<seasonnumber>\d+)\.e(?P<episodenumber>\d+)
To support multiple episodes, I'll support two named groups, something like startingepisodenumber and endingepisodenumber to support things like showname.s01e01-03:
(?P<showname>\w+\)\.s(?P<seasonnumber>\d+)\.e(?P<startingepisodenumber>\d+)-(?P<endingepisodenumber>e\d+)
And finally, allow named groups with names matching episodenumber\d+ (episodenumber1, episodenumber2 etc):
(?P<showname>\w+\)\.
s(?P<seasonnumber>\d+)\.
e(?P<episodenumber1>\d+)
e(?P<episodenumber2>\d+)
e(?P<episodenumber3>\d+)
It still requires possibly duplicating the patterns for different amounts of e01s, but there will never be a file with two non-consecutive episodes (like show.s01e01e03e04), so using the starting/endingepisodenumber groups should solve this, and for weird cases users come across, they can use the episodenumber\d+ group names
This doesn't really answer the sequence-of-patterns question, but it solves the problem that led me to ask it! (I'll still accept another answer that shows how to match s01e23e24...e27 in one regex - if someone works this out!)

Perhaps something like that?
def episode_matcher(filename):
m1= re.match(r"(?i)(.*?)\.s(\d+)((?:e\d+)+)", filename)
if m1:
m2= re.findall(r"\d+", m1.group(3))
return m1.group(1), m1.group(2), m2
# auto return None here
>>> episode_matcher("blah.s01e02")
('blah', '01', ['02'])
>>> episode_matcher("blah.S01e02E03")
('blah', '01', ['02', '03'])

Related

capture pattern_X repeatedly, then capture pattern_Y once, then repeat until EOS

[update:] Accepted answer suggests, this can not be done with the python re library in one step. If you know otherwise, please comment.
I'm reverse-engineering a massive ETL pipeline, I'd like to extract the full data lineage from stored procedures and views.
I'm struggling with the following regexp.
import re
select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`"
# this captures every occurrence of the source fields, but not the target
okay_pattern = r"(?i)((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?"
# this captures the target too, but captures only the first input field
wrong_pattern = r"(?i)((((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?AS (`[a-z0-9_]+)`).*?)"
re.findall(okay_pattern, select_clause)
re.findall(wrong_pattern, select_clause)
TLDR: I'd like to capture
[aaa, bbb, XXX],
[eee, fff, ..., ooo, YYY],
[ppp, ZZZ]
from a string like
"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."
where a,b,e,f,h match one pattern, X,Y,Z match another, and the first pattern might occur up to ~20 times, before the second one appears, which always appears alone.
I'm open to solutions with the sqlglot, sql-metadata, or sqlparse libraries as well, it is just regex is better documented.
(Probably I'm code golfing, and I should do this in several steps, starting with splitting the string into individual expressions.)
You may use this regex with 3 capture and 1 non-capture groups:
(\w+)\.+(\w+)(?:\.+(\w+))?
RegEx Demo
Code:
import re
s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....'
print (re.findall(r'(\w+)\.+(\w+)(?:\.+(\w+))?', s))
Output:
[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]
Here's two regexes, one to group things by the outside pattern, and one for the inside:
(.*?)(XXX|YYY|ZZZ)
(aaa|bbb|ccc|ddd|eee|fff|ggg)
What I would suggest is matching the whole string with the first regex, and then using the second regex on the first regex's match (.*?)
By using these two regexes, your matches will be grouped first by the outer pattern and then by the inner pattern, but the regexes themselves doesn't have to be overly complicated.

Regex in python repetition Error

In my code I Want answer [('22', '254', '15', '36')] but got [('15', '36')]. My regex (?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3} is not run for 3 time may be!
import re
def fun(st):
print(re.findall("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
ip="22.254.15.36"
print(fun(ip))
Overview
As I mentioned in the comments below your question, most regex engines only capture the last match. So when you do (...){3}, only the last match is captured: E.g. (.){3} used against abc will only return c.
Also, note that changing your regex to (2[0-4]\d|25[0-5]|[01]?\d{1,2}) performs much better and catches full numbers (currently you'll grab 25 instead of 255 on the last octet for example - unless you anchor it to the end).
To give you a fully functional regex for capturing each octet of the IP:
(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})
Personally, however, I'd separate the logic from the validation. The code below first validates the format of the string and then checks whether or not the logic (no octets greater than 255) passes while splitting the string on ..
Code
See code in use here
import re
ip='22.254.15.36'
if re.match(r"(?:\d{1,3}\.){3}\d{1,3}$", ip):
print([octet for octet in ip.split('.') if int(octet) < 256])
Result: ['22', '254', '15', '36']
If you're using this method to extract IPs from an arbitrary string, you can replace re.match() with re.search() or re.findall(). In that case you may want to remove $ and add some logic to ensure you're not matching special cases like 11.11.11.11.11: (?<!\d\.)\b(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)
You only have two capturing groups in your regex:
(?: # non-capturing group
( # group 1
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)\.
){3}
( # group 2
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)
That the first group can be repeated 3 times doesn't make it capture 3 times. The regex engine will only ever return 2 groups, and the last match in a given group will fill that group.
If you want to capture each of the parts of an IP address into separate groups, you'll have to explicitly define groups for each:
pattern = (
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)')
def fun(st, p=re.compile(pattern)):
return p.findall(st)
You could avoid that much repetition with a little string and list manipulation:
octet = r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)'
pattern = r'\.'.join([octet] * 4)
Next, the pattern will just as happily match the 25 portion of 255. Better to put matching of the 200-255 range at the start over matching smaller numbers:
octet = r'(2(?:5[0-5]|[0-4]\d)|[01]?[0-9]{1,2})'
pattern = r'\.'.join([octet] * 4)
This still allows leading 0 digits, by the way, but is
If all you are doing is passing in single IP addresses, then re.findall() is overkill, just use p.match() (matching only at the string start) or p.search(), and return the .groups() result if there is a match;)
def fun(st, p=re.compile(pattern + '$')):
match = p.match(st)
return match and match.groups()
Note that no validation is done on the surrounding data, so if you are trying to extract IP addresses from a larger body of text you can't use re.match(), and can't add the $ anchor and the match could be from a larger number of octets (e.g. 22.22.22.22.22.22). You'd have to add some look-around operators for that:
# only match an IP address if there is no indication that it is part of a larger
# set of octets; no leading or trailing dot or digits
pattern = r'(?<![\.\d])' + pattern + r'(?![\.\d])'
I encountered a very similar issue.
I found two solutions, using the official documentation.
The answer of #ctwheels above did mention the cause of the problem, and I really appreciate it, but it did not provide a solution.
Even when trying the lookbehind and the lookahead, it did not work.
First solution:
re.finditer
re.finditer iterates over match objects !!
You can use each one's 'group' method !
>>> def fun(st):
pr=re.finditer("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st)
for p in pr:
print(p.group(),end="")
>>> fun(ip)
22.254.15.36
Or !!!
Another solution haha : You can still use findall, but you'll have to make every group a non-capturing group ! (Since the main problem is not with findall, but with the group function that is used by findall (which, we all know, only returns the last match):
"re.findall:
...If one or more groups are present in the pattern, return a list of groups"
(Python 3.8 Manuals)
So:
>>> def fun(st):
print(re.findall("(?:(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
>>> fun(ip)
['22.254.15.36']
Have fun !

Regex to match repeated (unknown) substrings

I'm trying to find "laughter words" or similar such as hahaha, hihihi, hueheu within user messages. My current approach is as follows:
>>> substring_list = ['ha', 'ah', 'he', 'eh', 'hi', 'ih', 'ho', 'hu', 'hue']
>>> pattern_core = '|'.join(substring_list)
>>> self.regex_pattern = re.compile(r'\b[a-z]*(' + pattern_core + r'){2,}[a-z]*\b', re.IGNORECASE)
The [a-z]* allows for some leeway when it comes to typos (e.g., ahhahah). In principle this works reasonably well. The problem is that it needs to be maintained in the sense that substring_list needs to be updated to match new forms of "laughter words" (e.g., adding xi); "laughter words" seem to vary quite noticable between countries.
Now I wonder if I can somehow find words based on repeated patterns (of sizes, say, 2-4) without knowing the individual pattern. For example, hurrhurr contains hurr as repeated pattern. In the ideal case I can (a) match hurrhurr and (b) identify the core pattern hurr. I have no idea if this is possible with regular expressions.
This regex will do it:
\b[a-z]*?([a-z]{2,}?)\1+[a-z]*?\b
Usage:
self.regex_pattern = re.compile(r'\b[a-z]*?([a-z]{2,}?)\1+[a-z]*?\b', re.IGNORECASE)
Here's a working demo.
The gist is similar to what you were doing, but the "core" is different. The heart of the regex is this piece:
([a-z]{2,}?)\1+
The logic is to find a group consisting of 2 or more letters, then match the same group (\1) one or more additional times.
In the ideal case I can (a) match hurrhurr and (b) identify the core
pattern hurr. I have no idea if this is possible with regular expressions.
import re
string = """hahaha, huehue, heehee,
axaxaxax, x the theme, ------, hhxhhxhhx,
bananas, if I imagine, HahHaH"""
pattern = r"""
(
\b #Match a word boundary...
(
[a-z]{2,}? #Followed by a letter, 2 or more times, non-greedy...
) #Captured in group 2,
\2+ #Followed by whatever matched group 2, one or more times...
\b #Followed by a word boundary.
) #Capture in group 1.
"""
results = re.findall(pattern, string, re.X|re.I)
print(results)
--output:--
[('hahaha', 'ha'), ('huehue', 'hue'), ('heehee', 'hee'), ('axaxaxax', 'ax'), ('hhxhhxhhx', 'hhx'), ('HahHaH', 'Hah')]

How can I express 'repeat this part' in a regular expression?

Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

Regex to ensure group match doesn't end with a specific character

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.
I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.
It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.
I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.
If the last part never contains a dot: ^(.*)\.([^\.]+)$

Categories

Resources