How can I express 'repeat this part' in a regular expression? - python

Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.

I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"

You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+

You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

Related

Regex in python repetition Error

In my code I Want answer [('22', '254', '15', '36')] but got [('15', '36')]. My regex (?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3} is not run for 3 time may be!
import re
def fun(st):
print(re.findall("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
ip="22.254.15.36"
print(fun(ip))
Overview
As I mentioned in the comments below your question, most regex engines only capture the last match. So when you do (...){3}, only the last match is captured: E.g. (.){3} used against abc will only return c.
Also, note that changing your regex to (2[0-4]\d|25[0-5]|[01]?\d{1,2}) performs much better and catches full numbers (currently you'll grab 25 instead of 255 on the last octet for example - unless you anchor it to the end).
To give you a fully functional regex for capturing each octet of the IP:
(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})
Personally, however, I'd separate the logic from the validation. The code below first validates the format of the string and then checks whether or not the logic (no octets greater than 255) passes while splitting the string on ..
Code
See code in use here
import re
ip='22.254.15.36'
if re.match(r"(?:\d{1,3}\.){3}\d{1,3}$", ip):
print([octet for octet in ip.split('.') if int(octet) < 256])
Result: ['22', '254', '15', '36']
If you're using this method to extract IPs from an arbitrary string, you can replace re.match() with re.search() or re.findall(). In that case you may want to remove $ and add some logic to ensure you're not matching special cases like 11.11.11.11.11: (?<!\d\.)\b(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)
You only have two capturing groups in your regex:
(?: # non-capturing group
( # group 1
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)\.
){3}
( # group 2
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)
That the first group can be repeated 3 times doesn't make it capture 3 times. The regex engine will only ever return 2 groups, and the last match in a given group will fill that group.
If you want to capture each of the parts of an IP address into separate groups, you'll have to explicitly define groups for each:
pattern = (
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)')
def fun(st, p=re.compile(pattern)):
return p.findall(st)
You could avoid that much repetition with a little string and list manipulation:
octet = r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)'
pattern = r'\.'.join([octet] * 4)
Next, the pattern will just as happily match the 25 portion of 255. Better to put matching of the 200-255 range at the start over matching smaller numbers:
octet = r'(2(?:5[0-5]|[0-4]\d)|[01]?[0-9]{1,2})'
pattern = r'\.'.join([octet] * 4)
This still allows leading 0 digits, by the way, but is
If all you are doing is passing in single IP addresses, then re.findall() is overkill, just use p.match() (matching only at the string start) or p.search(), and return the .groups() result if there is a match;)
def fun(st, p=re.compile(pattern + '$')):
match = p.match(st)
return match and match.groups()
Note that no validation is done on the surrounding data, so if you are trying to extract IP addresses from a larger body of text you can't use re.match(), and can't add the $ anchor and the match could be from a larger number of octets (e.g. 22.22.22.22.22.22). You'd have to add some look-around operators for that:
# only match an IP address if there is no indication that it is part of a larger
# set of octets; no leading or trailing dot or digits
pattern = r'(?<![\.\d])' + pattern + r'(?![\.\d])'
I encountered a very similar issue.
I found two solutions, using the official documentation.
The answer of #ctwheels above did mention the cause of the problem, and I really appreciate it, but it did not provide a solution.
Even when trying the lookbehind and the lookahead, it did not work.
First solution:
re.finditer
re.finditer iterates over match objects !!
You can use each one's 'group' method !
>>> def fun(st):
pr=re.finditer("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st)
for p in pr:
print(p.group(),end="")
>>> fun(ip)
22.254.15.36
Or !!!
Another solution haha : You can still use findall, but you'll have to make every group a non-capturing group ! (Since the main problem is not with findall, but with the group function that is used by findall (which, we all know, only returns the last match):
"re.findall:
...If one or more groups are present in the pattern, return a list of groups"
(Python 3.8 Manuals)
So:
>>> def fun(st):
print(re.findall("(?:(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
>>> fun(ip)
['22.254.15.36']
Have fun !

several replacements in single regular expression

I'm trying to read a set of data from a file such that it can be cast to complex. The entries are of the form
line='0.2741564350068515+2.6100840481550604*^-10*I\n',
which is supposed to be rendered as
'(0.2741564350068515+2.6100840481550604e-10j)'.
Hence I need to insert the pair of parentheses and change the symbols for imaginary unit and exponential notation. My clumsy solution is to perform each substitution individually,
re.sub("\*\^","e",re.sub("[\.]{0,1}\*I","j)",re.sub("(^)","(",line))).strip(),
but this is not exactly readable, or sane. Is there a way to use a single regex to do this substitution?
It seems that you can do without a regex at all:
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(line.strip().replace("*^", "e").replace("*I", "j")))
# => (0.2741564350068515+2.6100840481550604e-10j)
See the IDEONE demo
A "funny" regex way showing how to use capturing groups and check what was captured in the replacement with a lambda:
import re
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(re.sub(r"(\*\^)|([.]?\*I)", lambda m: "e" if m.group(1) else "j", line.strip())))
# => (0.2741564350068515+2.6100840481550604e-10j)
If Group 1 ((\*\^)) was matched we replace with e, if Group 2 matched, replace with j.
Note that {0,1} limiting quantifier means the same as ? quantifier - 1 or 0 times.
The easiest way to do this with regex is to make a pattern that matches the whole number and captures all the important parts in capture groups:
(.*?)\*\^(.*?)\*I
This would capture 0.2741564350068515+2.6100840481550604 in group 1 and 10 in group 2, so substituting with (\1e\2j) will give you the expected result:
(0.2741564350068515+2.6100840481550604e-10j)

Regex Price Matching

I have a webscraper that scrapes prices, for that I need it to find following prices in strings:
762,50
1.843,75
In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:
re.findall("\d+,\d+", string)[0]
Now I need to match both cases and my initial idea was this:
re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]
With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?
No need to use a or, just add the first part as an optional parameter:
(?:\d+\.)?\d+,\d+
The ? after (?:\d+\.) makes it an optional parameter.
The '?:' indicate to not capture this group, just match it.
>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']
Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)
In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:
\d+\.\d+,\d+|\d+,\d+
^^
To match multiple leading digits, the regular expression should be:
>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75 123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']
NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.
UPDATE
>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
... '1,23 1.843,75 123.456.762,50 1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
How about:
(\d+[,.]\d+(?:[.,]\d+)?)
Matches:
- some digits followed by , or . and some digits
OR
- some digits followed by , or . and some digits followed by , or . and some digits
It matches: 762,50 and 1.843,75 and 1,75
It will also match 1.843.75 are you OK with that?
See it in action.
I'd use this:
\d{1,3}(?:\.\d{3})*,\d\d
This will match number that have dot as thousand separator
\d*\.?\d{3},\d{2}
See the working example here
This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.
Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.
import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
def float_filter(iterable):
result = []
for item in iterable:
try:
result.append(locale.atof(item))
except ValueError:
pass
return result
text = 'The price is 762,50 kroner'
print(float_filter(text.split()))
yields
[762.5]
The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.
In [107]: import locale
In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'
In [109]: locale.atof('762,50')
Out[109]: 762.5
In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.
That looks like this:
matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'
re.findall(matcher, '1.843,75 762,50')
This detects a lot of boundary cases, but may not catch everything....

repeating multiple characters regex

Is there a way using a regex to match a repeating set of characters? For example:
ABCABCABCABCABC
ABC{5}
I know that's wrong. But is there anything to match that effect?
Update:
Can you use nested capture groups? So Something like (?<cap>(ABC){5}) ?
Enclose the regex you want to repeat in parentheses. For instance, if you want 5 repetitions of ABC:
(ABC){5}
Or if you want any number of repetitions (0 or more):
(ABC)*
Or one or more repetitions:
(ABC)+
edit to respond to update
Parentheses in regular expressions do two things; they group together a sequence of items in a regular expression, so that you can apply an operator to an entire sequence instead of just the last item, and they capture the contents of that group so you can extract the substring that was matched by that subexpression in the regex.
You can nest parentheses; they are counted from the first opening paren. For instance:
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(0)
'123 ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(1)
'ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(2)
'DEF'
If you would like to avoid capturing when you are grouping, you can use (?:. This can be helpful if you don't want parentheses that you're just using to group together a sequence for the purpose of applying an operator to change the numbering of your matches. It is also faster.
>>> re.search('[0-9]* (?:ABC(...))', '123 ABCDEF 456').group(1)
'DEF'
So to answer your update, yes, you can use nested capture groups, or even avoid capturing with the inner group at all:
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(1)
'ABCABCABCABCABC'
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(2)
'DEF'
ABC{5} matches ABCCCCC. To match 5 ABC's, you should use (ABC){5}. Parentheses are used to group a set of characters. You can also set an interval for occurrences like (ABC){3,5} which matches ABCABCABC, ABCABCABCABC, and ABCABCABCABCABC.
(ABC){1,} means 1 or more repetition which is exactly the same as (ABC)+.
(ABC){0,} means 0 or more repetition which is exactly the same as (ABC)*.
(ABC){5} Should work for you
Parentheses "()" are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group.
(ABC){5}
As to the update to the question-
You can nest capture groups. The capture group index is incremented per open paren.
(((ABC)*)(DEF)*)
Feeding that regex ABCABCABCDEFDEFDEF, capture group 0 matches the whole thing, 1 is also the whole thing, 2 is ABCABCABC, 3 is ABC, and 4 is DEF (because the star is outside of the capture group).
If you have variation inside a capture group and a repeat just outside, then things can get a little wonky if you're not expecting it...
(a[bc]*c)*
when fed abbbcccabbc will return the last match as capture group 1, in this example just the abbc, since the capture group gets reset with the repeat operator.

Regex to ensure group match doesn't end with a specific character

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.
I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.
It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.
I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.
If the last part never contains a dot: ^(.*)\.([^\.]+)$

Categories

Resources