repeating multiple characters regex - python

Is there a way using a regex to match a repeating set of characters? For example:
ABCABCABCABCABC
ABC{5}
I know that's wrong. But is there anything to match that effect?
Update:
Can you use nested capture groups? So Something like (?<cap>(ABC){5}) ?

Enclose the regex you want to repeat in parentheses. For instance, if you want 5 repetitions of ABC:
(ABC){5}
Or if you want any number of repetitions (0 or more):
(ABC)*
Or one or more repetitions:
(ABC)+
edit to respond to update
Parentheses in regular expressions do two things; they group together a sequence of items in a regular expression, so that you can apply an operator to an entire sequence instead of just the last item, and they capture the contents of that group so you can extract the substring that was matched by that subexpression in the regex.
You can nest parentheses; they are counted from the first opening paren. For instance:
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(0)
'123 ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(1)
'ABCDEF'
>>> re.search('[0-9]* (ABC(...))', '123 ABCDEF 456').group(2)
'DEF'
If you would like to avoid capturing when you are grouping, you can use (?:. This can be helpful if you don't want parentheses that you're just using to group together a sequence for the purpose of applying an operator to change the numbering of your matches. It is also faster.
>>> re.search('[0-9]* (?:ABC(...))', '123 ABCDEF 456').group(1)
'DEF'
So to answer your update, yes, you can use nested capture groups, or even avoid capturing with the inner group at all:
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(1)
'ABCABCABCABCABC'
>>> re.search('((?:ABC){5})(DEF)', 'ABCABCABCABCABCDEF').group(2)
'DEF'

ABC{5} matches ABCCCCC. To match 5 ABC's, you should use (ABC){5}. Parentheses are used to group a set of characters. You can also set an interval for occurrences like (ABC){3,5} which matches ABCABCABC, ABCABCABCABC, and ABCABCABCABCABC.
(ABC){1,} means 1 or more repetition which is exactly the same as (ABC)+.
(ABC){0,} means 0 or more repetition which is exactly the same as (ABC)*.

(ABC){5} Should work for you

Parentheses "()" are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group.
(ABC){5}

As to the update to the question-
You can nest capture groups. The capture group index is incremented per open paren.
(((ABC)*)(DEF)*)
Feeding that regex ABCABCABCDEFDEFDEF, capture group 0 matches the whole thing, 1 is also the whole thing, 2 is ABCABCABC, 3 is ABC, and 4 is DEF (because the star is outside of the capture group).
If you have variation inside a capture group and a repeat just outside, then things can get a little wonky if you're not expecting it...
(a[bc]*c)*
when fed abbbcccabbc will return the last match as capture group 1, in this example just the abbc, since the capture group gets reset with the repeat operator.

Related

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

several replacements in single regular expression

I'm trying to read a set of data from a file such that it can be cast to complex. The entries are of the form
line='0.2741564350068515+2.6100840481550604*^-10*I\n',
which is supposed to be rendered as
'(0.2741564350068515+2.6100840481550604e-10j)'.
Hence I need to insert the pair of parentheses and change the symbols for imaginary unit and exponential notation. My clumsy solution is to perform each substitution individually,
re.sub("\*\^","e",re.sub("[\.]{0,1}\*I","j)",re.sub("(^)","(",line))).strip(),
but this is not exactly readable, or sane. Is there a way to use a single regex to do this substitution?
It seems that you can do without a regex at all:
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(line.strip().replace("*^", "e").replace("*I", "j")))
# => (0.2741564350068515+2.6100840481550604e-10j)
See the IDEONE demo
A "funny" regex way showing how to use capturing groups and check what was captured in the replacement with a lambda:
import re
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(re.sub(r"(\*\^)|([.]?\*I)", lambda m: "e" if m.group(1) else "j", line.strip())))
# => (0.2741564350068515+2.6100840481550604e-10j)
If Group 1 ((\*\^)) was matched we replace with e, if Group 2 matched, replace with j.
Note that {0,1} limiting quantifier means the same as ? quantifier - 1 or 0 times.
The easiest way to do this with regex is to make a pattern that matches the whole number and captures all the important parts in capture groups:
(.*?)\*\^(.*?)\*I
This would capture 0.2741564350068515+2.6100840481550604 in group 1 and 10 in group 2, so substituting with (\1e\2j) will give you the expected result:
(0.2741564350068515+2.6100840481550604e-10j)

How can I express 'repeat this part' in a regular expression?

Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

Regular expression - number with spaces and decimal comma

I'd like to write a regular expression for following type of strings in Pyhton:
1 100
1 567 865
1 474 388 346
i.e. numbers separated from thousand. Here's my regexp:
r"(\d{1,3}(?:\s*\d{3})*)
and it works fine. However, I also wanna parse
1 100,34848
1 100 300,8
19 328 383 334,23499
i.e. separated numbers with decimal digits. I wrote
rr=r"(\d{1,3}(?:\s*\d{3})*)(,\d+)?\s
It doesn't work. For instance, if I make
sentence = "jsjs 2 222,11 dhd"
re.findall(rr, sentence)
[('2 222', ',11')]
Any help appreciated, thanks.
This works:
import re
rr=r"(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)"
sentence = "jsjs 2 222,11 dhd"
print re.findall(rr, sentence) # prints ['2 222,11']
TL;DR: This regular expresion will print ['2 222,11 ']
r"(?:\d{1,3}(?:\s*\d{3})*)(?:,\d+)?"
The result of the search are expresions in parentheses except those starting (?: or whole expresion if the're aren't any subexpresion
So in your first regex it will match your string and return the whole expresion, since there aren't subexpressions (the only parenteses starts with (?:)
In the second it will find the string 2 222,11 and match it, then it looks at subexpresions ((\d{1,3}(?:\s*\d{3})*) and (,\d+), and will return tuple containing those: namely part before decimal comma, and the part after
So to fix your expresion, you'll need to either add to all parentheses ?: or remove them
Also the last \s is redundant as regexes always match as much characters as possible - meaning it will match all numbers after comma
The only problem with your result is that you're getting two match groups instead of one. The only reason that's happening is that you're creating two capture groups instead of one. You're putting separate parentheses around the first half and the second half, and that's what parentheses mean. Just don't do that, and you won't have that problem.
So, with this, you're half-way there:
(\d{1,3}(?:\s*\d{3})*,\d+)\s
Debuggex Demo
The only problem is that the ,\d+ part is now mandatory instead of optional. You obviously need somewhere to put the ?, as you were doing. But without a group, how do you do that? Simple: you can use a group, just make it a non-capturing group ((?:…) instead of (…)). And put it inside the main capturing group, not separate from it. Exactly as you're already doing for the repeated \s*\d{3} part.
(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)\s
Debuggex Demo

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Categories

Resources