I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d is written more concisely as \d{4}.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".
The 2nd group is referenced with \2.
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4} matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.
(?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.
Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.
Related
I have three classes of characters, say letters [A-Za-z], numbers [0-9], and symbols [!##$]. The particular symbols don't really matter for the sake of the argument. I would like to use a regular expression in Python so that I can select all permutations on these three classes with a length of three, without repetition.
For example, the following would successfully match:
a1!
4B_
*x7
And the following would fail:
ab!
BBB
*x_
a1!B
How would I go about this without explicitly writing out each potential permutation of classes in my regular expression?
I have previously attempted the following solution:
import re
regex = r"""
([A-Za-z]|[0-9]|[!##$])
(?!\1) ([A-Za-z]|[0-9]|[!##$])
(?![\1\2])([A-Za-z]|[0-9]|[!##$])
"""
s = "ab1"
re.fullmatch(regex, s, re.VERBOSE)
However the string ab1 is improperly matched. This is because the group references \1 and \2 refer to the actual matched contents of the groups, and not to the regular expressions contained within the groups.
Then, how do I refer to the regular expressions contained within prior matched groups, and not to their actual contents?
Your main problem is that you can't use backreferences to negate a part of a pattern, you may only use them to match/negate the same values as captured in the corresponding capturing group.
Note [^\1] matches any char other than the \x01 char, not any char other than what Group 1 holds, because, inside character classes, backreferences are no longer reated as such.
ab1 is matched becauie b is not equal to a and 1 is not equal to a and 1.
What you can use is a sequence of negative lookaheads that would "exclude" matching under certain conditions, like the string cannot have two digits/letters/special chars.
rx = re.compile(r"""
(?!(?:[\W\d_]*[^\W\d_]){2}) # no two letters allowed
(?!(?:\D*\d){2}) # no two digits allowed
(?!(?:[^_!#\#$*]*[_!#\#$*]){2}) # no two special chars allowed
[\w!#\#$*]{3} # three allowed chars
""", re.ASCII | re.VERBOSE)
See the regex demo. The negated character classes are replaced with .* in the demo since the test is performed against a single multiline text and not separate strings.
See the Python demo:
import re
passes = ['a1!','4B_','*x7']
fails = ['ab!','BBB','*x_','a1!B']
rx = re.compile(r"""
(?!(?:[\W\d_]*[^\W\d_]){2}) # no two letters allowed
(?!(?:\D*\d){2}) # no two digits allowed
(?!(?:[^_!#\#$*]*[_!#\#$*]){2}) # no two special chars allowed
[\w!#\#$*]{3} # three allowed chars
""", re.ASCII | re.VERBOSE)
for s in passes:
print(s, ' should pass, result:', bool(rx.fullmatch(s)))
for s in fails:
print(s, ' should fail, reuslt:', bool(rx.fullmatch(s)))
Output:
a1! should pass, result: True
4B_ should pass, result: True
*x7 should pass, result: True
ab! should fail, reuslt: False
BBB should fail, reuslt: False
*x_ should fail, reuslt: False
a1!B should fail, reuslt: False
A straightforward solution is to not write out the permutations yourself, but to let Python do it with the help of itertools.
from itertools import permutations
patterns = [
'[a-zA-Z]',
'[0-9]',
'[!##$]'
]
regex = '|'.join(
''.join(p)
for p in permutations(patterns)
)
I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]
I need to create the regex that will match such string:
AA+1.01*2.01,BB*2.01+1.01,CC
Order of * and + should be any
I've created the following regex:
^(([A-Z][A-Z](([*+][0-9]+(\.[0-9])?[0-9]?){0,2}),)*[A-Z]{2}([*+][0-9]+(\.[0-9])?[0-9]?){0,2})$
But the problem is that with this regex + or * could be used twice but I only need any of them once so the following strings matches should be:
AA+1*2,CC - true
AA+1+2,CC - false (now is true with my regex)
AA*1+2,CC - true
AA*1*2,CC - false (now is true with my regex)
Either of the [+*] should be captured first and then use negative lookahead to match the other one.
Regex: [A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),[A-Z]{2}
Explanation:
[A-Z]{2} Matches two upper case letters.
([+*]) captures either of + or *.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
(?!\1)[+*] looks ahead for symbol captured and matched the other one. So if + is captured previously then * will be matched.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
,[A-Z]{2} matches , followed by two upper case letters.
Regex101 Demo
To match the first case AA+1.01*2.01,BB*2.01+1.01,CC which is just a little advancement over previous pattern, use following regex.
Regex: (?:[A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),)+[A-Z]{2}
Explanation: Added whole pattern except ,CC in first group and made it greedy by using + to match one or more such patterns.
Regex101 Demo
To get a regex to match your given example, extended to an arbitrary number of commas, you could use:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*$
Note that this example will also allow a trailing comma. I'm not sure if there is much you can do about that.
Regex 101 Example
If the trailing comma is an issue:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*?(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\2)[+*]?\d*\.?\d*?)$
Regex 101 Example
I'm a Python beginner, so keep in mind my regex skills are level -122.
I need to convert a string with text containing file1 to file01, but not convert file10 to file010.
My program is wrong, but this is the closest I can get, I've tried dozens of combinations but I can't get close:
import re
txt = 'file8, file9, file10'
pat = r"[0-9]"
regexp = re.compile(pat)
print(regexp.sub(r"0\d", txt))
Can someone tell me what's wrong with my pattern and substitution and give me some suggestions?
You could capture the number and check the length before adding 0, but you might be able to use this instead:
import re
txt = 'file8, file9, file10'
pat = r"(?<!\d)(\d)(?=,|$)"
regexp = re.compile(pat)
print(regexp.sub(r"0\1", txt))
regex101 demo
(?<! ... ) is called a negative lookbehind. This prevents (negative) a match if the pattern after it has the pattern in the negative lookbehind matches. For example, (?<!a)b will match all b in a string, except if it has an a before it, meaning bb, cb matches, but ab doesn't match. (?<!\d)(\d) thus matches a digit, unless it has another digit before it.
(\d) is a single digit, enclosed in a capture group, denoted by simple parentheses. The captured group gets stored in the first capture group.
(?= ... ) is a positive lookahead. This matches only if the pattern inside the positive lookahead matches after the pattern before this positive lookahead. In other words, a(?=b) will match all a in a string only if there's a b after it. ab matches, but ac or aa don't.
(?=,|$) is a positive lookahead containing ,|$ meaning either a comma, or the end of the string.
(?<!\d)(\d)(?=,|$) thus matches any digit, as long as there's no digit before it and there's a comma after it, or if that digit is at the end of the string.
how about?
a='file1'
a='file' + "%02d" % int(a.split('file')[1])
This approach uses a regex to find every sequence of digits and str.zfill to pad with zeros:
>>> txt = 'file8, file9, file10'
>>> re.sub(r'\d+', lambda m : m.group().zfill(2), txt)
'file08, file09, file10'
I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string