I have the following regex (example is in Python):
pattern = re.compile(r'^(([a-zA-Z0-9]*[a-zA-Z]+)([\d]+)|([\d]+))$')
This correctly parses any string that has a numerical suffix and an optional prefix that is alphanumerics:
a123
a2a123
123
All will correctly see 123 as a suffix. It will correctly reject bad inputs:
abc
123abc
()123 # Or other non-alphanumerics
The regex itself is fairly unwieldy, though, and several of the capture groups are often empty as a result, meaning I have to go through the additional step of filtering them out. I am curious if there is a better way to be thinking about this regex than "a number OR a number preceeded by an alphanumeric that ends in a character"?
You may use
^[A-Za-z0-9]*?([0-9]+)$
See the regex demo
Details
^ - start of string
[A-Za-z0-9]*? - any letters/digits, zero or more times, as few as possible (due to this non-greedy matching, the next pattern, ([0-9]+), will match all digits at the end of the string there are)
([0-9]+) - Group 1: one or more digits
$ - end of string.
In Python:
m = re.search(r'^[A-Za-z0-9]*?([0-9]+)$') # Or, see below
# m = re.match(r'[A-Za-z0-9]*?([0-9]+)$') # re.match only searches at the start of the string
# m = re.fullmatch(r'[A-Za-z0-9]*?([0-9]+)') # Only in Python 3.x
if m:
print(m.group(1))
If you use non-capturing groups and a correct management of repetitions, the problem eases itself.
pattern = re.compile(r'^(?:[a-zA-Z0-9]*[a-zA-Z]+)?([0-9]+)$')
There's only one capturing group (group 1) for the suffix, and the alphanumerics before it is not captured.
Alternatively, using named groups is another option, and it often makes long, structured regexes easier to maintain:
pattern = re.compile(r'^(?P<a>[a-zA-Z0-9]*[a-zA-Z]+)?(?P<suffix>[0-9]+)$')
I need to create the regex that will match such string:
AA+1.01*2.01,BB*2.01+1.01,CC
Order of * and + should be any
I've created the following regex:
^(([A-Z][A-Z](([*+][0-9]+(\.[0-9])?[0-9]?){0,2}),)*[A-Z]{2}([*+][0-9]+(\.[0-9])?[0-9]?){0,2})$
But the problem is that with this regex + or * could be used twice but I only need any of them once so the following strings matches should be:
AA+1*2,CC - true
AA+1+2,CC - false (now is true with my regex)
AA*1+2,CC - true
AA*1*2,CC - false (now is true with my regex)
Either of the [+*] should be captured first and then use negative lookahead to match the other one.
Regex: [A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),[A-Z]{2}
Explanation:
[A-Z]{2} Matches two upper case letters.
([+*]) captures either of + or *.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
(?!\1)[+*] looks ahead for symbol captured and matched the other one. So if + is captured previously then * will be matched.
(?:\d+(?:\.\d+)?) matches number with optional decimal part.
,[A-Z]{2} matches , followed by two upper case letters.
Regex101 Demo
To match the first case AA+1.01*2.01,BB*2.01+1.01,CC which is just a little advancement over previous pattern, use following regex.
Regex: (?:[A-Z]{2}([+*])(?:\d+(?:\.\d+)?)(?!\1)[+*](?:\d+(?:\.\d+)?),)+[A-Z]{2}
Explanation: Added whole pattern except ,CC in first group and made it greedy by using + to match one or more such patterns.
Regex101 Demo
To get a regex to match your given example, extended to an arbitrary number of commas, you could use:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*$
Note that this example will also allow a trailing comma. I'm not sure if there is much you can do about that.
Regex 101 Example
If the trailing comma is an issue:
^(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\1)[+*]?\d*\.?\d*,?)*?(?:[A-Z]{2}([+*])?\d*\.?\d*(?!\2)[+*]?\d*\.?\d*?)$
Regex 101 Example
I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d is written more concisely as \d{4}.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".
The 2nd group is referenced with \2.
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4} matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.
(?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.
Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.
I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.
I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.