Regular expression for repeating sequence - python

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?

Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets

What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))

You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"

Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.

An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False

If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))

To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Related

using Python re to check a string

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

re.match never returns None? [duplicate]

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

How to specify the regex string in python

I have the following 2 strings of train station IDs (showing the direction of travel) separated by "-".
String A (strA):
NS1-NS2-NS3-NS4-NS5-NS7-NS8-NS9-NS10-NS11-NS13-NS14-NS15-NS16-NS17-NS18-NS19-NS20-NS21-NS22-NS23-NS24-NS25-NS26-NS27
String B (strB):
NS27-NS26-NS25-NS24-NS23-NS22-NS21-NS20-NS19-NS18-NS17-NS16-NS15-NS14-NS13-NS11-NS10-NS9-NS8-NS7-NS5-NS4-NS3-NS2-NS1
I want to find out which of String A or B contains stations "NS4" followed by "NS1" (answer should be String B).
My current code as follows:
searchStr = ".*NS4-.*NS1(-.*|)"
re.search(searchStr, strA)
re.search(searchStr, strB)
But the result keep returning a match in String A.
May I know how to specify 'searchStr' in order to match only String B?
Two ways to do it: tokenizing and improving the regex.
Tokenizing
tokA = strA.split('-')
tokB = strB.split('-')
print('NS4' in tokA and tokA.index('NS1') > tokA.index('NS4'))
print('NS4' in tokB and tokB.index('NS1') > tokB.index('NS4'))
# False
# True
Regex
import re
pattern = '(^|-)NS4.+NS1(-|$)'
print(re.search(pattern, strA) is not None)
print(re.search(pattern, strB) is not None)
# False
# True
Performance
Tokenization: 2.3072939129997394
Regex: 11.138173280000046
But if you really need performance, I'm sure there are faster ways. Even the tokenization method does multiple passes.
As an alternative to tokenizing, you could use the following expression.
NS4(?=.*?NS1(?!\d))
It literally means:
The characters "NS4" literally.
Followed by any characters, until it finds NS1.
NS1 cannot be followed by a digit.
To educate readers as to what I've used:
(?=) is a Positive Lookahead.
Whatever you place inside this token must be found for the match to be True.
I placed .*? to match anything, as few times as possible using the ? quantifier, followed by NS1 since that is what we want to find.
(?!) is a Negative Lookahead
Whatever you place inside this token, as you might guess, must NOT be found for the match to be True.
I placed a digit in here, so that things like NS10 or NS11 or NS19 are never matched.

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Categories

Resources