Avoid special values or space between values using python re - python

For any phone number which allows () in the area code and any space between area code and the 4th number, I want to create a tuple of the 3 sets of numbers.
For example: (301) 556-9018 or (301)556-9018 would return ('301','556','9018').
I will raise a Value error exception if the input is anything other than the original format.
How do I avoid () characters and include either \s or none between the area code and the next values?
This is my foundation so far:
phonenum=re.compile('''([\d)]+)\s([\d]+) - ([\d]+)$''',re.VERBOSE).match('(123) 324244-123').groups()
print(phonenum)
Do I need to make a if then statement to ignore the () for the first tuple element, or is there a re expression that does that more efficiently?
In addition the \s in between the first 2 tuples doesn't work if it's (301)556-9018.
Any hints on how to approach this?

When specifying a regular expression, you should use raw-string mode:
`r'abc'` instead of `'abc'`
That said, right now you are capturing three sets of numbers in groups. To allow parens, you will need to match parens. (The parens you currently have are for the capturing groups.)
You can match parens by escaping them: \( and \)
You can find various solutions to "what is a regex for XXX" by seaching one of the many "regex libary" web sites. I was able to find this one via DuckDuckGo: http://www.regexlib.com/Search.aspx?k=phone
To make a part of your pattern optional, you can make the individual pieces optional, or you can provide alternatives with the piece present or absent.
Since the parens have to be present or absent together - that is, you don't want to allow an opening paren but no closing paren - you probably want to provide alternatives:
# number, no parens: 800 555-1212
noparens = r'\d{3}\s+\d{3}-\d{4}'
# number with parens: (800) 555-1212
yesparens = r'\(\d{3}\)\s*\d{3}-\d{4}'
You can match the three pieces by inserting "grouping parens":
noparens_grouped = r'(\d{3})\s+(\d{3})-(\d{4})'
yesparens_grouped = r'\((\d{3})\)\s*(\d{3})-(\d{4})'
Note that the quoted parens go outside of the grouping parens, so that the parens do not become part of the captured group.
You can join the alternatives together with the | operator:
yes_or_no_parens_groups = noparens_grouped + '|' + yesparens_grouped

In regular expressions you can use special characters to specify some behavior of some part of the expression.
From python re documentation:
'*' =
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+' =
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?' =
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So to solve the blank space problem you can use either '?' if you know the occurrence will be no more than 1, or '+' if you can have more than 1.
In case of grouping information together and them returning a list, you can put your expression inside parenthesis and then use function groups() from re.
The result would be:
results = re.search('\((\d{3})\)\s?(\d{3})-(\d{4})', '(301) 556-9018')
if results:
print results.groups()
else:
print('Invalid phone number')

Related

python re regex matching in string with multiple () parenthesis

I have this string
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
What is a regex to first match exactly "IP(dev.maintserial):Y(dev.maintkeys)"
There might be a different path inside the parenthesis, like (name.dev.serial), so it is not like there will always be one dot there.
I though of something like this:
re.search('(IP\(.*?\):Y\(.*?\))', cmd) but this will also match the single IP(k1) and Y(y1
My usage will be:
If "IP(*):Y(*)" in cmd:
do substitution of IP(dev.maintserial):Y(dev.maintkeys) to Y(dev.maintkeys.IP(dev.maintserial))
How can I then do the above substitution? In the if condition I want to do this change in order: from IP(path_to_IP_key):Y(path_to_Y_key) to Y(path_to_Y_key.IP(path_to_IP_key)) , so IP is inside Y at the end after the dot.
This should work as it is more restrictive.
(IP\([^\)]+\):Y\(.*?\))
[^\)]+ means at least one character that isn't a closing parenthesis.
.*? in yours is too open ended allowing almost anything to be in until "):Y("
Something like this?
r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)"
You can try it with your string. It matches the entire string IP(dev.maintserial):Y(dev.maintkeys) with groups dev.maintserial and dev.maintkeys.
The RE matches IP(, zero or more characters that are not a closing parenthesis ([^)]*), a period . (\.), one or more of any characters (.+), then ):Y(, ... (between the parentheses -- same as above), ).
Example Usage
import re
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
# compile regular expression
p = re.compile(r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)")
s = p.search(cmd)
# if there is a match, s is not None
if s:
print(f"{s[0]}\n{s[1]}\n{s[2]}")
a = "Y(" + s[2] + ".IP(" + s[1] + "))"
print(f"\n{a}")
Above p.search(cmd) "[s]can[s] through [cmd] looking for the first location where this regular expression [p] produces a match, and return[s] a corresponding match object" (docs). None is the return value if there is no match. If there is a match, s[0] gives the entire match, s[1] gives the first parenthesized subgroup, and s[2] gives the second parenthesized subgroup (docs).
Output
IP(dev.maintserial):Y(dev.maintkeys)
dev.maintserial
dev.maintkeys
Y(dev.maintkeys.IP(dev.maintserial))
You can use 2 negated character classes [^()]* to match any character except parenthesis, and omit the outer capture group for a match only.
To prevent a partial word match, you might start matching IP with a word boundary \b
\bIP\([^()]*\):Y\([^()]*\)
Regex demo

using Python re to check a string

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

Regular expression misses match at beginning of string

I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:
I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.

Splitting a string with delimiters and conditions

I'm trying to split a general string of chemical reactions delimited by whitespace, +, = where there may be an arbitrary number of whitespaces. This is the general case but I also need it to split conditionally on the parentheses characters () when there is a + found inside the ().
For example:
reaction= 'C5H6 + O = NC4H5 + CO + H'
Should be split such that the result is
splitresult=['C5H6','O','NC4H5','CO','H']
This case seems simple when using filter(None,re.split('[\s+=]',reaction)). But now comes the conditional splitting. Some reactions will have a (+M) which I'd also like to split off of as well leaving only the M. In this case, there will always be a +M inside the parentheses
For example:
reaction='C5H5 + H (+M)= C5H6 (+M)'
splitresult=['C5H5','H','M','C5H6','M']
However, there will be some cases where the parentheses will not be delimiters. In these cases, there will not be a +M but something else that doesn't matter.
For example:
reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
splitresult=['C5H5','HO2','C5H5O(2,4)','OH']
My best guess is to use negative lookahead and lookbehind to match the +M but I'm not sure how to incorporate that into the regex expression I used above on the simple case. My intuition is to use something like filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)). Any help is much appreciated.
You could use re.findall() instead:
re.findall(pattern, string, flags=0)
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)
but, if you prefer re.split() and filter(), then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))
the pattern for findall is different from the pattern for split,
because findall and split are looking for different things; 'the opposite things', indeed.
findall, is looking for that you wanna (keep it).
split, is looking for that you don't wanna (get rid of it).
In findall, '[A-Z0-9]+(?:([1-9],[1-9]))?'
match any upper case or number > [A-Z0-9],
one or more times > +, follow by a pair of numbers, with a comma in the middle, inside of parenthesis > \([1-9],[1-9]\)
(literal parenthesis outside of character classes, must be escaped with backslashes '\'), optionally > ?
\([1-9],[1-9]\) is inside of (?: ), and then,
the ? (which make it optional); ( ), instead of (?: ) works, but, in this case, (?: ) is better; (?: ) is a no capturing group: read about this.
try it with the regex in the split
That seems overly complicated to handle with a single regular expression to split the string. It'd be much easier to handle the special case of (+M) separately:
halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))
So here is the regex which you are looking for.
Regex: ((?=\(\+)\()|[\s+=]|((?<=M)\))
Flags used:
g for global search. Or use them as per your situation.
Explanation:
((?=\(\+)\() checks for a ( which is present if (+ is present. This covers the first part of your (+M) problem.
((?<=M)\)) checks for a ) which is present if M is preceded by ). This covers the second part of your (+M) problem.
[\s+=] checks for all the remaining whitespaces, + and =. This covers the last part of your problem.
Note: The care for digits being enclosed by () ensured by both positive lookahead and positive lookbehind assertions.
Check Regex101 demo for working
P.S: Make it suitable for yourself as I am not a python programmer yet.

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Categories

Resources