Regex unbalanced parenthesis error while finding phone numbers - python

I am using Regex and got an error of unbalanced parenthesis while finding phone numbers at position 32. Image of error is given
import re
Regex_digit = re.compile(r'((\d\d\d-)? \d\d\d-\d\d\d\d(,))?) {3}')
Regex_digit.search('Hello, you can call me at 144-245-1452,152-632,414-156-3552')
enter image description here

To get the supposed phone numbers in the string:
'Hello, you can call me at 144-245-1452,152-632,414-156-3552'
We expect the output to be able to capture all of the followings;
144-245-1452
414-156-3552
Given your regex, the pattern that would work would be;
Regex_digit = re.compile(r'(\d\d\d-)?(\d\d\d)-(\d\d\d\d)')
There are 3 capturing groups here.
(\d\d\d-)? - optionally matches 3 digits [0-9].
(\d\d\d) - matches 3 digits [0-9]. Exactly 3 times.
- - matches the character - literally.
(\d\d\d\d) - matches 4 digits [0-9].
result = Regex_digit.search('Hello, you can call me at 144-245-1452,152-632,414-156-3552')
print(result)
<re.Match object; span=(26, 38), match='144-245-1452'>
However search would only give you the first matching pattern in the string.
To get all matching pattern;
string = 'Hello, you can call me at 144-245-1452,152-632,414-156-3552'
result = Regex_digit.findall(string)
print(result)
Because we have 3 capturing groups, you'd get a list of tuples. With each tuple containing 3 items from the capturing group.
To get back the result as a list of strings, you can use the join method;
print(["".join(x) for x in re.findall(Regex_digit, s)])
Refactored pattern would be;
regex_digit = re.compile(r'(\d{3}-)?(\d{3})-(\d{4})')
It means same thing as discussed above.

in your case thus error occurres because you have more closing parenthesis than opening parenthesis, to be balanced it will be like this:
r'((\d\d\d-)? \d\d\d-\d\d\d\d(,)?){3}'

Related

How to retrieve second value from search object?

I am unable to obtain the second match from a match object.
EDIT: found a fix
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+[.,][0-9]+)")
buy_in = re.findall(ptrn_buy_in, lines[0])
if buy_in[1] and buy_in[2]:
parsed_hand["metadata"]["buy_in"] = float(buy_in[1])
parsed_hand["metadata"]["rake"] = float(buy_in[2])
My string is: $14.69+$0.31
I have tried to see if the match object actually holds more values within the same group i.e. .group(0)[0] and [1]. This actually gave me the second digit of the number so not at all what I was expecting.
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+.[0-9]+)")
buy_in = re.search(ptrn_buy_in, lines[0])
if buy_in.group(0) and buy_in.group(1):
parsed_hand["metadata"]["buy_in"] = float(buy_in.group(0))
parsed_hand["metadata"]["rake"] = float(buy_in.group(1))
I am expecting to get 14.69 in .group(0) and 0.31 from .group(1) however I am getting 14.69 twice. Does anyone have any ideas?
Kind regards
There is only 1 capturing group ([0-9]+.[0-9]+) which will match 1+ digits, 1 times any character and again 1+ digits.
re.search returns a MatchObject which where .group(0) returns the entire match and .group(1) the first capturing group. That is why you get 14.69 twice.
You have to escape the dot to match it literally. But as you use a positive lookbehind, you could omit the group and get the match only:
(?<=[$€£])[0-9]+\.[0-9]+
ptrn_buy_in = re.compile("(?<=[$€£])[0-9]+\.[0-9]+")
print(re.findall(ptrn_buy_in, r"$14.69+$0.31"))
Regex demo | Python demo
Result
['14.69', '0.31']
Or use a match with the capturing group:
[$€£]([0-9]+\.[0-9]+)
Regex demo | Python demo

Capturing entire repeated string based on a repeated pattern

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59
Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.
In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

Regular expression to convert given number in the required format

I am first time using regular expression hence need help with one slightly complex regular expression. I have input list of around 100-150 string object(numbers).
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
Expected output = ['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
## I have tried this -
import re
new_list = []
for i in range (0, len(input)):
new_list.append(re.sub('\d+-\d+-\d+','0000\\1', input[i]))
## problem is with second argument '0000\\1'. I know its wrong but unable to solve
print(new_list) ## new_list is the expected output.
As you can see, I need to convert string of numbers coming in different formats into 15 digit numbers by adding leading zeros to them.
But there is catch here i.e. some numbers i.e.'000480087800784' are already 15 digits, so should be left unchanged (That's why I cannot use string formatting (.format) option of python) Regex has to be used here, which will modify only required numbers. I have already tried following answers but not been able to solve.
Using regex to add leading zeroes
using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
Regular expression to match defined length number with leading zeros
Your regex does not work as you used \1 in the replacement, but the regex pattern has no corresponding capturing group. \1 refers to the first capturing group in the pattern.
If you want to try your hand at regex, you may use
re.sub(r'^(\d+)-(\d+)-(\d+)$', lambda x: "{}-{}-{}".format(x.group(1).zfill(5), x.group(2).zfill(5), x.group(3).zfill(5)), input[i])
See the Python demo.
Here, ^(\d+)-(\d+)-(\d+)$ matches a string that starts with 1+ digits, then has -, then 1+ digits, - and again 1+ digits followed by the end of string. There are three capturing groups whose values can be referred to with \1, \2 and \3 backreferences from the replacement pattern. However, since we need to apply .zfill(5) on each captured text, a lambda expression is used as the replacement argument, and the captures are accessed via the match data object group() method.
However, if your strings are already in correct format, you may just split the strings and format as necessary:
for i in range (0, len(input)):
splits = input[i].split('-')
if len(splits) == 1:
new_list.append(input[i])
else:
new_list.append("{}-{}-{}".format(splits[0].zfill(5), splits[1].zfill(5), splits[2].zfill(5)))
See another Python demo. Both solutions yield
['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
How about analysing the string for numbers and dashes, then adding leading zeros?
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
output = []
for inp in input:
# calculate length of string
inpLen = len(inp)
# calculate num of dashes
inpDashes = inp.count('-')
# add specific number of leading zeros
zeros = "0" * (15-(inpLen-inpDashes))
output.append(zeros + inp)
print (output)
>>> ['00000090-10-07457', '000480087800784', '00000001-713-0926', '00000012-710-8197', '00000001-345-1715', '000000009-23-4532', '000200007100272']

Python Regular expression for splitting mentions of two years appearing altogether

I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d is written more concisely as \d{4}.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".
The 2nd group is referenced with \2.
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4} matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.
(?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.
Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Categories

Resources