sre_constants.error: missing ),unterminated subpattern - python

i am gaurav and i am learning programming. i was reading regular expressions in dive into python 3,so i thought to try myself something so i wrote this code in eclipse but i got a lot of errors.can anyone pls help me
import re
def add_shtner(add):
return re.sub(r"\bROAD\b","RD",add)
print(add_shtner("100,BROAD ROAD"))
# a code to check valid roman no.
ptn=r"^(M{0,3})(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}$)"
def romancheck(num):
num=num.upper()
if re.search(ptn,num):
return "VALID"
else:
return "INVALID"
print(romancheck("MMMLXXVIII"))
print(romancheck("MMMLxvviii"))
mul_line_str='''adding third argument
re.VERBOSE in re.search()
will ignore whitespace
and comments'''
print(re.search("re.search()will",mul_line_str,re.VERBOSE))
print(re.search("re.search() will",mul_line_str,re.VERBOSE))
print(re.search("ignore",mul_line_str,re.VERBOSE))
ptn='''
^ #beginning of the string
M{0,3} #thousands-0 to 3 M's
(CM|CD|D?C{0,3} #hundreds
(XC|XL|L?XXX) #tens
(IX|IV|V?III) #ones
$ #end of the string
'''
print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE))
def romanCheck(num):
num=num.upper()
if re.search(ptn,num,re.VERBOSE):
return "VALID"
else:
return "INVALID"
print(romanCheck("mmCLXXXIV"))
print(romanCheck("MMMCCLXXXiv"))
i wrote this code and i ran it but i got this-
100,BROAD RD
VALID
INVALID
None
None
<_sre.SRE_Match object; span=(120, 126), match='ignore'>
Traceback (most recent call last):
File "G:\pydev\xyz\rts\regular_expressions.py", line 46, in <module>
print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE))
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 856, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\Owner\AppData\Local\Programs\Python\Python36\lib\sre_parse.py", line 768, in _parse
source.tell() - start)
sre_constants.error: missing ), unterminated subpattern at position 113 (line 4, column 6)
what are these errors can anyone help me.
i have understood all the output but i am not able to understand this errors

The error means that you have passed a malformed regular expression to the search() function in line 46.
Although you have defined a valid RegEx in this line:
ptn=r"^(M{0,3})(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}$)"
you overwrite this pattern (ptn) later with what seems to be some help/docstring:
ptn='''
^ #beginning of the string
M{0,3} #thousands-0 to 3 M's
(CM|CD|D?C{0,3} #hundreds
(XC|XL|L?XXX) #tens
(IX|IV|V?III) #ones
$ #end of the string
'''
This is not a valid RegEx pattern, it is missing a closing bracket after (CM|CD|D?C{0,3}.
You pass this new string as regular expression in the nex line print(re.search(ptn,"MMMCDLXXIX",re.VERBOSE)) and re.compile() throws that error.
If you use another name for the variable to hold your help/docstring in line 27 (based on your sample code or line 38 based on your stacktrace) everything looks fine:
100,BROAD RD
VALID
INVALID
None
None
<_sre.SRE_Match object; span=(85, 91), match='ignore'>
<_sre.SRE_Match object; span=(0, 10), match='MMMCDLXXIX'>
VALID
VALID

I have had this issue when using re.VERBOSE. I suggest not using it in that format. Create the pattern in a single line rather than over multiple lines and don't pass the verbose parameter.

Related

how do I get words with meta expressions from string regardless of spacing

What I want to do is getting start point of some words from string.
For example,
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
I want to get the start index of answer from context which should be "9".
I tried something like this,
answer = ' ?'.join()
try:
answer = re.sub('[$]', '\$', answer)
answer = re.sub('[(]', '\(', answer)
answer = re.sub('[)]', '\)', answer)
except:
pass
start_point = re.search(answer, context).span()[0]
Because there are answers with meta expressions and answers without meta expressions I used try, except.
And I used this kinds of code,
answer = re.sub('[(]', '\(', answer)
because if I don't use it, I found that re.search(answer, context) can't find my answer from context.
then I get this error.
Traceback (most recent call last):
File "mc_answer_v2.py", line 42, in <module>
match = re.search(spaced_answer_text, mc_paragraph_text)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/home/hyelchung/data1/envs/albert/lib/python3.6/sre_parse.py", line 619, in _parse
source.tell() - here + len(this))
sre_constants.error: multiple repeat at position 3
How do I fix it and is there any other good way to get the start index?
It seems possible to do it by sticking \s* (variable number of white space characters) after each escaped character of answer string.
import re
def findPosition(context, answer):
regex=r"\s*"
regAnswer=regex.join([re.escape(w) for w in answer]) + regex
# print(regAnswer)
return re.search(regAnswer, context).start()
context = "abcd e f g ( $ 150 )"
answer = "g($150)"
print(findPosition(context, answer))
Use map to escape each character
Regex replace the original string with the target string
The string find method looks for the target string. If the target string does not exist, it will not return -1 abnormally.
>>> import re
>>> context = 'abcd e f g ( $ 150 )'
>>> answer = 'g($150)'
>>> findSpacing = lambda target, src :re.sub("\s*".join(map(re.escape, target)), target, src).find(target)
>>> findSpacing(answer, context)
9
>>> findSpacing("FLAG", context)
-1
>>>

how to replace +xx in pandas str replace

I'm using Python 2.7.12 and pandas 0.20.3, I have a data frame like below, I want to replace column called number, this column dtype is object, when I try to replace +91 in that column I'm getting error like below,
number
0 +9185600XXXXX
1 +9199651XXXXX
2 99211XXXXX
3 99341XXXXX
4 +9199651XXXXX
sre_constants.error: nothing to repeat
full trace,
Traceback (most recent call last):
File "encoder.py", line 21, in
df['number']=df['number'].str.replace('+91','')
File "/home/hduser/.local/lib/python2.7/site-packages/pandas/core/strings.py", line 1574, in replace
flags=flags)
File "/home/hduser/.local/lib/python2.7/site-packages/pandas/core/strings.py", line 424, in str_replace
regex = re.compile(pat, flags=flags)
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
But when I replace 91 it works as I expected, It's not working when I put + in prefix,
Please help me to solve this problem.
Error Occurs at,
df['number']=df['number'].str.replace('+91','')
You can escape special regex value + (one or more repetitions) like:
df['number'] = df['number'].str.replace('\+91','')
Or use parameter regex=False:
df['number'] = df['number'].str.replace('+91','', regex=False)
import pandas as pd
data={'number':['+9185600XXXXX','+9199651XXXXX']}
f=pd.DataFrame(data)
f['number']=f['number'].str.replace('\+91','')
print(f)

Trying to do a simple regex

i wan to extract (abc)(def) using the regex
which i ended up with that error below
import re
def main():
str = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , str).group(1)
print match
The error is:
Traceback (most recent call last):
File "test.py", line 7, in <module>
match = re.search("\-->(.*?)\<--" , str).group()
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
Corrected:
import re
def main():
my_string = "-->(abc)(def)<--"
match = re.search("\-->(.*?)\<--" , my_string).group(1)
print match
# (abc)(def)
main()
Note, that I renamed str to my_string (do not use standard library functions as own variables!). Maybe you can still optimize your regex with lookarounds, the lazy star (.*?) can get very ineffective sometimes.

Splitting up lines in a regular expression

I'm trying to break up a long regex into smaller chunks. Is it possible/good practice to change A to B?
A:
line = re.sub(r'\$\{([0-9]+)\}|\$([0-9]+)|\$\{(\w+?\=\w?+)\}|[^\\]\$(\w[^-]+)|[^\\]\$\{(\w[^-]+)\}',replace,line)
B:
line = re.sub(r'\$\{([0-9]+)\}|'
r'\$([0-9]+)|'
r'\$\{(\w+?\=\w?+)\}|'
r'[^\\]\$(\w[^-]+)|'
r'[^\\]\$\{(\w[^-]+)\}',replace,line)
Edit:
I receive the following error when running this in Python 2:
def main():
while(1):
line = raw_input("(%s)$ " % ncmd)
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''',replace,line,re.VERBOSE)
print '>> ' + line
Error:
(1)$ abc
Traceback (most recent call last):
File "Test.py", line 4, in <module>
main()
File "Test.py", line 2, in main
[^\\]\$\{(\w[^-]+)\}''',replace,line,re.VERBOSE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
You can use a triple-quoted (multi-line) string and set the re.VERBOSE flag, which allows you to break a Regex pattern over multiple lines:
line = re.sub(r'''
\$\{([0-9]+)\}|
\$([0-9]+)|
\$\{(\w+?\=\w?+)\}|
[^\\]\$(\w[^-]+)|
[^\\]\$\{(\w[^-]+)\}
''', replace, line, re.VERBOSE)
You can even include comments directly inside the string:
line = re.sub(r'''
\$\{([0-9]+)\}| # Pattern 1
\$([0-9]+)| # Pattern 2
\$\{(\w+?\=\w?+)\}| # Pattern 3
[^\\]\$(\w[^-]+)| # Pattern 4
[^\\]\$\{(\w[^-]+)\} # Pattern 5
''', replace, line, re.VERBOSE)
Lastly, it should be noted that you can likewise activate the verbose flag by using re.X or by placing (?x) at the start of your Regex pattern.
You can also separate your expression over multiple lines using double quotes, like the following:
line = re.sub(r"\$\{([0-9]+)\}|\$([0-9]+)|"
r"\$\{(.+-.+)\}|"
r"\$\{(\w+?\=\w+?)\}|"
r"\$(\w[^-]+)|\$\{(\w[^-]+)\}",replace,line)

How do i use list as variable in regexp in Python

How do i use list variable in regexp?
The problem is here:
re.search(re.compile(''.format('|'.join(map(re.escape, kand))), corpus.raw(fileid)))
error is
TypeError: unsupported operand type(s) for &: 'str' and 'int'
simple re.search works well, but i need list as first attribute in re.search:
for fileid in corpus.fileids():
if re.search(r'[Чч]естны[й|м|ого].труд(а|ом)', corpus.raw(fileid)):
dict_features[fileid]['samoprezentacia'] = 1
else:
dict_features[fileid]['samoprezentacia'] = 0
if re.search(re.compile('\b(?:%s)\b'.format('|'.join(map(re.escape, kand))), corpus.raw(fileid))):
dict_features[fileid]['up'] = 1
else:
dict_features[fileid]['up'] = 0
return dict_features
by the way kand is list:
kand = [line.strip() for line in open('kand.txt', encoding="utf8")]
in output kand is ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']
Edit: i am using Python 3.3.2 with WinPython on Windows 7
full errors stack:
Traceback (most recent call last):
File "F:/Python/NLTK packages/agit_classify.py", line 59, in <module>
print (regexp_features(agit_corpus))
File "F:/Python/NLTK packages/agit_classify.py", line 53, in regexp_features
if re.search(re.compile(r'\b(?:{0})\b'.format('|'.join(map(re.escape, kandidats_all))), corpus.raw(fileid))):
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 214, in compile
return _compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 281, in _compile
p = sre_compile.compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_compile.py", line 494, in compile
p = sre_parse.parse(p, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 748, in parse
p = _parse_sub(source, pattern, 0)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 360, in _parse_sub
itemsappend(_parse(source, state))
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 453, in _parse
if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
The reason you're getting the actual exception is mismatched parentheses. Let's break it up to make it clearer:
re.search(
re.compile(
''.format('|'.join(map(re.escape, kand))),
corpus.raw(fileid)))
In other words, you're passing a string, corpus.raw(fileid), as the second argument to re.compile, not as the second argument to re.search.
In other words, you're trying to use it as the flags value, which is supposed to be an integer. When re.compile tries to use the & operator on your string to test each flag bit, it raises a TypeError.
And if you got past this error, the re.search would itself raise a TypeError because you're only passing it one argument rather than two.
This is exactly why you shouldn't write overly-complicated expressions. They're very painful to debug. If you'd written this in separate steps, it would be obvious:
escaped_kand = map(re.escape, kand)
alternation = '|'.join(escaped_kand)
whatever_this_was_supposed_to_do = ''.format(alternation)
regexpr = re.compile(whatever_this_was_supposed_to_do, corpus.raw(fileid))
re.search(regexpr)
This would also make it obvious that half the work you're doing isn't needed in the first place.
First, re.search takes a pattern, not a compiled regexpr. If it happens to work with a compiled regexpr, that's just an accident. So, that whole part of the expression is useless. Just pass the pattern itself.
Or, if you have a good reason to compile the regexpr, as re.compile explains, the result regular expression object "can be used for matching using its match() and search() methods". So use the compiled object's search method, not the top-level re.search function.
Second, I don't know what you expected ''.format(anything) to do, but it can't possibly return anything but ''.
You're mixing old and new string formatting rules. Also, you need to use raw strings with a regex, or \b will mean backspace, not word boundary.
'\b(?:%s)\b'.format('|'.join(map(re.escape, kand)))
should be
r'\b(?:{0})\b'.format('|'.join(map(re.escape, kand)))
Furthermore, be aware that \b only works if your "words" start and end with alphanumeric characters (or _).

Categories

Resources