Python multiple repeat Error - python

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'

The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.

The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever

Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).

i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)

A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Related

Python regex: find words and emoticons

I want to find matches between a tweet and a list of strings containing words, phrases, and emoticons. Here is my code:
words = [':)','and i','sleeping','... :)','! <3','facebook']
regex = re.compile(r'\b%s\b|(:\(|:\))+' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
I keep receiving this error:
error: unbalanced parenthesis
Apparently there is something wrong with the code and it cannot match emoticons. Any idea how to fix it?
I tried the below and it stopped throwing the error:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
The re module has a function escape that takes care of correct escaping of words, so you could just use
words = map(re.escape, [':)','and i','sleeping','... :)','! <3','facebook'])
Note that word boundaries might not work as you expect when used with words that don't start or end with actual word characters.
While words has all the necessary formatting, re uses ( and ) as special characters. This requires you to use \( or \) to avoid them being interpreted as special characters, but rather as the ASCII characters 40 and 41. Since you didn't understand what #Nicarus was saying, you need to use this:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
Note: I'm only spelling it out because this doesn't seem like a school assignment, for all the people who might want to criticize this. Also, look at the documentation prior to going to stack overflow. This explains everything.

Backslashes in Python Regex

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:
searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
However - say pojo is 'MyObject' - the regex is not matching it to this line:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
If I print the string (while stopped in Pdb) I'm searching with, I see this:
'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'
I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:
searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)
And the string searched becomes
'<class name="(.*\\.|)MyObject".*table="(.*?)"'
It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?
Given this:
re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
The the first part of the pattern is interpreted like this:
1. class name=" a literal string beginning with c and ending with "
2. ( the beginning of a group
3. .* zero or more of any characters
4. \\ a literal single slash
5. . any single character
6. OR
7. nothing
8. ) end of the group
Since the string you're searching for does not have a literal backslash, it won't match.
If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.
Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".
This seems to work for me:
import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"
pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'
assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))
On Pythex, I tried this regex:
<class name="(.*)\.MyObject" table="([^"]*)"
on this string:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
and got these two match captures:
com.place.package
my_cool_object
So I think in your case, this line
searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)
will produce the result you want.
About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:
'\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".

How to fix: b = re.search('\(.*).\d+\.tld', a)

This is the code:
a = '000.222.tld'
b = re.search('(.*).\d+\.tld', a)
would like to see it print
000
so far..
print b.group(0)
gives me this:
000.222.tld
print b.group(1)
gives me this:
000.2
There are a a few problems with your expression:
b = re.match('\(.*)\.\d+\.com', a)
First, that \( means that you're escaping the (—it will only match a literal ( character in the search string. You're not trying to match any parentheses, you're trying to create a capturing group, so don't escape the parens. (Also, you're not escaping the matching ), so you'd get an error about mismatched parens trying to use this…)
Second, you're trying to match .com, but your sample input ends in .tld. Those obviously aren't going to match. Presumably you wanted to match any string of letters, or some other rule?
Finally, you're not using a raw string literal, or escaping your backslashes. Sometimes you get away with this, but do you know the Python backslash-escape rules by heart so well that you can be sure that \d or \. doesn't mean anything? Do you expect anyone who reads your code to also know?
If you fix all of those problems, your regex works:
>>> a = '1.2.tld'
>>> b = re.match(r'(.*)\.\d+\.[A-Za-z]+', a)
>>> b.group(1)
'1'
Now that you've completely changed both the expression and the input, you have completely different problems:
b = re.search('(.*).\d+\.tld', a)
The main problem here, besides again not using a raw string literal, is that you didn't escape the first ., so you're searching for any character there. Since regular expressions are greedy by default, the first .* will capture as much as it can while still leaving room for any character, 1 or more digits, and .tld, so it will match 000.2. But if you escape the ., it will capture as much as it can while still leaving room for a literal ., 1 or more digits, and .tld, which is exactly what you want.
>>> a = '000.222.tld'
>>> b = re.search(r'(.*)\.\d+\.tld', a)
>>> b.group(1)
'000'
Meanwhile, there are some great regular expression debuggers, both downloadable and online. I don't want to recommend one in particular, but Debuggex makes it easy to create a sharable link to a particular test, so here is your first one, and here is your second. Check out the examples and see how much easier it is to find the problems with your pattern that way.
You can do it without regex:
b = a.split('.', 1)[0]

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

How to find all words followed by symbol using Python Regex?

I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!
'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'
You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.
Just add some optional whitespace before the =:
\w+(?=\s*=)
Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»
You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")
r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.
re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']

Categories

Resources