What is wrong with this Regular Expression? - python

I am trying to create a test to verify that a link is rendered on a webpage.
I'm not understanding what I'm doing wrong on this assertion test:
self.assertRegexpMatches( response.content, r'elite')
I know that the markup is on the page because I copied it from response.content
I tried to use the regular expression in the Python shell:
In [27]: links = """<div class="tabsA">activenewesthottestmost votedelite</div>"""
In [28]: re.search(r'elite', links)
For some reason it's not working their either.
How do I create the regular expression so it works?

Why are you using a regex here? There's absolutely no reason to. You're just matching a simple string. Use:
self.assertContains(response, 'elite')

The ? in your regex is getting interpreted as a ? quantifier (end of this part):
<a href="/questions/?...
Thus the engine never matches the literal ? that appears in the string, and instead matches an optional / at that position. Escape it with a backslash like so:
<a href="/questions/\?...

You should escape "?", because that symbol has a special meaning on regex.
>>> re.search(r'elite', links)

The ? character is a special RegEx Character and must be escaped.
The follow regexp would work
elite
Note the \ before the ?
A great tool for messing around with RegEx can be found here:
http://regexpal.com/
It can save you an awful lot of time and headaches...

It's probably the "<" and ">" characters. In some regular expression syntaxes they are special characters that indicate beginning and end of line.
You might look at a regular expression tester tool to help you learn them.

Related

Is there a way to match ( ) brackets in regular expression

I trying to match the following using regular expression but I struggling in matching the round bracket.
[??(Z)Z-axis Down Position Stroke]
Can anyone kindly advise ?
My Current expression as shown.
[[][[a-zA-Z0-9_?. - ]{1,30}[]]
You can add a backslash \ before any character to escape it. Try this regex:
\[\?\?\(Z\)Z-axis Down Position Stroke\]
When writing regex, I find regex101.com to be really helpful. It's a free website that evaluates your regex and lets you specify test cases etc, then breaks those down and tells you about the various matching conditions in them. Worth a look if you're learning regular expressions.
Edit: Also, it's necessary to escape the brackets, parentheses, and question marks because those all have special meaning in regex.

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Django URL Reg-Ex

Hi all,
How does this expression actually work?
urlpatterns = patterns('',
url(r'^get/(?P<app_id>\d+)/$', 'app.views.app'),
...
)
I understand what it does, at least to map a url entered by the user to the app() function in the app's view page. I also understand it is a regular expression that ends up taking the id of the app and mapping it to the url. But where is this function going? What is going on with the r'^...?P /$ (I get the d+ is a digit regex, of the id itself, but that's about it).
I also understand this url function draws from the django.conf.urls module.
Perhaps my misunderstanding is more buried in my lack of regex experience. Nonetheless, I need help! I do not like using things I do not understand, and I am guilty.
Let's take a look: r'^get/(?P<app_id>\d+)/$'
The r'' means that assume as string characters every character inside the string quotes.
^ character means the beginning of the regular expression. For example, forget/123 won't match the expression because doesn't start with get, if the sign weren't there, it should've match it because it won't be forcing the matched string to begin with get, just that get...appears in the string.
The $ character means the end of the expression. If absent, get/123/xd may match the expression and this is not desired.
(?P<>) is a way to give a name/alias to a group in the expression.
You should read the python's regular expressions documentation. It's very good to know about regular expressions because they're very useful.
Hope this helps!
r just changes how the following string literal is interpreted. Backslashes (\) are not treated as escape sequences, that means that the regex in the string will be used as is.
^ at the beginning and $ at the end match and the end of the string respectively.
(?P<name>...) is a saving named group - it helps you to cut a part of url and pass it as a parameter into the view. See more in django named groups docs.
Hope that helps.

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

python re.search (regex) to search words who have pattern like {{world}} only

I have on HTML file in which I have inserted the custom tags like {{name}}, {{surname}}. Now I want to search the tags who exactly match the pattern like {{world}} only not even {world}}, {{world}, {world}, { word }, {{ world }}, etc.
I wrote the small code for the
re.findall(r'\{(\w.+?)\}', html_string)
It returns the words which follow the pattern {{world}} ,{world},{world}}
that I don't want. I want to match exactly the {{world}}. Can anybody please guide me?
Um, shouldn't the regex be:
'\{\{(\w.+?)\}\}'
Ok, after the comments, I understand your requirements more:
'\{\{\w+?\}\}'
should work for you.
Basically, you want {{any nnumber of word characters including underscore}}. You don't even need the lazy match in this case actually so you may remove th ? in the expression.
Something like {{keyword1}} other stuff {{keyword2}} will not match as a whole now.
To get only the keyword without getting the {{}} use below:
'(?<=\{\{)\w+?(?=\}\})'
How about this?
re.findall('{{(\w+)}}', html_string)
Or, if you want the curly braces included in the results:
re.findall('({{\w+}})', html_string)
If you're trying to accomplish html templating, though, I recommend using a good template engine.
This will match no curly braces within your result, do you want that?
'\{\{(\w[^\{\}]+?)\}\}'
http://rubular.com/r/79YwR13MS0
If you want to match doubled curly brackets, you should specify them in your regex:
re.findall(r'\{\{(\w[^}]?)\}\}', html_string)
You say the other answers don't work, but they seem to for me:
>>> import re
>>> html_string = '{{realword}} {fake1}} {{fake2} {fake3} fake4'
>>> re.findall(r'\{\{(\w.+?)\}\}', html_string)
['realword']
If it doesn't work for you, you'll need to give more details.
Edit: How about the following? Getting rid of the dot (.) and using only \w also allows you to use greedy qualifiers and works for the example HTML from your comment:
>>> html_string = 'html>\n <head>\n </head>\n <title>\n </title>\n <body>\n <h1>\n T - Shirts\n </h1>\n <img src="March-Tshirts/skull_headphones_tshirt.jpg" />\n <img src="/March-Tshirts/star-wars-t-shirts-6.jpeg" />\n <h2>\n we - we - we\n </h2>\n {{unsubscribe}} -- {{tracking_beacon} -- {web_url}} -- {name} \n </body>\n</html>\n'
>>> re.findall(r'\{\{(\w+)\}\}', html_string)
['unsubscribe']
The \w matches alphanumeric characters and the underscore; if you need to match more characters you could add it to a set (e.g., [\w\+] to also match the plus sign).

Categories

Resources