Regular Expressions in unicode strings

Regular Expressions in unicode strings - python

I have some unicode text that I want to clean up using regular expressions. For example I have cases where u'(2'. This exists because for formatting reasons the closing paren ends up in an adjacent html cell. My initial solution to this problem was to look ahead at the contents of the next cell and using a string function determine if it held the closing paren. I knew this was not a great solution but it worked. Now I want to fix it but I can't seem to make the regular expression work.
missingParen=re.compile(r"^\(\d[^\)]$")
My understanding of what I think I am doing:
^ at the beginning of the string I want to find
( an open paren, the paren has to be backslashed because it is a special character
\d I also want to find a single digit
[ I am creating a special character class
^ I don't want to find what follows
) which is a close paren
$ at the end of the string
And of course the plot thickens I made a silly assumption that because I placed a \d I would not find (33 but I am wrong so I added a {1} to my regular expression and that did not help, it matched (3333, so my problem is more complicated than I thought. I want the string to be only an open paren and a single digit. Is this the more clever approach
missingParen=re.compile(r"^\(\d$")
And note S Lott _I already tagged it beginner so you can't pick up any cheap points Not that I don't appreciate your insights I keep meaning to read your book, it probably has the answer

Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do:
missingParen=re.compile(r"^\(\d$")

Related

what does this python regex mean "([\w\/%]*)" [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am reading the Shinken source code in shinken/misc/perfdata.py and i finally find a regex that i can not understand. like this:
metric_pattern = re.compile('^([^=]+)=([\d\.\-\+eE]+)([\w\/%]*);?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE:~#]+)?;?([\d\.\-\+eE]+)?;?([\d\.\-\+eE]+)?;?\s*')
what confused me is that what does \/ mean in ([\w\/%]*)?

You're rightfully confused, because that regex must have been written by someone who doesn't know Python regexes well.
In some languages (e.g. JavaScript), regexes are delimited by slashes. That means that if you need an actual slash in your regex, you have to escape it. Since Python doesn't use slashes, there's no need to escape the slash (but it doesn't cause an error, either).
Much more worrisome is that the author failed to use a raw string. In many cases, that won't matter (because Python will treat "\d" as "\\d" which then correctly translates to the regex \d, but in other cases, it will cause problems. One example is "\b" which means "a backspace character" and not "a word boundary anchor" like the regex \b would.
Also, the author has escaped a lot of characters that didn't need escaping at all. The entire regex could be rewritten as
metric_pattern = re.compile(r'^([^=]+)=([\d.+eE-]+)([\w/%]*);?([\d.+eE:~#-]+)?;?([\d.+eE:~#-]+)?;?([\d.+eE-]+)?;?([\d.+eE-]+)?;?\s*')
and even then, I'm surprised that it works at all. Looks very chaotic to me and is definitely not foolproof. For example, there appears to be a big potential for catastrophic backtracking meaning that users could freeze your server with malicious input.

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!

The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs

The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].

If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.

This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.

You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)

you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Python Regex (Search Multiple values in one string)

In python regex how would I match against a large string of text and flag if any one of the regex values are matched... I have tried this with "|" or statements and i have tried making a regex list.. neither worked for me.. here is an example of what I am trying to do with the or..
I think my "or" gets commented out
patterns=re.compile(r'[\btext String1\b] | [\bText String2\b]')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
The above code always says it matches regardless if the string appears and if I change it around a bit I get matches on the first regex but never checks the second.... I believe this is because the "Raw" is commenting out my or statement but how would I get around this??
I also tried to get around this by taking out the "Raw" statement and putting double slashes on my \b for escaping but that didn't work either :(
patterns=re.compile(\\btext String1\\b | \\bText String2\\b)
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
I then tried to do 2 separate raw statements with the or and the interpreter complains about unsupported str opperands...
patterns=re.compile(r'\btext String1\b' | r'\bText String2\b')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")

patterns=re.compile(r'(\btext String1\b)|(\bText String2\b)')
You want a group (optionally capturing), not a character class. Technically, you don't need a group here:
patterns=re.compile(r'\btext String1\b|\bText String2\b')
will also work (without any capture).
The way you had it, it checked for either one of the characters between the first square brackets, or one of those between the second pair. You may find a regex tutorial helpful.
It should be clear where the "unsupported str operands" error comes from. You can't OR strings, and you have to remember the | is processed before the argument even gets to compile.

This part [\btext String1\b] means is there a "word separator" or one of the letters in "text String1" present. So that matches anything but an empty line I think.

In a RE pattern, square brackets [ ] indicate a "character class" (depending on what's inside them, "any one of these character" or "any character except one of these", the latter indicate by a caret ^ as the first character after the opening [). This is what you're expressing and it has absolutely nothing to do with what you want -- just remove the brackets and you should be fine;-).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expressions in unicode strings - python

Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do: missingParen=re.compile(r"^\(\d$")

Related

what does this python regex mean "([\w\/%]*)" [duplicate]

Match LaTeX reserved characters with regex

A simple regexp in python

Python Regex working different depending on the implementation?

Python Regex (Search Multiple values in one string)

Categories

Resources