Match LaTeX reserved characters with regex - python

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!

The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs

The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].

If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

Related

Python regex OR expression

I have a file named Document.pdf and sometimes it is called Document-12345678.pdf where -12345678 is a random number.
I want to check a file is downloaded in folder. When the file is not finished it display Document.pdf.fkasfmq or Document-12345678.pdf.fkasfmq where .fkasfmq is a random hash from the downloader and I don't want it to match.
I try make a regex like r'Document(?:[\-0-9]+).pdf' and test it with either Document.pdf or Document-12345678.pdf it will always return false.
From my understanding (?:[\-0-9]+) means it can be or not in the set that matches any hyphen and any numbers before .pdf, is that correct? I am very very rusty with regex...
The parentheses only perform grouping, not optionality. If you want to make the expression optional, the ? quantifier does that (and actually the parentheses are unnecessary, as the character class is a single expression). Though as #anubhava notes in a comment, you might as well use the * quantifier then.
r'Document[-0-9]*\.pdf'
Notice also the backslash to match a literal dot; an unescaped . matches any character (other than newline). Inside a character class, an initial or final hyphen does not need to be backslash-escaped.
On the other hand, perhaps prefer a more precise expression:
r'^Document(-\d)?\.pdf$'
which says, opionally, a hyphen followed by numbers, and nothing before or after.
You should mark it as optional with the "?" symbol. Otherwise, you are requiring that the name should have the numbers and/or digits part.
r'Document(?:[\-0-9]+)?\.pdf'
Or as #anubhava pointed out in the comments, it can be simplified to:
r'Document[\-0-9]*\.pdf'
This way, it will also match e.g. "Document.pdf"
Also, you should consider putting the mark "$" to signify end of string so that it doesn't match e.g. "Document.pdf.fkasfmq"
r'^Document(?:[\-0-9]+)?\.pdf$'
Or
r'^Document[\-0-9]*\.pdf$'
You can just use (\d{8}) to see if there's a document there with 8 digits in the filename.

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Correctly parsing string literals with python's re module

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.
For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.
The regex is: (?<!\\)\"[^\"]+\"
Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.
To break it down:
(?<\\)\" # The group should begin with a quotation mark that is not escaped
[^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\" # and end at the first quotation mark it finds
That being said, I (obviously) start hitting problems with things like this:
"This is a string with an \"escaped quote\" inside it"
I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:
([^\"]|\\\")+ # a group of anything but a quote or an escaped quote
, but that lead to very strange results.
I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.
Thanks for any assistance.
EDIT
As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):
STRLIT = r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # non-greedy 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
'''
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)
The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.
I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.
r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''
Which does the following:
(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.
Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.
If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.
edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).
Perhaps use two negative look behinds:
import re
text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''
for match in (re.findall(r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
''', text)):
print(match)
yields
"This is a string with an \"escaped quote\" inside it"
""
"data"
The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.
Using python, the correct regex matching double quoted string is:
pattern = r'"(\.|[^"])*"'
It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".
unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.
VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.
Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources