Escape Characters in Regex sub of Markdown Links to HTML Links - python

I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?

Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.

Related

python 3: quoting result of random string generation

I'm new to python and things do not always work as I expect... but I am learning, slowly. Here is a case in point. If I randomly create a string via:
thing = ''.join([
random.SystemRandom().choice(
"{}{}{}".format(
string.ascii_letters, string.digits, string.punctuation
)
) for i in range(63)
])
then I could end up with a string with single quotes as well as backslashes. I assume that I should then go through the string and quote the possibly problematic characters. So, for example: if I generate the (short) string:
cs]b77e\IM>&4/,u.s_jr"xmMdHD7a'wrEw(
my instinct tells me that I should quote that into:
cs]b77e\\IM>&4/,u.s_jr"xmMdHD7a\'wrEw(
It looks like the string.replace() method is my friend...
thing = ''.join([
random.SystemRandom().choice(
"{}{}{}".format(
string.ascii_letters, string.digits, string.punctuation
)
) for i in range(63)
]).replace('\\', '\\').replace('\'', '\'')
but is there a better way?
Also, in the replace() methods the meaning of the single quoted strings seems to change depending on context. Coming from Perl this seems strange to me. My initial attempts had me doing things like replace('\\', '\\\\') thinking that I had to quote the characters going into the replacement string. Is this normal or am I missing something else?
Edit
My goal here is to end up with 63 characters in a string. I don't really think that I have to quote any generated single quotes but my thought is that if I later use the string and it has generated backslashes then the next character after the backslash would act like it was quoted, right? I mean:
len('1234')
yields 4 but
len('12\4')
yields 3 so I need to post-process the generated string to at least quote the backslashes, right? Is there a better way to quote problematic characters than a chain of replaces() methods?
A string can contain any valid characters; the quotes and backslashes are only useful or special when representing a string in Python code. So you don't normally need to do anything like this when you already have a string which contains the characters you want.
If you want a representation which can be parsed by Python (e.g. by writing it to a .py file), repr() does that.
You don't have to escape characters unless they are part of code you are writing or from an input from a user. If the backslash character or a quote character is generated by a Python program, then it is already stored as that character in memory. There is no need do any additional escaping.
Why? Because Python is not interpreting a string literal, it is simply generating characters, which are stored as numbers in memory. When you ask Python to display a string containing one of the characters such as a single quote or a backslash, it will automatically escape them.
Here is an example. A double quote is 34, single quote is character 39, and backslash is 92.
'a'+chr(34)+'b'+chr(39)+'c'+chr(92)+'d'
# returns:
'a"b\'c\\d'
Because I included a double quote and a single quote Python will use a single quote to surround the string, an unescaped double quote within the string, an escaped single quote, and and escaped backslash.
So there is no need to escape characters that are generated within a Python program, it does it for you.

Defining file paths in python with EOL string literal errors [duplicate]

Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
Given the confusion around the arbitrary-seeming restriction against an odd number of backslashes at the end of a Python raw-string, it's fair to say that this is a design mistake or legacy issue originating in a desire to have a simpler parser.
While workarounds (such as r'C:\some\path' '\\' yielding 'C:\\some\\path\\' (in Python notation) or C:\some\path\ (verbatim)) are simple, it's counterintuitive to be needing them. For comparison, let's have a look at C++ and Perl.
In C++, we can straightforwardly use raw string literal syntax
#include <iostream>
int main() {
std::cout << R"(Hello World!)" << std::endl;
std::cout << R"(Hello World!\)" << std::endl;
std::cout << R"(Hello World!\\)" << std::endl;
std::cout << R"(Hello World!\\\)" << std::endl;
}
to get the following output:
Hello World!
Hello World!\
Hello World!\\
Hello World!\\\
If we want to use the closing delimiter (above: )) within the string literal, we can even extend the syntax in an ad-hoc way to R"delimiterString(quotedMaterial)delimiterString". For example, R"asdf(some random delimiters: ( } [ ] { ) < > just for fun)asdf" produces the string some random delimiters: ( } [ ] { ) < > just for fun in the output. (Ain't that a good use of "asdf"!)
In Perl, this code
my $str = q{This is a test.\\};
print ($str);
print ("This is another test.\n");
will output the following: This is a test.\This is another test.
Replacing the first line by
my $str = q{This is a test.\};
would lead to an error message: Can't find string terminator "}" anywhere before EOF at main.pl line 1.
However, Perl treating a pre-delimiter \ as an escape character doesn't prevent the user from having an odd number of backslashes at the end of the resulting string; eg to place 3 backslashes \\\ into the end of $str, simply end the code with 6 backslashes: my $str = q{This is a test.\\\\\\};. Importantly, while we need to double the backslashes in the input, there is no Python-like inconsistent-seeming syntactic restriction.
Another way of looking at things is that these 3 languages use different ways to address the parsing issue of interaction between escape characters and closing delimiters:
Python: disallows an odd number of backslashes just before the closing delimiter; a simple workaround is r'stringWithoutFinalBackslash' '\\'
C++: allows essentially¹ everything between the delimiters
Perl: allows essentially² everything between the delimiters, but backslashes need to be consistently doubled
¹ The custom delimiterString itself cannot be more than 16 characters long, but that's hardly a limitation.
² If you need the delimiter itself, just escape it with \.
However, to be fair in a comparison to Python, we need to acknowledge that (1) C++ didn't have such string literals until C++11 and is famously hard to parse and (2) Perl is even harder to parse.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).

Slash replacement inside a raw string

Just a simple question concerning raw string, regex pattern and replacement:
I have a string variable defined as follow:
> print repr(foo)
'\n\t\t\n\t\tIf (GUTIAttach>=1) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\tUECapInfo;//Mps("( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 )");
My problem are characters "(" and ")", I want to replace them by "\(" and "\)" inside the raw string because it will be used after as a regular expression pattern.
I tried to use this method:
foo_tmp= [inc.replace(')', '\)') for inc in foo]
foo_tmp= [inc.replace('(', '\)') for inc in foo_tmp]
foo = "".join(foo_tmp)
the result gives:
> print repr(foo)
'\n\t\t\n\t\tIf \\(GUTIAttach>=1\\) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\t{\n\t\t\tUECapInfo;//Mps\\("\\( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 \\)"\\);
Characters "(" and ")" have been replaced by "\\(" and "//)" instead of "\(" and "\)".
That's a bit unexpected for me, so do you know how I can proceed to get just a single slash without changing the other part of the string?
Note: The method .decode('string_escape') is also not working due to the rest of string. Double slashes already present in the original raw string must not change.
Thanks a lot for your help
Use the re.escape() function to escape regular expression meta characters for you.
What you are seeing is otherwise perfectly normal Python behaviour; you are looking at a python literal representation; the output can be pasted back into a Python interpreter and recreate the value. As such, anything that could be interpreted as an escape code is escaped for you; a single \ would normally be doubled to prevent it being interpreted as the start of an escape sequence:
>>> '\('
'\\('
>>> print '\\('
\(
You can see this at work in other places in your foo string; the \n character combination represents a newline character, not two separate characters \ and n. If you wanted to include a literal \ and n in the text, you'd have to double the backslash to \\n. Further on into the value of foo you'll find \\", which is a single backslash followed by a " quote.

Correctly parsing string literals with python's re module

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.
For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.
The regex is: (?<!\\)\"[^\"]+\"
Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.
To break it down:
(?<\\)\" # The group should begin with a quotation mark that is not escaped
[^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\" # and end at the first quotation mark it finds
That being said, I (obviously) start hitting problems with things like this:
"This is a string with an \"escaped quote\" inside it"
I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:
([^\"]|\\\")+ # a group of anything but a quote or an escaped quote
, but that lead to very strange results.
I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.
Thanks for any assistance.
EDIT
As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):
STRLIT = r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # non-greedy 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
'''
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)
The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.
I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.
r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''
Which does the following:
(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.
Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.
If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.
edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).
Perhaps use two negative look behinds:
import re
text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''
for match in (re.findall(r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
''', text)):
print(match)
yields
"This is a string with an \"escaped quote\" inside it"
""
"data"
The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.
Using python, the correct regex matching double quoted string is:
pattern = r'"(\.|[^"])*"'
It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".
unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.

Need regular expression expert: round bracket within stringliteral

I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("

Categories

Resources