I'm trying to use LXML to process a string in a XML file. The problem is the output file unable to escape some of the special characters(e.g. "\n" and " ' ").
xml.sax.saxutils.escape only escapes &, <, and > by default, but it does provide an entities parameter to additionally escape other strings. Does lxml provide the same flexibility in entities parameter for escape?
For XML:
from xml.sax.saxutils import escape
def xmlescape(data):
return escape(data, entities={
"'": "'",
"\"": """
})
Thank you so much!!
I'm not sure about lxml. But you can remove these special characters using replace.
Here is an example:
string = "I have a string with special value ' in between."
print(string.replace("'", ""))
output:
I have a string with special value in between.
Related
I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?
Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.
With Python 3.x, I need to replace escaped double quotes in some text with some custom pattern, leaving non-escaped double quotes as is. So I write as trivial code as:
text = 'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\"', '~', text))
And expect to see:
These are "quotes", and these are ~escaped quotes~
But instead of above, I get:
These are ~quotes~, and these are ~escaped quotes~
So, what't the correct pattern to replace escaped quotes only?
Background of this issue is an attempt to read 'invalid' JSON file containing Javascript function in it, placed with line feeds as is, but with escaped quotes. If there is easier way to parse JSON with newline characters in key values, I appreciate a hint on that.
First, you need to use a raw string to assign text, so that the backslashes will be kept literally (or you can escape the backslashes).
text = r'These are "quotes", and these are \"escaped quotes\"'
Second, you need to escape the backslash in the regexp so that it will be treated literally by the regexp engine.
print(re.sub(r'\\"', '~', text))
using raw text might help.
import re
text = r'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\\"', '~', text))
I need to handle backslash and tilde while using pyparsing in my piece of code and to keep it simple I used printables but this code raises an exception:
import string
import pyparsing as pp
from pyparsing import *
log_display = ("[pts\0]")
log_display1 = ("[~~ ]")
ut_user = "[" + Word(printables) + "]"
log = ut_user
data = log.parseString(log_display)
print(data.dump())
Thanks for any help!
"[pts\0]" does not have a backslash in it. It has a null character. If you wanted a string with a backslash, r"[pts\0]" would produce one. When reading input, this will generally not be a problem. String literal escape processing is only applied to string literals, not to user input.
The problem with "[~~ ]" has nothing to do with the tilde. The tilde is fine. The problem is the space, which doesn't count as a printable by the standards of pyparsing.printables. pyparsing.printables is a string containing all ASCII, printable, non-whitespace characters. The proper way to deal with this depends on what characters you actually want to allow.
I have a text file and a I want to replace the following pattern:
\"
with:
"
The initial version of what I'm looking at looks like:
{"latestProfileVersion":51,
"scannerAccess":true,
"productRatings":"[{\"7H65018000\":{\"reviewCount\":0,\"avgRating\":0}}
So someone embedded a JSON string inside a JSON response.
This is what I have currently:
rawAuthResponseTextFile = open(rawAuthResponseFilename,'r')
formattedAuthResponse = open('formattedAuthResponse.txt', 'w')
try:
stringVersionOfAuthResponse = rawAuthResponseTextFile.read().replace('\n','')
cleanedStringVersionOfAuthResponse = re.sub(r'\"', '"', stringVersionOfAuthResponse)
jsonVersionOfAuthResponse = json.dumps(cleanedStringVersionOfAuthResponse)
formattedAuthResponse.write(jsonVersionOfAuthResponse)
finally:
rawAuthResponseTextFile.close()
formattedAuthResponse.close
Using http://pythex.org/ I have found that r'\"' should match only \", but this is not the case when I look at the output which appears to be adding additional escape characters.
I know I am doing something wrong because I cannot get the quotes around the embedded string to look like the quotes in the regular JSON no matter how much I tweek it, escape characters or no.
You need to use this regex
\\"
You need to escape \ with \
I'm trying to figure out how to remove \r's and \n's and "\ from a json url site but everytime I try it keeps getting cut off when I output the results. There are:
\r\n\r\n
\n\n
\n
\r
"\wordhere"\
If you can help me I would appreciated.
use strict=False when loading, see python json docs:
>>> s
'\n{\n\r\n\r\n\n\n\n\n\n\n\r\n"wordhere": 0}\n'
>>> json.loads(s, strict=False)
{u'wordhere': 0}
You don't need regex for this.
You could use the replace method from string class.
string = 'abc\r\n\r\n\\\\'
string = string.replace('\r', '')
string = string.replace('\n', '')
string = string.replace('\\', '')
But if you really want to use regex, a possible approach would be:
string = re.sub('\\r*\\n*\\\\*', '', string)
When matching special characters, they need to be escaped with a backslash. When matching a backslash, though, you need to use four backslashes.