python xpath: single quotationmarks and double quotationmarks

python xpath: single quotationmarks and double quotationmarks - python

I hope this is a small problem:
I want to search for a text, that can contain doublequotes " and/or singelquotes '. Now I can use this:
"//a[contains(text(), '"+ mytext +"')]"
or this:
'//a[contains(text(), "'+ mytext +'")]'
but if i have mytext is something like this:
a'b"c
i get (of course) a xpath-error (Invalid XPath-Expression). How can I avoid it?

In XPath 2.0 the delimiting quotes of a string literal can be included in the literal by doubling: "He said, ""I can't""".
In XPath 1.0, such a string can't be written as a literal, but it can be expressed using concat(): concat("He said, ", '"', "I can't", '"')
Of course, if such an XPath expression is then to be written as a string literal in a host language such as Python, the quotes must be further escaped according to Python rules.

Use Python triple quotes r"""a'b"c""". Inside the xpath, use an escape .

Related

Why use an escape sequence instead of a different quote type?

Why would we want to use escape sequence characters like for example in this Python code:
print('It\'s alright.')
Why are we using this backslash to print a single quote when we can accomplish the same by using:
print("it's alright")

This is useful because you can do:
txt = 'in python you can have \'string\' or "string"'
print(txt)

No matter how many different kinds of quote you have, you may still need an escape mechanism now and then. Consider this:
If you want to use Python's "multiline string literal" you have to begin it and end it with a triple quote, which can be either """ or '''.
To put that into a string literal you are going to have to quote ' or ":
a = 'If you want to use Python\'s "multiline string literal" you have to begin it and end it with a triple quote, which can be either """ or \'\'\'.'.
a = "If you want to use Python's \"multiline string literal\" you have to begin it and end it with a triple quote, which can be either \"\"\" or '''."
a = """If you want to use Python's "multiline string literal" you have to begin it and end it with a triple quote, which can be either ""\" or '''."""
Having different quote types is a great programming convenience, making it easier and less error prone to put quotes and apostrophes in the data without having to jump through hoops. But it can't cover every case. If you need to convince yourself of this, experiment with those three lines at a command prompt and see if you can come up with a way to avoid backslashes. You will find you always need at least one.

Without further context, I can only take a guess and say that the person who wrote the first example, didn't know or wasn't aware of the fact that it's possible to use double-quotes "" for string literals in Python.

That's just a matter of style. Some people like to use single quotes to create string literals, and therefore they'll have to escape any single quotes it comes inside of their strings (same for double quotes). The following will raise a SyntaxError:
s = 'It's gonna be alright!'
s = "They used to call me "Big" but I was 4ft!"
So you may ask why they don't use " when their string have single quotes and ' when their string have double quotes? Yes, they can, but there are some unavoidable situations, such as Regex:
regexp = r"["']\w+["']"
Note that they can't use neither single nor double quotes to create the string, since both are present in the Regex. Therefore, they'll need to escape it.

In this case its not needed cuz you have used " " for the print statement.
case1) use: print(" It's alright.")
case2) use: print(' It\'s alright.')
Note the parenthesis used for the print statements.
You cant use ' directly in case2 cuz python would think that the string ends causing a SyntaxError.

In the code
txt = 'It\'s alright.'
you need the backslash(\) so python understands that the second apostrophe is a character of the string. Without the backslash, Python would interpret it as the character used to mark the end of the string.

When you use a ' at the start, python looks for a matching ' and considers whatever is present in between these quotes as a string.
But if you use a ' in the middle of the string, python considers that as the end of the string. And since there is no matching ' for the ' at the end of the string that results in a SyntaxError
The backslash () character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character.
Refer the docs: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

How to add additional escape using lxml

I'm trying to use LXML to process a string in a XML file. The problem is the output file unable to escape some of the special characters(e.g. "\n" and " ' ").
xml.sax.saxutils.escape only escapes &, <, and > by default, but it does provide an entities parameter to additionally escape other strings. Does lxml provide the same flexibility in entities parameter for escape?
For XML:
from xml.sax.saxutils import escape
def xmlescape(data):
return escape(data, entities={
"'": "&apos;",
"\"": """
})
Thank you so much!!

I'm not sure about lxml. But you can remove these special characters using replace.
Here is an example:
string = "I have a string with special value ' in between."
print(string.replace("'", ""))
output:
I have a string with special value in between.

Escape Characters in Regex sub of Markdown Links to HTML Links

I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?

Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.

Python re.sub(): trying to replace escaped characters only

With Python 3.x, I need to replace escaped double quotes in some text with some custom pattern, leaving non-escaped double quotes as is. So I write as trivial code as:
text = 'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\"', '~', text))
And expect to see:
These are "quotes", and these are ~escaped quotes~
But instead of above, I get:
These are ~quotes~, and these are ~escaped quotes~
So, what't the correct pattern to replace escaped quotes only?
Background of this issue is an attempt to read 'invalid' JSON file containing Javascript function in it, placed with line feeds as is, but with escaped quotes. If there is easier way to parse JSON with newline characters in key values, I appreciate a hint on that.

First, you need to use a raw string to assign text, so that the backslashes will be kept literally (or you can escape the backslashes).
text = r'These are "quotes", and these are \"escaped quotes\"'
Second, you need to escape the backslash in the regexp so that it will be treated literally by the regexp engine.
print(re.sub(r'\\"', '~', text))

using raw text might help.
import re
text = r'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\\"', '~', text))

Python XPath parsing tag with apostrophe

I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.
For parsing i use Grab.
tag from source:
<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>
Actual XPath:
g.xpath('.//tr/td/a[3]/img').get('title')
Returns
commission:Alfred\\
Is there any way to fix this?
Thanks

Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use &apos; to include a single quote.
From the XML spec:
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos; ", and the double-quote character (") as " " ".

As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.
The provided non-well-formed text can be corrected to:
<img src="somelink"
border="0"
alt="commission:Alfred's misadventures"
title="commission:Alfred's misadventures"/>
In case there is a weird requiremend not to use quotes, then one correct convertion is:
<img src='somelink'
border='0'
alt='commission:Alfred&apos;s misadventures'
title='commission:Alfred&apos;s misadventures'/>
If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:
string correctXml = input.replace("\\'s", "&apos;s")
Probably there is a similar way to do the same in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python xpath: single quotationmarks and double quotationmarks - python

Use Python triple quotes r"""a'b"c""". Inside the xpath, use an escape .

Related

Why use an escape sequence instead of a different quote type?

How to add additional escape using lxml

Escape Characters in Regex sub of Markdown Links to HTML Links

Python re.sub(): trying to replace escaped characters only

Python XPath parsing tag with apostrophe

Categories

Resources