Python XPath parsing tag with apostrophe - python

I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.
For parsing i use Grab.
tag from source:
<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>
Actual XPath:
g.xpath('.//tr/td/a[3]/img').get('title')
Returns
commission:Alfred\\
Is there any way to fix this?
Thanks

Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use &apos; to include a single quote.
From the XML spec:
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos; ", and the double-quote character (") as " " ".

As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.
The provided non-well-formed text can be corrected to:
<img src="somelink"
border="0"
alt="commission:Alfred's misadventures"
title="commission:Alfred's misadventures"/>
In case there is a weird requiremend not to use quotes, then one correct convertion is:
<img src='somelink'
border='0'
alt='commission:Alfred&apos;s misadventures'
title='commission:Alfred&apos;s misadventures'/>
If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:
string correctXml = input.replace("\\'s", "&apos;s")
Probably there is a similar way to do the same in Python.

Related

Escape Characters in Regex sub of Markdown Links to HTML Links

I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?
Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.

How to write XPath query that contains HTML entities?

I have this block of XML:
<bpmn:scriptTask id="UserTask_0qtrxsq" name="set variables app_from_user & applist to "ticketingsystem"" scriptFormat="groovy">
... <bpmn:script> What should be matched is here ... </bpmn:script>
</bpmn:scriptTask>
in an XML file I'm trying to parse using Python and XPath. Below is the line that should match the script tag:
getLines = xml.xpath('//*[local-name()="scriptTask"][#name="%s"]/*[local-name()="script"]/text()' % script_name) where script_name should be set variables app_from_user & applist to "ticketingsystem" in one of the iterations over all the existing scriptTask tags in the XML file.
It works fine for all other tags, but not for this one. When I removed the HTML entities (the placeholders for ampersands, quotes, etc. It worked fine:
<bpmn:scriptTask id="UserTask_0qtrxsq" name="set variables app_from_user" scriptFormat="groovy">
... <bpmn:script> What should be matched is here ... </bpmn:script>
</bpmn:scriptTask>
But I don't have control over the XML files and I want the script to be as generic as possible. Is there a way I could make the XPath query to extract what's inside the script tag without errors?
You have a problem with your quotes. In XPath, quotes have to be altered between " and ', respectively " and &apos;, alternately. Because you use " in your %s parameter, the surrounding brackets have to be ' or, respectively, &apos;. So your XPath expression could look like this...
//*[local-name()='scriptTask'][#name='set variables app_from_user & applist to "ticketingsystem"']/*[local-name()='script']/text()
and therefore your whole expression could look like the following:
getLines = xml.xpath("//*[local-name()='scriptTask'][#name='%s']/*[local-name()='script']/text()" % script_name)
Now the " entities should be properly encapsulated in the &apos; entities of the [#name='%s'].
There is a reference about entities in XML at W3Resource which says:
The apostrophe (') and quote characters (") may also need to be encoded as entities when used in attribute values. If the delimiter for the attribute value is the apostrophe, then the quote character is legal but the apostrophe character is not, because it would signal the end of the attribute value. If an apostrophe is needed, the character entity &apos; must be used. Similarly, if a quote character is needed in an attribute value that is delimited by quotes, then the character entity " must be used.

Error in HTML escaping with Jinja

I have the following regex that searches through text and prepends and appends HTML 'a' tags for the matched substring. It successfully does everything I want except when the HTML is escaped by using the 'safe' filter by Jinja. The regex is below:
re.sub('(^#\w*|(?<=\s)#\w*)',
r'\1',
'here is some #text with a #hashtag')
The above should come out here is some #text with a #hashtag
where '#text' and '#hashtag' are clickable links. However by using Jinja's 'safe' filter it comes out
"here is some "#text" with a "#hashtag
There are a few things to note:
Unmatched substrings are being wrapped in quotations
The html links should come out #hashtag<a> not <a href="{{ url_for(\'main.tag\', tagname=tag) }}">#hashtag
I'm confident it has to do with the string that is being processed by Jinja. I am not confident with how I am escaping specific characters in the string and passing it to Jinja to process.
Am I escaping the characters wrong? Thoughts? Thank you in advance.

python xpath remove unicode chars

I have this text in html page
<div class="phone-content">
‪050 2836142‪
</div>
I extract it like this:
I am using xpath to extract the value inside that div live this
normalize-space(.//div[#class='fieldset-content']/span[#class='listing-reply-phone']/div[#class='phone-content']/text())
I got this result:
"\u202a050 2836142\u202a"
anyone knows who to tell the xpath in python to remove that unicode chars?
If you're looking for an XPath solution: to remove all characters but those from a given set, you can use two nested translate(...) calls following this pattern:
translate($string, translate($string, ' 0123456789', ''), '')
This will remove all characters that are not the space character or a digit. You will have to replace both occurrences of $string by the complete XPath expression to fetch that string.
It might be more reasonable though to apply that outside XPath using more advanced string manipulation features. Those of XPath 1.0 are very limited.

python xpath: single quotationmarks and double quotationmarks

I hope this is a small problem:
I want to search for a text, that can contain doublequotes " and/or singelquotes '. Now I can use this:
"//a[contains(text(), '"+ mytext +"')]"
or this:
'//a[contains(text(), "'+ mytext +'")]'
but if i have mytext is something like this:
a'b"c
i get (of course) a xpath-error (Invalid XPath-Expression). How can I avoid it?
In XPath 2.0 the delimiting quotes of a string literal can be included in the literal by doubling: "He said, ""I can't""".
In XPath 1.0, such a string can't be written as a literal, but it can be expressed using concat(): concat("He said, ", '"', "I can't", '"')
Of course, if such an XPath expression is then to be written as a string literal in a host language such as Python, the quotes must be further escaped according to Python rules.
Use Python triple quotes r"""a'b"c""". Inside the xpath, use an escape .

Categories

Resources