How to write XPath query that contains HTML entities?

How to write XPath query that contains HTML entities? - python

I have this block of XML:
<bpmn:scriptTask id="UserTask_0qtrxsq" name="set variables app_from_user & applist to "ticketingsystem"" scriptFormat="groovy">
... <bpmn:script> What should be matched is here ... </bpmn:script>
</bpmn:scriptTask>
in an XML file I'm trying to parse using Python and XPath. Below is the line that should match the script tag:
getLines = xml.xpath('//*[local-name()="scriptTask"][#name="%s"]/*[local-name()="script"]/text()' % script_name) where script_name should be set variables app_from_user & applist to "ticketingsystem" in one of the iterations over all the existing scriptTask tags in the XML file.
It works fine for all other tags, but not for this one. When I removed the HTML entities (the placeholders for ampersands, quotes, etc. It worked fine:
<bpmn:scriptTask id="UserTask_0qtrxsq" name="set variables app_from_user" scriptFormat="groovy">
... <bpmn:script> What should be matched is here ... </bpmn:script>
</bpmn:scriptTask>
But I don't have control over the XML files and I want the script to be as generic as possible. Is there a way I could make the XPath query to extract what's inside the script tag without errors?

You have a problem with your quotes. In XPath, quotes have to be altered between " and ', respectively " and &apos;, alternately. Because you use " in your %s parameter, the surrounding brackets have to be ' or, respectively, &apos;. So your XPath expression could look like this...
//*[local-name()='scriptTask'][#name='set variables app_from_user & applist to "ticketingsystem"']/*[local-name()='script']/text()
and therefore your whole expression could look like the following:
getLines = xml.xpath("//*[local-name()='scriptTask'][#name='%s']/*[local-name()='script']/text()" % script_name)
Now the " entities should be properly encapsulated in the &apos; entities of the [#name='%s'].
There is a reference about entities in XML at W3Resource which says:
The apostrophe (') and quote characters (") may also need to be encoded as entities when used in attribute values. If the delimiter for the attribute value is the apostrophe, then the quote character is legal but the apostrophe character is not, because it would signal the end of the attribute value. If an apostrophe is needed, the character entity &apos; must be used. Similarly, if a quote character is needed in an attribute value that is delimited by quotes, then the character entity " must be used.

Related

Escape Characters in Regex sub of Markdown Links to HTML Links

I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?

Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.

find the parent tag of the given text from its position in html string

I am using Python to manipulate the HTML string. I want to
find the parent tag from the given text(start & end offset of text are known) in html string.
e.g consider the following html string
<html><body><span id="1234">The Dormouse's story</span><body></head>
input is offset (33,43) i.e. string 'Dormouse's' and the parent tag is <span id="1234">

Right off the top of my head here, since you have the offset (which I think you may have to tweak because I had to use (28,48)),
Create a substring based on the offset.
Split the full html string using split() using the offset string as a delimiter.
Take the first substring created by the split and split that with >.
The second to last substring from that list of substrings is your parent tag (because the split list will return an empty string if the delimiter is at the end of the string you're splitting):
html_string = '<html><body><span id="1234">The Dormouse\'s story</span><body></head>'
offset_string = html_string[28:48]
tags_together = html_string.split(offset_string)[0]
list_of_tags = tags_together.split('>')
parent_tag = list_of_tags[len(list_of_tags)-2]
Note you will be missing a '>' so you will have to add that back if necessary.
parent_tag = parent_tag + ">"
Also, the reason why I put the html_string in single quotes is because you have double quotes in there already.
This is gross and a little brutish but it should get the job done. I am sure there exists a python library out there that can do this kind of task for you. You just need to look hard enough!
I recommend opening up a python shell and printing out each variable after you create it so that you can see what split() does. Here are some docs for that!
Now that I think about it, using regex with your known offset could get you the tags too...

Grabbing text between either double/single quote in Python regex

I have a bunch (thousands) of old unit testing scripts written with the Selenium RC interface in JavaScript. Since we're upgrading to Selenium 3, I want to try and get rid of some of the RC methods in an automated fashion using Python scripts. I'm iterating through these scripts line by line, picking up the Selenese methods, deconstructing them then attempting to rebuild with the WebDriver interface. For example:
selenium.type("xpath=//*[text()='test, xpath']", "test, text");
Would be output as...
driver.findElement(By.xpath("//*[text()='test, xpath']")).sendKeys("test, text");
I have a system for automatically identifying the Selenese methods, storing whitespace and separating the method from the parameters, so what I'm left with is the following string:
("xpath=//*[text()='test, xpath']", "test, text")
A problem I'm running into is, these aren't always consistent. Sometimes there are double-quotes nested in single-quotes, or vice-versa, or escaped double-quotes nested in double-quotes, etc. For example:
("xpath=//*[text()=\"test, xpath\"]", "test, text")
('xpath=//*[text()=\'test, xpath\']', 'test, text')
('xpath=//*[text()="test, xpath"]', 'test, text')
These are all valid. I want to be able to always match the arguments passed into the method, whether double-quotes are used or single-quotes, plus ignore nested quotes opposite of what's used to open the string as well as escaped quotes, then return them as lists.
['xpath=//*[text()="test, xpath"]', 'test, text']
...etc. I've attempted to use the re.findall using the following expression.
([\"'])(?:(?=(\\?))\2.)*?\1
What I'm getting back is this.
>>> print arguments
[('"', ''), ('"', '')]
Is there something I'm missing?

I would not make it this complex using lookbehind or lookahead. Rather I would build a case specific regex. In your case you have something like below
("param1", "param2")
('param1', 'param2')
Inside these params you may have additional escaped quotes or single quotes or what not. But if look at one thing, which is split it using ", " or ', ', these exact patterns will rarely occur in param1 and param2
So simplest non-regex solution would be to split based on ", " or ', '. But then there may be extra spaces or no spaces between, so we use a pattern
^\(\s*["']\s*(?<first_param>.*?)("\s*,\s*"|'\s*,\s*')(?<second_param>.*?)\s*["']\s*\)$
\(\s*["']\s* to match the first brackets and any starting quote
(?<first_param>.*?) to match the first parameter
("\s*,\s*"|'\s*,\s*') to match our split command pattern
(?<second_param>.*?) to match the second param
\s*["']\s*\)$ to match the end.
This is not perfect but will work in 95%+ cases of your
You can check regex fiddle on below link
https://regex101.com/r/z9PytD/1/

How to return html content without escaping in serializer?

I have model with TextField that contains html. Let's say I have a row that contains google in TextField. The API returns "google".
How can I remove " escaping?

You can use the html module, which has a method named escape:
html.escape(s, quote=True)
Convert the characters &, < and > in string s to HTML-safe sequences. Use this if you need to display text that might contain
such characters in HTML. If the optional flag quote is true, the
characters (") and (') are also translated; this helps for inclusion
in an HTML attribute value delimited by quotes, as in <a href="...">.
New in version 3.2.
Let s be: s = 'example' then:
from html import escape
html_line = escape(s)
Now the html_line contains the s string without any 'escaping', looking like this:
<a href="http://example.com">example</a>
If you want to keep the characters < > & etc. but avoid the escaping of ", you can utilize the other method of the html module, called unescape:
from html import unescape
html_line = unescape(s)
Now the html_line will look like this:
example

Python XPath parsing tag with apostrophe

I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.
For parsing i use Grab.
tag from source:
<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>
Actual XPath:
g.xpath('.//tr/td/a[3]/img').get('title')
Returns
commission:Alfred\\
Is there any way to fix this?
Thanks

Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use &apos; to include a single quote.
From the XML spec:
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos; ", and the double-quote character (") as " " ".

As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.
The provided non-well-formed text can be corrected to:
<img src="somelink"
border="0"
alt="commission:Alfred's misadventures"
title="commission:Alfred's misadventures"/>
In case there is a weird requiremend not to use quotes, then one correct convertion is:
<img src='somelink'
border='0'
alt='commission:Alfred&apos;s misadventures'
title='commission:Alfred&apos;s misadventures'/>
If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:
string correctXml = input.replace("\\'s", "&apos;s")
Probably there is a similar way to do the same in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write XPath query that contains HTML entities? - python

Related

Escape Characters in Regex sub of Markdown Links to HTML Links

find the parent tag of the given text from its position in html string

Grabbing text between either double/single quote in Python regex

How to return html content without escaping in serializer?

Python XPath parsing tag with apostrophe

Categories

Resources