Error in HTML escaping with Jinja - python

I have the following regex that searches through text and prepends and appends HTML 'a' tags for the matched substring. It successfully does everything I want except when the HTML is escaped by using the 'safe' filter by Jinja. The regex is below:
re.sub('(^#\w*|(?<=\s)#\w*)',
r'\1',
'here is some #text with a #hashtag')
The above should come out here is some #text with a #hashtag
where '#text' and '#hashtag' are clickable links. However by using Jinja's 'safe' filter it comes out
"here is some "#text" with a "#hashtag
There are a few things to note:
Unmatched substrings are being wrapped in quotations
The html links should come out #hashtag<a> not <a href="{{ url_for(\'main.tag\', tagname=tag) }}">#hashtag
I'm confident it has to do with the string that is being processed by Jinja. I am not confident with how I am escaping specific characters in the string and passing it to Jinja to process.
Am I escaping the characters wrong? Thoughts? Thank you in advance.

Related

Python - Remove apostrophe from Regular Expression

I have the following regular expression to extract song names from a certain website:
<h2 class="chart-row__song">(.*?)</h2>
It displays the results below :
Where ' is in the output below, is an apostrophe on the website the song name is extract from.
How would I go about changing my regular expression to remove those characters? '
TIA
As stated in the comments, you can't do that using a regex alone. You need to unescape the HTML entities present in the match separately.
import re
import html
regex = re.compile(r'<h2 class="chart-row__song">(.*?)</h2>')
result = [html.unescape(s) for s in regex.findall(mystring)]

python xpath remove unicode chars

I have this text in html page
<div class="phone-content">
‪050 2836142‪
</div>
I extract it like this:
I am using xpath to extract the value inside that div live this
normalize-space(.//div[#class='fieldset-content']/span[#class='listing-reply-phone']/div[#class='phone-content']/text())
I got this result:
"\u202a050 2836142\u202a"
anyone knows who to tell the xpath in python to remove that unicode chars?
If you're looking for an XPath solution: to remove all characters but those from a given set, you can use two nested translate(...) calls following this pattern:
translate($string, translate($string, ' 0123456789', ''), '')
This will remove all characters that are not the space character or a digit. You will have to replace both occurrences of $string by the complete XPath expression to fetch that string.
It might be more reasonable though to apply that outside XPath using more advanced string manipulation features. Those of XPath 1.0 are very limited.

Python regex: remove certain HTML tags and the contents in them

If I have a string that contains this:
<p><span class=love><p>miracle</p>...</span></p><br>love</br>
And I want to remove the string:
<span class=love><p>miracle</p>...</span>
and maybe some other HTML tags. At the same time, the other tags and the contents in them will be reserved.
The result should be like this:
<p></p><br>love</br>
I want to know how to do this using regex pattern?
what I have tried :
r=re.compile(r'<span class=love>.*?(?=</span>)')
r.sub('',s)
but it will leave the
</span>
can you help me using re module this time?and i will learn html parser next
First things first: Don’t parse HTML using regular expressions
That being said, if there is no additional span tag within that span tag, then you could do it like this:
text = re.sub('<span class=love>.*?</span>', '', text)
On a side note: paragraph tags are not supposed to go within span tags (only phrasing content is).
The expression you have tried, <span class=love>.*?(?=</span>), is already quite good. The problem is that the lookahead (?=</span>) will never match what it looks ahead for. So the expression will stop immediately before the closing span tag. You now could manually add a closing span at the end, i.e. <span class=love>.*?(?=</span>)</span>, but that’s not really necessary: The .*? is a non-greedy expression. It will try to match as little as possible. So in .*?</span> the .*? will only match until a closing span is found where it immediately stops.

How to make regex with unicode symbols?

I need to make regex which will capture the following:
Fixed unicode text:
<br>
<strong>
text I am looking for
</strong>
I do something like
regex = re.compile(unicode('Fixed unicode text:.*','utf-8'))
How to modify that to capture remaining text?
Simply prefix u (in Python 2.x, nothing in Python 3) to get a unicode string, and use parentheses to capture the remaining text, like this:
import re
haystack = u'Fixed unicode text:\n<br><strong>\ntext I\nam looking for</strong>'
match = re.search(ur'Fixed unicode text:(.*)', haystack, re.DOTALL)
print(match.group(1))
However, it looks like your input is HTML. If that's the case, you should not use a regular expression, but parse the HTML with lxml, BeautifulSoup, or another HTML parser.

Python XPath parsing tag with apostrophe

I'm new to XPath. I'm trying to parse a page using XPath. I need to get information from tag, but escaped apostrophe in title screws up everything.
For parsing i use Grab.
tag from source:
<img src='somelink' border='0' alt='commission:Alfred\'s misadventures' title='commission:Alfred\'s misadventures'>
Actual XPath:
g.xpath('.//tr/td/a[3]/img').get('title')
Returns
commission:Alfred\\
Is there any way to fix this?
Thanks
Garbage in, garbage out. Your input is not well-formed, because it improperly escapes the single quote character. Many programming languages (including Python) use the backslash character to escape quotes in string literals. XML does not. You should either 1) surround the attribute's value with double-quotes; or 2) use &apos; to include a single quote.
From the XML spec:
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos; ", and the double-quote character (") as " " ".
As the provided "XML" isn't a wellformed document due to nested apostrophes, no XPath expression can be evaluated on it.
The provided non-well-formed text can be corrected to:
<img src="somelink"
border="0"
alt="commission:Alfred's misadventures"
title="commission:Alfred's misadventures"/>
In case there is a weird requiremend not to use quotes, then one correct convertion is:
<img src='somelink'
border='0'
alt='commission:Alfred&apos;s misadventures'
title='commission:Alfred&apos;s misadventures'/>
If you are provided the incorrect input, in a language such as C# one can try to convert it to its correct counterpart using:
string correctXml = input.replace("\\'s", "&apos;s")
Probably there is a similar way to do the same in Python.

Categories

Resources