Python - Remove 'style'-attribute from HTML

Python - Remove 'style'-attribute from HTML - python

I have a String in Python, which has some HTML in it. Basically it looks like this.
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
I try to display this HTML in a PDF. Because my PDF generator can't handle the styles-attribute (and no, I can't take another one), I have to remove it from the string. So basically, it should be like that:
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
>>> parsedString = someFunction(someString)
>>> print parsedString
"<img src='somepath'/>"
I guess the best way to do this is with RegEx, but I'm not very keen on it. Can someone help me out?

I wouldn't use RegEx with this because
Regex is not really suited for HTML parsing and even though this is a simple one there could be many variations and edge cases you need to consider and the resulting regex could turn out to be a nightmare
Regex sucks. It can be really useful but honestly, they are the epitome of user unfriendly.
Alright, so how would I go about it. I would use trusty BeautifulSoup! Install with pip by using the following command:
pip install beautifulsoup4
Then you can do the following to remove the style:
from bs4 import BeautifulSoup as Soup
del Soup(someString).find('img')['style']
This first parses your string, then finds the img tag and then deletes its style attribute.
It should also work with arbitrary strings but I can't promise that. Maybe you will come up with an edge case.
Remember, using RegEx to parse an HTML string is not the best of ideas. The internet and Stackoverflow is full of answers why this is not possible.
Edit: Just for kicks you might want to check out this answer. You know stuff is serious when it is said that even Jon Skeet can't do it.

Using RegEx to work with HTML is a very bad idea but if you really want to use it, try this:
/style=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/ig

Related

Python encode spaces in url only and not other special characters

I know this question has been asked many times but I can't seem to find the variation that I'm looking for specifically.
I have a url, lets say its:
https://somethingA/somethingB/somethingC/some spaces here
I want to convert it to:
https://somethingA/somethingB/somethingC/some%20spaces%20here
I know I can do it with the replace function like below:
url = https://somethingA/somethingB/somethingC/some spaces here
url.replace(' ', '%20')
But I have a feeling that the best practice is probably to use the urllib.parse library. The problem is that when I use it, it encodes other special characters like : too.
So if I do:
url = https://somethingA/somethingB/somethingC/some spaces here
urllib.parse.quote(url)
I get:
https%3A//somethingA/somethingB/somethingC/some%20spaces%20here
Notice the : also gets converted to %3A. So my question is, is there a way I can achieve the same thing as replace with urllib? I think I would rather use a tried and tested library that is designed specifically to encode URLs instead of reinventing the wheel, which I may or may not be missing something leading to a security loop hole. Thank you.

So quote() there is built to work on just the path portion of a url. So you need to break things up a bit like this:
from urllib.parse import urlparse
url = 'https://somethingA/somethingB/somethingC/some spaces here'
parts = urlparse(url)
fixed_url = f"{parts.scheme}://{parts.netloc}/{urllib.parse.quote(parts.path)}"

Python - Convert unicode hex to string

I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:
<div>Ввоскресень</div>...etc
The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:
content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')
but it did not work. Could you tell me if there is a way to convert this string before saving to database?
Please let me know if the title of my question explains correctly what I need. Thank you.

One way (Python 3.3):
>>> s='<div>Ввоскресень</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'
Python 2.7:
>>> s='<div>Ввоскресень</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>
P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:
>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'
Per comment it looks finally documented (and moved) in Python 3.4:
https://docs.python.org/3.4/library/html.html#html.unescape

Python unescape URL

I have got a url in this form - http:\\/\\/en.wikipedia.org\\/wiki\\/The_Truman_Show. How can I make it normal url. I have tried using urllib.unquote without much success.
I can always use regular expressions or some simple string replace stuff. But I believe that there is a better way to handle this...

urllib.unquote is for replacing %xx escape codes in URLs with the characters they represent. It won't be useful for this.
Your "simple string replace stuff" is probably the best solution.

Have you tried using json.loads from the json module?
>>> json.loads('"http:\\/\\/en.wikipedia.org\\/wiki\\/The_Truman_Show"')
'http://en.wikipedia.org/wiki/The_Truman_Show'
The input that I'm showing isn't exactly what you have. I've wrapped it in double quotes to make it valid json.
When you first get it from the json, how are you decoding it? That's probably where the problem is.

It is too childish -- look for some library function when you can transform URL by yourself.
Since there are not other visible rules but "/" replaced by "\/", you can simply replace it back:
def unescape_this(url):
return url.replace(r"\\/", "/")

From escaped html -> to regular html? - Python

I used BeautifulSoup to handle XML files that I have collected through a REST API.
The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely.
Unfortunately I need the HTML code.
How would I go on about transforming the escaped HTML into proper markup?
Help would be very much appreciated!

I think you want xml.sax.saxutils.unescape from the Python standard library.
E.g.:
>>> from xml.sax import saxutils as su
>>> s = '<foo>bar</foo>'
>>> su.unescape(s)
'<foo>bar</foo>'

You could try the urllib module?
It has a method unquote() that might suit your needs.
Edit: on second thought, (and more reading of your question) you might just want to just use string.replace()
Like so:
string.replace('<','<')
string.replace('>','>')

Converting \n to <br> in mako files

I'm using python with pylons
I want to display the saved data from a textarea in a mako file with new lines formatted correctly for display
Is this the best way of doing it?
> ${c.info['about_me'].replace("\n", "<br />") | n}

The problem with your solution is that you bypass the string escaping, which can lead to security issues. Here is my solution :
<%! import markupsafe %>
${text.replace('\n', markupsafe.Markup('<br />'))}
or, if you want to use it more than once :
<%!
import markupsafe
def br(text):
return text.replace('\n', markupsafe.Markup('<br />'))
%>
${text | br }
This solution uses markupsafe, which is used by mako to mark safe strings and know which to escape. We only mark <br/> as being safe, not the rest of the string, so it will be escaped if needed.

It seems to me that is perfectly suitable.
Be aware that replace() returns a copy of the original string and does not modify it in place. So since this replacement is only for display purposes it should work just fine.
Here is a little visual example:
>>> s = """This is my paragraph.
...
... I like paragraphs.
... """
>>> print s.replace('\n', '<br />')
This is my paragraph.<br /><br />I like paragraphs.<br />
>>> print s
This is my paragraph.
I like paragraphs.
The original string remains unchanged. So... Is this the best way of doing it?
Ask yourself: Does it work? Did it get the job done quickly without resorting to horrible hacks? Then yes, it is the best way.

Beware as line breaks in <textarea>s should get submitted as \r\n according to http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4
To be safe, try s.replace('\r\n', '<br />') then s.replace('\n', '<br />').

This seems dangerous to me because it prints the whole string without escaping, which would allow arbitrary tags to be rendered. Make sure you cleanse the user's input with lxml or similar before printing. Beware that lxml will wrap in an HTML tag, it just can't handle things that aren't like that, so get ready to remove that manually or adjust your CSS to accommodate.

try this it will work:-
${c.info['about_me'] | n}

There is also a simply help function that can be called which will format and santize text correctly replacing \n for tags (see http://sluggo.scrapping.cc/python/WebHelpers/modules/html/converters.html).
In helpers.py add the following:
from webhelpers.html.converters import textilize
Then in your mako file simply say
h.textilize( c.info['about_me'], santize=True)
The santize=True just means that it will make sure any other nasty codes are escaped so users can't hack your site, as the default is False. Alternatively I make do a simple wrapper function in helpers so that santize=True is always defaults to True i.e.
from webhelpers.html.converters import textilize as unsafe_textilize
def textilize( value, santize=True):
return unsafe_textilize( value, santize )
This way you can just call h.textilize( c.info['about_me'] ) from your mako file, which if you work with lots of designers stops them from going crazy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Remove 'style'-attribute from HTML - python

Using RegEx to work with HTML is a very bad idea but if you really want to use it, try this: /style=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/ig

Related

Python encode spaces in url only and not other special characters

Python - Convert unicode hex to string

Python unescape URL

From escaped html -> to regular html? - Python

Converting \n to <br> in mako files

Categories

Resources