Replacing HTML representation to ascii using Python [duplicate] - python

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Decode HTML entities in Python string?
I have parsed some HTML text. But some punctuations like apostrophe are replaced by ’. How to revert them back to `
P.S: I am using Python/Feedparser
Thanks

The PSF Wiki has some ways of doing it. Here is one way:
import htmllib
def unescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
See http://wiki.python.org/moin/EscapingHtml

This helped me
import HTMLParser
hparser=HTMLParser.HTMLParser()
new_text=hparser.unescape(raw_text)

Related

python decode the words beginning with &#xe such as '' and '&#xe3c4' [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 2 years ago.
I am trying scraping and meet an issue about the words shows as ''and '', i serach the whole network but there's no answer about how to decode it, so I come to here to ask for help, is there's any way to decode it?
These words called "html entities". Searching use this name, you can find many methods to parse them in python. (Decode HTML entities in Python string?)
import html
print(html.unescape(''))
P.S. Unicode code point U+E091 and U+E3C4 are in Private Use Area of Unicode, these don't have any meaning unless someone defines it (e.g. webfonts).

Python - How to convert HTML entity to UTF-8 [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 3 years ago.
I want to convert in Python 2.7 string like
"€", "ż"
and similar to UTF-8 string.
How to do it?
Python3
>>> import html
>>> html.unescape('©')
'©'
>>> html.unescape('€')
'€'
>>> html.unescape('ż')
'ż'
It's in html module in python.

Python convert Hexadecimal Character to Respective Symbols? [duplicate]

This question already has answers here:
How do I url unencode in Python?
(3 answers)
Closed 5 years ago.
I'm trying to find a python package/sample code that can convert the following input "why+don%27t+you+want+to+talk+to+me" to "why+don't+you+want+to+talk+to+me".
Converting the Hex codes like %27 to ' respectively. I can hardcode the who hex character set and then swap them with their symbols. However, I want a simple and scalable solution.
Thanks for helping
You can use urllib's unquote function.
import urllib.parse
urllib.parse.unquote('why+don%27t+you+want+to+talk+to+me')

encoding string that has been decoded with %' to unicode [duplicate]

This question already has answers here:
Transform URL string into normal string in Python (%20 to space etc)
(3 answers)
Url decode UTF-8 in Python
(5 answers)
Decode escaped characters in URL
(5 answers)
Closed 5 years ago.
html POST method decoded my string like this:
Ostrołęka => Ostro%C5%82%C4%99ka
How do I encode it into readable form in Python?
Sorry for possible duplicate.
EDIT: Solution in 'possible duplicate' doesn't solve above problem
Python 2:
from urllib import unquote
x = unquote('Ostro%C5%82%C4%99ka')
Python 3:
from urllib.parse import unquote
x = unquote('Ostro%C5%82%C4%99ka')

Python character encoding for '%C5%9' and similar [duplicate]

This question already has an answer here:
Weird character encoding issue with python / nautilus scripts combo
(1 answer)
Closed 9 years ago.
I am working in Python with strings, but I can't manage to display certain charatcers properly.
For example, I have this string:
%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine
I have applied several functions to it to no avail. How could I display the appropiate characters in a web site?
you need two things. First you need to unescape the urlencoded data with urllib.unquote, then you need to decode the bytes from whatever charset they're in, this looks like it's utf-8:
>>> import urllib
>>> foo = '%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine'
>>> print urllib.unquote(foo).decode('utf-8')
#ŞenŞakrakTakipleşelimYine

Categories

Resources