Python character encoding for '%C5%9' and similar [duplicate] - python

This question already has an answer here:
Weird character encoding issue with python / nautilus scripts combo
(1 answer)
Closed 9 years ago.
I am working in Python with strings, but I can't manage to display certain charatcers properly.
For example, I have this string:
%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine
I have applied several functions to it to no avail. How could I display the appropiate characters in a web site?

you need two things. First you need to unescape the urlencoded data with urllib.unquote, then you need to decode the bytes from whatever charset they're in, this looks like it's utf-8:
>>> import urllib
>>> foo = '%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine'
>>> print urllib.unquote(foo).decode('utf-8')
#ŞenŞakrakTakipleşelimYine

Related

python decode the words beginning with &#xe such as '' and '&#xe3c4' [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 2 years ago.
I am trying scraping and meet an issue about the words shows as ''and '', i serach the whole network but there's no answer about how to decode it, so I come to here to ask for help, is there's any way to decode it?
These words called "html entities". Searching use this name, you can find many methods to parse them in python. (Decode HTML entities in Python string?)
import html
print(html.unescape(''))
P.S. Unicode code point U+E091 and U+E3C4 are in Private Use Area of Unicode, these don't have any meaning unless someone defines it (e.g. webfonts).

How to read unicode file in python [duplicate]

This question already has an answer here:
text with unicode escape sequences to unicode in python [duplicate]
(1 answer)
Closed 2 years ago.
I have a tab separated file written as following:
col_name cnt
\u7834\u6653\u5fae\u660e 8
\u9ed8\u8ba4 12
I use pandas.read_excel to read them into python, and it display the same thing.
How can I read data and derive the following? Thanks!
col_name cnt
破晓微明 8
默认 12
I am using python 3.7.7 and pandas 1.0.4
You need to decode the text with an appropriate decoder. For this case we can use unicode-escape. But to decode the text you have to make bytes out of it first.
col_name = r'\u7834\u6653\u5fae\u660e'
print(bytes(col_name, 'ascii').decode('unicode-escape'))
This will give you 破晓微明.
I don't think this can be done during the call to pandas.read_excel but I'm no pandas expert. You might have to change the contentn of the column after reading the file.

Python convert Hexadecimal Character to Respective Symbols? [duplicate]

This question already has answers here:
How do I url unencode in Python?
(3 answers)
Closed 5 years ago.
I'm trying to find a python package/sample code that can convert the following input "why+don%27t+you+want+to+talk+to+me" to "why+don't+you+want+to+talk+to+me".
Converting the Hex codes like %27 to ' respectively. I can hardcode the who hex character set and then swap them with their symbols. However, I want a simple and scalable solution.
Thanks for helping
You can use urllib's unquote function.
import urllib.parse
urllib.parse.unquote('why+don%27t+you+want+to+talk+to+me')

How to match English word with its equivalent Accented characters in Python [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 7 years ago.
I am trying to automate an application, where it is as like google engine. When user enters "société" (Accented character), it gives details which has either "société"(Accented) or "societe" (English) in it.
So my job is to validate is details contains given keyword. I dont have problem on comparing accented characters. EX: "société" = "société". But the case "société" == "societe" fails. Python code below:
>>> "société".find("societe")
-1 #Fail
>>> "société".find("société")
0 #Success
>>>
So how to match equivalent english word in Accented characters.
Any help will be highly apreciated. Thanks
You can use unidecode:
>>> from unidecode import unidecode
>>> unidecode(u"société")
societe

Replacing HTML representation to ascii using Python [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Decode HTML entities in Python string?
I have parsed some HTML text. But some punctuations like apostrophe are replaced by ’. How to revert them back to `
P.S: I am using Python/Feedparser
Thanks
The PSF Wiki has some ways of doing it. Here is one way:
import htmllib
def unescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
See http://wiki.python.org/moin/EscapingHtml
This helped me
import HTMLParser
hparser=HTMLParser.HTMLParser()
new_text=hparser.unescape(raw_text)

Categories

Resources