How to read unicode file in python [duplicate] - python

This question already has an answer here:
text with unicode escape sequences to unicode in python [duplicate]
(1 answer)
Closed 2 years ago.
I have a tab separated file written as following:
col_name cnt
\u7834\u6653\u5fae\u660e 8
\u9ed8\u8ba4 12
I use pandas.read_excel to read them into python, and it display the same thing.
How can I read data and derive the following? Thanks!
col_name cnt
破晓微明 8
默认 12
I am using python 3.7.7 and pandas 1.0.4

You need to decode the text with an appropriate decoder. For this case we can use unicode-escape. But to decode the text you have to make bytes out of it first.
col_name = r'\u7834\u6653\u5fae\u660e'
print(bytes(col_name, 'ascii').decode('unicode-escape'))
This will give you 破晓微明.
I don't think this can be done during the call to pandas.read_excel but I'm no pandas expert. You might have to change the contentn of the column after reading the file.

Related

Python convert Hexadecimal Character to Respective Symbols? [duplicate]

This question already has answers here:
How do I url unencode in Python?
(3 answers)
Closed 5 years ago.
I'm trying to find a python package/sample code that can convert the following input "why+don%27t+you+want+to+talk+to+me" to "why+don't+you+want+to+talk+to+me".
Converting the Hex codes like %27 to ' respectively. I can hardcode the who hex character set and then swap them with their symbols. However, I want a simple and scalable solution.
Thanks for helping
You can use urllib's unquote function.
import urllib.parse
urllib.parse.unquote('why+don%27t+you+want+to+talk+to+me')

How can I match a Pandas Dataframe with invalid characters (accents) to an array? [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 6 years ago.
I have been trying to use a Pandas Dataframe in Python 3 to find a specific id matching a name from a CSV file. The API I am reading from gives me the name António, along with other names, the way I need it to with the accent in a column called "first". I have an array of names that won't necessarily have all of the accents I need to match. This program seems to work for every name I try except for the ones that have different values for accented characters.
import pandas as pd
nameArray=[Antonio,Matt,Mark,Raul]
playersUrl = 'https://www.FakeSite.com/players'
playerData = pd.read_csv(playersUrl, names=["PLAYERID", "FIRSTNAME"]
for first, playerid in zip(playerData["FIRSTNAME"],playerData["PLAYERID"]):
for i in len(nameArray):
testName = nameArray[i]
if first == testName:
return playerid
If you want to do a compare without diacritics, see previous SO post here:
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Is there a way to convert unicode to the nearest ASCII equivalent? [duplicate]

This question already has answers here:
Convert a Unicode string to a string in Python (containing extra symbols)
(12 answers)
Closed 7 years ago.
I will give the example from Turkish, for example "şğüı" becomes "sgui"
I'm sure each language has it's own conversion methods, sometimes a character might be converted to multiple ASCII characters, like "alpha"/"phi" etc.
I'm wondering whether there is a library/method that achieves this conversion
What you are asking is called transliteration.
Try the Unidecode library.

Python character encoding for '%C5%9' and similar [duplicate]

This question already has an answer here:
Weird character encoding issue with python / nautilus scripts combo
(1 answer)
Closed 9 years ago.
I am working in Python with strings, but I can't manage to display certain charatcers properly.
For example, I have this string:
%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine
I have applied several functions to it to no avail. How could I display the appropiate characters in a web site?
you need two things. First you need to unescape the urlencoded data with urllib.unquote, then you need to decode the bytes from whatever charset they're in, this looks like it's utf-8:
>>> import urllib
>>> foo = '%23%C5%9Een%C5%9EakrakTakiple%C5%9FelimYine'
>>> print urllib.unquote(foo).decode('utf-8')
#ŞenŞakrakTakipleşelimYine

Unicode values in strings are escaped when dumping to JSON in Python [duplicate]

This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 7 months ago.
For example:
>>> print(json.dumps('růže'))
"r\u016f\u017ee"
(Of course, in the real program it's not just a single string, and it also appears like this in the file, when using json.dump()) I'd like it to output simply "růže" as well, how to do that?
Pass the ensure_ascii=False argument to json.dumps:
>>> print(json.dumps('růže', ensure_ascii=False))
"růže"

Categories

Resources