Python Convert Unicode-Hex utf-8 strings to Unicode strings - python

Have s = u'Gaga\xe2\x80\x99s' but need to convert to t = u'Gaga\u2019s'
How can this be best achieved?

s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t
print(x)
yields
Gaga’s

Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:
>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'

import codecs
s = u"Gaga\xe2\x80\x99s"
s_as_str = codecs.charmap_encode(s)[0]
t = unicode(s_as_str, "utf-8")
print t
prints
u'Gaga\u2019s'

Related

Decode base64 in python (for example :: Q29ycsOqYQ== into Corrêa)

I have base64 encoded values like : Q29ycsOqYQ==
and i tried this code, to decode it to Corrêa.
import base64
encoded = ': Q29ycsOqYQ=='
data = base64.b64decode(encoded)
print(data)
i get this result b'Corr\xc3\xaaa'
but the desired result is Corrêa.
ê is not standard ascii encoding. If you actually print the data in python2.7 it will give you what you want.
You're printing the bytes. Turn it into a string
import base64
encoded = 'Q29ycsOqYQ=='
data = base64.b64decode(encoded)
s = str(data, encoding='utf-8')
print(s)
Output:
Corrêa

How to get UTF-16 (decimal) in Python?

I have Unicode Code Point of an emoticon represented as U+1F498:
emoticon = u'\U0001f498'
I would like to get utf-16 decimal groups of this character, which according to this website are 55357 and 56472.
I tried to do print emoticon.encode("utf16") but did not help me at all because it gives some other characters.
Also, trying to decode from UTF-8 before encode it to UTF-16 as follow print str(int("0001F498", 16)).decode("utf-8").encode("utf16") does not help either.
How do I correctly get the utf-16 decimal groups of a unicode character?
You can encode the character with the utf-16 encoding, and then convert every 2 bytes of the encoded data to integers with int.from_bytes (or struct.unpack in python 2).
Python 3
def utf16_decimals(char, chunk_size=2):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every `chunk_size` bytes to an integer
decimals = []
for i in range(0, len(encoded_char), chunk_size):
chunk = encoded_char[i:i+chunk_size]
decimals.append(int.from_bytes(chunk, 'big'))
return decimals
Python 2 + Python 3
import struct
def utf16_decimals(char):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every 2 bytes to an integer
decimals = []
for i in range(0, len(encoded_char), 2):
chunk = encoded_char[i:i+2]
decimals.append(struct.unpack('>H', chunk)[0])
return decimals
Result:
>>> utf16_decimals(u'\U0001f498')
[55357, 56472]
In a Python 2 "narrow" build, it is as simple as:
>>> emoticon = u'\U0001f498'
>>> map(ord,emoticon)
[55357, 56472]
This works in Python 2 (narrow and wide builds) and Python 3:
from __future__ import print_function
import struct
emoticon = u'\U0001f498'
print(struct.unpack('<2H',emoticon.encode('utf-16le')))
Output:
(55357, 56472)
This is a more general solution that prints the UTF-16 code points for any length of string:
from __future__ import print_function,division
import struct
def utf16words(s):
encoded = s.encode('utf-16le')
num_words = len(encoded) // 2
return struct.unpack('<{}H'.format(num_words),encoded)
emoticon = u'ABC\U0001f498'
print(utf16words(emoticon))
Output:
(65, 66, 67, 55357, 56472)

How to replace accented characters?

My output looks like 'àéêöhello!'. I need change my output like this 'aeeohello', Just replacing the character à as a like this.
Please Use the below code:
import unicodedata
def strip_accents(text):
try:
text = unicode(text, 'utf-8')
except NameError: # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)\
.encode('ascii', 'ignore')\
.decode("utf-8")
return str(text)
s = strip_accents('àéêöhello')
print s
import unidecode
somestring = "àéêöhello"
#convert plain text to utf-8
u = unicode(somestring, "utf-8")
#convert utf-8 to normal text
print unidecode.unidecode(u)
Output:
aeeohello
Alpesh Valaki's answer is the "nicest", but I had to do some adjustments for it to work:
# I changed the import
from unidecode import unidecode
somestring = "àéêöhello"
#convert plain text to utf-8
# replaced unicode by unidecode
u = unidecode(somestring, "utf-8")
#convert utf-8 to normal text
print(unidecode(u))

How to handle UTF8 string from Pythons imaplib

Python imaplib sometimes returns strings that looks like this:
=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=
What is the name for this notation?
How can I decode (or should I say encode?) it to UTF8?
In short:
>>> from email.header import decode_header
>>> msg = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')[0][0].decode('utf-8')
>>> msg
'Repertuar wydarze\u0144 z woj. Dolno\u015bl\u0105skie'
My computer doesn't show the polish characters, but they should appear in yours (locales etc.)
Explained:
Use the email.header decoder:
>>> from email.header import decode_header
>>> value = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')
>>> value
[(b'Repertuar wydarze\xc5\x84 z woj. Dolno\xc5\x9bl\xc4\x85skie', 'utf-8')]
That will return a list with the decoded header, usually containing one tuple with the decoded message and the encoding detected (sometimes more than one pair).
>>> msg, encoding = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')[0]
>>> msg
b'Repertuar wydarze\xc5\x84 z woj. Dolno\xc5\x9bl\xc4\x85skie'
>>> encoding
'utf-8'
And finally, if you want msg as a normal utf-8 string, use the bytes decode method:
>>> msg = msg.decode('utf-8')
>>> msg
'Repertuar wydarze\u0144 z woj. Dolno\u015bl\u0105skie'
You can directly use the bytes decoder instead , here is an example:
result, data = imapSession.uid('search', None, "ALL") #search and return uids
latest_email_uid = data[0].split()[-1] #data[] is a list, using split() to separate them by space and getting the latest one by [-1]
result, data = imapSession.uid('fetch', latest_email_uid, '(BODY.PEEK[])')
raw_email = data[0][1].decode("utf-8") #using utf-8 decoder`

Python3 print in hex representation

I can find lot's of threads that tell me how to convert values to and from hex. I do not want to convert anything. Rather I want to print the bytes I already have in hex representation, e.g.
byteval = '\x60'.encode('ASCII')
print(byteval) # b'\x60'
Instead when I do this I get:
byteval = '\x60'.encode('ASCII')
print(byteval) # b'`'
Because ` is the ASCII character that my byte corresponds to.
To clarify: type(byteval) is bytes, not string.
>>> print("b'" + ''.join('\\x{:02x}'.format(x) for x in byteval) + "'")
b'\x60'
See this:
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
And
print(hexify("abcde"))
# ['0x61', '0x62', '0x63', '0x64', '0x65']
Another example:
byteval='\x60'.encode('ASCII')
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
print(hexify(byteval))
# ['0x62', '0x27', '0x60', '0x27']
Taken from https://helloacm.com/one-line-python-lambda-function-to-hexify-a-string-data-converting-ascii-code-to-hexadecimal/

Categories

Resources