How to handle UTF8 string from Pythons imaplib - python

Python imaplib sometimes returns strings that looks like this:
=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=
What is the name for this notation?
How can I decode (or should I say encode?) it to UTF8?

In short:
>>> from email.header import decode_header
>>> msg = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')[0][0].decode('utf-8')
>>> msg
'Repertuar wydarze\u0144 z woj. Dolno\u015bl\u0105skie'
My computer doesn't show the polish characters, but they should appear in yours (locales etc.)
Explained:
Use the email.header decoder:
>>> from email.header import decode_header
>>> value = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')
>>> value
[(b'Repertuar wydarze\xc5\x84 z woj. Dolno\xc5\x9bl\xc4\x85skie', 'utf-8')]
That will return a list with the decoded header, usually containing one tuple with the decoded message and the encoding detected (sometimes more than one pair).
>>> msg, encoding = decode_header('=?utf-8?Q?Repertuar_wydarze=C5=84_z_woj._Dolno=C5=9Bl=C4=85skie?=')[0]
>>> msg
b'Repertuar wydarze\xc5\x84 z woj. Dolno\xc5\x9bl\xc4\x85skie'
>>> encoding
'utf-8'
And finally, if you want msg as a normal utf-8 string, use the bytes decode method:
>>> msg = msg.decode('utf-8')
>>> msg
'Repertuar wydarze\u0144 z woj. Dolno\u015bl\u0105skie'

You can directly use the bytes decoder instead , here is an example:
result, data = imapSession.uid('search', None, "ALL") #search and return uids
latest_email_uid = data[0].split()[-1] #data[] is a list, using split() to separate them by space and getting the latest one by [-1]
result, data = imapSession.uid('fetch', latest_email_uid, '(BODY.PEEK[])')
raw_email = data[0][1].decode("utf-8") #using utf-8 decoder`

Related

How do you use a variable argument on base64.b64encode but when i am not using the prompt window?

This question is similar to this one here but if I put this into this code like so:
import base64
theone = input('Enter your plaintext: ')
encoded = str(base64.b64encode(theone))
encoded = base64.b64encode(encoded.encode('ascii'))
encoded = encoded[2:]
o = len(encoded)
o = o-1
encoded = encoded[:o]
print(encoded)
it raises this problem:
line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
And then if I remove this line of code:
encoded = base64.b64encode(encoded.encode('ascii'))
then it raises the same error. I'm not sure what to do from here and I would be grateful for any help.
You seem to be having problems with bytes and strings. The value returned by input is a string (str), but base64.b64encode expects bytes (bytes).
If you print a bytes instance you see something like
b'spam'
To remove the leading 'b' you need to decode back to a str.
To make your code work, pass bytes to base64.b64encode, and decode the result to print it.
>>> theone = input('Enter your plaintext: ')
Enter your plaintext: Hello World!
>>> encoded = base64.b64encode(theone.encode())
>>> encoded
b'SGVsbG8gV29ybGQh'
>>> print(encoded.decode())
SGVsbG8gV29ybGQh

Decode base64 in python (for example :: Q29ycsOqYQ== into Corrêa)

I have base64 encoded values like : Q29ycsOqYQ==
and i tried this code, to decode it to Corrêa.
import base64
encoded = ': Q29ycsOqYQ=='
data = base64.b64decode(encoded)
print(data)
i get this result b'Corr\xc3\xaaa'
but the desired result is Corrêa.
ê is not standard ascii encoding. If you actually print the data in python2.7 it will give you what you want.
You're printing the bytes. Turn it into a string
import base64
encoded = 'Q29ycsOqYQ=='
data = base64.b64decode(encoded)
s = str(data, encoding='utf-8')
print(s)
Output:
Corrêa

base64decode a string like "b'Mw=='" (containing literal b' substring)

I encoded a comma delimited list (ex. "1,2,3") of ids to base64 then the returned data from the form looks like x below.
I tried decoding and encoding and all sorts of things but nothing seems to return a the original string.
x = "b'Mw=='"
base64.b64decode(x)
# b'l\xcc'
x.decode()
# AttributeError: 'str' object has no attribute 'decode'
y = x.encode('utf-8')
print(y)
# b"b'Mw=='"
What am I missing?
If you have b'...' in your data, that's the repr()esentation of a bytestring.
If you can't get your data source to fix their content (it should just be Mw==: what they're giving you isn't valid base64 encoding!), you can use ast.literal_eval() to read it into a bytestring:
>>> import ast, base64
>>> x = "b'Mw=='"
>>> base64.b64decode(ast.literal_eval(x))
'3'

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Well, let me introduce the problem first.
I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.
Example code:
unicode_str - this is the string I should obtain
encoded_str - this is the string I got with POST/GET requests - initial data
bad_str - the data I have in the Database at the moment and I need to get unicode from.
So apparently I know how to convert:
unicode_str =(encode)=> encoded_str =(str)=> bad_str
But I couldn't come up with solution back:
bad_str =(???)=> encoded_str =(decode)=> unicode_str
In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'
In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'
In [3]: encoded_str = unicode_str.encode("UTF-8")
In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'
In [5]: bad_str = str(encoded_str)
In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"
In [7]: new_encoded_str = some_magical_function_here(bad_str) ???
You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.
>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'
Since you were the one who generated the strings, using eval() would be safe, but why not be safer?
Please do not use eval, instead:
import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))
# strip quotes
x = x[2:-1]
# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')
# profit
x == s

Python Convert Unicode-Hex utf-8 strings to Unicode strings

Have s = u'Gaga\xe2\x80\x99s' but need to convert to t = u'Gaga\u2019s'
How can this be best achieved?
s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t
print(x)
yields
Gaga’s
Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:
>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'
import codecs
s = u"Gaga\xe2\x80\x99s"
s_as_str = codecs.charmap_encode(s)[0]
t = unicode(s_as_str, "utf-8")
print t
prints
u'Gaga\u2019s'

Categories

Resources