python how to convert str to unicode ( persion )? - python

this string is subject of a mail. I get this string by imaplib.
type of this string is "str".
thank you!
#-*- coding: utf-8 -*-
import imaplib
from email.parser import HeaderParser
conn = imaplib.IMAP4('imap.gmail.com')
conn.login('myuser', 'my_pass')
conn.select()
conn.search(None, 'ALL') # returns a nice list of messages...
data = conn.fetch(1, '(BODY[HEADER])')
header_data = data[1][0][1]
parser = HeaderParser()
msg = parser.parsestr(header_data)
print repr(msg['subject'].decode('utf-8'))
result:
u'=?UTF-8?B?V2VsY29tZSB0byBBdGxhc01haWw=?='

Use the decode_header and make_header functions from the email.header package to process the header, then convert the header object to unicode:
from email.header import make_header, decode_header
header = make_header(decode_header(msg['subject']))
unicode_header = unicode(header)
print repr(unicode_header) # prints: u'Welcome to AtlasMail'

The encoding of non ascii characterers in an e-mail subject is decribed at RFC-1342 -
as you can see there, your utf-8 bytes are, in this case, base 64 encoded.
SO, to actually read this, you could do something along:
import base64, quopri
try:
encoding, enc_type, subject = msg["subject"].split("?", 2)
except ValueError:
subject = msg["subject"].decode("utf-8")
enc_type = "N/A"
if enc_type == "B":
subject = base64.decodestring(subject).decode(encoding.lower())
elif enc_type == "Q":
subject = quopri.decodestring(subject).decode(encoding.lower())

Related

Python (requests) - incorrect encoding when fetching headers

I am using requests library (python 3.9) to get filename from URL.[1] For some reason a file name is incorrectly encoded.
I should get "Ogłoszenie_0320.pdf" instead of "OgÅ\x82oszenie_0320.pdf".
My code looks something like this:
import requests
import re
def getFilenameFromRequest(url : str, headers):
# Parses from header information
contentDisposition = headers.get('content-disposition')
if contentDisposition:
filename = re.findall('filename=(.+)', contentDisposition)
print("oooooooooo: " + contentDisposition + " : " + str(filename))
if len(filename) != 0:
return filename[0]
# Parses from url
parsedUrl = urlparse(url)
return os.path.basename(parsedUrl.path)
def getFilenameFromUrl(url : str):
request = requests.head(url)
headers = request.headers
return getFilenameFromRequest(url, headers)
getFilenameFromUrl('https://przedszkolekw.bip.gov.pl'+
'/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html')
Any idea how to fix it?
I know for standard request I can set encoding directly:
request.encoding = 'utf-8'
But what am I supposed to do with this case?
[1]
https://przedszkolekw.bip.gov.pl/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html
Only characters from the ascii based latin-1 should be used as header values [rfc]. Here the file name has been escaped.
>>> s = "Ogłoszenie_0320.pdf"
>>> s.encode("utf8").decode("unicode-escape")
'OgÅ\x82oszenie_0320.pdf'
To reverse the process you can do
>>> sx = 'OgÅ\x82oszenie_0320.pdf'
>>> sx.encode("latin-1").decode("utf8")
'Ogłoszenie_0320.pdf'
(updated after conversation in comments)

Invalid syntax error in python program which tries to figure out an IP

I have no experience whatsoever in coding but wanted to get this code snippet here working:
import re
import sys
import json
import GeoIP
import urllib
import string
import requests
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP STANDARD)
r = requests.get('http://lichess.org/stream', stream=True)
buff = ''
pattern = re.compile(sys.argv[1] + '.{30}')
for content in r.iter content():
if content:
buff = buff + content
if len(buff) > 1000:
result keys = re.findall(pattern, buff)
for el in result keys:
result = string.split(el)
print result[0], result[1], result[2][:-8], gi.record by addr(result[2][:-8])['country name'],
gi.record by addr(result[2][:-8])['region name'], gi.record by addr(result[2][:-8])['city']
buff = buff[-30:]
the compiler tells me there is invalid syntax in line 9, where it says STANDARD.
I looked the code up to find out the IP adress of a user based on the ID of a game on a chess site called lichess.org. I sort of expect some changes will be necessary given the fact that this code was posted 7 years ago and lichess changed certain things.
The OP of the thread where I found this additionally gave this advice:
usage: getip.py owlc08je
where getip.py your script name, "owlc08je" -id of game. If someone making move in this game his ip, country and city print out to the console.
However, it does not work.
Thanks in advance
Edited code with underscores and changes:
import re
import sys
import json
import GeoIP
import urllib
import string
import requests
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP_STANDARD)
r = requests.get('http://lichess.org/stream', stream=True)
buff = ''
pattern = re.compile(sys.argv[1] + '.{30}')
for content in r.iter_content():
if content:
buff = buff + content
if len(buff) > 1000:
result_keys = re.findall(pattern, buff)
for el in result_keys:
result = string.split(el)
print(result[0], result[1], result[2][:-8], gi.record_by_addr(result[2][:-8])['country name'],
gi.record by addr(result[2][:-8])['region name'], gi.record by addr(result[2][:-8])['city'])
buff = buff[-30:]
I think you are missing an underscore between GEOIP and STANDARD.
Replacing Line 9 with this should probably solve the issue:
gi = GeoIP.open("GeoLiteCity.dat",GeoIP.GEOIP_STANDARD)
EDIT:
As mentioned in one of the comments, if there are other places where underscores have been left out; that should solve the issue.

MIMEBase: conversion to bytes and back again removes \r in binary data: Documented or just broken behavior in Python 3?

It seems to me like email.Message.as_bytes is broken:
import email
from email.encoders import encode_7or8bit
from email.mime.base import MIMEBase
orig_data = b"Zeilenenden\n<Unix\r\n<DOS\rMac"
msg = MIMEBase('application/octet-stream', "gzip")
msg.set_payload(orig_data)
encode_7or8bit(msg)
print("orig_data = %r" % orig_data)
print("payload = %r" % msg.get_payload(decode=1))
b = msg.as_bytes()
msg2 = email.message_from_bytes(b)
print("payload2 = %r" % msg2.get_payload(decode=1))
The output is
orig_data = b'Zeilenenden\n<Unix\r\n<DOS\rMac'
payload = b'Zeilenenden\n<Unix\r\n<DOS\rMac'
payload2 = b'Zeilenenden\n<Unix\n<DOS\nMac'
Note how the conversion message > bytes > message breaks the binary payload.
This used to work with similar code in Python 2.
Is this a bug or intended and if so where is it documented?
It seems like my use case is more or less invali???
Obviously, Python 3 uses base64 encoding for binary data and does not support transferring pure bytes.
Anyway, by looking into the source code of the email package, I came up with the following workaround:
import io
import email
import email.generator
from email.encoders import encode_7or8bit
from email.mime.base import MIMEBase
orig_data = b"Zeilenenden\n<Unix\r\n<DOS\rMac"
msg = MIMEBase('application/octet-stream', "gzip")
msg.set_payload(orig_data)
encode_7or8bit(msg)
print("orig_data = %r" % orig_data)
print("payload = %r" % msg.get_payload(decode=1))
class MyBytesGenerator(email.generator.BytesGenerator):
def _write_lines(self, lines):
self.write(lines)
def as_bytes(msg, unixfrom=False, policy=None):
"""Return the entire formatted message as a bytes object.
Optional 'unixfrom', when true, means include the Unix From_ envelope
header. 'policy' is passed to the BytesGenerator instance used to
serialize the message; if not specified the policy associated with
the message instance is used.
"""
from email.generator import BytesGenerator
policy = msg.policy if policy is None else policy
fp = io.BytesIO()
g = MyBytesGenerator(fp, mangle_from_=False, policy=policy)
g.flatten(msg, unixfrom=unixfrom)
return fp.getvalue()
b = as_bytes(msg)
msg2 = email.message_from_bytes(b)
print("payload2 = %r" % msg2.get_payload(decode=1))
With this workaround, the as_bytes function and email.message_from_bytes allow transferring the message in binary format without base64-overhead.

How to encode text to base64 in python

I am trying to encode a text string to base64.
i tried doing this :
name = "your name"
print('encoding %s in base64 yields = %s\n'%(name,name.encode('base64','strict')))
But this gives me the following error:
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs
How do I go about doing this ? ( using Python 3.4)
Remember to import base64 and that b64encode takes bytes as an argument.
import base64
b = base64.b64encode(bytes('your string', 'utf-8')) # bytes
base64_str = b.decode('utf-8') # convert bytes to string
It turns out that this is important enough to get it's own module...
import base64
base64.b64encode(b'your name') # b'eW91ciBuYW1l'
base64.b64encode('your name'.encode('ascii')) # b'eW91ciBuYW1l'
For py3, base64 encode and decode string:
import base64
def b64e(s):
return base64.b64encode(s.encode()).decode()
def b64d(s):
return base64.b64decode(s).decode()
1) This works without imports in Python 2:
>>>
>>> 'Some text'.encode('base64')
'U29tZSB0ZXh0\n'
>>>
>>> 'U29tZSB0ZXh0\n'.decode('base64')
'Some text'
>>>
>>> 'U29tZSB0ZXh0'.decode('base64')
'Some text'
>>>
(although this doesn't work in Python3 )
2) In Python 3 you'd have to import base64 and do base64.b64decode('...')
- will work in Python 2 too.
To compatibility with both py2 and py3
import six
import base64
def b64encode(source):
if six.PY3:
source = source.encode('utf-8')
content = base64.b64encode(source).decode('utf-8')
It looks it's essential to call decode() function to make use of actual string data even after calling base64.b64decode over base64 encoded string. Because never forget it always return bytes literals.
import base64
conv_bytes = bytes('your string', 'utf-8')
print(conv_bytes) # b'your string'
encoded_str = base64.b64encode(conv_bytes)
print(encoded_str) # b'eW91ciBzdHJpbmc='
print(base64.b64decode(encoded_str)) # b'your string'
print(base64.b64decode(encoded_str).decode()) # your string
Whilst you can of course use the base64 module, you can also to use the codecs module (referred to in your error message) for binary encodings (meaning non-standard & non-text encodings).
For example:
import codecs
my_bytes = b"Hello World!"
codecs.encode(my_bytes, "base64")
codecs.encode(my_bytes, "hex")
codecs.encode(my_bytes, "zip")
codecs.encode(my_bytes, "bz2")
This can come in useful for large data as you can chain them to get compressed and json-serializable values:
my_large_bytes = my_bytes * 10000
codecs.decode(
codecs.encode(
codecs.encode(
my_large_bytes,
"zip"
),
"base64"),
"utf8"
)
Refs:
https://docs.python.org/3/library/codecs.html#binary-transforms
https://docs.python.org/3/library/codecs.html#standard-encodings
https://docs.python.org/3/library/codecs.html#text-encodings
Use the below code:
import base64
#Taking input through the terminal.
welcomeInput= raw_input("Enter 1 to convert String to Base64, 2 to convert Base64 to String: ")
if(int(welcomeInput)==1 or int(welcomeInput)==2):
#Code to Convert String to Base 64.
if int(welcomeInput)==1:
inputString= raw_input("Enter the String to be converted to Base64:")
base64Value = base64.b64encode(inputString.encode())
print "Base64 Value = " + base64Value
#Code to Convert Base 64 to String.
elif int(welcomeInput)==2:
inputString= raw_input("Enter the Base64 value to be converted to String:")
stringValue = base64.b64decode(inputString).decode('utf-8')
print "Base64 Value = " + stringValue
else:
print "Please enter a valid value."
Base64 encoding is a process of converting binary data to an ASCII
string format by converting that binary data into a 6-bit character
representation. The Base64 method of encoding is used when binary
data, such as images or video, is transmitted over systems that are
designed to transmit data in a plain-text (ASCII) format.
Follow this link for further details about understanding and working of base64 encoding.
For those who want to implement base64 encoding from scratch for the sake of understanding, here's the code that encodes the string to base64.
encoder.py
#!/usr/bin/env python3.10
class Base64Encoder:
#base64Encoding maps integer to the encoded text since its a list here the index act as the key
base64Encoding:list = None
#data must be type of str or bytes
def encode(data)->str:
#data = data.encode("UTF-8")
if not isinstance(data, str) and not isinstance(data, bytes):
raise AttributeError(f"Expected {type('')} or {type(b'')} but found {type(data)}")
if isinstance(data, str):
data = data.encode("ascii")
if Base64Encoder.base64Encoding == None:
#construction base64Encoding
Base64Encoder.base64Encoding = list()
#mapping A-Z
for key in range(0, 26):
Base64Encoder.base64Encoding.append(chr(key + 65))
#mapping a-z
for key in range(0, 26):
Base64Encoder.base64Encoding.append(chr(key + 97))
#mapping 0-9
for key in range(0, 10):
Base64Encoder.base64Encoding.append(chr(key + 48))
#mapping +
Base64Encoder.base64Encoding.append('+')
#mapping /
Base64Encoder.base64Encoding.append('/')
if len(data) == 0:
return ""
length=len(data)
bytes_to_append = -(length%3)+(3 if length%3 != 0 else 0)
#print(f"{bytes_to_append=}")
binary_list = []
for s in data:
ascii_value = s
binary = f"{ascii_value:08b}"
#binary = bin(ascii_value)[2:]
#print(s, binary, type(binary))
for bit in binary:
binary_list.append(bit)
length=len(binary_list)
bits_to_append = -(length%6) + (6 if length%6 != 0 else 0)
binary_list.extend([0]*bits_to_append)
#print(f"{binary_list=}")
base64 = []
value = 0
for index, bit in enumerate(reversed(binary_list)):
#print (f"{bit=}")
#converting block of 6 bits to integer value
value += ( 2**(index%6) if bit=='1' else 0)
#print(f"{value=}")
#print(bit, end = '')
if (index+1)%6 == 0:
base64.append(Base64Encoder.base64Encoding[value])
#print(' ', end="")
#resetting value
value = 0
pass
#print()
#padding if there is less bytes and returning the result
return ''.join(reversed(base64))+''.join(['=']*bytes_to_append)
testEncoder.py
#!/usr/bin/env python3.10
from encoder import Base64Encoder
if __name__ == "__main__":
print(Base64Encoder.encode("Hello"))
print(Base64Encoder.encode("1 2 10 13 -7"))
print(Base64Encoder.encode("A"))
with open("image.jpg", "rb") as file_data:
print(Base64Encoder.encode(file_data.read()))
Output:
$ ./testEncoder.py
SGVsbG8=
MSAyIDEwIDEzIC03
QQ==

Python socket.send encoding

It seems i've run a problem with the encoding itself in where i need to pass Bing translation junks..
def _unicode_urlencode(params):
if isinstance(params, dict):
params = params.items()
return urllib.urlencode([(k, isinstance(v, unicode) and v.encode('utf-8') or v) for k, v in params])
def _run_query(args):
data = _unicode_urlencode(args)
sock = urllib.urlopen(api_url + '?' + data)
result = sock.read()
if result.startswith(codecs.BOM_UTF8):
result = result.lstrip(codecs.BOM_UTF8).decode('utf-8')
elif result.startswith(codecs.BOM_UTF16_LE):
result = result.lstrip(codecs.BOM_UTF16_LE).decode('utf-16-le')
elif result.startswith(codecs.BOM_UTF16_BE):
result = result.lstrip(codecs.BOM_UTF16_BE).decode('utf-16-be')
return json.loads(result)
def set_app_id(new_app_id):
global app_id
app_id = new_app_id
def translate(text, source, target, html=False):
"""
action=opensearch
"""
if not app_id:
raise ValueError("AppId needs to be set by set_app_id")
query_args = {
'appId': app_id,
'text': text,
'from': source,
'to': target,
'contentType': 'text/plain' if not html else 'text/html',
'category': 'general'
}
return _run_query(query_args)
...
text = translate(sys.argv[2], 'en', 'tr')
HOST = '127.0.0.1'
PORT = 894
s = socket.socket()
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.connect((HOST, PORT))
s.send("Bing translation: " + text.encode('utf8') + "\r");
s.close()
As you can see, if the translated text contains some turkish characters, the script fails to send the text to the socket..
Do you have any idea on how to get rid of this?
Regards.
Your problem is entirely unrelated to the socket. text is already a bytestring, and you're trying to encode it. What happens is that Python tries to converts the bytestring to a unicode via the safe ASCII encoding in order to be able to encode as UTF-8, and then fails because the bytestring contains non-ASCII characters.
You should fix translate to return a unicode object, by using a JSON variable that returns unicode objects.
Alternatively, if it is already encoding text encoded as UTF-8, you can simply use
s.send("Bing translation: " + text + "\r")
# -*- coding:utf-8 -*-
text = u"text in you language"
s.send(u"Bing translation: " + text.encode('utf8') + u"\r");
This must work. text must be spelled in utf-8 encoding.

Categories

Resources