I have many emails coming in from different sources.
they all have attachments, many of them have attachment names in chinese, so these
names are converted to base64 by their email clients.
When I receive these emails, I wish to decode the name. but there are other names which are
not base64. How can I differentiate whether a string is base64 or not, using the jython programming language?
Ie.
First attachment:
------=_NextPart_000_0091_01C940CC.EF5AC860
Content-Type: application/vnd.ms-excel;
name="Copy of Book1.xls"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Copy of Book1.xls"
second attachment:
------=_NextPart_000_0091_01C940CC.EF5AC860
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Please note both "Content-Transfer-Encoding" have base64
The header value tells you this:
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
"=?" introduces an encoded value
"gb2312" denotes the character encoding of the original value
"B" denotes that B-encoding (equal to Base64) was used (the alternative
is "Q", which refers to something close to quoted-printable)
"?" functions as a separator
"uLG..." is the actual value, encoded using the encoding specified before
"?=" ends the encoded value
So splitting on "?" actually gets you this (JSON notation)
["=", "gb2312", "B", "uLGxvmhlbrixsb5nLnhscw==", "="]
In the resulting array, if "B" is on position 2, you face a base-64 encoded string on position 3. Once you decoded it, be sure to pay attention to the encoding on position 1, probably it would be best to convert the whole thing to UTF-8 using that info.
Please note both Content-Transfer-Encoding have base64
Not relevant in this case, the Content-Transfer-Encoding only applies to the body payload, not to the headers.
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
That's an RFC2047-encoded header atom. The stdlib function to decode it is email.header.decode_header. It still needs a little post-processing to interpret the outcome of that function though:
import email.header
x= '=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?='
try:
name= u''.join([
unicode(b, e or 'ascii') for b, e in email.header.decode_header(x)
])
except email.Errors.HeaderParseError:
pass # leave name as it was
However...
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
This is simply wrong. What mailer created it? RFC2047 encoding can only happen in atoms, and a quoted-string is not an atom. RFC2047 §5 explicitly denies this:
An 'encoded-word' MUST NOT appear within a 'quoted-string'.
The accepted way to encode parameter headers when long string or Unicode characters are present is RFC2231, which is a whole new bag of hurt. But you should be using a standard mail-parsing library which will cope with that for you.
So, you could detect the '=?' in filename parameters if you want, and try to decode it via RFC2047. However, the strictly-speaking-correct thing to do is to take the mailer at its word and really call the file =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=!
#gnud, #edg - Unless I misunderstand, he's asking about the filename, not the file content
#setori - the Content-Trasfer-Encoding is telling you how the CONTENT of the file is encoded, not the "filename".
I'm not an expert, but this part here in the filename is telling him about the characters that follow:
=?gb2312?B?
I'm looking for the documentation in the RFCs... Ah! here it is: https://www.rfc-editor.org/rfc/rfc2047
The RFC says:
Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between.
Something else to look at is the code in SharpMimeTools, a MIME parser (in C#) that I use in my bug tracking app, BugTracker.NET
There is a better way than bobince’s method to handle the output of decode_header. I found it here: http://mail.python.org/pipermail/email-sig/2007-March/000332.html
name = unicode(email.header.make_header(email.header.decode_header(x)))
Well, you parse the email header into a dictionary. And then you check if Content-Transfer-Encoding is set, and if it = "base64" or "base-64".
Question: """Also I actually need to know what type of file it is ie .xls or .doc so I do need to decode the filename in order to correctly process the attachment, but as above, seems gb2312 is not supported in jython, know any roundabouts?"""
Data:
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Observations:
(1) The first line indicates Microsoft Excel, so .xls is looking better than .doc
(2)
>>> import base64
>>> base64.b64decode("uLGxvmhlbrixsb5nLnhscw==")
'\xb8\xb1\xb1\xbehen\xb8\xb1\xb1\xbeg.xls'
>>>
(a) The extension appears to be .xls -- no need for a gb2312 codec
(b) If you want a file-system-safe file name, you could use the "-_" variant of base64 OR you could percent-encode it
(c) For what it's worth, the file name is XYhenXYg.xls where X and Y are 2 Chinese characters that together mean "copy" and the remainder are literal ASCII characters.
Related
Some things that were trivial in Python 2 get a bit more tedious in Python 3. I am sending a string followed by some hex value:
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
This gives an error when sending, and I have read in other post that the solution is to use sendall and encode:
s.sendall(buffer.encode("UTF-8"))
However, what is send in the network for the hex value is the UTF-8 encoded:
c3 8a c3 be c3 8a c3 be
instead of the exact bytes I defined. How should I do this without using external libraries and possibly without having to "convert" the data into another structure?
I know this question has been widely asked, but I can't find a satisfying solution
You may think Python 3 is making thing more difficult, but it is the converse which is intended. You are experiencing a charset enforcement issue. In python 2 there were multiple reasons to be confused with UTF-8 and Unicode charsets. It is now fixed.
First of all, if you need to send binary data, you better choose the ad-hoc type, which is bytes. Using Python 3, it is sufficient to prefix your string with a b. This should fix you problem:
buffer = b"ABCD"
buffer += b"\xCA\xFE\xCA\xFE"
s.sendall(buffer)
Of course, bytes object has no encode method as it is already encoded to binary. But it has the converse method decode.
When you create a str object using quotes with no prefix, by default Python 3 will use Unicode encoding (which was enforced by unicode type or u prefix in Python 2). It means you will require to use encode method to get binary data.
Instead, directly use bytes to store binary data as no encoding operation will occur and it will stay as you typed it.
The error can only concatenate str (not "bytes") to str speaks for itself. Python is complaining it cannot concatenate str with bytes as the former data requires a further step, namely encoding, to make the + operation meaningful.
Based on the information in your question, you might be able to get away with encoding your data as latin-1, because this will not change any byte values
buffer = "ABCD"
buffer += "\xCA\xFE\xCA\xFE"
payload = buffer.encode("latin-1")
print(payload)
b'ABCD\xca\xfe\xca\xfe'
On the other side, you could just decode from latin-1:
buffer = payload.decode('latin-1')
buffer
'ABCDÊþÊþ'
But you might prefer to keep the text and binary parts of your message as their respective types:
encoded_text = payload[:4]
encoded_text
b'ABCD'
text = encoded_text.decode('latin-1')
print(text)
ABCD
binary_data = payload[4:]
binary_data
b'\xca\xfe\xca\xfe'
If your text contains codepoints which cannot be encoded as latin-1 - '你好,世界' for example - you could follow the same approach, but you would need to encode the text as UTF-8 while encoding the binary data as 'latin-1'; the resulting bytes will need to be split into their text and binary sections and decoded separately.
Finally: encoding string literals like '\xca\xfe\xca\xfe' is a poor style in Python3 - better to declare them as bytes literals like b'\xca\xfe\xca\xfe'.
I'm using a Python 2.x-library email to iterate over some .eml-files, but I have Python 3.x installed.
I extract the filename in the header of each payload (attachment) using .get_filename(). Encoding is not set in the header and thus I believe Python 3.x interprets the returned string as utf-8. The string however looks like this, when it contains special characters, e.g. like "ø":
=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?=
I have failed in numerous ways to convert this string into utf-8 making it into bytes or not and de- and encoding using latin-1, ISO-8859-1 (should be the same though) and utf-8.
I've also tried using:
ast.literal_eval(r"b'=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='")
and decoding that, but it still returns the original string containing the encoded characters.
How do one go about this?
You are handling email, so you can use email handling functions:
Try with https://docs.python.org/3.5/library/email.header.html.
The last example (and second one, very small module:
>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
[(b'p\xf6stal', 'iso-8859-1')]
There is also a version for python 2.7.
So for your case:
subj = '=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='
subject, encoder = email.header.decode_header(subj)[0]
print(subject.decode(encoder))
I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs!
I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!
In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).
The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.
If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type header.
With no explicit charset declared:
text/* (e.g., text/html) is in ASCII.
application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
otherwise use the charset declared.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)
The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet can be helpful.
You're checking whether str bytes contained within bytes object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.
Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
Use the character set supplied by HTTP (via the Content-Type header)
Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.
I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".
gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. æ).
I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).
Anyway I guess gSOAP probably is obeying transport rules, or what?
When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:
So if the string "æble" is contained in the XML, it comes like this in the request:
"æble"
After parsing the XML the unicode string in the DOM Text Node's data member looks like this:
u'\xc3\xa6ble'
I would expect it to look like this:
u'\xe6ble'
What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?
Thanks in advance.
Best regards Jakob Simon-Gaarde
æble is actually æble.
To get the expected Unicode string u'\xe6ble' after parsing, the string in the request should be æble.
Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html
However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...
Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce 'æ', not 'æ'.
However it appears that it is encoding to UTF-8 first and then doing the escaping.
What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.
On the receiving end, please show us the repr() of the raw XML that you receive.
Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""
It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.
(1) unescaping "<" to "<" would blow up
(2) what would you unescape "Ā" to? "\xc4\x80"?
(3) how could it unescape at all if the encoding was UTF-16xx?
Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:
def unescape_hash_char(req):
pat = re.compile('&#(\d+);',re.M)
parts = pat.split(req)
a=0
ret = ''
for p in parts:
if a%2:
n = chr(int(p))
else:
n = p
ret += n
a+=1
return ret
After doing this I parse the XML and I get the expected reslut.
Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?
Best regards
Jakob Simon-Gaarde
Note that
In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'
So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:
In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'
In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'
In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ
Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.
Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>æble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>
/ Jakob
i am trying to form-post a sql file that consists on many INSERTS, eg.
INSERT INTO `TABLE` VALUES ('abcdé', 2759);
then i use re.search to parse it and extract the fields to put into my own datastore. The problem is that, although the file contains accented characters (see the e is a é), once uploaded it loses it and either errors or stores a bytestring representation of it.
Heres what i am currently using (and I have tried loads of alternatives):
form = cgi.FieldStorage()
uFile = form['sql']
uSql = uFile.file.read()
lineX = uSql.split("\n") # to get each line
and so on.
has anyone got a robust way of making this work? remember i am on appengine so access to some libraries is restricted/forbidden
You mention utf8 in the Q's title but then never again: what are you doing (in terms of setting headers and checking them) to verify what encoding is in use? There should be headers of the form
Content-Type: text/plain; charset=utf-8
and the charset= part is where the encoding is specified. So what are the values upon sending and receiving this? If charset is erroneous, you may have to manually perform some encoding and decoding. To help us gauge what the encoding seems to be, besides the headers, what's the ord value of that accented-e? E.g., if the encoding was actually iso-8859-1, that ord value would be 233 (in decimal; 0xE9 in hex).