Python minidom and UTF-8 encoded XML with hash references

Python minidom and UTF-8 encoded XML with hash references - python

I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".
gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. Ã¦).
I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).
Anyway I guess gSOAP probably is obeying transport rules, or what?
When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:
So if the string "æble" is contained in the XML, it comes like this in the request:
"Ã¦ble"
After parsing the XML the unicode string in the DOM Text Node's data member looks like this:
u'\xc3\xa6ble'
I would expect it to look like this:
u'\xe6ble'
What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?
Thanks in advance.
Best regards Jakob Simon-Gaarde

Ã¦ble is actually Ã¦ble.
To get the expected Unicode string u'\xe6ble' after parsing, the string in the request should be æble.

Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html
However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...
Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce 'æ', not 'Ã¦'.
However it appears that it is encoding to UTF-8 first and then doing the escaping.
What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.
On the receiving end, please show us the repr() of the raw XML that you receive.
Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""
It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.
(1) unescaping "<" to "<" would blow up
(2) what would you unescape "&#256" to? "\xc4\x80"?
(3) how could it unescape at all if the encoding was UTF-16xx?

Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:
def unescape_hash_char(req):
pat = re.compile('&#(\d+);',re.M)
parts = pat.split(req)
a=0
ret = ''
for p in parts:
if a%2:
n = chr(int(p))
else:
n = p
ret += n
a+=1
return ret
After doing this I parse the XML and I get the expected reslut.
Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?
Best regards
Jakob Simon-Gaarde

Note that
In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'
So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:
In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'
In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'
In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ

Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.
Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>Ã¦ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>
/ Jakob

Related

Python - HTML to Unicode

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.
Here is how I am getting the HTML
html = urllib2.urlopen(url).read().replace(' ',"")
xml = etree.HTML(html)
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError
How could I change this into unicode. So if there are non unicode characters, my code won't break.

When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError. How could I change this into unicode.
unicode characters -> bytes = ‘encode’
bytes -> unicode characters = ‘decode’
You have bytes and you want unicode characters, so the method for that is decode. As you have used encode, Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.
However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it's what it's good at.

If you feed HTML to BeautifulSoup, it will decode it to Unicode.
If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit, which might help you with these documents.
If you mention BeautifulSoup, why don't you do it like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(url).read())
and work with the soup?
BTW, all HTML entities will be resolved to unicode characters.
The ascii character set is very limited and might lack many characters in your document. I'd use utf-8 instead whenever possible.

Python - How to get accented characters correct? (BeautifulSoup)

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.
The charset of the HTML is this
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
I've this python code:
some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')
And I get this:
CalÃ§Ãµes
What I'm doing wrong here? Some clues?
Best Regards,

The question here is about "where" do you "get this".
If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!
You can try this when using print:
import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)

As bernie points out, BS uses Unicode internally.
For BS3:
Beautiful Soup Gives You Unicode, Dammit
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
For BS4, the docs explain a bit more clearly when this happens:
You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`
In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.
The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.
The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.
And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).
If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

dealing with multiple charset in python 3

I'm using python 3.3.0 in Windows 8.
requrl = urllib.request.Request(url)
response = urllib.request.urlopen(requrl)
source = response.read()
source = source.decode('utf-8')
It will work fine if the websites have utf-8 charset but what if it has iso-8859-1 or any other charset. Means I may have different website url with different charset.
So, how to deal with multiple charset?
Now let me tell you my efforts when I tried to resolve this issue like:
b1 = b'charset=iso-8859-1'
b1 = b1.decode('iso-8859-1')
if b1 in source:
source = source.decode('iso-8859-1')
It gave me an error like TypeError: Type str doesn't support the buffer API
So, I'm assuming that it's considering b1 as string! and this is not the correct way! :(
Please, don't say that manually change charset in the source code or have you read python docs!
I have already tried to put my head into python 3 docs but still have no luck or I may not be picking up correct modules/contents to read!

In Python 3, a str is actually a sequence of unicode characters (equivalent to u'mystring' syntax in Python 2). What you get back from response.read() is a byte string (a sequence of bytes).
The reason your b1 in source fails is you are trying to find a unicode character sequence inside a byte string. This makes no sense, so it fails. If you take out the line b1.decode('iso-8859-1'), it should work because you are now comparing two byte sequences.
Now back to your real underlying issue. To support multiple charsets, you need to determine the character set so you cn decode it to a Unicode string. This is tricky to do. Normally you can examine the Content-Type header of the response. (See the rules below.) However, so many websites declare the wrong encoding in the header that we have had to develop other complicated encoding sniffing rules for html. Please read that link so you realize what a difficult problem this is!
I recommend you either:
Use the requests library instead of urllib, because it automatically takes care of most unicode conversions properly. (It's also much easier to use.) If conversion to unicode at this layer fails:
Try to pass the bytes directly to an underlying library you are using (e.g. lxml or html5lib) and let them deal with determining the encoding. They often implement the right charset-sniffing algorithms for the document type.
If neither of these work, you can get more aggressive and use libraries like chardet to detect the encoding, but in my experience people who serve their web pages this incorrectly are so incompetent that they produce mixed-encoding documents, so you will end up with garbage characters no matter what you do!
Here are the rules for interpreting the charset declared in a content-type header.
With no explicit charset declared:
text/* (e.g., text/html) is in ASCII.
application/* (e.g. application/json, application/xhtml+xml) is utf-8.
With an explicit charset declared:
if type is text/html and charset is iso-8859-1, it's actually win-1252 (==CP1252)
otherwise use the charset declared.
(Note that the html5 spec willfully violates the w3c specs by looking for UTF8 and UTF16 byte markers in preference to the Content-Type header. Please read that encoding detection algorithm link and see why we can't have nice things...)

The big problem here is that in many cases you can't be sure about the encoding of a webpage, even if it defines a charset. I've seen enough pages declaring one charset but acutally being in another, or having a different charsets in their Content-Type header then in their meta-tag or xml declaration.
In such cases chardet can be helpful.

You're checking whether str bytes contained within bytes object:
>>> 'df' in b'df'
Traceback (most recent call last):
File "<pyshell#107>", line 1, in <module>
'df' in b'df'
TypeError: Type str doesn't support the buffer API
So, yes, it considers b1 a str, because you've decoded bytes object into a str object with the certain encoding. Instead, you should check against original value of b1. It's not clear why you do .decode on it.

Have a look at the HTML standard, Parsing HTML documents, Determine character set (HTML5 is sufficient for our purposes).
There is an algorithm to take. For your purpose boils down to the following:
Check for identifying sequences for UTF-16 or UTF-8 (see provided link)
Use the character set supplied by HTTP (via the Content-Type header)
Apply the algorithm described a little later in Prescan a byte-stream to determine its encoding. This is basically searching for "charset=" in the document and extracting the value.

Parsing a utf-8 encoded web page with some gb2312 body text with Python

I'm trying to parse a web page using Python's beautiful soup Python parser, and am running into an issue.
The header of the HTML we get from them declares a utf-8 character set, so Beautiful Soup encodes the whole document in utf-8, and indeed the HTML tags are encoded in UTF-8 so we get back a nicely structured HTML page.
The trouble is, this stupid website injects gb2312-encoded body text into the page that gets parsed as utf-8 by beautiful soup. Is there a way to convert the text from this "gb2312 pretending to be utf-8" state to "proper expression of the character set in utf-8?"

The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.
I don't know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.
This may be the only way to do it, actually. In general, GB2312-encoded text won't be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:
In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object.
This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has contains_replacement_characters == True), then there is no way to get the original data back from document once it's been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.

how to tell if a string is base64 or not

I have many emails coming in from different sources.
they all have attachments, many of them have attachment names in chinese, so these
names are converted to base64 by their email clients.
When I receive these emails, I wish to decode the name. but there are other names which are
not base64. How can I differentiate whether a string is base64 or not, using the jython programming language?
Ie.
First attachment:
------=_NextPart_000_0091_01C940CC.EF5AC860
Content-Type: application/vnd.ms-excel;
name="Copy of Book1.xls"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Copy of Book1.xls"
second attachment:
------=_NextPart_000_0091_01C940CC.EF5AC860
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Please note both "Content-Transfer-Encoding" have base64

The header value tells you this:
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
"=?" introduces an encoded value
"gb2312" denotes the character encoding of the original value
"B" denotes that B-encoding (equal to Base64) was used (the alternative
is "Q", which refers to something close to quoted-printable)
"?" functions as a separator
"uLG..." is the actual value, encoded using the encoding specified before
"?=" ends the encoded value
So splitting on "?" actually gets you this (JSON notation)
["=", "gb2312", "B", "uLGxvmhlbrixsb5nLnhscw==", "="]
In the resulting array, if "B" is on position 2, you face a base-64 encoded string on position 3. Once you decoded it, be sure to pay attention to the encoding on position 1, probably it would be best to convert the whole thing to UTF-8 using that info.

Please note both Content-Transfer-Encoding have base64
Not relevant in this case, the Content-Transfer-Encoding only applies to the body payload, not to the headers.
=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=
That's an RFC2047-encoded header atom. The stdlib function to decode it is email.header.decode_header. It still needs a little post-processing to interpret the outcome of that function though:
import email.header
x= '=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?='
try:
name= u''.join([
unicode(b, e or 'ascii') for b, e in email.header.decode_header(x)
])
except email.Errors.HeaderParseError:
pass # leave name as it was
However...
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
This is simply wrong. What mailer created it? RFC2047 encoding can only happen in atoms, and a quoted-string is not an atom. RFC2047 §5 explicitly denies this:
An 'encoded-word' MUST NOT appear within a 'quoted-string'.
The accepted way to encode parameter headers when long string or Unicode characters are present is RFC2231, which is a whole new bag of hurt. But you should be using a standard mail-parsing library which will cope with that for you.
So, you could detect the '=?' in filename parameters if you want, and try to decode it via RFC2047. However, the strictly-speaking-correct thing to do is to take the mailer at its word and really call the file =?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?=!

#gnud, #edg - Unless I misunderstand, he's asking about the filename, not the file content
#setori - the Content-Trasfer-Encoding is telling you how the CONTENT of the file is encoded, not the "filename".
I'm not an expert, but this part here in the filename is telling him about the characters that follow:
=?gb2312?B?
I'm looking for the documentation in the RFCs... Ah! here it is: https://www.rfc-editor.org/rfc/rfc2047
The RFC says:
Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between.
Something else to look at is the code in SharpMimeTools, a MIME parser (in C#) that I use in my bug tracking app, BugTracker.NET

There is a better way than bobince’s method to handle the output of decode_header. I found it here: http://mail.python.org/pipermail/email-sig/2007-March/000332.html
name = unicode(email.header.make_header(email.header.decode_header(x)))

Well, you parse the email header into a dictionary. And then you check if Content-Transfer-Encoding is set, and if it = "base64" or "base-64".

Question: """Also I actually need to know what type of file it is ie .xls or .doc so I do need to decode the filename in order to correctly process the attachment, but as above, seems gb2312 is not supported in jython, know any roundabouts?"""
Data:
Content-Type: application/vnd.ms-excel;
name="=?gb2312?B?uLGxvmhlbrixsb5nLnhscw==?="
Observations:
(1) The first line indicates Microsoft Excel, so .xls is looking better than .doc
(2)
>>> import base64
>>> base64.b64decode("uLGxvmhlbrixsb5nLnhscw==")
'\xb8\xb1\xb1\xbehen\xb8\xb1\xb1\xbeg.xls'
>>>
(a) The extension appears to be .xls -- no need for a gb2312 codec
(b) If you want a file-system-safe file name, you could use the "-_" variant of base64 OR you could percent-encode it
(c) For what it's worth, the file name is XYhenXYg.xls where X and Y are 2 Chinese characters that together mean "copy" and the remainder are literal ASCII characters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.