Is there a way to try to decode a bytearray without raising an error if the encoding fails?
EDIT: The solution needn't use bytearray.decode(...). Anything library (preferably standard) that does the job would be great.
Note: I don't want to ignore errors, (which I could do using bytearray.decode(errors='ignore')). I also don't want an exception to be raised. Preferably, I would like the function to return None, for example.
my_bytearray = bytearray('', encoding='utf-8')
# ...
# Read some stream of bytes into my_bytearray.
# ...
text = my_bytearray.decode()
If my_bytearray doesn't contain valid UTF-8 text, the last line will raise an error.
Question: Is there a way to perform the validation but without raising an error?
(I realize that raising an error is considered "pythonic". Let's assume this is undesirable for some or other good reason.)
I don't want to use a try-catch block because this code gets called thousands of times and I don't want my IDE to stop every time this exception is raised (whereas I do want it to pause on other errors).
You could use the suppress context manager to suppress the exception and have slightly prettier code than with try/except/pass:
import contextlib
...
return_val = None
with contextlib.suppress(UnicodeDecodeError):
return_val = my_bytearray.decode('utf-8')
The chardet module can be used to detect the encoding of a bytearray before calling bytearray.decode(...).
The Code:
import chardet
identity = chardet.detect(my_bytearray)
The method chardet.detect(...) returns a dictionary with the following format:
{
'confidence': 0.99,
'encoding': 'ascii',
'language': ''
}
One could check analysis['encoding'] to confirm that my_bytearray is compatible with an expected set of text encoding before calling my_bytearray.decode().
One consideration of using this approach is that the encoding indicated by the analysis might indicate one of a number of equivalent encodings. In this case, for instance, the analysis indicates that the encoding is ASCII whereas it could equivalently be UTF-8.
(Credit to #simon who pointed this out on StackOverflow here.)
Related
Armin Ronacher, http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/
If you for instance pass [the result of os.fsdecode() or equivalent] to a template engine you [sometimes get a UnicodeEncodeError] somewhere else entirely and because the encoding happens at a much later stage you no longer know why the string was incorrect. If you detect that error when it happens the issue becomes much easier to debug
Armin suggests a function
def remove_surrogate_escaping(s, method='ignore'):
assert method in ('ignore', 'replace'), 'invalid removal method'
return s.encode('utf-8', method).decode('utf-8')
Nick Coghlan, 2014, [Python-Dev] Cleaning up surrogate escaped strings
The current proposal on the issue tracker is to ... take advantage of
the existing error handlers:
def convert_surrogateescape(data, errors='replace'):
return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)
The functions are slightly different. The second was written with knowledge of the first.
Since Python 3.5, the backslashreplace error handler now works on decoding as well as encoding. The first approach is not designed to use backslashreplace e.g. an error decoding the byte 0xff would get printed as "\udcff". The second approach is designed to solve this; it would print "\xff".
If you did not need backslashreplace, you might prefer the first version if you had the misfortune to be supporting Python < 3.5 (including polyglot 2/3 code, ouch).
Question
Is there a better idiom for this purpose yet? Or do we still use this drop-in function?
Nick referred to an issue for adding such a function to the codecs module. As of 2019 the function has not been added, and the ticket remains open.
The latest comment says
msg314682 Nick Coghlan, 2018
A recent discussion on python-ideas also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data.
That includes a lone surrogate fixer: ftfy.fixes.fix_surrogates(text)
...
I do not find the function in ftfy appealing. The documentation does not say so, but it appears to be designed to handle both surrogateescape and ... be part of a workaround for CESU-8, or something like that ?
Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.
I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!
From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.
The Code Furies have turned their baleful glares upon me, and it's fallen to me to implement "Secure Transport" as defined by The Direct Project. Whether or not we internally use DNS rather than LDAP for sharing certificates, I'm obviously going to need to set up the former to test against, and that's what's got me stuck. Apparently, an X509 cert needs some massaging to be used in a CERT record, and I'm trying to work out how that's done.
The clearest thing I've found is a script on Videntity's blog, but not being versed in python, I'm hitting a stumbling block. Specifically, this line crashes:
decoded_clean_pk = clean_pk.decode('base64', strict)
since it doesn't seem to like (or rather, to know) whatever 'strict' is supposed to represent. I'm making the semi-educated guess that the line is supposed to decode the base64 data, but I learned from the Debian OpenSSL debacle some years back that blindly diddling with crypto-related code is a Bad Thing(TM).
So I turn the illustrious python wonks on SO to ask if that line might be replaced by this one (with the appropriate import added):
decoded_clean_pk = base64.b64decode(clean_pk)
The script runs after that change, and produces correct-looking output, but I've got enough instinct to know that I can't necessarily trust my instincts here. :)
This line should've work if you would've called like this:
decoded_clean_pk = clean_pk.decode('base64', 'strict')
Notice that strict has to be a string, otherwise python interpreter would try to search for a variable named strict and if it didn't find it or otherwise has other value than: strict, ignore, and replace, it'll probably would've complain about it.
Take a look at this code:
>>>b=base64.b64encode('hello world')
>>>b.decode('base64')
'hello world'
>>>base64.b64decode(b)
'hello world'
Both decode and b64decode works the same when .decode is passed the base64 argument string.
The difference is that str.decode will take a string of bytes as arguments and will return it's Unicode representation depending on the encoding argument you pass as first parameter. In this case, you're telling it to handle a bas64 string so it will do it ok.
To answer your question, both works the same, although b64decode/encode are meant to work only with base64 encodings and str.decode can handle as many encodings as the library is aware of.
For further information take a read at both of the doc sections: decode and b64decode.
UPDATE: Actually, and this is the most important example I guess :) take a look at the source code for encodings/base64_codec.py which is that decode() uses:
def base64_decode(input,errors='strict'):
""" Decodes the object input and returns a tuple (output
object, length consumed).
input must be an object which provides the bf_getreadbuf
buffer slot. Python strings, buffer objects and memory
mapped files are examples of objects providing this slot.
errors defines the error handling to apply. It defaults to
'strict' handling which is the only currently supported
error handling for this codec.
"""
assert errors == 'strict'
output = base64.decodestring(input)
return (output, len(input))
As you may see, it actually uses base64 module to do it :)
Hope this clarify in some way your question.
I'm building an application that in the database has data with latin symbols. Users are able to enter this data.
What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template.
This is a bit annoying and I'm wondering if there is any better way of handling this.
Python's unicode type is designed to be the "natural" representation for strings. Besides the unicode type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.
I would recommend you see about converting your db data to UTF-8.
If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode type right away.
Using unicode types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).
For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the safe filter:
<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>
See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).
try add this to the top of your program.
import sys
reload(sys)
sys.setdefaultencoding('latin2')
We have to reload sys because:
>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>
I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml.
Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In essence, it would be comparable with this Python statement:
u'<?xml version="1.0" encoding="utf-8" ?><root><lotsofelements \></root>'
Which is incongrous, since if the XML declaration is correct, the contents are UTF-8 encoded, and thus not a Python Unicode object (because those are stored in some internal encoding like UCS4).
lxml will refuse to parse this, as documented, since there is no clear answer to what encoding it should be interpreted as.
As I see it, there are two ways out of this bind:
Strip the <?xml> declaration, including the encoding.
Convert the output from Suds into a bytestring, using the specified encoding.
Currently, the data I’m receiving from the webservice is within the ASCII-range, so either way will work, but both feels very much like ugly hacks to me, and I’m not quite sure what would happen, if I start to receive data that would need a wider range of Unicode characters.
Any good ideas? I can’t imagine I’m the first one in this position…
You and lxml are correct; a valid XML document must be a stream of bytes encoded as declared in the <?xml ..... header (default: UTF-8).
I'd suggest a third option: leave it in unicode with an XML header that omits the encoding declaration but leaves the version in there (future-safe). That will keep lxml happy and avoid the overhead of you encoding it again.
I'd also suggest some gentle enquiry at the suds site and having a poke around in their source.
Hmm, I'm currently implementing my first Suds-based solution and parsing my responses with lxml without a problem, but I think this could be because I'm doing it in a pretty blunt and dumb way. Here's what my code looks like:
try:
result = self.client.service.ExportOwnersDetails(fAccess=self.access_id, fParams=params)
except URLError:
# TODO: Log timeout here, handle
return
response = str(result.fReturn)
if len(response) == 0 or response.find('<?xml ') == -1:
# TODO: Log import error here, handle
return
response = StringIO(response)
xml = etree.parse(response)
Like I said, not very clever (and obviously I still have some logging to do), but that's my approach. The fAccess, fParams, fReturn nonsense is the naming convention at the third-party provider I'm integrating with.