From an API I receive following types of bytes-array:
b'2021:09:30 08:28:24'
b'\x01\x02\x03\x00'
I know how to get the values for them, like for the first one with value.decode() and for the second one with ''.join([str(c) for c in value])
the problem is, I need to do this dynamically. I don't know what the second one is called (is it a hex-literal?), but I can't even check for value.decode().startswith('\x'), it gives me a
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape
which makes sense, because of the escape sequence.
so, I need something that checks, if the value is from format \x... or a simple string I can just use with decode().
Use a combination of try/except with isprintable():
def bytes_to_string(data):
try:
rv = data.decode()
return rv if rv.isprintable() else data.hex()
except UnicodeDecodeError:
return data.hex()
print(bytes_to_string(b'\x01\x02\x03\x00')) # Will decode but to unprintable string
print(bytes_to_string(b'2021:09:30 08:28:24')) # Will decode to printable string
print(bytes_to_string(b'Z\xfcrich')) # Will throw an exception on decode
You can use isprintable to approximate a check for escape sequences inside a string literal:
str_val = value.decode()
if not str_val.isprintable():
str_val = ''.join([str(c) for c in value])
However, as noted in my comment, this seems like an unnecessary hack, and unreliable at that. The fundamental issue is that there is no hard boundary for telling different byte buffers apart, only heuristics (this is a fundamental theorem of communication theory; there is no way around it!). The API that sends the data should therefore tell you how to interpret the data.
Related
I am new to Python. I am trying to instantiate a grid from DEM data. Then I will try to create a flow direction map from raster data. But when I write the following line:
(grid = Grid.from_raster("C:\Users\ogun_\Masaüstü\DEM", data_name = 'dem')
I have got this error.
(SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape)
When I try the r or R functions, they don't work for this code.
Strings in Python allow "escape codes" which start with a \. For example, \n is used, that signifies a new-line character.
In your case, you've specified \U as part of \Users. Python is trying to interpret that as a raw Unicode value (which is what a \U escape code normally denotes).
You can solve this either with an escape for your \ characters (which makes the string "C:\\Users\\ogun_\\Masaüstü\\DEM") or with a "raw string" which doesn't do escape codes (which makes the string r"C:\Users\ogun_\Masaüstü\DEM").
I suspect the latter might be what you mean by "r functions". These are not functions, but rather are lexigraphic markers. They're much lower-level than function and can be thought of as part of the quoting itself. If you tried to call these as functions, that would've been why it didn't work.
You can read more about by marking a string as raw here.
NOTE: This also is different between Python 2 and 3. In 2, strings aren't Unicode by default, so you don't experience this unless you ask for a unicode string. In Python 3, strings are Unicode by default, so this happens unless you explicitly ask for bytes with a byte string.
From other source i get two names with two polish letter (ń and ó), like below:
piaseczyński
zielonogórski
Of course these names is more then two.
The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using:
str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
Why polish letter are not treated the same?
How to fix it?
As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).
If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:
def decode_crappy_bytes(b):
try:
return b.decode('utf-8')
except UnicodeDecodeError:
return b.decode('1252')
Note that this can still fail, in two ways:
If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.
If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.
#Thomas I added another except then now works perfectly:
try:
entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
pass
except UnicodeEncodeError:
pass
Passed for żarski.
I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.
Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.
Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.
I'm doing this:
word.rstrip(s)
Where word and s are strings containing unicode characters.
I'm getting this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
There's a bug report where this error happens on some Windows Django systems. However, my situation seems unrelated to that case.
What could be the problem?
EDIT: The code is like this:
def Strip(word):
for s in suffixes:
return word.rstrip(s)
The issue is that s is a bytestring, while word is a unicode string - so, Python tries to turn s into a unicode string so that the rstrip makes sense. The issue is, it assumes s is encoded in ASCII, which it clearly isn't (since it contains a character outside the ASCII range).
So, since you intitialise it as a literal, it is very easy to turn it into a unicode string by putting a u in front of it:
suffixes = [u'ি']
Will work. As you add more suffixes, you'll need the u in front of all of them individually.
I guess this happens because of implicit conversion in python2.
It's explained in this document, but I recommend you to read the whole presentation about handling unicode in python 2 and 3 (and why python3 is better ;-))
So, I think the solution to your problem would be to force the decoding of strings as utf8 before striping.
Something like :
def Strip(word):
word = word.decode("utf8")
for s in suffixes:
return word.rstrip(s.decode("utf8")
Second try :
def Strip(word):
if type(word) == str:
word = word.decode("utf8")
for s in suffixes:
if type(s) == str:
s = s.decode("utf8")
return word.rstrip(s)
I have a site that displays user input by decoding it to unicode using utf-8. However, user input can include binary data, which is obviously not always able to be 'decoded' by utf-8.
I'm using Python, and I get an error saying:
'utf8' codec can't decode byte 0xbf in position 0: unexpected code byte. You passed in '\xbf\xcd...
Is there a standard efficient way to convert those undecodable characters into question marks?
It would be most helpful if the answer uses Python.
Try:
inputstring.decode("utf8", "replace")
See here for reference
I think what you are looking for is:
str.decode('utf8','ignore')
which should drop invalid bytes rather than raising exception