utf-8 plus question marks - python

I have a site that displays user input by decoding it to unicode using utf-8. However, user input can include binary data, which is obviously not always able to be 'decoded' by utf-8.
I'm using Python, and I get an error saying:
'utf8' codec can't decode byte 0xbf in position 0: unexpected code byte. You passed in '\xbf\xcd...
Is there a standard efficient way to convert those undecodable characters into question marks?
It would be most helpful if the answer uses Python.

Try:
inputstring.decode("utf8", "replace")
See here for reference

I think what you are looking for is:
str.decode('utf8','ignore')
which should drop invalid bytes rather than raising exception

Related

Encoding and decoding for chars are not treated the same for polish letters

From other source i get two names with two polish letter (ń and ó), like below:
piaseczyński
zielonogórski
Of course these names is more then two.
The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using:
str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
Why polish letter are not treated the same?
How to fix it?
As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).
If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:
def decode_crappy_bytes(b):
try:
return b.decode('utf-8')
except UnicodeDecodeError:
return b.decode('1252')
Note that this can still fail, in two ways:
If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.
If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.
#Thomas I added another except then now works perfectly:
try:
entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
pass
except UnicodeEncodeError:
pass
Passed for żarski.

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

Python2: Using .decode with errors='replace' still returns errors

So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
The returned message is then dumped into a JSON and send to the front end.
I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)
You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.
My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.
When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.
(Python 3 no longer does this as it is terribly confusing.)
Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.
decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors
text encodings means A "codec which encodes Unicode strings to bytes."
maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.
message.decode(encoding='utf-8', errors="ignore")

Get an UTF-16 string length from memory in python

I need to read a utf-16 encoded string that is stored in memory in a python script for LLDB. According to their documentation I'm able to use ReadMemory(address, length, error) but I need to know its length in advance.
If not python's decode function fails when it stumbles upon a character it cannot decode (even using the 'ignore' option) and the process stops:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u018e' in position 12: ordinal not in range(128)
Can anyone suggest a way of achieving this? (either using a "python" or "lldb python" implementation). I don't have the original string's length.
Thanks.
Is the string 0-terminated? If so, you could read 2 bytes at a time, until you encounter 0x0000, and then you'd know you have a complete string.
If you do this, you'd want to give yourself a constraint (e.g. "I will give up after reading - say - 1MB of data", in case you're running into corrupted memory).

Conversion of Unicode

I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"
When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).
add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")
you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').

Categories

Resources