Convert UTF-8 bytes to some other encoding in Python

Convert UTF-8 bytes to some other encoding in Python - python

I need to do in Python 2.4 (yes, 2.4 :-( ).
I've got a plain string object, which represents some text encoded with UTF-8. It comes from an external library, which can't be modified.
So, what I think I need to do, is to create an Unicode object using bytes from that source object, and then convert it to some other encoding (iso-8859-2, actually).
The plain string object is 'x'. "unicode()" seems to not work:
>>> x
'Sk\xc5\x82odowski'
>>> str(unicode(x, encoding='iso-8859-2'))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
>>> unicode(x, encoding='iso-8859-2')
u'Sk\u0139\x82odowski'

>>> x.decode('utf8').encode('iso-8859-2')
'Sk\xb3odowski'

Related

UnicodeEncodeError AND TypeError: can only concatenate str (not “bytes”) to str

I have a problem and it is that I try to search in results with Google Custom search api for python but when I search the things that are stored in a varibale instead of writting them manually it says UnicodeEncodeError: 'ascii' codec can't encode character '\xa2' in position 104: ordinal not in range(128). When I solve it with
.encode('ascii', 'ignore').decode('ascii')
it displays another error such as that the google custom search
TypeError: can only concatenate str (not "bytes") to str.
PD: I have also tried some thing such as str() or .decode alone.
Edit: Sure, the input that is store in the variables come from Pytesseract that reads the text of an image. So, I store this information in a variable and then I tried to search this information in a google custom search API. As it displayed an Unicode error, I looked in stackoverflow the solution and I found that I could try to .decode the varible in order to not have this problem anymore. In fact this problem was solved but now another one appeared and it is the one of TypeError: can only concatenate str (not "bytes") to str. So, I can't use the .decode function because it will display anopther error. What can I do?
Edit 2.0
text_photo = pytesseract.image_to_string(img2) #this will read the text and put it in a variable
text_photo = text_photo.replace('\r', '').replace('\n', '') #this will elimininate de /n
rawData = urllib.request.urlopen(url_google_1 + text_photo1 + '+' + text_photo2 + url_google_2).read()
the url_google 1 containg the first part of the link (api key...) for a google search and the second contains what I want to get from google. In the middle I add the variable because it is what i want to search. If I write hello is works perfectly the problem is that the format that tesseract writes is not compatible I have tried to use str(text_photo) and the .decode but doesn't work json_data = json.loads(rawData)

I wasn't able to understand all the details of your specific problem, but I'm quite sure the root cause is the following:
Python 3 distinguishes two string types, str and bytes, which are similar, yet incompatible.
Once you understand what this means, what each of them can/can't do, and how to go from one to the other, I'm sure you can figure out how to properly construct the URL for the API call.
Different types, incompatible:
>>> type('abc'), type(b'abc')
(<class 'str'>, <class 'bytes'>)
>>> 'abc' + b'abc'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be str, not bytes
>>> b'abc' + 'abc'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
If you want to combine them, you need to convert everything to the same type.
For conversion, encode str to bytes, decode bytes to str:
>>> 'abc'.encode()
b'abc'
>>> b'abc'.decode()
'abc'
The str.encode and bytes.decode methods take an optional encoding= parameter, which defaults to UTF-8.
This parameter defines the mapping between the characters in a str and the octets in a bytes object.
If there's a problem mapping characters to bytes with the given encoding, you'll encounter a UnicodeEncodeError.
This happens if you use a character that isn't defined in the given mapping:
>>> '5 £'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 2: ordinal not in range(128)
Similarly, if some text had been encoded with encoding X but you try to decode it with encoding Y, you might see a UnicodeDecodeError:
>>> b = '5 £'.encode('utf8')
>>> b
b'5 \xc2\xa3'
>>> b.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
You can avoid the exception with the errors="ignore" strategy, but you will lose information this way:
>>> '5 £'.encode('ascii', errors='ignore')
b'5 '
Typically, if you work with text, you use str everywhere.
You should also not often need to use .encode/.decode directly; often file handlers etc. accept str and will convert them to bytes behind the scene.
In your case, you need to find out where and why you have a mixture of str and bytes, then make sure everything has the same type before concatenating.

(python)In my database mix some wrong ascii code, how to converte those string without errors

In my database mix some wrong ascii code, how to make concatenate those string without errors?
my example situation is like(some ascii character is larger than 128):
>>> s=b'\xb0'
>>> addstr='read '+s
>>> print addstr
read ░
>>> addstr.encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
>>> addstr.encode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
I can do:
>>> addstr.decode("windows-1252").encode('utf-8')
'read \xc2\xb0'
but you can see the windows-1252 coding will change my character.
I would like convert the addstr to unicode? how to do it?

addstrUnicode = addstr.decode("unicode-escape")
You should not be concerned about the character changing, it is just that the utf-8 encoding requires two bytes, not one byte, for characters between 0x80 and 0x7FF, so when you encode as utf-8, an extra byte (0xC2) is added.
This is a useful link to read to help understand different types of encodings.
Additionally, make sure you know the original encoding of the character before you start trying to decode it. While you mentioned that it was "ascii code", the ascii character set only extends up to 127, which means the character cannot be ascii-encoded. I'm assuming here it's just Unicode point \u00B0.

Remove all characters from a string who's ordinals are out of range

What is a good way to remove all characters that are out of the range: ordinal(128) from a string in python?
I'm using hashlib.sha256 in python 2.7. I'm getting the exception:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u200e' in position 13: ordinal not in range(128)
I assume this means that some funky character found its way into the string that I am trying to hash.
Thanks!

new_safe_str = some_string.encode('ascii','ignore')
I think would work
or you could do a list comprehension
"".join([ch for ch in orig_string if ord(ch)<= 128])
[edit] however as others have said it may be better to figure out how to deal with unicode in general... unless you really need it encoded as ascii for some reason

Instead of removing those characters, it would be better to use an encoding that hashlib won't choke on, utf-8 for example:
>>> data = u'\u200e'
>>> hashlib.sha256(data.encode('utf-8')).hexdigest()
'e76d0bc0e98b2ad56c38eebda51da277a591043c9bc3f5c5e42cd167abc7393e'

This is an example of where the changes in python3 will make an improvement, or at least generate a clearer error message
Python2
>>> import hashlib
>>> funky_string=u"You owe me £100"
>>> hashlib.sha256(funky_string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 11: ordinal not in range(128)
>>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
'81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
>>>
Python3
>>> import hashlib
>>> funky_string="You owe me £100"
>>> hashlib.sha256(funky_string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing
>>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
'81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
>>>
The real problem is that sha256 takes a sequence of bytes which python2 doesn't have a clear concept of. Use .encode("utf-8") is what I'd suggest.

Decoding if it's not unicode

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this:
def myfunction(text):
if not isinstance(text, unicode):
text = unicode(text, 'utf-8')
...
Is it possible to avoid the use of isinstance? I was looking for something more duck-typing friendly.
During my experiments with decoding, I have run into several weird behaviours of Python. For instance:
>>> u'hello'.decode('utf-8')
u'hello'
>>> u'cer\xf3n'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in po
sition 3: ordinal not in range(128)
Or
>>> u'hello'.decode('utf-8')
u'hello' 12:11
>>> unicode(u'hello', 'utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: decoding Unicode is not supported
By the way. I'm using Python 2.6

You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.
def myfunction(text):
try:
text = unicode(text, 'utf-8')
except TypeError:
return text
print(myfunction(u'cer\xf3n'))
# cerón
When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.
Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.
So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.
See this Python bug ticket for an interesting discussion of the issue,
and also Guido van Rossum's blog:
"We are adopting a slightly different
approach to codecs: while in Python 2,
codecs can accept either Unicode or
8-bits as input and produce either as
output, in Py3k, encoding is always a
translation from a Unicode (text)
string to an array of bytes, and
decoding always goes the opposite
direction. This means that we had to
drop a few codecs that don't fit in
this model, for example rot13, base64
and bz2 (those conversions are still
supported, just not through the
encode/decode API)."

I'm not aware of any good way to avoid the isinstance check in your function, but maybe someone else will be. I can point out that the two weirdnesses you cite are because you're doing something that doesn't make sense: Trying to decode into Unicode something that's already decoded into Unicode.
The first should instead look like this, which decodes the UTF-8 encoding of that string into the Unicode version:
>>> 'cer\xc3\xb3n'.decode('utf-8')
u'cer\xf3n'
And your second should look like this (not using a u'' Unicode string literal):
>>> unicode('hello', 'utf-8')
u'hello'

The confusion on python encoding

I retrieved the data encoded in big5 from database,and I want to send the data as email of html content, the code is like this:
html += """<tr><td>"""
html += unicode(rs[0], 'big5') # rs[0] is data encoded in big5
I run the script, but the error raised: UnicodeDecodeError: 'ascii' codec can't decode byte...... However, I tried the code in interactive python command line, there are no errors raised, could you give me the clue?

If html is not already a unicode object but a normal string, it is converted to unicode when it is concatenated with the converted version of rs[0]. If html now contains special characters you can get a unicode error.
So the other contents of html also need to be correctly decoded to unicode. If the special characters come from string literals, you could use unicode literals (like u"abcä") instead.

Your call to unicode() is working correctly. It is the concatenation, which is adding a unicode object to a byte string, that is causing trouble. If you change the first line to u'''<tr><td>''', (or u'<tr><td>') it should work fine.
Edit: This means your error lies in the data that is already in html by the time python reaches this snippet:
>>> '\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 0: ordinal not in range(128)
>>> u'\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
u'\x9f<tr><td>\u56a5'
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert UTF-8 bytes to some other encoding in Python - python

>>> x.decode('utf8').encode('iso-8859-2') 'Sk\xb3odowski'

Related

UnicodeEncodeError AND TypeError: can only concatenate str (not “bytes”) to str

(python)In my database mix some wrong ascii code, how to converte those string without errors

Remove all characters from a string who's ordinals are out of range

Decoding if it's not unicode

The confusion on python encoding

Categories

Resources