Append bytes to string in python - python

I have bytes array as str and want to send it (if I look at it through a debugger, it shows me that File.body is str). For that reason I have to create message to send.
request_text += '\n'.join([
'',
'--%s' % boundary_id,
attachment_headers,
File.body,
])
But at only it tries to join file body, I receive exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Still, in example where I took it, it was implemented this way. How should I order python string to work byte string? Do I should decode it somehow? But how, if that is not text, but just bytes.

You are probably getting the error as your string has non-ascii characters. The following is an example on how to encode/decode a string containing non-ascii characters
1) convert the string to unicode
string="helloé"
u=unicode(string, 'utf-8')
2) Encode the string to utf-8 before sending it on the network
encoded = u.encode('utf-8')
3) Decode it from utf-8 to unicode on the other side
encoded.decode('utf-8')

Related

UTF-8 encoding in str type Python 2

I have a Python 2.7 code which retrieves a base64 encoded response from a server. This response is decoded using base64 module (b64decode / decodestring functions, returning str). Its decoded content has the Unicode code points of the original strings.
I need to convert these Unicode code points to UTF-8.
The original string has a substring content "Não". When I decode the responded string, it shows:
>>> encoded_str = ... # server response
>>> decoded_str = base64.b64decode(encoded_str)
>>> type(decoded_str)
<type 'str'>
>>> decoded_str[x:y]
'N\xe3o'
When I try to encode to UTF-8, it leads to errors as
>>> (decode_str[x:y]).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
However, when this string is manually written in Unicode type, I can correctly convert it to my desired UTF-8 string.
>>> test_str = u'N\xe3o'
>>> test.encode('utf-8')
'N\xc3\xa3o'
I have to retrieve this response from the server and correctly generate an UTF-8 string which can be printed as "Não", how can I do this in Python 2?
You want to decode, not encode the byte string.
Think of it like this: a Unicode string was encoded into bytes, and these bytes were further encoded into base64.
To reverse this, you need to reverse both encodings, in the opposite order.
However, the sample you show most definitely isn't a valid UTF-8 byte string - 0xE3 in isolation is not a valid UTF-8 encoding. Most likely, the Unicode string was encoded using Latin-1 or a related encoding (the sample is much too small to establish this conclusively; other common candidates are the fugly Windows code page CP1252 and Latin-9).

Printing decoded JSON string

I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"
You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'

A script in python 2.7 urllib2 and json raises unicode error

import json
import urllib2
url='http://search.twitter.com/search.json?q=python'
open=urllib2.urlopen(url)
response=open.read().encode('utf8')
data=json.loads(response)
results=data['results']
for result in results:
print result['from_user'] + ': ' + result['text'] + '\n'
gives the error UnicodeEncodeError: 'charmap' codec can't encode characters in position 16-24: character maps to <undefined>.
Anyone have a solution for this?
What you are looking to do is probably to decode and not encode the response.
A very short explanation why is that the http server doesn't know how to send unicode characters, just byte. Hence it uses an encoding like utf-8 to translate these characters into bytes.
When you receive a response from the server you receive this chunk of bytes, and if you want to translate it back into a list of unicode characters (basically a unicode object in python) you have to decode them.
What adds more to the confusion is that the lower spectrum of ascii characters (codepoint < 127) are exactly the same as the lower unicode codepoints when using utf-8. A situation where a unicode codepoint is both encoded the same and fits within the range that can be represented in a single byte for each character.
Hope this is helpful.

Encoding and decoding in Python with MD5( )

Running this code on Ubuntu 10.10 in Python 3.1.1
I am getting the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
And the position of the error changes depending on when I run the following code:
(not the real keys or secret)
sandboxAPIKey = "wed23hf5yxkbmvr9jsw323lkv5g"
sandboxSharedSecret = "98HsIjh39z"
def buildAuthParams():
authHash = hashlib.md5();
#encoding because the update on md5() needs a binary rep of the string
temp = str.encode(sandboxAPIKey + sandboxSharedSecret + repr(int(time.time())))
print(temp)
authHash.update(temp)
#look at the string representation of the binary digest
print(authHash.digest())
#now I want to look at the string representation of the digest
print(bytes.decode(authHash.digest()))
Here is the output of a run (with the sig and key information changed from the real output)
b'sdwe5yxkwewvr9j343434385gkbH4343h4343dz129443643474'
b'\x945EM3\xf5\xa6\xf6\x92\xd1\r\xa5K\xa3IO'
print(bytes.decode(authHash.digest()))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
I am assuming I am not getting something right with my call to decode but I can not figure out what it is. The print of the authHash.digest looks like valid to me.
I would really appreciate any ideas on how to get this to work
When you try to decode a bytearray into a string it tries to match sequentially the bytes to valid characters of an encoding set(by default, utf-8), the exception is being raised because it can't match a sequence of bytes to a valid character in the utf-8 alphabet.
The same will happen if you try to decode it using ascii, any value greater than 127 is an invalid ascii character.
So, if you are trying to get a printable version of the md5 hash, you should hexdigest it, this is the standard way of printing any type of hash, each byte is represented by 2 hexadecimal digits.
In order to do this you can use:
authHash.hexdigest()
If you need to use it in a url, you probably need to encode the bytearray into base64:
base64.b64encode(authHash.digest())

Convert or strip out "illegal" Unicode characters

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?
Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.
When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Categories

Resources