I have a data in the below bytes format,
{'command': 'MESSAGE', 'body': b'\x04\x08{\x0b:\tbody"n\x04\x08{\x08:\tdata{\n:\x0bstdout"\x14output-data\n:\rexitcodei\x00:\x0bstderr"\x00:\x0boutput0:\nerror0:\x0estatusmsg"\x07OK:\x0fstatuscodei\x00:\rsenderid"\x13server1:\thash"%903ff3bf7e9212105df23c92dd8f718a:\x10senderagent"\ntoktok:\x0cmsgtimel+\x07\xf6\xb9hZ:\x0erequestid"%7a358c34f8f9544sd2350c99953d0eec', 'rawHeaders': [('content-length', '264'), ('expires', '1516812860547'), ('destination', '/queue/test.queue'), ('priority', '4'), ('message-id', '12345678'), ('content-type', 'text/plain; charset=UTF-8'), ('timestamp', '1516812790347')]}
and trying to decode and convert it as JSON formatted data but its not working. I tried with data.decode() and data.decode('utf-8') and tried json.loads as well but nothing working.
When I tried with data.decode('utf-8') got below error,
'utf-8' codec can't decode byte 0xf6 in position 215: invalid start byte
and when I tried with data.decode('ascii') get below error,
'ascii' codec can't decode byte 0xa9 in position 215: ordinal not in range(128)
Am confused myself whether am doing right way or anything am missing with this data conversion and parsing.
Update 1:
Just now found that this data is generated using Ruby with PSK security plugin and this message object has .decode! public_method. So is there any way to use the same public_method in python to decode it or if possible using PSK also would be fine.
JSON is a Unicode format, it cannot (easily and transparently) accommodate arbitrary byte strings. What you can easily do is save a blob in some textual format -- base64 is a common choice. But of course, all producers and consumers need to share an understanding of how to decode the blob, rather than just use it as text.
Python 3.5.1 (default, Dec 26 2015, 18:08:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> d = {'json': True, 'data': b'\xff\xff\xff'}
>>> json.dumps(d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
... yada yada yada ...
TypeError: b'\xff\xff\xff' is not JSON serializable
>>> import base64
>>> base64.b64encode(d['data'])
b'////'
>>> base64.b64encode(d['data']).decode('ascii')
'////'
>>> d['data_base64'] = base64.b64encode(d['data']).decode('ascii')
>>> del d['data']
>>> json.dumps(d)
'{"json": true, "data_base64": "////"}'
I very specifically used a different name for the encoded field to avoid having any consumer think that the base64 blob is the actual value for the data member.
Random binary data most definitely isn't valid UTF-8 so obviously cannot be decoded using that codec. UTF-8 is a very specific encoding for Unicode text which cannot really be used for data which isn't exactly that. You usually encode, rather than decode, binary data for transport, and need to have something at the other end decode it back into bytes. Here, that encoding is base64, but anything which can transparently embed binary as text will do.
If that is your data and you are trying to round-trip it through the JSON serializable format, this will do it:
import json
import base64
data = {'command': 'MESSAGE',
'body': b'\x04\x08{\x0b:\tbody"n\x04\x08{\x08:\tdata{\n:\x0bstdout"\x14output-data\n:\rexitcodei\x00:\x0bstderr"\x00:\x0boutput0:\nerror0:\x0estatusmsg"\x07OK:\x0fstatuscodei\x00:\rsenderid"\x13server1:\thash"%903ff3bf7e9212105df23c92dd8f718a:\x10senderagent"\ntoktok:\x0cmsgtimel+\x07\xf6\xb9hZ:\x0erequestid"%7a358c34f8f9544sd2350c99953d0eec',
'rawHeaders': [('content-length', '264'), ('expires', '1516812860547'), ('destination', '/queue/test.queue'), ('priority', '4'), ('message-id', '12345678'), ('content-type', 'text/plain; charset=UTF-8'), ('timestamp', '1516812790347')]}
# Make a copy of the original data and base64 for bytes content.
datat = data.copy()
datat['body'] = base64.encodebytes(datat['body']).decode('ascii')
# Now it serializes
jsondata = json.dumps(datat)
print(jsondata)
# Read it back and decode the base64 field back to its original bytes value
data2 = json.loads(jsondata)
data2['body'] = base64.decodebytes(data2['body'].encode('ascii'))
# For comparison, since the tuples in 'rawHeaders' are read back as lists by JSON,
# convert the list entries back to tuples.
data2['rawHeaders'] = [tuple(x) for x in data2['rawHeaders']]
# Did the data restore correctly?
print(data == data2)
Output:
{"command": "MESSAGE", "body": "BAh7CzoJYm9keSJuBAh7CDoJZGF0YXsKOgtzdGRvdXQiFG91dHB1dC1kYXRhCjoNZXhpdGNvZGVp\nADoLc3RkZXJyIgA6C291dHB1dDA6CmVycm9yMDoOc3RhdHVzbXNnIgdPSzoPc3RhdHVzY29kZWkA\nOg1zZW5kZXJpZCITc2VydmVyMToJaGFzaCIlOTAzZmYzYmY3ZTkyMTIxMDVkZjIzYzkyZGQ4Zjcx\nOGE6EHNlbmRlcmFnZW50Igp0b2t0b2s6DG1zZ3RpbWVsKwf2uWhaOg5yZXF1ZXN0aWQiJTdhMzU4\nYzM0ZjhmOTU0NHNkMjM1MGM5OTk1M2QwZWVj\n", "rawHeaders": [["content-length", "264"], ["expires", "1516812860547"], ["destination", "/queue/test.queue"], ["priority", "4"], ["message-id", "12345678"], ["content-type", "text/plain; charset=UTF-8"], ["timestamp", "1516812790347"]]}
True
Related
I'm using a FAST API to retrieve a mongo document that contains some bytes. The structure is as follows
item =
{"namd" : "xyz",
"value1: b'\x89PNG\r\n\sla\..."
...
"some_other_byte: b'\x89PNG\r\n\sla\..."
}
using a post request in fast API to return the above data, it tries to convert it json, but fails to do so automatically.
So I tried this:
json_compatible_item_data = jsonable_encoder(item)
but then I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
Is there a way to automatically convert the above dict into a json so it can be returned in a rest API? What would be the best way to do that?
With FastAPI jsonable_encoder you can use custom encoders. Example of converting arbitrary bytes object to base64 str:
json_compatible_item_data = jsonable_encoder(item, custom_encoder={
bytes: lambda v: base64.b64encode(v).decode('utf-8')})
Decoding target fields on the client side can be done like this:
value1 = base64.b64decode(response_dict["value1"])
In my case, the jsonable_encoder is an unnecessary wrapper around the lambda. I just call base64 directly on the pyodbc data row...
column_names = tuple(c[0] for c in cursor.description)
for row in cursor:
row = [base64.b64encode(x).decode() if isinstance(x, bytes) else x for x in row]
yield dict(zip(column_names, row))
But seriously, why is this necessary? For everything else, FastAPI just works "out of the box". This seems like a bug.
hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?
Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')
I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},
This is the code I'm using:
def parse(self, response):
for sel in response.xpath('//section[#class="items-box"]'):
item = ShopItem()
item['name'] = sel.xpath('a/div/h3/text()').extract()
item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
yield item
How would I output unescaped Unicode characters onto the JSON?
Edit (2016-10-19):
With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8' (the default value being None, which means \uXXXX escaping)
Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.
Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965
Scrapy's default JSON exporter uses (the default) ensure_ascii=True argument, so it outputs Unicode characters as \uXXXX sequences before writing to file. (This is what is used when doing -o somefile.json)
Setting ensure_ascii=False in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.
To illustrate, let's read your input JSON string back into some data to work on:
>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}
The input with \uXXXX sequences is valid JSON for Python (as it should), and loads() produces a valid Python dict.
Now let's serialize to JSON again:
>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>
And now with ensure_ascii=False
>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>
Let's print to see the difference:
>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}
>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}
If you want to write JSON items as UTF-8, you can do it like this:
1.. define a custom item exporter, e.g. in an exporters.py file in your project
$ cat myproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
2.. replace the default JSON item exporter in your settings.py
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Use the codecs module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str doesn't have a decode method, because the methods are for str -> bytes and back, not str -> str). Using the unicode_escape codec for decoding will get you the correct data back:
import codecs
somestr = codecs.decode(strwithescapes, 'unicode-escape')
So to fix the names you're getting, you'd do:
item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')
If the problem is in JSON you're producing, you'd want to just make sure the json module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump/json.dumps (or create a json.JSONEncoder), make sure to explicitly pass ensure_ascii=False.
I have the following function to create cipher text and then save it:
def create_credential(self):
des = DES.new(CIPHER_N, DES.MODE_ECB)
text = str(uuid.uuid4()).replace('-','')[:16]
cipher_text = des.encrypt(text)
return cipher_text
def decrypt_credential(self, text):
des = DES.new(CIPHER_N, DES.MODE_ECB)
return des.decrypt(text)
def update_access_credentials(self):
self.access_key = self.create_credential()
print repr(self.access_key) # "\xf9\xad\xfbO\xc1lJ'\xb3\xda\x7f\x84\x10\xbbv&"
self.access_password = self.create_credential()
self.save()
And I will call:
>>> from main.models import *
>>> u=User.objects.all()[0]
>>> u.update_access_credentials()
And this is the stacktrace I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 738: invalid start byte
Why is this occurring and how would I get around it?
You are storing a bytestring into a Unicode database field, so it'll try and decode to Unicode.
Either use a database field that can store opaque binary data, decode explicitly to Unicode (latin-1 maps bytes one-on-one to Unicode codepoints) or wrap your data into a representation that can be stored as text.
For Django 1.6 and up, use a BinaryField, for example. For earlier versions, using a binary-to-text conversion (such as Base64) would be preferable over decoding to Latin-1; the result of the latter would not give you meaningful textual data but Django may try to display it as such (in the admin interface for example).
It's occurring because you're attempting to save non-text data in a text field. Either use a non-text field instead, or encode the data as text via e.g. Base-64 encoding.
Using base64 encoding and decoding here fixed this:
import base64
def create_credential(self):
des = DES.new(CIPHER_N, DES.MODE_ECB)
text = str(uuid.uuid4()).replace('-','')[:16]
cipher_text = des.encrypt(text)
base64_encrypted_message = base64.b64encode(cipher_text)
return base64_encrypted_message
def decrypt_credential(self, text):
text = base64.b64decode(text)
des = DES.new(CIPHER_N, DES.MODE_ECB)
message = des.decrypt(text)
return message
I have problem with UnicodeEncodeError in my users_information list:
{u'\u0633\u062a\u064a\u062f#nimbuzz.com': {'UserName': u'\u0633\u062a\u064a\u062f#nimbuzz.com', 'Code': 5, 'Notes': '', 'Active': 0, 'Date': '12/07/2014 14:16', 'Password': '560pL390T', 'Email': u'yuyb0y#gmail.com'}}
And I need to run this code to get users information:
def get_users_info(type, source, parameters):
users_registertion_file = 'static/users_information.txt'
fp = open(users_registertion_file, 'r')
users_information = eval(fp.read())
if parameters:
jid = parameters+"#nimbuzz.com"
if users_information.has_key(jid):
reply(type, source, u"User name:\n" +str(users_information[jid]['UserName'])+ u"\nPassword:\n" +str(users_information[jid]['Password'])+ u"\nREG-code:\nP" +str(users_information[jid]['Code'])+ u"\nDate:\n" +str(users_information[jid]['Date'])+ u"\naccount status:\n " +str(users_information[jid]['Active']))
else:
reply(type, source, u"This user " +parameters+ u" not in user list")
else:
reply(type, source, u"write the id after command")
but when I try to get users information I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
I try to unicode the jid using unicode('utf8'):
jid = parameters.encode('utf8')+"#nimbuzz.com"
but I get the same error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Please how I can solve this problem and as you see the UserName key in the users_information list look like:
u'\u0633\u062a\u064a\u062f#nimbuzz.com'
and the users_information list located in txt file.
You'll not find your user information unless jid is a unicode string. Make sure parameters is a unicode value here, and it'll be easier to use string formatting here:
jid = u"{}#nimbuzz.com".format(parameters)
If you use an encoded bytestring, Python will not find your username in the dictionary as it won't know what encoding you used for the string and won't implicitly decode or encode to make the comparisons.
Next, you cannot call str() on a Unicode value without specifying a codec:
str(users_information[jid]['UserName'])
This is guaranteed to throw an UnicodeEncodeError exception if users_information[jid]['UserName'] contains anything other than ASCII codepoints.
You need to use Unicode values throughout, leave encoding the value to the last possible moment (preferably by leaving it to a library).
You can use string formatting with unicode objects here too:
reply(type, source,
u"User name:\n{0[UserName]}\nPassword:\n{0[Password]}\n"
u"REG-code:\nP{0[Code]}\nDate:\n{0[Date]}\n"
u"account status:\n {0[Active]}".format(users_information[jid]))
This interpolates the various keys from users_information[jid] without calling str on each value.
Note that dict.has_key() has been deprecated; use the in operator to test for a key instead:
if jid in users_information:
Last but not least, don't use eval() if you can avoid it. You should use JSON here for the file format, but if you cannot influence that then at least use ast.literal_eval() on the file contents instead of eval() and limit permissible input to just Python literal syntax:
import ast
# ...
users_information = ast.literal_eval(fp.read())
I had some problem years ago:
jid = parameters+"#nimbuzz.com"
must be
jid = parameters+u"#nimbuzz.com"
and put it at first or second row:
#coding:utf8
Example for Martijn Pieters - on my machine
Python 2.7.8 (default, Jul 1 2014, 17:30:21)
[GCC 4.9.0 20140604 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'asdf'
>>> b='ваывап'
>>> a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
>>> c=u'аыиьт'
>>> a+c
u'asdf\u0430\u044b\u0438\u044c\u0442'
>>>