BOM in server response screws up json parsing

BOM in server response screws up json parsing - python

I'm trying to write a Python script that posts some JSON to a web server and gets some JSON back. I patched together a few different examples on StackOverflow, and I think I have something that's mostly working.
import urllib2
import json
url = "http://foo.com/API.svc/SomeMethod"
payload = json.dumps( {'inputs': ['red', 'blue', 'green']} )
headers = {"Content-type": "application/json;"}
req = urllib2.Request(url, payload, headers)
f = urllib2.urlopen(req)
response = f.read()
f.close()
data = json.loads(response) # <-- Crashes
The last line throws an exception:
ValueError: No JSON object could be decoded
When I look at response, I see valid JSON, but the first few characters are a BOM:
>>> response
'\xef\xbb\xbf[\r\n {\r\n ... Valid JSON here
So, if I manually strip out the first three bytes:
data = json.loads(response[3::])
Everything works and response is turned into a dictionary.
My Question:
It seems kinda silly that json barfs when you give it a BOM. Is there anything different I can do with urllib or the json library to let it know this is a UTF8 string and to handle it as such? I don't want to manually strip out the first 3 bytes.

You should probably yell at whoever's running this service, because a BOM on UTF-8 text makes no sense. The BOM exists to disambiguate byte order, and UTF-8 is defined as being little-endian.
That said, ideally you should decode bytes before doing anything else with them. Luckily, Python has a codec that recognizes and removes the BOM: utf-8-sig.
>>> '\xef\xbb\xbffoo'.decode('utf-8-sig')
u'foo'
So you just need:
data = json.loads(response.decode('utf-8-sig'))

In case I'm not the only one who experienced the same problem, but is using requests module instead of urllib2, here is a solution that works in Python 2.6 as well as 3.3:
import requests
r = requests.get(url, params=my_dict, auth=(user, pass))
print(r.headers['content-type']) # 'application/json; charset=utf8'
if r.text[0] == u'\ufeff': # bytes \xef\xbb\xbf in utf-8 encoding
r.encoding = 'utf-8-sig'
print(r.json())

Since I lack enough reputation for a comment, I'll write an answer instead.
I usually encounter that problem when I need to leave the underlying Stream of a StreamWriter open. However, the overload that has the option to leave the underlying Stream open needs an encoding (which will be UTF8 in most cases), here's how to do it without emitting the BOM.
/* Since Encoding.UTF8 (the one you'd normally use in those cases) **emits**
* the BOM, use whats below instead!
*/
// UTF8Encoding has an overload which enables / disables BOMs in the output
UTF8Encoding encoding = new UTF8Encoding(false);
using (MemoryStream ms = new MemoryStream())
using (StreamWriter sw = new StreamWriter(ms, encoding, 4096, true))
using (JsonTextWriter jtw = new JsonTextWriter(sw))
{
serializer.Serialize(jtw, myObject);
}

Related

How can i parse a json response? [duplicate]

I am getting error Expecting value: line 1 column 1 (char 0) when trying to decode JSON.
The URL I use for the API call works fine in the browser, but gives this error when done through a curl request. The following is the code I use for the curl request.
The error happens at return simplejson.loads(response_json)
response_json = self.web_fetch(url)
response_json = response_json.decode('utf-8')
return json.loads(response_json)
def web_fetch(self, url):
buffer = StringIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, url)
curl.setopt(curl.TIMEOUT, self.timeout)
curl.setopt(curl.WRITEFUNCTION, buffer.write)
curl.perform()
curl.close()
response = buffer.getvalue().strip()
return response
Traceback:
File "/Users/nab/Desktop/myenv2/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
111. response = callback(request, *callback_args, **callback_kwargs)
File "/Users/nab/Desktop/pricestore/pricemodels/views.py" in view_category
620. apicall=api.API().search_parts(category_id= str(categoryofpart.api_id), manufacturer = manufacturer, filter = filters, start=(catpage-1)*20, limit=20, sort_by='[["mpn","asc"]]')
File "/Users/nab/Desktop/pricestore/pricemodels/api.py" in search_parts
176. return simplejson.loads(response_json)
File "/Users/nab/Desktop/myenv2/lib/python2.7/site-packages/simplejson/__init__.py" in loads
455. return _default_decoder.decode(s)
File "/Users/nab/Desktop/myenv2/lib/python2.7/site-packages/simplejson/decoder.py" in decode
374. obj, end = self.raw_decode(s)
File "/Users/nab/Desktop/myenv2/lib/python2.7/site-packages/simplejson/decoder.py" in raw_decode
393. return self.scan_once(s, idx=_w(s, idx).end())
Exception Type: JSONDecodeError at /pricemodels/2/dir/
Exception Value: Expecting value: line 1 column 1 (char 0)

Your code produced an empty response body, you'd want to check for that or catch the exception raised. It is possible the server responded with a 204 No Content response, or a non-200-range status code was returned (404 Not Found, etc.). Check for this.
Note:
There is no need to use simplejson library, the same library is included with Python as the json module.
There is no need to decode a response from UTF8 to unicode, the simplejson / json .loads() method can handle UTF8 encoded data natively.
pycurl has a very archaic API. Unless you have a specific requirement for using it, there are better choices.
Either the requests or httpx offers much friendlier APIs, including JSON support. If you can, replace your call with:
import requests
response = requests.get(url)
response.raise_for_status() # raises exception when not a 2xx response
if response.status_code != 204:
return response.json()
Of course, this won't protect you from a URL that doesn't comply with HTTP standards; when using arbirary URLs where this is a possibility, check if the server intended to give you JSON by checking the Content-Type header, and for good measure catch the exception:
if (
response.status_code != 204 and
response.headers["content-type"].strip().startswith("application/json")
):
try:
return response.json()
except ValueError:
# decide how to handle a server that's misbehaving to this extent

Be sure to remember to invoke json.loads() on the contents of the file, as opposed to the file path of that JSON:
json_file_path = "/path/to/example.json"
with open(json_file_path, 'r') as j:
contents = json.loads(j.read())
I think a lot of people are guilty of doing this every once in a while (myself included):
contents = json.load(json_file_path)

Check the response data-body, whether actual data is present and a data-dump appears to be well-formatted.
In most cases your json.loads- JSONDecodeError: Expecting value: line 1 column 1 (char 0) error is due to :
non-JSON conforming quoting
XML/HTML output (that is, a string starting with <), or
incompatible character encoding
Ultimately the error tells you that at the very first position the string already doesn't conform to JSON.
As such, if parsing fails despite having a data-body that looks JSON like at first glance, try replacing the quotes of the data-body:
import sys, json
struct = {}
try:
try: #try parsing to dict
dataform = str(response_json).strip("'<>() ").replace('\'', '\"')
struct = json.loads(dataform)
except:
print repr(resonse_json)
print sys.exc_info()
Note: Quotes within the data must be properly escaped

With the requests lib JSONDecodeError can happen when you have an http error code like 404 and try to parse the response as JSON !
You must first check for 200 (OK) or let it raise on error to avoid this case.
I wish it failed with a less cryptic error message.
NOTE: as Martijn Pieters stated in the comments servers can respond with JSON in case of errors (it depends on the implementation), so checking the Content-Type header is more reliable.

Check encoding format of your file and use corresponding encoding format while reading file. It will solve your problem.
with open("AB.json", encoding='utf-8', errors='ignore') as json_data:
data = json.load(json_data, strict=False)

I had the same issue trying to read json files with
json.loads("file.json")
I solved the problem with
with open("file.json", "r") as read_file:
data = json.load(read_file)
maybe this can help in your case

A lot of times, this will be because the string you're trying to parse is blank:
>>> import json
>>> x = json.loads("")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
You can remedy by checking whether json_string is empty beforehand:
import json
if json_string:
x = json.loads(json_string)
else:
# Your code/logic here
x = {}

I encounterred the same problem, while print out the json string opened from a json file, found the json string starts with 'ï»¿', which by doing some reserach is due to the file is by default decoded with UTF-8, and by changing encoding to utf-8-sig, the mark out is stripped out and loads json no problem:
open('test.json', encoding='utf-8-sig')

This is the minimalist solution I found when you want to load json file in python
import json
data = json.load(open('file_name.json'))
If this give error saying character doesn't match on position X and Y, then just add encoding='utf-8' inside the open round bracket
data = json.load(open('file_name.json', encoding='utf-8'))
Explanation
open opens the file and reads the containts which later parse inside json.load.
Do note that using with open() as f is more reliable than above syntax, since it make sure that file get closed after execution, the complete sytax would be
with open('file_name.json') as f:
data = json.load(f)

There may be embedded 0's, even after calling decode(). Use replace():
import json
struct = {}
try:
response_json = response_json.decode('utf-8').replace('\0', '')
struct = json.loads(response_json)
except:
print('bad json: ', response_json)
return struct

I had the same issue, in my case I solved like this:
import json
with open("migrate.json", "rb") as read_file:
data = json.load(read_file)

I was having the same problem with requests (the python library). It happened to be the accept-encoding header.
It was set this way: 'accept-encoding': 'gzip, deflate, br'
I simply removed it from the request and stopped getting the error.

Just check if the request has a status code 200. So for example:
if status != 200:
print("An error has occured. [Status code", status, "]")
else:
data = response.json() #Only convert to Json when status is OK.
if not data["elements"]:
print("Empty JSON")
else:
"You can extract data here"

I had exactly this issue using requests.
Thanks to Christophe Roussy for his explanation.
To debug, I used:
response = requests.get(url)
logger.info(type(response))
I was getting a 404 response back from the API.

In my case I was doing file.read() two times in if and else block which was causing this error. so make sure to not do this mistake and hold contain in variable and use variable multiple times.

In my case it occured because i read the data of the file using file.read() and then tried to parse it using json.load(file).I fixed the problem by replacing json.load(file) with json.loads(data)
Not working code
with open("text.json") as file:
data=file.read()
json_dict=json.load(file)
working code
with open("text.json") as file:
data=file.read()
json_dict=json.loads(data)

For me, it was not using authentication in the request.

For me it was server responding with something other than 200 and the response was not json formatted. I ended up doing this before the json parse:
# this is the https request for data in json format
response_json = requests.get()
# only proceed if I have a 200 response which is saved in status_code
if (response_json.status_code == 200):
response = response_json.json() #converting from json to dictionary using json library

I received such an error in a Python-based web API's response .text, but it led me here, so this may help others with a similar issue (it's very difficult to filter response and request issues in a search when using requests..)
Using json.dumps() on the request data arg to create a correctly-escaped string of JSON before POSTing fixed the issue for me
requests.post(url, data=json.dumps(data))

In my case it is because the server is giving http error occasionally. So basically once in a while my script gets the response like this rahter than the expected response:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<h1>502 Bad Gateway</h1>
<p>The proxy server received an invalid response from an upstream server.<hr/>Powered by Tengine</body>
</html>
Clearly this is not in json format and trying to call .json() will yield JSONDecodeError: Expecting value: line 1 column 1 (char 0)
You can print the exact response that causes this error to better debug.
For example if you are using requests and then simply print the .text field (before you call .json()) would do.

I did:
Open test.txt file, write data
Open test.txt file, read data
So I didn't close file after 1.
I added
outfile.close()
and now it works

If you are a Windows user, Tweepy API can generate an empty line between data objects. Because of this situation, you can get "JSONDecodeError: Expecting value: line 1 column 1 (char 0)" error. To avoid this error, you can delete empty lines.
For example:
def on_data(self, data):
try:
with open('sentiment.json', 'a', newline='\n') as f:
f.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
Reference:
Twitter stream API gives JSONDecodeError("Expecting value", s, err.value) from None

if you use headers and have "Accept-Encoding": "gzip, deflate, br" install brotli library with pip install. You don't need to import brotli to your py file.

In my case it was a simple solution of replacing single quotes with double.
You can find my answer here

Validating webhook with sha256 hmac using PHP and Python

I am working with webhooks from Bold Commerce, which are validated using a hash of the timestamp and the body of the webhook, with a secret key as the signing key. The headers from the webhook looks like this :
X-Bold-Signature: 06cc9aab9fd856bdc326f21d54a23e62441adb5966182e784db47ab4f2568231
timestamp: 1556410547
Content-Type: application/json
charset: utf-8
According to their documentation, the hash is built like so (in PHP):
$now = time(); // current unix timestamp
$json = json_encode($payload, JSON_FORCE_OBJECT);
$signature = hash_hmac('sha256', $now.'.'.$json, $signingKey);
I am trying to recreate the same hash using python, and I am always getting the wrong value for the hash. I've tried several combinations, with and without base64 encoding. In python3, a bytes object is expected for the hmac, so I need to encode everything before I can compare it. At this point my code looks like so :
json_loaded = json.loads(request.body)
json_dumped = json.dumps(json_loaded)
# if I dont load and then dump the json, the message has escaped \n characters in it
message = timestamp + '.' + json_dumped
# => 1556410547.{"event_type" : "order.created", "date": "2020-06-08".....}
hash = hmac.new(secret.encode(), message.encode(), hashlib.sha256)
hex_digest = hash.hexdigest()
# => 3e4520c869552a282ed29b6523eecbd591fc19d1a3f9e4933e03ae8ce3f77bd4
# hmac_to_verify = 06cc9aab9fd856bdc326f21d54a23e62441adb5966182e784db47ab4f2568231
return hmac.compare_digest(hex_digest, hmac_to_verify)
Im not sure what I am doing wrong. For the other webhooks I am validating, I used base64 encoding, but it seems like here, that hasnt been used on the PHP side. I am not so familiar with PHP so maybe there is something I've missed in how they built the orginal hash. There could be complications coming from the fact that I have to convert back and forth between byte arrays and strings, maybe I am using the wrong encoding for that ? Please someone help, I feel like I've tried every combination and am at a loss.
EDIT : Tried this solution by leaving the body without encoding it in json and it still fails :
print(type(timestamp)
# => <class 'str'>
print(type(body))
# => <class 'bytes'>
# cant concatenate bytes to string...
message = timestamp.encode('utf-8') + b'.' + body
# => b'1556410547.{\n "event_type": "order.created",\n "event_time": "2020-06-08 11:16:04",\n ...
hash = hmac.new(secret.encode(), message, hashlib.sha256)
hex_digest = hash.hexdigest()
# ...etc
EDIT EDIT :
Actually it is working in production ! Thanks to the solution described above (concatenating everything as bytes). My Postman request with the faked webhook was still failing, but that's probably because of how I copied and pasted the webhook data from my heroku logs :) .

How to return this valid json data in Python?

I tested using Python to translate a curl to get some data.
import requests
import json
username="abc"
password="123"
headers = {
'Content-Type': 'application/json',
}
params = (
('version', '2017-05-01'),
)
data = '{"text":["This is message one."], "id":"en-es"}'
response = requests.post('https://somegateway.service/api/abc', headers=headers, params=params, data=data, auth=(username, password))
print(response.text)
The above works fine. It returns json data.
It seems ["This is message one."] is a list. I want to use a variable that loads a file to replace this list.
I tried:
with open(f,"r",encoding='utf-8') as fp:
file_in_list=fp.read().splitlines()
toStr=str(file_in_list)
data = '{"text":'+toStr+', "id":"en-es"}'
response = requests.post('https://somegateway.service/api/abc', headers=headers, params=params, data=data, auth=(username, password))
print(response.text)
But it returned error below.
{
"code" : 400,
"error" : "Mapping error, invalid JSON"
}
Can you help? How can I have valid response.text?
Thanks.
update:
The content of f contains only five lines below:
This is message one.
this is 2.
this is three.
this is four.
this is five.

The reason your existing code fails is that str applied to a list of strings will only rarely give you valid JSON. They're not intended to do the same thing. JSON only allows double-quoted strings; Python allows both single- and double-quoted strings. And, unless your strings all happen to include ' characters, Python will render them with single quotes:
>>> print(["abc'def"]) # gives you valid JSON, but only by accident
["abc'def"]
>>> print(["abc"]) # does not give you valid JSON
['abc']
If you want to get the valid JSON encoding of a list of strings, don't try to trick str into giving you valid JSON by accident, just use the json module:
toStr = json.dumps(file_in_list)
But, even more simply, you shouldn't be trying to figure out how to construct JSON strings in the first place. Just create a dict and json.dumps the whole thing:
data = {"text": file_in_list, "id": "en-es"}
data_str = json.dumps(data)
Being able to do this is pretty much the whole point of JSON: it's a simple way to automatically serialize all of the types that are common to all the major scripting languages.
Or, even better, let requests do it for you by passing a json argument instead of a data argument:
data = {"text": file_in_list, "id": "en-es"}
response = requests.post('https://somegateway.service/api/abc', headers=headers, params=params, json=data, auth=(username, password))
This also automatically takes care of setting the Content-Type header to application/json for you. You weren't doing that—and, while many servers will accept your input without it, it's illegal, and some servers will not allow it.
For more details, see the section More complicated POST requests in the requests docs. But there really aren't many more details.

tldr;
toStr = json.dumps(file_in_list)
Explanation
Assuming your file contains something like
String_A
String_B
You need to ensure that toStr is:
Enclosed by [ and ]
Every String in the list is enclosed by quotation marks.
So your raw json (as a String) is equal to '{"text":["String_A", "String_B"], "id":"en-es"}'

curl post request failing in the presence of special characters

Ok, I know there are too many questions on this topic already; reading every one of those hasn't helped me solve my problem.
I have " hello'© " on my webpage. My objective is to get this content as json, strip the "hello" and write back the remaining contents ,i.e, "'©" back on the page.
I am using a CURL POST request to write back to the webpage. My code for getting the json is as follows:
request = urllib2.Request("http://XXXXXXXX.json")
user = 'xxx'
base64string = base64.encodestring('%s:%s' % (xxx, xxx))
request.add_header("Authorization", "Basic %s" % base64string)
result = urllib2.urlopen(request) #send URL request
newjson = json.loads(result.read().decode('utf-8'))
At this point, my newres is unicode string. I discovered that my curl post request works only with percentage-encoding (like "%A3" for £).
What is the best way to do this? The code I wrote is as follows:
encode_dict = {'!':'%21',
'"':'%22',
'#':'%24',
'$':'%25',
'&':'%26',
'*':'%2A',
'+':'%2B',
'#':'%40',
'^':'%5E',
'`':'%60',
'©':'\xa9',
'®':'%AE',
'™':'%99',
'£':'%A3'
}
for letter in text1:
print (letter)
for keyz, valz in encode_dict.iteritems():
if letter == keyz:
print(text1.replace(letter, valz))
path = "xxxx"
subprocess.Popen(['curl','-u', 'xxx:xxx', 'Content-Type: text/html','-X','POST','--data',"text="+text1, ""+path])
This code gives me an error saying " UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if letter == keyz:"
Is there a better way to do this?

The problem was with the encoding. json.loads() returns a stream of bytes and needs to be decoded to unicode, using the decode() fucntion. Then, I replaced all non-ascii characters by encoding the unicode with ascii encoding using encode('ascii','xmlcharrefreplace').
newjson = json.loads(result.read().decode('utf-8').encode("ascii","xmlcharrefreplace"))
Also, learning unicode basics helped me a great deal! This is an excellent tutorial.

python jsonify dictionary in utf-8

I want to get json data into utf-8
I have a list my_list = []
and then many appends unicode values to the list like this
my_list.append(u'ტესტ')
return jsonify(result=my_list)
and it gets
{
"result": [
"\u10e2\u10d4\u10e1\u10e2",
"\u10e2\u10dd\u10db\u10d0\u10e8\u10d5\u10d8\u10da\u10d8"
]
}

Use the following config to add UTF-8 support:
app.config['JSON_AS_ASCII'] = False

Use the standard-library json module instead, and set the ensure_ascii keyword parameter to False when encoding, or do the same with flask.json.dumps():
>>> data = u'\u10e2\u10d4\u10e1\u10e2'
>>> import json
>>> json.dumps(data)
'"\\u10e2\\u10d4\\u10e1\\u10e2"'
>>> json.dumps(data, ensure_ascii=False)
u'"\u10e2\u10d4\u10e1\u10e2"'
>>> print json.dumps(data, ensure_ascii=False)
"ტესტ"
>>> json.dumps(data, ensure_ascii=False).encode('utf8')
'"\xe1\x83\xa2\xe1\x83\x94\xe1\x83\xa1\xe1\x83\xa2"'
Note that you still need to explicitly encode the result to UTF8 because the dumps() function returns a unicode object in that case.
You can make this the default (and use jsonify() again) by setting JSON_AS_ASCII to False in your Flask app config.
WARNING: do not include untrusted data in JSON that is not ASCII-safe, and then interpolate into a HTML template or use in a JSONP API, as you can cause syntax errors or open a cross-site scripting vulnerability this way. That's because JSON is not a strict subset of Javascript, and when disabling ASCII-safe encoding the U+2028 and U+2029 separators will not be escaped to \u2028 and \u2029 sequences.

If you still want to user flask's json and ensure the utf-8 encoding then you can do something like this:
from flask import json,Response
#app.route("/")
def hello():
my_list = []
my_list.append(u'ტესტ')
data = { "result" : my_list}
json_string = json.dumps(data,ensure_ascii = False)
#creating a Response object to set the content type and the encoding
response = Response(json_string,content_type="application/json; charset=utf-8" )
return response
#I hope this helps

In my case the above solution was not sufficient. (Running flask on the GCP App Engine flexible environment). I ended up doing:
json_str = json.dumps(myDict, ensure_ascii = False, indent=4, sort_keys=True)
encoding = chardet.detect(json_str)['encoding']
json_unicode = json_str.decode(encoding)
json_utf8 = json_unicode.encode('utf-8')
response = make_response(json_utf8)
response.headers['Content-Type'] = 'application/json; charset=utf-8'
response.headers['mimetype'] = 'application/json'
response.status_code = status

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BOM in server response screws up json parsing - python

Related

How can i parse a json response? [duplicate]

Validating webhook with sha256 hmac using PHP and Python

How to return this valid json data in Python?

curl post request failing in the presence of special characters

python jsonify dictionary in utf-8

Categories

Resources