JSON string decoding error

JSON string decoding error - python

I am calling the URL :
http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json
using urllib2 and decoding using the json module
url = "http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json"
request = urllib2.Request(query)
response = urllib2.urlopen(request)
issue_report = json.loads(response.read())
I run into the following error :
ValueError: Invalid control character at: line 1 column 1120 (char 1120)
I tried checking the header and I got the following :
Content-Type: application/json; charset=UTF-8
Access-Control-Allow-Origin: *
Expires: Sun, 03 Jul 2011 17:38:38 GMT
Date: Sun, 03 Jul 2011 17:38:38 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"CUEGQX47eCl7ImA9WxJaFEw."
Last-Modified: Tue, 04 Aug 2009 19:20:20 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
I also tried adding an encoding parameter as follows :
issue_report = json.loads(response.read() , encoding = 'UTF-8')
I still run into the same error.

The feed has raw data from a JPEG in it at that point; the JSON is malformed, so it's not your fault. Report a bug to Google.

You could consider using lxml instead, since the JSON is malformed. It's XPath support makes working with XML pretty straight-forward:
import lxml.etree
url = 'http://code.google.com/feeds/issues/p/chromium/issues/full/291'
doc = lxml.etree.parse(url)
ns = {'issues': 'http://schemas.google.com/projecthosting/issues/2009'}
issues = doc.xpath('//issues:*', namespaces=ns)
Fairly easy to manipulate elements, for instance to strip namespace from tags, convert to dict:
>>> dict((x.tag[len(ns['issues'])+2:], x.text) for x in issues)
<<<
{'closedDate': '2009-08-04T19:20:20.000Z',
'id': '291',
'label': 'Area-BrowserUI',
'stars': '13',
'state': 'closed',
'status': 'Verified'}

Related

How to read application/octet-stream in Python

Building off of this question, I'm using a Python script to call the API detailed in the link below:
https://developer.wmata.com/docs/services/gtfs/operations/5cdc51ea7a6be320cab064fe?
I use the code below to call the api:
import requests
# define functions
def _prepare_url(path):
return f'{API_URL}/{path.lstrip("/")}'
def pull_data(path, params=None, headers=None):
url =_prepare_url(path)
return requests.get(url, params=params, headers=headers)
# print results in cleaner format
def jprint(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
API_URL = 'https://api.wmata.com'
# authenticate with your api key
headers = {
"api_key": "myKey",
}
response = pull_data('/gtfs/bus-gtfsrt-tripupdates.pb', headers=headers)
print(response.content)
print(response.headers)
print(response.url)
But it returns a meaningless stream of data along with the following headers:
Request-Context: appId=cid-v1:2833aead-1a1f-4ffd-874e-ef3a5ceb1de8
Cache-Control: public, must-revalidate, max-age=5
Date: Thu, 11 Feb 2021 22:05:31 GMT
ETag: 0x8D8CED90CC8419C
Content-Length: 625753
Content-MD5: fspEFl7LJ8QbZPgf677WqQ==
Content-Type: application/octet-stream
Expires: Thu, 11 Feb 2021 22:05:37 GMT
Last-Modified: Thu, 11 Feb 2021 22:04:49 GMT
'''b'\n\r\n\x031.0\x10\x00\x18\xd9\xef\xa5\x81\x06\x12\xee\x02\n\n1932817010\x1a\xdf\x02\n\x1a\n\n1932817010\x1a\x0820210214*\x0233\x12\x13\x08\x02\x1a\x06\x10\x9c\xff\xa5\x81\x06"\x0513752...'''
Any guidance on how to go about reading this kind of response?

GTFS-rt is transported in a compressed encoded representation called a "protobuf." Your Python script will need to use the gtfs-realtime.proto file (which contains a definition of the expected contents of the GTFS-rt feed) along with the Google Protobuf Python package in order to decode the response.
Here is an example of how to read a GTFS-rt API in Python from the documentation: https://developers.google.com/transit/gtfs-realtime/examples/python-sample.

URLError doesn't return json body

I have built a REST interface. On '400 Bad Request' it returns a json body with specific information about the error.
(Pdb) error.code
400
Python correctly throws a URLError with these headers
(Pdb) print(error.headers)
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 20 Aug 2016 13:01:05 GMT
Connection: close
Content-Length: 236
There is a content of 236 char, but I cannot find a way to read the body.
I can see the extra information using DHC chrome plugin
{
"error_code": "00000001",
"error_message": "The json data is not in the correct json format.\r\nThe json data is not in the correct json format.\r\n'Execution Start Time' must not be empty.\r\n'Execution End Time' must not be empty.\r\n"
}
However, I cannot find a way in Python to read the body
Here are some of the things I have tried and what was returned.
(Pdb) len(error.read())
0
error.read().decode('utf-8', 'ignore')
''
(Pdb) error.readline()
b''

I found that this works the first time it is called, but does not work if called again.
error.read().decode('utf-8')

Forcing response charset in CherryPy

I want to specify the HTTP response charset by modifying the Content-Type header. However, it doesn't work.
Here is a short example:
#coding=utf-8
import cherrypy
class Website:
#cherrypy.expose()
def index(self):
cherrypy.response.headers['Content-Type']='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/',{
'/': {
'tools.response_headers.on':True,
}
})
And when I visit that page, the Content-Type is changed mysteriously to text/plain;charset=utf-8, causing mojibake in the browser.
C:\Users\Administrator>ncat 127.0.0.1 8080 -C
GET / HTTP/1.1
Host: 127.0.0.1:8080
HTTP/1.1 200 OK
Server: CherryPy/7.1.0
Content-Length: 6
Content-Type: text/plain;charset=utf-8
Date: Mon, 22 Aug 2016 01:08:13 GMT
。。。^C
It seems that CherryPy detect content encoding and override the charset automatically. If so, how can I disable this feature?

All right. Solved this problem by tampering cherrypy.response.header_list directly.
#coding=utf-8
import cherrypy
def set_content_type():
header=(b'Content-Type',cherrypy.response._content_type.encode())
for ind,(key,_) in enumerate(cherrypy.response.header_list):
if key.lower()==b'content-type':
cherrypy.response.header_list[ind]=header
break
else:
cherrypy.response.header_list.append(header)
cherrypy.tools.set_content_type=cherrypy.Tool('on_end_resource',set_content_type)
class Website:
#cherrypy.expose()
#cherrypy.tools.set_content_type()
def index(self):
cherrypy.response._content_type='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/')

I had success to set the content-type-charset by setting/manipulate the request header attribute "Accept-Charset".
cherrypy.request.headers["Accept-Charset"] = "ISO-8859-1"
cherrypy.response.headers["Content-Type"] = "text/xml"
The result:
>curl -I https://127.0.0.1/url?param=value
HTTP/1.1 200 OK
Content-Type: text/xml;charset=ISO-8859-1
Server: CherryPy/18.6.0
Date: Mon, 10 Aug 2020 11:54:49 GMT
Content-Length: 288
Set-Cookie: session_id=d28fa46a1a3d901d9502038255ce869b21add5cc; expires=Mon, 10 Aug 2020 12:54:49 GMT; Max-Age=3600; Path=/

GoogleOAuth2 Auth Failure "unauthorized_client"/ Python GoogleAdWord API

Iam trying to accsess OAuth2 in Python (the code is the same as http://code.google.com/p/google-api-ads-python/source/browse/trunk/examples/adspygoogle/adwords/v201302/misc/use_oauth2.py?spec=svn139&r=139):
flow = OAuth2WebServerFlow(client_id='XXX',
client_secret='YYY',
scope='https://adwords.google.com/api/adwords',
user_agent='ZZZ')
authorize_url = flow.step1_get_authorize_url('urn:ietf:wg:oauth:2.0:oob')
code = raw_input('Code: ').strip()
credential = None
try:
credential = flow.step2_exchange(code) #<- error
except FlowExchangeError, e:
sys.exit('Authentication has failed: %s' % e)
This produces a "socket.error: [Errno 10054]" error at the step2_exchange and Python cuts off the excact message.
So after checking the key with OAuthplayground (to get an better errormsg) i get this Error:
HTTP/1.1 400 Bad Request
Content-length: 37
X-xss-protection: 1; mode=block
X-content-type-options: nosniff
X-google-cache-control: remote-fetch
-content-encoding: gzip
Server: GSE
Via: HTTP/1.1 GWA
Pragma: no-cache
Cache-control: no-cache, no-store, max-age=0, must-revalidate
Date: Thu, 06 Jun 2013 13:54:29 GMT
X-frame-options: SAMEORIGIN
Content-type: application/json
Expires: Fri, 01 Jan 1990 00:00:00 GMT
{
"error" : "unauthorized_client"
}
I checked that client_id (for installed Apps) and client_secret are Identical with the one specified in the Google API Console (https://code.google.com/apis/console/).
If i do the whole proces over OAuthPlayground it will work but if i try to use a Token created by playground the App fails also.
Anyone knows how to fix it?

Fixed it. Iam behind a proxy which makes lets the step1 Auth through but apparently not the step2 auth. So a simple
h = httplib2.Http(proxy_info = httplib2.ProxyInfo PROXY DATA .....)
flow.step2_exchange(code, h)
fixed it.

There is an example of how to configure the proxy_info in httplib2 is in https://code.google.com/p/httplib2/wiki/Examples
which says:
import httplib2
import socks
httplib2.debuglevel=4
h = httplib2.Http(proxy_info = httplib2.ProxyInfo(socks.PROXY_TYPE_HTTP, 'localhost', 8000))
r,c = h.request("http://bitworking.org/news/")
However, I've found with the latest httplib2, a cleaned-up socks module comes with it, such that you really want to do something more like:
import httplib2
ht = httplib2.Http(proxy_info = httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, 'name_or_ip_of_the_proxy_box', proxy_port)
flow.step2_exchange(code, ht)
also, you want to be using a version of oauth2client >= 1.0beta8 which requires the version of httplib2 >= 0.7.4 which is where the support for HTTP proxy was cleaned up in both packages.

Print line containing "word" python

I would like to print ONLY the line which contains "Server" in the below piece of output:
Date: Sun, 16 Dec 2012 20:07:44 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Connection: close
This information is fetched from a list called websiteheaders. I have the below piece of code which is driving me crazy that it is not working properly...
for line in websiteheaders:
if "Server" in line:
print line
Now this above piece of code prints exactly the same block of text that is described at the beginning of my post. I just dont seem to get why it does that...
As I've said, I only want to print the line that contains "Server", if possible without regex. And if not possible, with regex.
Please help and thanks!
EDIT: My complete code so far is pasted here: http://pastebin.com/sYuZyvX9
EDIT2: For completeness, in hosts.txt there currently is 1 host named "google.com"
Update
My code was actually working fine, but there was a mistake in a other piece of my code which ensured that the data that was put into the list websiteheaders was 1 large string instead of multiple entries. In the above piece of code, it will ofcourse find "Server" and print the whole entry, which in my case was the full (large) string.
Using
websiteheaders.extend(headers.splitlines())
instead of
websiteheaders.append(headers)
did the trick for me. Thanks alot guys.

Is websiteheaders really a list which is split for very line? Because if it's a string you should use:
for line in websiteheaders.splitlines():
if "Server" in line:
print line
Also, a good tip: I would recommend adding some print-statements on encountering this kind of problems. If you would have added something like:
else:
print 'WRONG LINE:', line
You probably would have catched that this loop was not looping over every line but over every character.
Update
I can't wee what's wrong with your code then. This is what I get:
In [3]: websiteheaders
Out[3]:
['Date: Sun, 16 Dec 2012 20:07:44 GMT',
'Expires: -1',
'Cache-Control: private, max-age=0',
'Content-Type: text/html; charset=ISO-8859-1',
'Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com',
'Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly',
'P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."',
'Server: gws',
'X-XSS-Protection: 1; mode=block',
'X-Frame-Options: SAMEORIGIN',
'Connection: close"']
In [4]: for line in websiteheaders:
...: if 'Server' in line:
...: print line
...:
Server: gws

for single_line in websiteheaders.splitlines():
if `Server` in single_line:
print single_line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

JSON string decoding error - python

The feed has raw data from a JPEG in it at that point; the JSON is malformed, so it's not your fault. Report a bug to Google.

Related

How to read application/octet-stream in Python

URLError doesn't return json body

Forcing response charset in CherryPy

GoogleOAuth2 Auth Failure "unauthorized_client"/ Python GoogleAdWord API

Print line containing "word" python

Categories

Resources