How to read application/octet-stream in Python

How to read application/octet-stream in Python - python

Building off of this question, I'm using a Python script to call the API detailed in the link below:
https://developer.wmata.com/docs/services/gtfs/operations/5cdc51ea7a6be320cab064fe?
I use the code below to call the api:
import requests
# define functions
def _prepare_url(path):
return f'{API_URL}/{path.lstrip("/")}'
def pull_data(path, params=None, headers=None):
url =_prepare_url(path)
return requests.get(url, params=params, headers=headers)
# print results in cleaner format
def jprint(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
API_URL = 'https://api.wmata.com'
# authenticate with your api key
headers = {
"api_key": "myKey",
}
response = pull_data('/gtfs/bus-gtfsrt-tripupdates.pb', headers=headers)
print(response.content)
print(response.headers)
print(response.url)
But it returns a meaningless stream of data along with the following headers:
Request-Context: appId=cid-v1:2833aead-1a1f-4ffd-874e-ef3a5ceb1de8
Cache-Control: public, must-revalidate, max-age=5
Date: Thu, 11 Feb 2021 22:05:31 GMT
ETag: 0x8D8CED90CC8419C
Content-Length: 625753
Content-MD5: fspEFl7LJ8QbZPgf677WqQ==
Content-Type: application/octet-stream
Expires: Thu, 11 Feb 2021 22:05:37 GMT
Last-Modified: Thu, 11 Feb 2021 22:04:49 GMT
'''b'\n\r\n\x031.0\x10\x00\x18\xd9\xef\xa5\x81\x06\x12\xee\x02\n\n1932817010\x1a\xdf\x02\n\x1a\n\n1932817010\x1a\x0820210214*\x0233\x12\x13\x08\x02\x1a\x06\x10\x9c\xff\xa5\x81\x06"\x0513752...'''
Any guidance on how to go about reading this kind of response?

GTFS-rt is transported in a compressed encoded representation called a "protobuf." Your Python script will need to use the gtfs-realtime.proto file (which contains a definition of the expected contents of the GTFS-rt feed) along with the Google Protobuf Python package in order to decode the response.
Here is an example of how to read a GTFS-rt API in Python from the documentation: https://developers.google.com/transit/gtfs-realtime/examples/python-sample.

Related

Python HTTP Request Returns 404 or Bytes

I'm trying to use a Python script to call the API detailed in the link below:
https://developer.wmata.com/docs/services/gtfs/operations/5cdc51ea7a6be320cab064fe?
When I use the code below, it always returns a 404 error:
import requests
import json
def _url(path):
return "http://api.wmata.com" + path
def pull_data():
return requests.get(_url("/gtfs/bus-gtfsrt-tripupdates.pb"), params=params)
def jprint(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
# authenticate with your api key
params = {
"apiKey": "mykey",
}
response = pull_data()
print(response)
jprint(response.json())
I have also tried using the python code provided in the link, but it returns meaningless response content as shown below. Any attempts to decode the content have been unsuccessful.
Request-Context: appId=cid-v1:2833aead-1a1f-4ffd-874e-ef3a5ceb1de8
Cache-Control: public, must-revalidate, max-age=5
Date: Thu, 11 Feb 2021 22:05:31 GMT
ETag: 0x8D8CED90CC8419C
Content-Length: 625753
Content-MD5: fspEFl7LJ8QbZPgf677WqQ==
Content-Type: application/octet-stream
Expires: Thu, 11 Feb 2021 22:05:37 GMT
Last-Modified: Thu, 11 Feb 2021 22:04:49 GMT
1.0�Ԗ��
1818410080�
181841008020210211*52!�Ԗ�"19032("�Ԗ�"18173(#�Ԗ�"7779($�Ֆ�"18174(%�Ֆ�"7909(&�Ֆ�"7986('�Ֆ�"8039((�Ֆ�"8130()�֖�"8276(+�֖�"8313(,�ז�"8403(-�ז�"8452(.�ז�"8520(/�ؖ�"8604(1�ؖ�"8676(
7070 �Ӗ�(����������
1814174080�
181417408020210211*P129�Ԗ�"2373(:�Ԗ�"2387(;�Ԗ�"17296(=�Ֆ�"17212(>�֖�"2444(?�֖�"2493(#�֖�"2607(A�֖�"14633(B�֖�"2784(C�ז�"2832(D�ז�"2843(E�ז�"2848(F�ז�"2875(G�ؖ�"2945(H�ؖ�"2987(I�ؖ�"21946(K�ٖ�"14636(L�ٖ�"3122(M�ٖ�"3227(N�ٖ�"3308(O�ٖ�"3411(P�ٖ�"3500(Q�ٖ�"3539(R�ٖ�"14637(S�ږ�"3685(T�ږ�"15195(U�ږ�"15196(V�ۖ�"4243(W�ۖ�"4443(X�ۖ�"4517(Y�ۖ�"4631([�ܖ�"11962(
8002 �Ӗ�(/�
1825989080�
182598908020210211*7Y�Ӗ�"2158(�Ԗ�"2215(�Ԗ�"2259(�Ԗ�"2292(�Ԗ�"2299(�Ֆ�"18701(�Ֆ�"2310( �Ֆ�"2245(!�Ֆ�"2174("�Ֆ�"1987(#�֖�"1937(%�֖�"1864(
3191 �Ӗ�(��
1819988080�
Any guidance or direction would be greatly appreciated!

Change as pull_data function as follows:
def pull_data():
return requests.get(_url("/gtfs/bus-gtfsrt-tripupdates.pb"), headers=headers)
Then rename params module global variable to headers .
headers = {"apiKey": "mykey"}
WMATA looks for a apiKey in the headers, not in the query params.
Update: I noticed they use api_key for some samples, and apiKey for another ones. For example see:
https://developer.wmata.com/docs/services/gtfs/operations/5cdc51ea7a6be320cab064fe
Update 2: Notice the content type in the response headers :
print(response.headers['content-type'])
# application/octet-stream
it is not a JSON. You can get contents as follows:
print(response.content)
Worked example:
import requests
API_URL = 'https://api.wmata.com'
def _prepare_url(path):
return f'{API_URL}/{path.lstrip("/")}'
def pull_data(**options):
url = _prepare_url('/gtfs/bus-gtfsrt-tripupdates.pb')
return requests.get(url, **options)
response = pull_data(headers={'api_key': 'secret'})
print(response.content)

Try changing your URL so that it is using https instead of http. The documentation that you have linked at Bus RT Trip Updates seems to indicate that https is required.
Change this:
def _url(path):
return "http://api.wmata.com" + path
to make it this:
def _url(path):
return "https://api.wmata.com" + path

URLError doesn't return json body

I have built a REST interface. On '400 Bad Request' it returns a json body with specific information about the error.
(Pdb) error.code
400
Python correctly throws a URLError with these headers
(Pdb) print(error.headers)
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 20 Aug 2016 13:01:05 GMT
Connection: close
Content-Length: 236
There is a content of 236 char, but I cannot find a way to read the body.
I can see the extra information using DHC chrome plugin
{
"error_code": "00000001",
"error_message": "The json data is not in the correct json format.\r\nThe json data is not in the correct json format.\r\n'Execution Start Time' must not be empty.\r\n'Execution End Time' must not be empty.\r\n"
}
However, I cannot find a way in Python to read the body
Here are some of the things I have tried and what was returned.
(Pdb) len(error.read())
0
error.read().decode('utf-8', 'ignore')
''
(Pdb) error.readline()
b''

I found that this works the first time it is called, but does not work if called again.
error.read().decode('utf-8')

Forcing response charset in CherryPy

I want to specify the HTTP response charset by modifying the Content-Type header. However, it doesn't work.
Here is a short example:
#coding=utf-8
import cherrypy
class Website:
#cherrypy.expose()
def index(self):
cherrypy.response.headers['Content-Type']='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/',{
'/': {
'tools.response_headers.on':True,
}
})
And when I visit that page, the Content-Type is changed mysteriously to text/plain;charset=utf-8, causing mojibake in the browser.
C:\Users\Administrator>ncat 127.0.0.1 8080 -C
GET / HTTP/1.1
Host: 127.0.0.1:8080
HTTP/1.1 200 OK
Server: CherryPy/7.1.0
Content-Length: 6
Content-Type: text/plain;charset=utf-8
Date: Mon, 22 Aug 2016 01:08:13 GMT
。。。^C
It seems that CherryPy detect content encoding and override the charset automatically. If so, how can I disable this feature?

All right. Solved this problem by tampering cherrypy.response.header_list directly.
#coding=utf-8
import cherrypy
def set_content_type():
header=(b'Content-Type',cherrypy.response._content_type.encode())
for ind,(key,_) in enumerate(cherrypy.response.header_list):
if key.lower()==b'content-type':
cherrypy.response.header_list[ind]=header
break
else:
cherrypy.response.header_list.append(header)
cherrypy.tools.set_content_type=cherrypy.Tool('on_end_resource',set_content_type)
class Website:
#cherrypy.expose()
#cherrypy.tools.set_content_type()
def index(self):
cherrypy.response._content_type='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/')

I had success to set the content-type-charset by setting/manipulate the request header attribute "Accept-Charset".
cherrypy.request.headers["Accept-Charset"] = "ISO-8859-1"
cherrypy.response.headers["Content-Type"] = "text/xml"
The result:
>curl -I https://127.0.0.1/url?param=value
HTTP/1.1 200 OK
Content-Type: text/xml;charset=ISO-8859-1
Server: CherryPy/18.6.0
Date: Mon, 10 Aug 2020 11:54:49 GMT
Content-Length: 288
Set-Cookie: session_id=d28fa46a1a3d901d9502038255ce869b21add5cc; expires=Mon, 10 Aug 2020 12:54:49 GMT; Max-Age=3600; Path=/

JSON string decoding error

I am calling the URL :
http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json
using urllib2 and decoding using the json module
url = "http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json"
request = urllib2.Request(query)
response = urllib2.urlopen(request)
issue_report = json.loads(response.read())
I run into the following error :
ValueError: Invalid control character at: line 1 column 1120 (char 1120)
I tried checking the header and I got the following :
Content-Type: application/json; charset=UTF-8
Access-Control-Allow-Origin: *
Expires: Sun, 03 Jul 2011 17:38:38 GMT
Date: Sun, 03 Jul 2011 17:38:38 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"CUEGQX47eCl7ImA9WxJaFEw."
Last-Modified: Tue, 04 Aug 2009 19:20:20 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
I also tried adding an encoding parameter as follows :
issue_report = json.loads(response.read() , encoding = 'UTF-8')
I still run into the same error.

The feed has raw data from a JPEG in it at that point; the JSON is malformed, so it's not your fault. Report a bug to Google.

You could consider using lxml instead, since the JSON is malformed. It's XPath support makes working with XML pretty straight-forward:
import lxml.etree
url = 'http://code.google.com/feeds/issues/p/chromium/issues/full/291'
doc = lxml.etree.parse(url)
ns = {'issues': 'http://schemas.google.com/projecthosting/issues/2009'}
issues = doc.xpath('//issues:*', namespaces=ns)
Fairly easy to manipulate elements, for instance to strip namespace from tags, convert to dict:
>>> dict((x.tag[len(ns['issues'])+2:], x.text) for x in issues)
<<<
{'closedDate': '2009-08-04T19:20:20.000Z',
'id': '291',
'label': 'Area-BrowserUI',
'stars': '13',
'state': 'closed',
'status': 'Verified'}

How can I view error messages with pycurl?

I have the following pycurl code:
curl = pycurl.Curl()
foo = StringIO()
curl.setopt(pycurl.WRITEFUNCTION, foo.write)
curl.setopt(pycurl.POST, 1)
curl.setopt(pycurl.URL, finalURL)
curl.setopt(pycurl.POSTFIELDS, encodedArgs)
curl.perform()
responseCode = curl.getinfo(pycurl.RESPONSE_CODE)
effectiveURL = curl.getinfo(pycurl.EFFECTIVE_URL)
curl.close()
When the command line curl command comes back I see:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/xml;charset=UTF-8
Content-Length: 216
Date: Thu, 06 Jan 2011 15:49:36 GMT
Some XML Error Here: Something you are trying to do is not permitted.
But I don't see this from pycurl.
How can I extract this alert/error message when using pycurl?

The response from the server is written using the curl option pycurl.WRITEFUNCTION.
In your case, since you are passing it a StringIO object, the response data should be in the foo variable: foo.getvalue()
Reference: http://pycurl.sourceforge.net/doc/curlobject.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read application/octet-stream in Python - python

Related

Python HTTP Request Returns 404 or Bytes

URLError doesn't return json body

Forcing response charset in CherryPy

JSON string decoding error

How can I view error messages with pycurl?

Categories

Resources