Get a header with Python and convert in JSON (requests - urllib2 - json) - python

I’m trying to get the header from a website, encode it in JSON to write it to a file.
I’ve tried two different ways without success.
FIRST with urllib2 and json
import urllib2
import json
host = ("https://www.python.org/")
header = urllib2.urlopen(host).info()
json_header = json.dumps(header)
print json_header
in this way I get the error:
TypeError: is not
JSON serializable
So I try to bypass this issue by converting the object to a string -> json_header = str(header)
In this way I can json_header = json.dumps(header) but the output it’s weird:
"Date: Wed, 02 Jul 2014 13:33:37 GMT\r\nServer: nginx\r\nContent-Type:
text/html; charset=utf-8\r\nX-Frame-Options:
SAMEORIGIN\r\nContent-Length: 45682\r\nAccept-Ranges: bytes\r\nVia:
1.1 varnish\r\nAge: 1263\r\nX-Served-By: cache-fra1220-FRA\r\nX-Cache: HIT\r\nX-Cache-Hits: 2\r\nVary: Cookie\r\nStrict-Transport-Security:
max-age=63072000; includeSubDomains\r\nConnection: close\r\n"
SECOND with requests
import requests
r = requests.get(“https://www.python.org/”)
rh = r.headers
print rh
{'content-length': '45682', 'via': '1.1 varnish', 'x-cache': 'HIT',
'accept-ranges': 'bytes', 'strict-transport-security':
'max-age=63072000; includeSubDomains', 'vary': 'Cookie', 'server':
'nginx', 'x-served-by': 'cache-fra1226-FRA', 'x-cache-hits': '14',
'date': 'Wed, 02 Jul 2014 13:39:33 GMT', 'x-frame-options':
'SAMEORIGIN', 'content-type': 'text/html; charset=utf-8', 'age':
'1619'}
In this way the output is more JSON like but still not OK (see the ‘ ‘ instead of “ “ and other stuff like the = and ;).
Evidently there’s something (or a lot) I’m not doing in the right way.
I’ve tried to read the documentation of the modules but I can’t understand how to solve this problem.
Thank you for your help.

There are more than a couple ways to encode headers as JSON, but my first thought would be to convert the headers attribute to an actual dictionary instead of accessing it as requests.structures.CaseInsensitiveDict
import requests, json
r = requests.get("https://www.python.org/")
rh = json.dumps(r.headers.__dict__['_store'])
print rh
{'content-length': ('content-length', '45474'), 'via': ('via', '1.1
varnish'), 'x-cache': ('x-cache', 'HIT'), 'accept-ranges':
('accept-ranges', 'bytes'), 'strict-transport-security':
('strict-transport-security', 'max-age=63072000; includeSubDomains'),
'vary': ('vary', 'Cookie'), 'server': ('server', 'nginx'),
'x-served-by': ('x-served-by', 'cache-iad2132-IAD'), 'x-cache-hits':
('x-cache-hits', '1'), 'date': ('date', 'Wed, 02 Jul 2014 14:13:37
GMT'), 'x-frame-options': ('x-frame-options', 'SAMEORIGIN'),
'content-type': ('content-type', 'text/html; charset=utf-8'), 'age':
('age', '1483')}
Depending on exactly what you want on the headers you can specifically access them after this, but this will give you all the information contained in the headers, if in a slightly different format.
If you prefer a different format, you can also convert your headers to a dictionary:
import requests, json
r = requests.get("https://www.python.org/")
print json.dumps(dict(r.headers))
{"content-length": "45682", "via": "1.1 varnish", "x-cache": "HIT",
"accept-ranges": "bytes", "strict-transport-security":
"max-age=63072000; includeSubDomains", "vary": "Cookie", "server":
"nginx", "x-served-by": "cache-at50-ATL", "x-cache-hits": "5", "date":
"Wed, 02 Jul 2014 14:08:15 GMT", "x-frame-options": "SAMEORIGIN",
"content-type": "text/html; charset=utf-8", "age": "951"}

If you are only interested in the header, make a head request. convert the CaseInsensitiveDict in a dict object and then convert it to json.
import requests
import json
r = requests.head('https://www.python.org/')
rh = dict(r.headers)
json.dumps(rh)

import requests
import json
r = requests.get('https://www.python.org/')
rh = r.headers
print json.dumps( dict(rh) ) # use dict()
result:
{"content-length": "45682", "via": "1.1 varnish", "x-cache": "HIT", "accept-ranges": "bytes", "strict-transport-security": "max-age=63072000; includeSubDomains", "vary": "Cookie", "server": "nginx", "x-served-by": "cache-fra1224-FRA", "x-cache-hits": "5", "date": "Wed, 02 Jul 2014 14:08:04 GMT", "x-frame-options": "SAMEORIGIN", "content-type": "text/html; charset=utf-8", "age": "3329"}

I know this is an old question, but I stumbled across it when trying to put together a quick and dirty Python curl-esque URL getter. I kept getting an error:
TypeError: Object of type 'CaseInsensitiveDict' is not JSON serializable
The above solutions are good if need to output a JSON string immediately, but in my case I needed to return a python dictionary of the headers, and I wanted to normalize the capitalization to make all keys lowercase.
My solution was to use a dict comprehension:
import requests
response = requests.head('https://www.python.org/')
my_dict = {
'body': response.text,
'http_status_code': response.status_code,
'headers': {k.lower(): v for (k, v) in response.headers.items()}
}

Related

Boto3 AWS TypeError: string indices must be integers

I have below code
import boto3
client = session.client('ec2')
response = client.describe_iam_instance_profile_associations()
print(response)
{'IamInstanceProfileAssociations': [{'AssociationId': 'iip-assoc-0c7941c0858c84652', 'InstanceId': 'i-xxx', 'IamInstanceProfile': {'Arn': 'arn:aws:iam::xxxx:instance-profile/xxx', 'Id': 'xxx'}, 'State': 'associated'}], 'ResponseMetadata': {'RequestId': '15bdc08e-ff66-431e-968e-1930557847ef', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '15bdc08e-ff66-431e-968e-1930557847ef', 'cache-control': 'no-cache, no-store', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'vary': 'accept-encoding', 'content-type': 'text/xml;charset=UTF-8', 'transfer-encoding': 'chunked', 'date': 'Wed, 27 Jul 2022 22:04:12 GMT', 'server': 'AmazonEC2'}, 'RetryAttempts': 0}}
I would like to get info from output
so i tried
for key in response:
... print (key)
I get response
Also for
> for r in response['IamInstanceProfileAssociations']: print
> (r['InstanceId'] , r['AssociationId'])
With Above also i get expected response.
Now i need to get InstanceId & ARN of the instance profile, so i tried below but i got an errro 'TypeError: string indices must be integers'
Any suggestions pls ? i checked nested dictionary pages from google & other but not able to find any solution.
for r in response['IamInstanceProfileAssociations']:
... #print (r['InstanceId'] , r['AssociationId'])
... for i in r['IamInstanceProfile']:
... print (i['Arn'],r['InstanceId'])
The variable i is a string than you cannot access the position Arn
Here is a example showing how you can get the instanceId and Profile's Arn:
import boto3
client = boto3.client('ec2')
response = client.describe_iam_instance_profile_associations()
for association in response['IamInstanceProfileAssociations']:
print((association['InstanceId'], association['IamInstanceProfile']['Arn']))
Using you example the output must be the following:
('i-xxx', 'arn:aws:iam::xxxx:instance-profile/xxx')

Download large number of byte ranges from file using Python requests

I've got a script that identifies byte ranges in a very large file that I'd like to download. I'm using Python's requests library to download the content, specifying the byte ranges of interest in the range header. Here's a simplified version of the code (without the logic that constructs the byte range string):
import requests
URL = 'https://ncei.noaa.gov/data/rapid-refresh/access/historical/analysis/201602/20160217/rap_130_20160217_2200_000.grb2'
byte_range = '0-33510, 110484-147516, 219121-253904, 421175-454081, 685402-719065, 1039572-1076567, 1299982-1333158, 1398139-1429817, 1492109-1522167, \
1662765-1689414, 1870865-1896117, 2120537-2145725, 2301018-2335355, 2404445-2439381, 2511283-2547104, 2717931-2750956, 2971504-3001716, 3268591-3295610, \
3395201-3395200, 3395201-3461393, 3593639-3593638, 3593639-3659732, 3792859-3792858, 3792859-3859312, 4183232-4183231, 4183232-4245378, 4668359-4668358, \
4668359-4728450, 5283559-5283558, 5283559-5344745, 7251508-7317016, 7498496-7558460'
response = requests.get(URL, headers={"Range": "bytes={}".format(byte_range)})
print(response.headers)
As far as I can tell, this is a valid request and I don't get any errors. However, it downloads the entire file rather than the specified ranges. The output:
{'Date': 'Sat, 16 Oct 2021 15:26:17 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000', 'Last-Modified': 'Thu, 18 Feb 2016 15:22:25 GMT', 'ETag': '"cfde25-52c0cef11064f"', 'Accept-Ranges': 'bytes', 'Content-Length': '13622821', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': 'X-Requested-With, Content-Type', 'Connection': 'close'}
To debug, I tried shortening the number of byte ranges and that seemed to work. Seems like the max number of ranges for the request to return a subset of the file is in the low 20s.
# Shorter list of byte ranges
URL = 'https://ncei.noaa.gov/data/rapid-refresh/access/historical/analysis/201602/20160217/rap_130_20160217_2200_000.grb2'
byte_range = '0-33510, 110484-147516, 219121-253904, 421175-454081, 685402-719065, 1039572-1076567, 1299982-1333158, 1398139-1429817, 1492109-1522167, \
1662765-1689414, 1870865-1896117, 2120537-2145725, 2301018-2335355, 2404445-2439381, 2511283-2547104, 2717931-2750956, 2971504-3001716, 3268591-3295610'
response = requests.get(URL, headers={"Range": "bytes={}".format(byte_range)})
print(response.headers)
In this case, the content type is a multipart byte range, as expected.
{'Date': 'Sat, 16 Oct 2021 15:26:41 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000', 'Last-Modified': 'Thu, 18 Feb 2016 15:22:25 GMT', 'ETag': '"cfde25-52c0cef11064f"', 'Accept-Ranges': 'bytes', 'Content-Length': '577544', 'Content-Type': 'multipart/byteranges; boundary=ccc35e3764d85dea', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': 'X-Requested-With, Content-Type', 'Connection': 'close'}
My question now is where's the limitation - is this an issue with requests, the server, or just an issue with HTTP headers? I could break this up into multiple requests, but I need to try to avoid spamming the server with a lot of requests near the same time (this byte range list could get pretty long depending on what I want from the file). If I had to break up the request, what's the most efficient way to do so? I really don't want to download more data than I need as these files can be quite large.

Python request module gives error 400(bad request) for POST

I am new to http requests and trying to automate some work. But I am unable to get the required result. I have looked many posts and documentation of python requests module but there is no change in the result.
Code I wrote
def installFont():
print "Installing font"
urlToHit = "some http address"
header_ = { "UserID": "00000", "PortalName": "EDC", "ModifyBy" : "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "Content-Type" : "application/json"}
body_ = {
"Email": "abc#xyz.com",
"AssetLicenseType": "Trial",
"MachineIds": ["machine1", "machine2"],
"fontAsset":
[
{
"FontId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"FontName": "Neue Aachen™ Pro Ultra Light",
"FontUrl": "http://helveticaurl",
"FontDownloadUrlAPI": "url",
"FontDownloadUrl" : "url1,
"FontFamilyName": "Neue Aachen™ Pro",
"FontFamilyUrl": "http://FontFamilyUrl",
"FontStyle": "Normal",
"FontWeight": "100",
"ExpiryDate": "2017-2-27 11:17:01",
"FontFamilyId": "34"
}
]
}
r = requests.request("POST", urlToHit, data=body_, headers=header_)
print r.headers
print r.status_code
print r.text
Itried same thing with postman which gives me correct result but via python I am getting output as
{'X-Processing-Time-Milliseconds': '3', 'Transfer-Encoding': 'chunked', 'X-Powered-By': 'ASP.NET', 'Server': 'Kestrel', 'Date': 'Mon, 06 Feb 2017 14:51:59 GMT', 'Content-Type': 'application/json'}
400
{"Message":"''"}
I think I am doing some mistake while passing body_ in
r = requests.request("POST", urlToHit, data=body_, headers=header_)
Output via postman
{"Message":"Created Successfully","SuccessCount":2,"FailCount":0}
You need to use:
r = requests.post(urlToHit, json=body_, headers=headers_)
Please go through the documentation.

Python 3 - Urllib3 read internet radio metadata

How can I read information about playing song using urllib3? Which headers should I use?
import urllib3
http = urllib3.PoolManager()
response = http.request("GET", "http://pool.cdn.lagardere.cz/fm-evropa2-128", headers={
'User-Agent': 'User-Agent: VLC/2.0.5 LibVLC/2.0.5',
'Icy-MetaData': '1',
'Range': 'bytes=0-',
})
print(response.data)
I tried this. But it stucks at sending request. Can anyone help me? Thanks for answers.
The following code returns all the header data of the given stream. Unfortunately I was not able to obtain song names this way.
import requests
url = 'http://pool.cdn.lagardere.cz/fm-evropa2-128'
def print_url(r, *args, **kwargs):
print(r.headers)
requests.get(url, hooks=dict(response=print_url))
Output is as follows:
{'icy-description': 'Evropa 2', 'Via': '1.1 s670-6.noc.rwth-aachen.de:80 (Cisco-WSA/8.8.0-085)', 'icy-genre': 'Various', 'icy-url': 'http://www.evropa2.cz', 'icy-pub': '0', 'ice-audio-info': 'ice-samplerate=44100;ice-bitrate=128;ice-channels=2', 'Date': 'Fri, 29 Jan 2016 17:24:20 GMT', 'icy-br': '128, 128', 'Content-Type': 'audio/mpeg', 'Connection': 'keep-alive', 'Transfer-Encoding': 'chunked', 'icy-name': 'Evropa 2', 'Server': 'Icecast 2.3.2', 'Cache-Control': 'no-cache'}

Python Requests: response object does not contain 'status' header

This is Requests 1.1.0 and Python 2.6.4 (also same behavior on Python 2.7.2).
>>> import requests
>>> response = requests.get('http://www.google.com')
>>> response.status_code
200
>>> print response.headers.get('status')
None
According to the docs, there should be a headers['status'] entry with a string like "200 OK".
Here is the full contents of the headers dict:
>>> response.headers
{'x-xss-protection': '1; mode=block', 'transfer-encoding': 'chunked', 'set-cookie': 'PREF=ID=74b29ee465454efd:FF=0:TM=1362094463:LM=1362094463:S=Xa96iJQX_9BrC-Vm; expires=Sat, 28-Feb-2015 23:34:23 GMT; path=/; domain=.google.com, NID=67=IH21bLPTK2gLTHCyDCMEs3oN5g1uMV99U4Wsc2YA00AbFt4fQCoywQNEQU0pR6VuaNhhQGFCsqdr0FnWbPcym-pizo0xVuS6WBJ9EOTeSFARpzrsiHh6HNnaQeCnxCSH; expires=Fri, 30-Aug-2013 23:34:23 GMT; path=/; domain=.google.com; HttpOnly', 'expires': '-1', 'server': 'gws', 'cache-control': 'private, max-age=0', 'date': 'Thu, 28 Feb 2013 23:34:23 GMT', 'p3p': 'CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."', 'content-type': 'text/html; charset=ISO-8859-1', 'x-frame-options': 'SAMEORIGIN'}
Here is where I got the idea that this dict should contain a 'status' entry.
Am I doing something wrong?
You're looking for the "reason"
>>> x=requests.get("http://apple.adam.gs")
>>> x.reason
'OK'
>>>
custom.php contains:
header("HTTP/1.1 200 Testing")
Results in:
>>> x=requests.get("http://apple.adam.gs/custom.php")
>>> print x.reason
Testing
>>>

Categories

Resources