Python urllib2 Response header

Python urllib2 Response header - python

I'm trying to extract the response header of a URL request. When I use firebug to analyze the response output of a URL request, it returns:
Content-Type text/html
However when I use the python code:
urllib2.urlopen(URL).info()
the resulting output returns:
Content-Type: video/x-flv
I am new to python, and to web programming in general; any helpful insight is much appreciated. Also, if more info is needed please let me know.
Thanks in advance for reading this post

Try to request as Firefox does. You can see the request headers in Firebug, so add them to your request object:
import urllib2
request = urllib2.Request('http://your.tld/...')
request.add_header('User-Agent', 'some fake agent string')
request.add_header('Referer', 'fake referrer')
...
response = urllib2.urlopen(request)
# check content type:
print response.info().getheader('Content-Type')
There's also HTTPCookieProcessor which can make it better, but I don't think you'll need it in most cases. Have a look at python's documentation:
http://docs.python.org/library/urllib2.html

Content-Type text/html
Really, like that, without the colon?
If so, that might explain it: it's an invalid header, so it gets ignored, so urllib guesses the content-type instead, by looking at the filename. If the URL happens to have ‘.flv’ at the end, it'll guess the type should be video/x-flv.

This peculiar discrepancy might be explained by different headers (maybe ones of the accept kind) being sent by the two requests -- can you check that...? Or, if Javascript is running in Firefox (which I assume you're using when you're running firebug?) -- since it's definitely NOT running in the Python case -- "all bets are off", as they say;-).

Keep in mind that a web server can return different results for the same URL based on differences in the request. For example, content-type negotiation: the requestor can specify a list of content-types it will accept, and the server can return different results to try to accomodate different needs.
Also, you may be getting an error page for one of your requests, for example, because it is malformed, or you don't have cookies set that authenticate you properly, etc. Look at the response itself to see what you are getting.

according to http://docs.python.org/library/urllib2.html there is only get_header() method and nothing about getheader .
Asking because Your code works fine for
response.info().getheader('Set cookie')
but once i execute
response.info().get_header('Set cookie')
i get:
Traceback (most recent call last):
File "baza.py", line 11, in <module>
cookie = response.info().get_header('Set-Cookie')
AttributeError: HTTPMessage instance has no attribute 'get_header'
edit:
Moreover
response.headers.get('Set-Cookie') works fine as well, not mentioned in urlib2 doc....

for getting raw data for the headers in python2, a little bit of a hack but it works.
"".join(urllib2.urlopen("http://google.com/").info().__dict__["headers"])
basically "".join(list) will the list of headers, which all include "\n" at the end.
__dict__ is a built in python variable for all dicts, basically you can select a list out of a 2d array with it.
and ofcourse ["headers"] is selecting the list value from the .info() response value dict
hope this helped you learn a few ez python tricks :)

Related

Printing Cookie domain in RequestsCookieJar using Python 3

I'm working with Python requests module and have ran into an issue I can't seem to resolve. I'm relatively new to requests, but even with googling/documentation I have hit a wall.
I am trying to get a link (or "domain") from a specific cookie that is in the response of a GET request i've made. I can only seem to print the cookie, and not the domain. Explanation below:
Code (See comments):
import pickle
import requests
import time
r = requests.post('https://example.com/AddToCartURL')
print("----------------------------------------")
cookies = r.cookies #prints all cookies
print(r.cookies)
time.sleep(3)
# Code below will now use cookies and do a GET request to checkout URL
r = requests.get("https://example.com/checkout", cookies=cookies)
print("HEADERS")
time.sleep(1)
print(r.headers)
print("---------")
print("Cookie")
print(r.cookies)
print("---------") # THIS IS WHERE MY ISSUE IS :
print(r.cookies['checkout']) # This prints the cookie itself
print(r.cookies['checkout']['domain'])
outputs & issues:
So here are my issues:
#1 - The cookieJar cookie is shown like so when I print r.cookies:
<Cookie checkout=r3jk43nb42knj32--fjnk3jk2jkn2323njk for www.example.com/checkout/url>
And when I print(r.cookies['checkout']) I get the cookie, obviously:
r3jk43nb42knj32--fjnk3jk2jkn2323njk
Well what I'm trying to do here, essentially, is get the domain from it, which I try to do as:
print(r.cookies['checkout']['domain'])
getting the response:
Traceback (most recent call last):
File "/Users/user/PycharmProjects/project/Main.py", line 29, in <module>
print(r.cookies['checkout']['domain'])
TypeError: string indices must be integers
Which is something you'll find from a quick google search. Documentation wise I wasn't able to find a clear answer, probably because I'm still bad at searching. I tried the obvious of using an integer, presuming they're referencing indexing. However, this prints a single character of the cookie itself.
What i'm trying to print, specifically, is example.com/checkout/url from the cookie above. I'm trying to interact with it to continue my code on through the checkout process.
This is an example of the full cookie jar: (called by print(r.cookies))
<RequestsCookieJar[<Cookie random_other_cookie= 3j32fj302fj023jfi for example.com/checkoutURL>, <Cookie checkout=r3jk43nb42knj32--fjnk3jk2jkn2323njk for example.com/checkoutURL>]>
--------------------------------
#2 - Am I tackling this the wrong way?
Little more background information, the GET request above gives a response of 301. Permanently moved. I am 99% sure that the URL that I end up at (at least, front end wise) is the URL I need/the same URL as in the cookie above.
My question is should I not be trying to grab the domain from the cookie, and instead somehow try to grab the redirection URL?
(aka the URL that the request ends up at, not the original url https://example.com/checkout)
------------------------------------
I hope I outlined my issues well enough. This is my first post on StackOverflow after months of lurking around for answers.
Thank you.

Why do I get this error - "missing token"? (playing with an API using Python)

So, I am playing around with Etilbudsavis' API (Danish directory containing offers from retail stores). My goal is to retrieve data based on a search query. the API acutally allows this, out of the box. However, when I try to do this, I end up with an error saying that my token is missing. Anyways, here is my code:
from urllib2 import urlopen
from json import load
import requests
body = {'api_key': 'secret_api_key'}
response = requests.post('https://api.etilbudsavis.dk/v2/sessions', data=body)
print response.text
new_body = {'_token:': 'token_obtained_from_POST_method', 'query:': 'coca-cola'}
new_response = requests.get('https://api.etilbudsavis.dk/v2/offers/search', data=new_body)
print new_response.text
Full error:
{"code":1107,"id":"00ilpgq7etum2ksrh4nr6y1jlu5ng8cj","message":"missing token","
details":"Missing token\nNo token found in request to an endpoint that requires
a valid token.","previous":null,"#note.1":"Hey! It looks like you found an error
. If you have any questions about this error, feel free to contact support with
the above error id."}

Since this is a GET request, you should use the params argument to pass the data in the URL.
new_response = requests.get('https://api.etilbudsavis.dk/v2/offers/search', params=new_body)
See the requests docs.

I managed to solve the problem with the help of Daniel Roseman who reminded me of the fact that playing with an API in the Python Shell is different from interacting with the API in the browser. The docs clearly stated that you'd have to sign the API token in order for it to work. I missed that tiny detail ... Never the less, Daniel helped me figure everything out. Thanks again, Dan.

Receiving 500 HTTP response when posting to website

I am attempting am attempting to extract some information from a website that requires a post to an ajax script.
I am trying to create an automated script however I am consitently running into an HTTP 500 error. This is in contrast to a different data pull I did from a
url = 'http://www.ise.com/ExchangeDataService.asmx/Get_ISE_Dividend_Volume_Data/'
paramList = ''
paramList += '"' + 'dtStartDate' + '":07/25/2014"'
paramList += ','
paramList += '"' + 'dtEndDate' + '":07/25/2014"';
paramList = '{' + paramList + '}';
response = requests.post(url, headers={
'Content-Type': 'application/json; charset=UTF-8',
'data': paramList,
'dataType':'json'
})
I was wondering if anyone had any recommendations as to what is happening. This isn't proprietary data as they allow you to manually download it in excel format.

The input you're generating is not valid JSON. It looks like this:
{"dtStartDate":07/25/2014","dtEndDate":07/25/2014"}
If you look carefully, you'll notice a missing " before the first 07.
This is one of many reasons you shouldn't be trying to generate JSON by string concatenation. Either build a dict and use json.dump, or if you must, use a multi-line string as a template for str.format or %.
Also, as bruno desthuilliers points out, you almost certainly want to be sending the JSON as the POST body, not as a data header in an empty POST. Doing it the wrong way does happen to work with some back-ends, but only by accident, and that's certainly not something you should be relying on. And if the server you're talking to isn't one of those back-ends, then you're sending the empty string as your JSON data, which is just as invalid.
So, why does this give you a 500 error? Probably because the backend is some messy PHP code that doesn't have an error handler for invalid JSON, so it just bails with no information on what went wrong, so the server can't do anything better than send you a generic 500 error.

If that's a copy/paste from you actual code, 'data' is probably not supposed to be part of the request headers. As a side note: you don't "post to an ajax script", you post to an URL. The fact that this URL is called via an asynchronous request from some javascript on some page of the site is totally irrelevant.

it sounds like a server error. So what your posting could breaking their api due to its formatting.
Or their api could be down.
http://pcsupport.about.com/od/findbyerrormessage/a/500servererror.htm

Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

I'm using urllib2's urlopen function to try and get a JSON result from the StackOverflow api.
The code I'm using:
>>> import urllib2
>>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/")
>>> conn.readline()
The result I'm getting:
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\...
I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the address with a browser gives me: a JSON object).
Using urlopen on other sites (e.g. "http://google.com") works fine, and gives me actual html. I've also tried using urllib and it gives the same result.
I'm pretty stuck, not even knowing where to look to solve this problem. Any ideas?

That almost looks like something you would be feeding to pickle. Maybe something in the User-Agent string or Accepts header that urllib2 is sending is causing StackOverflow to send something other than JSON.
One telltale is to look at conn.headers.headers to see what the Content-Type header says.
And this question, Odd String Format Result from API Call, may have your answer. Basically, you might have to run your result through a gzip decompressor.
Double checking with this code:
>>> req = urllib2.Request("http://api.stackoverflow.com/0.8/users/",
headers={'Accept-Encoding': 'gzip, identity'})
>>> conn = urllib2.urlopen(req)
>>> val = conn.read()
>>> conn.close()
>>> val[0:25]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ'
Yes, you are definitely getting gzip encoded data back.
Since you seem to be getting different results on different machines with the same version of Python, and in general it looks like the urllib2 API would require you do something special to request gzip encoded data, my guess is that you have a transparent proxy in there someplace.
I saw a presentation by the EFF at CodeCon in 2009. They were doing end-to-end connectivity testing to discover dirty ISP tricks of various kinds. One of the things they discovered while doing this testing is that a surprising number of consumer level NAT routers add random HTTP headers or do transparent proxying. You might have some piece of equipment on your network that's adding or modifying the Accept-Encoding header in order to make your connection seem faster.

Sending the variable's content to my mailbox in Python?

I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.

If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.

for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python urllib2 Response header - python

Content-Type text/html Really, like that, without the colon? If so, that might explain it: it's an invalid header, so it gets ignored, so urllib guesses the content-type instead, by looking at the filename. If the URL happens to have ‘.flv’ at the end, it'll guess the type should be video/x-flv.

Related

Printing Cookie domain in RequestsCookieJar using Python 3

Why do I get this error - "missing token"? (playing with an API using Python)

Receiving 500 HTTP response when posting to website

Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

Sending the variable's content to my mailbox in Python?

Categories

Resources