I know I am probably going about this the wrong way. But i am trying to figure out what resource URL's in our proxy server config are broken or redirecting to a different URL than what we have on file.
An example of a resource being passed into out proxy prefix URL is:
https://login.proxy.library.ohio.edu/login?auth=ou&url=https://www.whatismyip.com/
When this URL is resolved it should redirect to the proxied link
https://www-whatismyip-com.proxy.library.ohio.edu/
What I am wanting is to get the final status code of the final URL after it is resolved and redirected
What I have code wise, just a snippet...
proxy_url = "https://login.proxy.library.ohio.edu/login?auth=ou&url=https://www.whatismyip.com/"
conn = requests.head(proxy_url, allow_redirects=True)
print conn.url[:-3]
The [:-3] is to remove some weird unwanted characters at the end of the string.
However it is only returning the original link I am passing it.
How can I get the correct proxied URL after it's resolved and redirects.
You can use history property of the Response object to track redirection:
resp = requests.head("http://some.url", allow_redirects=True)
for elem in resp.history:
print elem.url # "some_other.url"
Response.history contains all responses with their urls from intermediate steps. Read more here
Related
I have a task to crawl all Pulitzer Winner, and I found this page has all I want: https://www.pulitzer.org/prize-winners-by-year/2018.
But I got the following problems,
Problem 1: How to crawl a dynamic page? I use python/urllib2.urlopen, to get the page's content, but this dynamic page doesn't return the real content from this.
Problem 2: I then found an API URL from devtool: https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json. But when I sent a GET request from urllib2.urlopen, I always get null. How does it happen? Or how can I handle with it?
If this is too naive for you, please name some words so that I can learn it from Google.
Thanks in advance!
One way to handle is to create a session using requests module. This way, it passes necessary session details required for next api call, you also have to pass one more parameter Referer to the header. This differentiates which year you are looking for in the api call.
import requests
s = requests.session()
url = "https://www.pulitzer.org/prize-winners-by-year/2017"
resp1 = s.get(url)
headers = {'Referer': 'https://www.pulitzer.org/prize-winners-by-year/2017'}
api = "https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json"
data = s.get(api,headers=headers)
now you can extract the data from the response in data.
Here is my scenario.
I have a lot of links. I want to know if any of them redirect to a different site (maybe a particular one) and only get those redirect URLs.(I want to preserve them for further scraping).
I don't want to get contents of webpage. I only want to get the link it redirects to. If there are multiple redirects, I may want to get the urls until say the 3rd redirect (So, that I'm not in a redirect loop).
How do I achieve this?
Can I do this in requests?
Requests seems to have a r.status, but it only works after fetching the page.
You can use requests.head(url, allow_redirects=True) which will only get the headers. If the response has the Location header it will follow the redirect and head the next url.
import requests
response = requests.head('http://httpbin.org/redirect/3', allow_redirects=True)
for redirect in response.history:
print(redirect.url)
print(response.url)
Output:
http://httpbin.org/redirect/3
http://httpbin.org/relative-redirect/2
http://httpbin.org/relative-redirect/1
http://httpbin.org/get
I am using Requests API with Python2.7.
I am trying to download certain webpages through proxy servers. I have a list of available proxy servers. But not all proxy servers work as desired. Some proxies require authentication, others redirect to advertisement pages etc. In order to detect/verify incorrect responses, I have included two checks in my url requests code. It looks similar to this
import requests
proxy = '37.228.111.137:80'
url = 'http://www.google.ca/'
response = requests.get(url, proxies = {'http' : 'http://%s' % proxy})
if response.url != url or response.status_code != 200:
print 'incorrect response'
else:
print 'response correct'
print response.text
There are some proxy servers with which the requests.get call is successful and they pass these two conditions and still contain invalid html source in response.text attribute. However, if I use the same proxy in my FireFox browser and try to open the same webpage, I am displayed an invalid webpage, but my python script says that the response should be valid.
Can someone point to me that what other necessary checks I am missing to weed out incorrect html results?
or
How can I successfully verify if the webpage I intended to receive is correct?
Regards.
What is an "invalid webpage" when displayed by your browser? The server can return a HTTP status code of 200, but the content is an error message. You understand it to be an error message because you can comprehend it, a browser or code can not.
If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis.
I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text
Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.