I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated
use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)
Related
I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!
I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text
Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.
I am using request to fetch data. on USPS.COM the tracking URL is redirecting permanently(301) hence can't see desired page. The URL work's perfectly on Browser.
Update:
Added the Real URL for clarification/debugging
According to Redirection and History - Requests documentation:
Requests will automatically perform location redirection for all verbs
except HEAD.
So, you don't need to worry about redirection.
The problem is that USPS.COM checks User-Agent header and returns different result according to the header value. You need to specify the header to get the same result with the browser.
For example:
import requests
url = 'http://.....'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
assert 'Delivered' in r.content
I am using cookielib and some times opening a url in browser downloads many other files by browser making many other requests. Can I replicate the same behaviour using cookie lib or any other python library?
For example: To get all the required information from page https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1
I have to make more than 1 GET requests from my python script. I got the request urls of all the requests browser makes by analysing the network requests when I opened the page.
I am seeing if there is any way I can just make 1 request and it fetches all the related requests by itself like browser.
I am not very much interested in the js or css but the main html.
I tried with the following code but it couldn't download whole page
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = opener.open('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
html = response.read()
but when I fetched 3 other GET urls in sequence it is able to give me the required html in the third GET response. I got these urls by examining network tab of the browser
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
'https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent'
'https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes'
and following is the complete code for the other fetches I am making
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PT_NAV.ISCRIPT1.FieldFormula.IScript_UniHeader_Frame?c=NNTCgkqGs001AcPaisqGbYpTu%2fbGx4jx&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes')
response.read()
response = opener.open('https://applicant.keybank.com/psc/hrsappl/EMPLOYEE/EMPL/s/WEBLIB_PTPPB.ISCRIPT2.FieldFormula.IScript_TemplatePageletBuilder?PTPPB_PAGELET_ID=KC_LNAV_APPLICANT&target=KCNV_KC_LNAV_APPLICANT_TMPL&Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&PortalIsPagelet=true&NoCrumbs=yes&PortalTargetFrame=TargetContent')
response.read()
response = opener.open('https://hronline.keybank.com/psc/hrshrm/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1&PortalActualURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentURL=https%3a%2f%2fhronline.keybank.com%2fpsc%2fhrshrm%2fEMPLOYEE%2fHRMS%2fc%2fHRS_HRAM.HRS_CE.GBL%3fPage%3dHRS_CE_HM_PRE%26Action%3dA%26SiteId%3d1&PortalContentProvider=HRMS&PortalCRefLabel=Careers&PortalRegistryName=EMPLOYEE&PortalServletURI=https%3a%2f%2fapplicant.keybank.com%2fpsp%2fhrsappl%2f&PortalURI=https%3a%2f%2fapplicant.keybank.com%2fpsc%2fhrsappl%2f&PortalHostNode=EMPL&NoCrumbs=yes&PortalKeyStruct=yes')
required_html = response.read()
requests can handle cookies, as you can see here.
It's a great library, far more powerful that urllib2, and yet simpler-looking.
>>> import requests
>>> r = requests.get('https://applicant.keybank.com/psp/hrsappl/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?Page=HRS_CE_HM_PRE&Action=A&SiteId=1')
>>> r.cookies
Edit: This answer dos not really address the problem, I read too fast. Sorry about that.
As suggested by #J.F.Sebastian, I'm adding a link to a python webkit client, Ghost.py, that could emulate a browser, as you requested.