Python: urllib2 get nothing which does exist

Python: urllib2 get nothing which does exist - python

I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.

Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text

Related

Requests.py Redirect URL

When you go to the site, https://www.jimmyjazz.com/search?keywords=11468285, you are redirected to https://www.jimmyjazz.com/mens/footwear/adidas-solar-hu-nmd/BB9528.
I would like to use requests to enter that search link, then return the url that it is redirected to.
Here is my code to do that:
import requests
from bs4 import BeautifulSoup
sitename = "https://www.jimmyjazz.com/search?keywords=11468285"
response = requests.get(sitename, allow_redirects=True)
print(response.url)
But it still returns the original url:
PS C:\Users\jokzc\Desktop\python\learning requests> py test2.py
https://www.jimmyjazz.com/search?keywords=11468285
How would I append my code to make fix that? Thanks :)

This doesn't actually send a 302 redirect code back. I did the same HTTP GET call in postman, and it appears that it returns a 200 OK response:
Same goes for chrome dev tools, looking at the network traffic:
I think that somewhere in their JavaScript code they are setting the location.href to be a new url. I didn't go through the whole JS stack trace to prove it, but that is my best guess.

how to keep cookies persistent all the time for requests.session in python

i am using python requests to get some information from company site,
first, i need to login, then use beautifulsoup to get some other url, and go to these url to get information,
i use session, but now the problem is:
after i login,get the url, then go to these url, and find the text return from these url is not what i want, not sure whether it is because of non-persistent cookies,
s = requests.session()
s.post(url = "https://login.company.com/login/login.do",data={'uid':user,'password':
password,'actionFlag': 'loginAuthenticate'})
r=s.get("http://3ms.company.com/hi/space/?l=zh-cn")
soup = BeautifulSoup(r.text,'html.parser')
div=soup.find('div',attrs={'class':'top_pop mt10'})
for a in div.find_all('li'):
url=a.find('div',attrs={'class':'top_pop_P_right fn'}).find('a')['href']
r1=s.get(url)
print(r1.text)
i try to use the following codes to update cookies, but do not work,
if r1.cookies.get_dict():
s.cookies.update(r1.cookies)
any idea on how to resolve this problem ?

Getting the redirected url in urllib2

I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated

use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)

Why BeautifulSoup and lxml don't work?

I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!

I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!

In Python why does urllib.urlopen make Google give an http status "302 Moved"?

Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.

urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()

It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.

I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)

If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: urllib2 get nothing which does exist - python

Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history r = requests.get('website', allow_redirects=True) print r.text

Related

Requests.py Redirect URL

how to keep cookies persistent all the time for requests.session in python

Getting the redirected url in urllib2

Why BeautifulSoup and lxml don't work?

In Python why does urllib.urlopen make Google give an http status "302 Moved"?

Categories

Resources