invalid response from proxy with python requests - python

I am using Requests API with Python2.7.
I am trying to download certain webpages through proxy servers. I have a list of available proxy servers. But not all proxy servers work as desired. Some proxies require authentication, others redirect to advertisement pages etc. In order to detect/verify incorrect responses, I have included two checks in my url requests code. It looks similar to this
import requests
proxy = '37.228.111.137:80'
url = 'http://www.google.ca/'
response = requests.get(url, proxies = {'http' : 'http://%s' % proxy})
if response.url != url or response.status_code != 200:
print 'incorrect response'
else:
print 'response correct'
print response.text
There are some proxy servers with which the requests.get call is successful and they pass these two conditions and still contain invalid html source in response.text attribute. However, if I use the same proxy in my FireFox browser and try to open the same webpage, I am displayed an invalid webpage, but my python script says that the response should be valid.
Can someone point to me that what other necessary checks I am missing to weed out incorrect html results?
or
How can I successfully verify if the webpage I intended to receive is correct?
Regards.

What is an "invalid webpage" when displayed by your browser? The server can return a HTTP status code of 200, but the content is an error message. You understand it to be an error message because you can comprehend it, a browser or code can not.
If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis.

Related

BeautifulSoup not returning the title of page

I tried to getting the title of a web page by web scraping using Beautifulsoup4 python module and it's returning a string "Not Acceptable!" as the title, but when I open the webpage via browser the title is different. I tried looping through list of links and extract titles of all the webpages but it's returning the same string "Not Acceptable!" for all the links.
here is the python code
from bs4 import BeautifulSoup
import requests
URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())
here is link to the corresponding web page webpage link
I don't know if it is a problem with Beautifulsoup4 or with requests library, is it because the site has enabled bot protection and not returning the HTML when sending the requests?
The server expects the User-Agent header. Interestingly, it is happy with any User-Agent, even a fictitious one:
result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})
An easy way to debug this kind of issue is to print (or write to a file) the request.text. This is because some servers don't allow scraping. Some websites generate HTML using JavaScript at runtime (e.g. YouTube). These are some of the scenarios where the request.text can be different than the source HTML we see in the browser. The below text has been returned by the server.
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
Edit:
As pointed by DYZ, this is a 406 error and User Agent in the request header was missing.
https://www.exai.com/blog/406-not-acceptable
The 406 Not Acceptable status code is a client-side error. It's part
of the HTTP response status codes in the 4xx category, which are
considered client error responses

Why BeautifulSoup and lxml don't work?

I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!

Why doesn't urllib2 throw a 404?

I have a public folder in Google Drive, in which I store pictures.
In Python, I am trying to detect if a picture with a particular name exist or not. I am using this code:
import urllib2
url = "http://googledrive.com/host/0B7K23HtYjKyBfnhYbkVyUld3YUVqSWgzWm1uMXdrMzQ0NlEwOXVUd3o0MWVYQ1ZVMlFSNms/0000.png"
resp = urllib2.urlopen(url)
print resp.getcode()
And even though there is no file with this name in this folder, this code is not throwing an exception and is printing "200" as the return code. I have checked in my browser and this URL (http://googledrive.com/host/0B7K23HtYjKyBfnhYbkVyUld3YUVqSWgzWm1uMXdrMzQ0NlEwOXVUd3o0MWVYQ1ZVMlFSNms/0000.png) does return a 404, after a few redirects.
Why doesn't urllib2 detect that this file actually doesn't exist?
When you make the request, your request goes to google's web servers and is processed there. If and only if google's servers were to return a 404, would you see a 404 on your end; urllub2 simply encapsulates the underlying handshaking and data transfer logic.
In this particular case, google's server side code requires the request to be authenticated, and your request url is simply unauthenticated. As such, the request is redirected to the login page, and since this is a valid existing page/response, urllib2 shows the correct code 200. You can get the same page if you open the link in a private window.
However, if you are authenticated and then open the url (basically logged into your gmail/googgle docs account), you would get the 404 error.

In Python why does urllib.urlopen make Google give an http status "302 Moved"?

Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.

Python requests library doesn't have all headers

I am using python requests library for a POST request and I expect a return message with an empty payload. I am interested in the headers of the returned message, specifically the 'Location' attribute. I tried the following code:
response=requests.request(method='POST', url=url, headers={'Content-Type':'application/json'}, data=data)
print response.headers ##Displays a case-insensitve map
print response.headers['Location'] ##blows up
Strangely the 'Location' attribute is missing in the headers map. If I try the same POST request on postman, I do get a valid Location attribute. Has anyone else seen this? Is this a bug in the requests library?
Sounds like everything's working as expected? Check your response.history
From the Requests documentation:
Requests will automatically perform location redirection for all verbs except HEAD.
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [301]>]
From the HTTP Location page on wikipedia:
The HTTP Location header field is returned in responses from an HTTP server under two circumstances:
To ask a web browser to load a different web page. In this circumstance, the Location header should be sent with an HTTP status code of 3xx. It is passed as part of the response by a web server when the requested URI has:
Moved temporarily, or
Moved permanently
To provide information about the location of a newly-created resource. In this circumstance, the Location header should be sent with an HTTP status code of 201 or 202.1
The requests library follows redirections automatically.
To take a look at the redirections, look at the history of the requests. More details in the docs.
Or you pass the extra allow_redirects=False parameter when making the request.

Categories

Resources