Python check if website exists - python

I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?

You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')

It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2

from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source

There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.

I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.

code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"

You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False

def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1

Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)

Related

(Python) How to check http response on status

can someone tell me how to check the statuscode of a HTTP response with http.client? I didn't find anything specifically to that in the documentary of http.client.
Code would look like this:
if conn.getresponse():
return True #Statuscode = 200
else:
return False #Statuscode != 200
My code looks like that:
from urllib.parse import urlparse
import http.client, sys
def check_url(url):
url = urlparse(url)
conn = http.client.HTTPConnection(url.netloc)
conn.request("HEAD", url.path)
r = conn.getresponse()
if r.status == 200:
return True
else:
return False
if __name__ == "__main__":
input_url=input("Enter the website to be checked (beginning with www):")
url = "http://"+input_url
url_https = "https://"+input_url
if check_url(url_https):
print("The entered Website supports HTTPS.")
else:
if check_url(url):
print("The entered Website doesn't support HTTPS, but supports HTTP.")
if check_url(url):
print("The entered Website supports HTTP too.")
Take a look at the documentation here, you simply needs to do:
r = conn.getresponse()
print(r.status, r.reason)
Update: If you want (as said in the comments) to check an http connection, you could eventually use an HTTPConnection and read the status:
import http.client
conn = http.client.HTTPConnection("docs.python.org")
conn.request("GET", "/")
r1 = conn.getresponse()
print(r1.status, r1.reason)
If the website is correctly configured to implement HTTPS, you should not have a status code 200; In this example, you'll get a 301 Moved Permanently response, which means the request was redirected, in this case rewritten to HTTPS .

Connect with page (Error 403)

I can't connect with page. Here is my code and error witch I have:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import urllib
someurl = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET"
req = Request(someurl)
try:
response = urllib.request.urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
Error code: 403
Some websites require a browser-like "User-Agent" header, other requires specific cookies. In this case, I found out by trial and error that both are required. What you need to do is:
Send an initial request with a browser-like user-agent. This will fail with 403, but you will also obtain a valid cookie in the response.
Send a second request with the same user-agent and the cookie that you got before.
In code:
import urllib.request
from urllib.error import URLError
# This handler will store and send cookies for us.
handler = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(handler)
# Browser-like user agent to make the website happy.
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
request = urllib.request.Request(url, headers=headers)
for i in range(2):
try:
response = opener.open(request)
except URLError as exc:
print(exc)
print(response)
# Output:
# HTTP Error 403: Forbidden (expected, first request always fails)
# <http.client.HTTPResponse object at 0x...> (correct 200 response)
Or, if you prefer, using requests:
import requests
session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
for i in range(2):
response = session.get(url, cookies=jar, headers=headers)
print(response)
# Output:
# <Response [403]>
# <Response [200]>
You can use http.client. First, you need to open a connection with the server. And, after, make a GET request. Like this:
import http.client
conn = http.client.HTTPConnection("genecards.org:80")
conn.request("GET", "/cgi-bin/carddisp.pl?gene=MET")
try:
response = conn.getresponse().read().decode("UTF-8")
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")

Re-attempt to open url with urllib in python on timeout

I am looking to parse data from a large number of webpages using Python (>10k) and I am finding that the function I have written to do this often encounters a timeout error every 500 loops. I have attempted to fix this with a try - except code block, but i would like to improve the function so it will re-attempt to open the url four or five times before returning the error. Is there an elegant way to do this?
My code below:
def url_open(url):
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
s = urlopen(req,timeout=50).read()
except urllib.request.HTTPError as e:
if e.code == 404:
print(str(e))
else:
print(str(e))
s=urlopen(req,timeout=50).read()
raise
return BeautifulSoup(s, "lxml")
I've used a pattern like this for retrying in the past:
def url_open(url):
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
retrycount = 0
s = None
while s is None:
try:
s = urlopen(req,timeout=50).read()
except urllib.request.HTTPError as e:
print(str(e))
if canRetry(e.code):
retrycount+=1
if retrycount > 5:
raise
# thread.sleep for a bit
else:
raise
return BeautifulSoup(s, "lxml")
You just have to define canRetry somewhere else.

how to handle python crawler's urlopen error?

When I write python crawler, I often use urlopen. Sometimes it can't open url(So I get an error), but when I retry to open this url, it succeeds. So I handle this situation by writing my crawler like this:
def url_open(url):
'''open the url and return its content'''
req = urllib.request.Request(headers=header, url=url)
while True:
try:
response = urllib.request.urlopen(req)
break
except:
continue
contents = response.read().decode('utf8')
return contents
I think this code is ugly... but it works, so is there some elegant way to do this?
I would strongly recommend using the requests library. You may end up with the same problem, but I found requests easier to work with and also more reliable.
The same request would go like this
def url_open(url):
while True:
try:
response = requests.get(url, headers=header)
break
except:
continue
return response.text
What error are you getting?
I would recommend going ahead and using the requests API with Sessions and Adapters so that you can explicitly set the number of retries. It is more code, but it is definitely cleaner:
import requests
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=3)
https_adapter = requests.adapters.HTTPAdapter(max_retries=3)
session.mount('http://', http_adapter)
session.mount('https://', https_adapter)
response = s.get(url)
if response.status_code != 200 then:
# Handle the request failure here
pass

How to use exceptions for different cases with python requests

I have this code
try:
response = requests.post(url, data=json.dumps(payload))
except (ConnectionError, HTTPError):
msg = "Connection problem"
raise Exception(msg)
Now i want the following
if status_code == 401
login() and then try request again
if status_code == 400
then send respose as normal
if status_code == 500
Then server problem , try the request again and if not successful raise EXception
Now these are status codes , i donn't know how can i mix status codes with exceptions. I also don't know what codes will be covered under HttpError
requests has a call called raise_for_status available in your request object which will raise an HTTPError exception if any code is returned in the 400 to 500 range inclusive.
Documentation for raise_for_status is here
So, what you can do, is after you make your call:
response = requests.post(url, data=json.dumps(payload))
You make a call for raise_for_status as
response.raise_for_status()
Now, you are already catching this exception, which is great, so all you have to do is check to see which status code you have in your error. This is available to you in two ways. You can get it from your exception object, or from the request object. Here is the example for this:
from requests import get
from requests.exceptions import HTTPError
try:
r = get('http://google.com/asdf')
r.raise_for_status()
except HTTPError as e:
# Get your code from the exception object like this
print(e.response.status_code)
# Or you can get the code which will be available from r.status_code
print(r.status_code)
So, with the above in mind, you can now use the status codes in your conditional statements
https://docs.python.org/2/library/urllib2.html#urllib2.URLError
code
An HTTP status code as defined in RFC 2616. This numeric value
corresponds to a value found in the dictionary of codes as found in
BaseHTTPServer.BaseHTTPRequestHandler.responses.
You can get the error code from an HTTPError from its code member, like so
try:
# ...
except HTTPError as ex:
status_code = ex.code

Categories

Resources