(Python) How to check http response on status - python

can someone tell me how to check the statuscode of a HTTP response with http.client? I didn't find anything specifically to that in the documentary of http.client.
Code would look like this:
if conn.getresponse():
return True #Statuscode = 200
else:
return False #Statuscode != 200
My code looks like that:
from urllib.parse import urlparse
import http.client, sys
def check_url(url):
url = urlparse(url)
conn = http.client.HTTPConnection(url.netloc)
conn.request("HEAD", url.path)
r = conn.getresponse()
if r.status == 200:
return True
else:
return False
if __name__ == "__main__":
input_url=input("Enter the website to be checked (beginning with www):")
url = "http://"+input_url
url_https = "https://"+input_url
if check_url(url_https):
print("The entered Website supports HTTPS.")
else:
if check_url(url):
print("The entered Website doesn't support HTTPS, but supports HTTP.")
if check_url(url):
print("The entered Website supports HTTP too.")

Take a look at the documentation here, you simply needs to do:
r = conn.getresponse()
print(r.status, r.reason)
Update: If you want (as said in the comments) to check an http connection, you could eventually use an HTTPConnection and read the status:
import http.client
conn = http.client.HTTPConnection("docs.python.org")
conn.request("GET", "/")
r1 = conn.getresponse()
print(r1.status, r1.reason)
If the website is correctly configured to implement HTTPS, you should not have a status code 200; In this example, you'll get a 301 Moved Permanently response, which means the request was redirected, in this case rewritten to HTTPS .

Related

how to handle python crawler's urlopen error?

When I write python crawler, I often use urlopen. Sometimes it can't open url(So I get an error), but when I retry to open this url, it succeeds. So I handle this situation by writing my crawler like this:
def url_open(url):
'''open the url and return its content'''
req = urllib.request.Request(headers=header, url=url)
while True:
try:
response = urllib.request.urlopen(req)
break
except:
continue
contents = response.read().decode('utf8')
return contents
I think this code is ugly... but it works, so is there some elegant way to do this?
I would strongly recommend using the requests library. You may end up with the same problem, but I found requests easier to work with and also more reliable.
The same request would go like this
def url_open(url):
while True:
try:
response = requests.get(url, headers=header)
break
except:
continue
return response.text
What error are you getting?
I would recommend going ahead and using the requests API with Sessions and Adapters so that you can explicitly set the number of retries. It is more code, but it is definitely cleaner:
import requests
session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=3)
https_adapter = requests.adapters.HTTPAdapter(max_retries=3)
session.mount('http://', http_adapter)
session.mount('https://', https_adapter)
response = s.get(url)
if response.status_code != 200 then:
# Handle the request failure here
pass

Python check if website exists

I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?
You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2
from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source
There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.
I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.
code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"
You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False
def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1
Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)

Python Requests - Is it possible to receive a partial response after an HTTP POST?

I am using the Python Requests Module to datamine a website. As part of the datamining, I have to HTTP POST a form and check if it succeeded by checking the resulting URL. My question is, after the POST, is it possible to request the server to not send the entire page? I only need to check the URL, yet my program downloads the entire page and consumes unnecessary bandwidth. The code is very simple
import requests
r = requests.post(URL, payload)
if 'keyword' in r.url:
success
fail
An easy solution, if it's implementable for you. Is to go low-level. Use socket library.
For example you need to send a POST with some data in its body. I used this in my Crawler for one site.
import socket
from urllib import quote # POST body is escaped. use quote
req_header = "POST /{0} HTTP/1.1\r\nHost: www.yourtarget.com\r\nUser-Agent: For the lulz..\r\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\r\nContent-Length: {1}"
req_body = quote("data1=yourtestdata&data2=foo&data3=bar=")
req_url = "test.php"
header = req_header.format(req_url,str(len(req_body))) #plug in req_url as {0}
#and length of req_body as Content-length
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #create a socket
s.connect(("www.yourtarget.com",80)) #connect it
s.send(header+"\r\n\r\n"+body+"\r\n\r\n") # send header+ two times CR_LF + body + 2 times CR_LF to complete the request
page = ""
while True:
buf = s.recv(1024) #receive first 1024 bytes(in UTF-8 chars), this should be enought to receive the header in one try
if not buf:
break
if "\r\n\r\n" in page: # if we received the whole header(ending with 2x CRLF) break
break
page+=buf
s.close() # close the socket here. which should close the TCP connection even if data is still flowing in
# this should leave you with a header where you should find a 302 redirected and then your target URL in "Location:" header statement.
There's a chance the site uses Post/Redirect/Get (PRG) pattern. If so then it's enough to not follow redirect and read Location header from response.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
If you need more information on what would you get if you had followed redirection then you can use HEAD on the url given in Location header.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
>>> response2 = requests.head(response.headers['location'])
>>> response2.status_code
200
>>> response2.headers
{'date': 'Wed, 07 Nov 2012 20:04:16 GMT', 'content-length': '352', 'content-type':
'application/json', 'connection': 'keep-alive', 'server': 'gunicorn/0.13.4'}
It would help if you gave some more data, for example, a sample URL that you're trying to request. That being said, it seems to me that generally you're checking if you had the correct URL after your POST request using the following algorithm relying on redirection or HTTP 404 errors:
if original_url == returned request url:
correct url to a correctly made request
else:
wrong url and a wrongly made request
If this is the case, what you can do here is use the HTTP HEAD request (another type of HTTP request like GET, POST, etc.) in Python's requests library to get only the header and not also the page body. Then, you'd check the response code and redirection url (if present) to see if you made a request to a valid URL.
For example:
def attempt_url(url):
'''Checks the url to see if it is valid, or returns a redirect or error.
Returns True if valid, False otherwise.'''
r = requests.head(url)
if r.status_code == 200:
return True
elif r.status_code in (301, 302):
if r.headers['location'] == url:
return True
else:
return False
elif r.status_code == 404:
return False
else:
raise Exception, "A status code we haven't prepared for has arisen!"
If this isn't quite what you're looking for, additional detail on your requirements would help. At the very least, this gets you the status code and headers without pulling all of the page data.

Calling a REST API from django view

Is there any way to make a RESTful api call from django view?
I am trying to pass header and parameters along a url from the django views. I am googling from half an hour but could not find anything interesting.
Any help would be appreciated
Yes of course there is. You could use urllib2.urlopen but I prefer requests.
import requests
def my_django_view(request):
if request.method == 'POST':
r = requests.post('https://www.somedomain.com/some/url/save', params=request.POST)
else:
r = requests.get('https://www.somedomain.com/some/url/save', params=request.GET)
if r.status_code == 200:
return HttpResponse('Yay, it worked')
return HttpResponse('Could not save data')
The requests library is a very simple API over the top of urllib3, everything you need to know about making a request using it can be found here.
Yes i am posting my source code it may help you
import requests
def my_django_view(request):
url = "https://test"
header = {
"Content-Type":"application/json",
"X-Client-Id":"6786787678f7dd8we77e787",
"X-Client-Secret":"96777676767585",
}
payload = {
"subscriptionId" :"3456745",
"planId" : "check",
"returnUrl": "https://www.linkedin.com/in/itsharshyadav/"
}
result = requests.post(url, data=json.dumps(payload), headers=header)
if result.status_code == 200:
return HttpResponse('Successful')
return HttpResponse('Something went wrong')
In case of Get API
import requests
def my_django_view(request):
url = "https://test"
header = {
"Content-Type":"application/json",
"X-Client-Id":"6786787678f7dd8we77e787",
"X-Client-Secret":"96777676767585",
}
result = requests.get(url,headers=header)
if result.status_code == 200:
return HttpResponse('Successful')
return HttpResponse('Something went wrong')
## POST Data To Django Server using python script ##
def sendDataToServer(server_url, people_count,store_id, brand_id, time_slot, footfall_time):
try:
result = requests.post(url="url", data={"people_count": people_count, "store_id": store_id, "brand_id": brand_id,"time_slot": time_slot, "footfall_time": footfall_time})
print(result)
lJsonResult = result.json()
if lJsonResult['ResponseCode'] == 200:
print("Data Send")
info("Data Sent to the server successfully: ")
except Exception as e:
print("Failed to send json to server....", e)

How can I unshorten a URL?

I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?
Additional Clarification:
Case 1: shortened --> unshortened
Case 2: unshortened --> unshortened
e.g. bit.ly/silly in the input array should be google.com in the output array
e.g. google.com in the input array should be google.com in the output array
Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
Using requests:
import requests
session = requests.Session() # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
Unshorten.me has an api that lets you send a JSON or XML request and get the full URL returned.
If you are using Python 3.5+ you can use the Unshortenit module that makes this very easy:
from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
Open the url and see what it resolves to:
>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908#N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
To unshort, you can use requests. This is a simple solution that works for me.
import requests
url = "http://foo.com"
site = requests.get(url)
print(site.url)
http://github.com/stef/urlclean
sudo pip install urlclean
urlclean.unshorten(url)
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
You can use geturl()
from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com
This Is very easy task you just need to add 4 lines of codes thats it :)
import requests
url = input('Enter url : ')
site = requests.get(url)
print(site.url)
just run this code you will successfully unshort the url.

Categories

Resources