Handling bad URLs with requests - python

Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?

Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'

You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text

Related

How to filter out crawler traps in a static corpus

I am doing a homework where we are asked to write a program to crawl a given static corpus. In the output, my code prints all the URLs crawled, but I know there are some that are traps, but I can't think of a way to filter those out in a Pythonic way.
I used regex to filter the tap-like url contents out, but this is not allowed in the homework as it is considered as hard-coding.
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=4d26fc0839d47d4ec13c5461c1ed6d96
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094
http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d504a3676483838e82f07064ca3e12ee
and more with similar structure. There are also calendar urls with similar structure, only changing days:
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=22&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=25&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=26&month=01&year=2017
http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=27&month=01&year=2017
I want to filter those out of my results but I can't think of any way.
I think this will solve your problem
import requests
for url in urls:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except Exception as err:
print(f'Other error occurred: {err}')
else:
print('Url is valid!')

Python - Page Source when calling a URL

Im looking for a really simple code to call a url and print the html source code. This is what I am using. Im following an online course which has the code
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except:
return ""
print(get_page('https://www.yahoo.com/'))
Prints nothing but also no errors. Alternatively from browsing these forums I've tried
from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/'))
when I do this I get
<http.client.HTTPResponse object at 0x000001E947559710>
from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/').read().decode())
Assuming UTF-8 encoding was used
from urllib import request
def get_src_code(url):
r = request.urlopen("url")
byte_code = r.read()
src_code = bytecode.decode()
return src_code
It prints the empty string at the except block. Your code is generating error because there is no attribute called open in urllib module. You can't see the error because you are using a try-except block which is returning an empty string on every error. In your code, you can see the error like this:
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except Exception as e:
return e.args[0]
To get your expected output, do it like this:
def get_page(url):
try:
from urllib.request import urlopen
return urlopen(url).read().decode('utf-8')
except Exception as e:
return e.args[0]

Python 3 variable in request causing 404?

I've got two lists, one a list of dates the other a list of hours and I'm scanning a website looking for content.
date = ['1','2','3']
hour = ['1','2','3']
I've written the following while / for to loop through the dates and hours and open all the combinations:
datetotry = 0
while (datetotry < len(date)):
for i in range(len(hour)):
print ('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+".sql")
html = opener.open('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql').read()
datetotry += 1
When the console prints the url it looks okay with the variables being replaced with the numbers from the list.
But its possibly not replacing the variables in the actual url request.
The code was stopping due to a 404 error but I think I handled that with the info I found here:
https://docs.python.org/3/howto/urllib2.html#wrapping-it-up
The first part of the 404 error was showing the
date[datetotry]+'_'+hour[i]+
section, instead of the items from the list like when its printing to console.
Does this mean I have to do something like urllib.parse.urlencode to actually replace the variables?
I imported the libraries mentioned in the article and changed code to:
from urllib.error import URLError, HTTPError
from urllib.request import Request, urlopen
while (datetotry < len(date)):
for I in range(len(hour)):
HTML = Request('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql')
try:
response = urlopen(html)
except URLError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
So the code actually runs, not stopping due to 404 being returned. What is the best way to see what its actually requesting and do I have to do some kind of encoding? New to programming especially Python 3.

IncompleteRead using httplib

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.
I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Python: handing exceptions when downloading non-existing files using urllib

I know how to download a file from the web using python, however I wish to handle cases where the file being requested does not exist. In which case, I want to print an error message ("404: File not found") and not write anything to disk. However, I still want to be able to continue executing the program (i.e. downloading other files in a list that may exist).
How do I do this? Below is some template code to download a file given its url (feel free to modify it if you believe there is a better way, but please keep it concise and simple).
import urllib
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
from urllib2 import URLError
try:
# your file request code here
except URLError, e:
if e.code == 404:
# your appropriate code here
else:
# raise maybe?
I followed this guide, which has a specific section about handling exceptions, and found it really helpful.
import urllib, urllib2
try:
urllib.urlretrieve ("http://www.example.com/", "myfile.mp3")
except URLError, e:
if e.code == 404:
print "4 0 4"
else:
print "%s" % e
This is what your code does. It basically tries to retrieve the web page of www.example.com and writes it to myfile.mp3. It does not end into exception because it is not looking for the myfile.mp3, it basically writes everything it gets in html to myfile.mp3
If you are looking for code to download files at a certain location on the web, try this
How do I download a zip file in python using urllib2?
Your code should look like this:
try:
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
except URLError,e:
if e.code==404:
print 'file not found. moving on...'
pass

Categories

Resources