I know how to download a file from the web using python, however I wish to handle cases where the file being requested does not exist. In which case, I want to print an error message ("404: File not found") and not write anything to disk. However, I still want to be able to continue executing the program (i.e. downloading other files in a list that may exist).
How do I do this? Below is some template code to download a file given its url (feel free to modify it if you believe there is a better way, but please keep it concise and simple).
import urllib
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
from urllib2 import URLError
try:
# your file request code here
except URLError, e:
if e.code == 404:
# your appropriate code here
else:
# raise maybe?
I followed this guide, which has a specific section about handling exceptions, and found it really helpful.
import urllib, urllib2
try:
urllib.urlretrieve ("http://www.example.com/", "myfile.mp3")
except URLError, e:
if e.code == 404:
print "4 0 4"
else:
print "%s" % e
This is what your code does. It basically tries to retrieve the web page of www.example.com and writes it to myfile.mp3. It does not end into exception because it is not looking for the myfile.mp3, it basically writes everything it gets in html to myfile.mp3
If you are looking for code to download files at a certain location on the web, try this
How do I download a zip file in python using urllib2?
Your code should look like this:
try:
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
except URLError,e:
if e.code==404:
print 'file not found. moving on...'
pass
Related
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!
I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!
Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.
I've got two lists, one a list of dates the other a list of hours and I'm scanning a website looking for content.
date = ['1','2','3']
hour = ['1','2','3']
I've written the following while / for to loop through the dates and hours and open all the combinations:
datetotry = 0
while (datetotry < len(date)):
for i in range(len(hour)):
print ('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+".sql")
html = opener.open('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql').read()
datetotry += 1
When the console prints the url it looks okay with the variables being replaced with the numbers from the list.
But its possibly not replacing the variables in the actual url request.
The code was stopping due to a 404 error but I think I handled that with the info I found here:
https://docs.python.org/3/howto/urllib2.html#wrapping-it-up
The first part of the 404 error was showing the
date[datetotry]+'_'+hour[i]+
section, instead of the items from the list like when its printing to console.
Does this mean I have to do something like urllib.parse.urlencode to actually replace the variables?
I imported the libraries mentioned in the article and changed code to:
from urllib.error import URLError, HTTPError
from urllib.request import Request, urlopen
while (datetotry < len(date)):
for I in range(len(hour)):
HTML = Request('https://www.website.org/backups/backup_2004-09-'+date[datetotry]+'_'+hour[i]+'.sql')
try:
response = urlopen(html)
except URLError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
So the code actually runs, not stopping due to 404 being returned. What is the best way to see what its actually requesting and do I have to do some kind of encoding? New to programming especially Python 3.
I am creating a python application where I am downloading a list of rss content from internet. I am having a list of 10 url's which I need to download.
I am using the urllib2 library provided by python. The code I am using is:
for url in urls:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with open(self.SERVER_PATH+"\\feeds\\"+str(feedID)+str(extension), "w") as requiredData:
requiredData.write(str(data))
requiredData.close()
Here the first url is downloaded but while downloading the next url I get an error:
<urlopen error [Errno 66] unknown>
Is there any event which can notify me for the completion of the downloading of the first URL? Or is there any other way with the help of which I can avoid this issue?
Thanks in advance.
Is there any event which can notify me for the completion of the downloading of the first URL?
The raising of the Exception is notification that the URL cannot be downloaded.
Or is there any other way with the help of which I can avoid this issue?
Yes, you can catch the exception.
try:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
except URLError:
# do stuff which handles the error
I'm not perfectly sure that's the error you need to catch, but hopefully you have the skills to work out exactly what to catch (if it's not URLError).
As a follow-up to John Mee's answer, and after reading your comments, you can try something like the following:
def complete_urlopen(url):
complete = False
while not complete:
try:
obj = urllib2.urlopen(url)
complete = True
except URLError, e:
pass
return obj.read()
And then use it by replacing:
rssObject = urllib2.urlopen(url)
data = rssObject.read()
with:
data = complete_urlopen(url)
Be aware that this code assumes that the urlopen() will eventually succeed. If it never does, your process will hang in that while loop forever. A more sophisticated version of this could contemplate a max number of iterations, such that when they're reached, the process exits.