I am parsing through the HTML returned from a list of links. When I reach a certain point in each HTML document I raise an Exception.
import urllib2, time,
from HTMLParser import HTMLParser
class MyHTMLParser2(HTMLParser):
def handle_starttag(self, tag, attrs):
if somethings:
do somethings
if tag == "div" and "section2" in attrs[0][1]:
raise NameError('End')
parser2 = MyHTMLParser2()
cntr = 0
for links in ls:
try:
f = urllib2.urlopen(links)
parser2.feed(f.read())
cntr+=1
if cntr%10 == 0:
print "Parsing...", " It has benn", (time.clock()-start)/60, 'mins.'
break
except Exception, e:
print 'There has been an error Jim. url_check number', cntr
error_log.write(links)
continue
It just executes the try statement once for the first link and then executes the exception clause to infinity.
How can I get it to move on to the next link once the exception is raised
The error_log is from some other errors it would run into related to urllib2, mostly they seemed like it wasn't able to connect to the webpage fast enough. So if there was a way to quit the HTMLParser2 without throwing an exception, that would be great. That way I could re-implement the error_log
No, your diagnosis is not correct, there is not infinite exception loop here. Each URL is an entirely separate exception.
The cntr variable won't update whenever you have an exception, perhaps that is giving you the impression that you end up in a exception loop. Either move the cntr += 1 line out of the try: statement, or use enumerate() to generate a counter for you.
That said, why are you trying to parse multiple HTML pages with one parser instance? Most likely the exception you keep getting is that a specific page is malformed and put the parser into a state it cannot continue from.
You should not stop the parser with an exception. Parsing is a pretty complex process and usually, it's better to let the parser complete, collecting the information you need and process this information when the parser has done it's job. That way, you keep different things in your software separate, making everything easier to maintain, debug and understand.
Related
TLDR: Is there a good way to cascade a heirarchy of exception handlers similar to what is possible with a series of if statements? e.g. one handler may attempt to handle a problem but throw an exception to be caught by the next handler, or the initial try block throws the second exception directly.
Basic premise: (There may be poory conceived code here, so bear with me) Also may be a duplicate, but I couldn't find it.
I am trying to verify the validity of a url with a head request. If I get a ConnectionError the url is not valid. The head request will helpfully throw a MissingSchema exception for missing "http://" so I added an exception handler to try the url with an "http://". However, if the url is still invalid it throws the expected ConnectionError. Is there a good way to pass that exception back to the exception handler that takes the ConnectionError directly from the try block? This would be similar to how you can cascade if statements. I could solve this particular example with some copy paste or recursion, but I could see both solutions becoming pretty annoying in more complex functions. Sample code below:
def checkURL(url):
try:
resp = requests.head(url)
return True # if request does not raise exception
except requests.exceptions.MissingSchema as exception:
try:
#try url with schema
resp = requests.head('http://' + url)
return True
#if url is still bad it will throw Connection Error
#I would like this to be also handled by the block below
except requests.ConnectionError as exception:
#ConnectionError == bad url
return False
I could solve this by duplicating my ConnectionError handler in the secondary try - except block, but that seems like a bad solution. Or I could recursively call checkURL('http://' + url) in the MissingSchema handler, but I could see that being problematic / inefficient also if there was more work being done in the initial try block. There's a good chance I'm missing something obvious here, but I'd appreciate any feedback.
In this case it would be easier to check the protocol "http://" or "https://" with .startswith() or a regular expression.
Nesting exception handling is rarely a good design choice.
Also your strategy for checking the url by trying the request multiple times can have serious performance issues if you are going to do a lot of checks.
Your best option here would be to check whatever you can without performing any request first, and only then have a single try block with multiple except clauses if you need to address the failures separately.
Edit: in the context of your question, where you want to retry the request in case of failures it is not really helpful and will force you to repeat the code.
If you need some mechanics that keeps retrying some operation then you need to enclose the try-except block inside a loop. For example:
for url in url_variations:
try:
request.head(url)
return url
except BlaBlaError:
continue
return None
There are many at times we write code inside try: and except: block in Django. But my confusion is what to write inside exception: block or how to know what is the exact error and raise it.
For ex: In this particular case I am invoking an utility function from my views.py, and I have written like this.
try:
res = process_my_data(request, name, email)
except:
import sys
print sys.exc_value
Inside the process_my_data() definition I am doing some kind of DB operations. If it fails, and comes to except block what should I write here. I am not sure what to write that is why written print sys.exc_value
I think the biggest question you have to ask yourself is "why am I writing this as a try except in the first place?" Do you know that what is inside the try is capable of throwing an exception? If the answer to that is yes, then surely you know what type of exception to catch. In this instance it seems like it depends on what backend DB library you are using. Look up the documentation for the one you're using. From there it's really up to you what you want to do within the except - how do you want your program to behave if an exception is thrown? At the absolute minimum I would catch a general exception and print the output to the console with something like this:
except Exception, e:
message = str(e)
print (message)
I'm attempting to scrape the HTML from various webpages of a website. However, I am occasionally getting the following error:
urllib2.HTTPError: HTTP Error 500: Internal Server Error
I'm trying to do a "while" loop to keep trying until the error goes away, but I haven't figured out the correct format for the loop. It seems as though the website is a bit flaky, since it seems to fail on a different webpage each time.
I'm trying to do something like this:
web_raw_results = urllib2.urlopen(web_url)
while urllib2.HTTPError:
web_raw_results = urllib2.urlopen(web_url)
But, that seems like it's doing just the opposite when I run it (it seems like it keeps repeatedly pulling the same webpage until it gets an error).
I'm pretty new to Python and I'm just messing around with a hobby project, so don't assume that I understand very much. I'm sure I've made a stupid mistake, but I can't figure out what I did wrong.
urllib.urlopen is throwing an exception. You need to use the try and except statements to "catch" the exception, like this:
while True:
try:
web_raw_results = urllib2.urlopen(web_url)
break
except urllib.HTTPError:
continue
This will loop continuously until the fetch succeeds. You don't really want to do this; repeatedly requesting a URL in this sort of tight loop would probably be frowned upon by the server operator. You might want to insert a delay before retrying, and you might want behave differently depending on whether or not you get a 500 error or something else. Maybe:
while True:
try:
web_raw_results = urllib2.urlopen(web_url)
break
except urllib.HTTPError, detail:
if detail.errno == 500:
time.sleep(1)
continue
else:
raise
This will pause for 1 second and continue the loop in the event of a
500 error; otherwise it will raise the exception (pass it on up the
call stack).
The Python tutorial has lots of good information.
I would do something like this:
import time
RETRY_TIME = 20.0
while True:
try:
web_raw_results = urllib2.urlopen(web_url)
break
except urllib2.HTTPError:
time.sleep(RETRY_TIME)
pass
import time
import traceback
import sys
import tools
from BeautifulSoup import BeautifulSoup
f = open("randomwords.txt","w")
while 1:
try:
page = tools.download("http://wordnik.com/random")
soup = BeautifulSoup(page)
si = soup.find("h1")
w = si.string
print w
f.write(w)
f.write("\n")
time.sleep(3)
except:
traceback.print_exc()
continue
f.close()
It prints just fine. It just won't write to the file. It's 0 bytes.
You can never leave the while loop, hence the f.close() call will never be called and the stream buffer to the file will never be flushed.
Let me explain a little bit further, in your exception catch statement you've included continue so there's no "exit" to the loop condition. Perhaps you should add some sort of indicator that you've reached the end of the page instead of a static 1. Then you'd see the close call and information printed to the file.
A bare except is almost certainly a bad idea; you should only handle the exception you expect to see. Then if it does something totally unexpected you will still get a useful error trace about it.
import time
import tools
from BeautifulSoup import BeautifulSoup
def scan_file(url, logf):
try:
page = tools.download(url)
except IOError:
print("Couldn't read url {0}".format(url))
return
try:
soup = BeautifulSoup(page)
w = soup.find("h1").string
except AttributeError:
print("Couldn't find <h1> tag")
return
print(w)
logf.write(w)
logf.write('\n')
def main():
with open("randomwords.txt","a") as logf:
try:
while True:
time.sleep(3)
scan_file("http://wordnik.com/random", logf)
except KeyboardInterrupt:
break
if __name__=="__main__":
main()
Now you can close the program by typing Ctrl-C, and the "with" clause will ensure that the log file is closed properly.
From what i understand, you want to output a random number every three second into a file. But caching will take place, so you will not see your numbers until the cache has grown too large, typically in the order of 4K bytes.
i suggest that in your loop, you add a f.flush() before the sleep() line.
Also, like wheaties sugessted, you should have proper exception handling (if i want to stop your program, i will likely do a SIGINT using Ctrl+C, and your program won't stop in this case) and a proper exit path.
I'm sure that when you test your program, you will kill it hard to stop it, and any random number it has written will not be written because the file is not properly closed. If you program could exit normally, you would have close()d the file, and close() triggers a flush(), and so you would have something written in your file.
Read the answer posted by wheaties.
And, if you want to force to write the file's buffer to the disk, read:
http://docs.python.org/library/stdtypes.html#file.flush
I find I've been confused by the problem that when I needn't to use try..except.For last few days it was used in almost every function I defined which I think maybe a bad practice.For example:
class mongodb(object):
def getRecords(self,tname,conditions=''):
try:
col = eval("self.db.%s" %tname)
recs = col.find(condition)
return recs
except Exception,e:
#here make some error log with e.message
What I thought is ,exceptions may be raised everywhere and I have to use try to get them.
And my question is,is it a good practice to use it everywhere when defining functions?If not are there any principles for it?Help would be appreciated!
Regards
That may not be the best thing to do. Whole point of exceptions is that you can catch them on very different level than it's raised. It's best to handle them in the place where you have enough information to make something useful with them (that is very application and context dependent).
For example code below can throw IOError("[Errno 2] No such file or directory"):
def read_data(filename):
return open(filename).read()
In that function you don't have enough information to do something with it, but in place where you actually using this function, in case of such exception, you may decide to try different filename or display error to the user, or something else:
try:
data = read_data('data-file.txt')
except IOError:
data = read_data('another-data-file.txt')
# or
show_error_message("Data file was not found.")
# or something else
This (catching all possible exceptions very broadly) is indeed considered bad practice. You'll mask the real reason for the exception.
Catch only 'explicitely named' types of exceptions (which you expect to happen and you can/will handle gracefully). Let the rest (unexpected ones) bubble as they should.
You can log these (uncaught) exceptions (globally) by overriding sys.excepthook:
import sys
import traceback
# ...
def my_uncaught_exception_hook(exc_type, exc_value, exc_traceback):
msg_exc = "".join( \
traceback.format_exception(exc_type, exc_value, exc_traceback) )
# ... log here...
sys.excepthook = my_uncaught_exception_hook # our uncaught exception hook
You must find a balance between several goals:
An application should recover from as many errors as possible by itself.
An application should report all unrecoverable errors with enough detail to fix the cause of the problem.
Errors can happen everywhere but you don't want to pollute your code with all the error handling code.
Applications shouldn't crash
To solve #3, you can use an exception hook. All unhandled exceptions will cause the current transaction to abort. Catch them at the highest level, roll back the transaction (so the database doesn't become inconsistent) and either throw them again or swallow them (so the app doesn't crash). You should use decorators for this. This solves #4 and #1.
The solution for #2 is experience. You will learn with time what information you need to solve problems. The hard part is to still have the information when an error happens. One solution is to add debug logging calls in the low level methods.
Another solution is a dictionary per thread in which you can store some bits and which you dump when an error happens.
another option is to wrap a large section of code in a try: except: (for instance in a web application, one specific GUI page) and then use sys.exc_info() to print out the error and also the stack where it occurred
import sys
import traceback
try:
#some buggy code
x = ??
except:
print sys.exc_info()[0] #prints the exception class
print sys.exc_info()[1] #prints the error message
print repr(traceback.format_tb(sys.exc_info()[2])) #prints the stack