I am writing a web scraper using Python and mechanize. The scraper looks for the "Next" button and loops until it comes to the last page, which does not have a "Next" button. That gives the FormNotFoundError: exception, which stops the loop. When I try to catch the exception, I get a NameError: instead of the actual error.
What am I doing wrong?
Alternatively, is there a better way to stop the loop when I have reached the end?
Here is the relevant code.
Import mechanize
br = mechanize.Browser()
br.open("http://example.com")
x=0
while x > 1:
try:
br.select_form(nr=2)
response = br.submit("next")
*otherstuff*
except FormNotFoundError:
break
Here is the error output.
File "scraping.py", line 32, in <module>
except FormNotFoundError:
NameError: name 'FormNotFoundError' is not defined
Can you try to change this to:
except mechanize._mechanize.FormNotFoundError:
instead of this:
except FormNotFoundError:
Related
I'm scraping the content out of 100k systematic URLS (example.com/entry/1 > example.com/entry/100000).
However, around 10% of the URLs have been deleted, meaning when the script gets to them it gives me the error "urllib2.httperror http error 404" and stops running.
I'm relatively new to python and was wondering if there's a way to do something like this:
if result == error:
div_text = "missing"
So that the loop can continue to the next URL, but make a note that it failed.
urllib2.HTTPError is an exception raised by Python. You can wrap your URL call with a try/except block:
try:
# ... put your URL open call here ...
except urllib2.HTTPError:
div_text = 'missing'
This way, if this exception is encountered again, the Python interpreter will run the code inside that except block.
How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!
I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!
Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.
I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?
website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
for lendo in arquivo.readlines():
msmwebsite = website + lendo
try:
abrindo = urllib2.urlopen(msmwebsite)
abrindo2 = abrindo.read()
except URLError as e:
pass
if abrindo.code == 200:
palavras = ['registration', 'there is no form']
for palavras2 in palavras:
if palavras2 in abrindo2:
print msmwebsite, 'up'
else:
pass
else:
pass
It's working but for some reason, some websites I got this error:
if abrindo.code == 200:
NameError: name 'abrindo' is not defined
How to fix it?
.......................................................................................................................................................................................
Replace pass with continue. And at least do some error logging, as you silently skip erroneous links.
In case your request resulted in an URLError, no variable abrindo is defined, hence your error.
abrindo is created only in the try block. It will not be available if the catch block is executed. To fix this, move the block of code starting with
if abrindo.code == 200:
inside the try block. One more suggestion, if you are not doing anything in the else part, instead of explicitly writing that with pass, simply remove them.
I am trying to search through http://www.wegottickets.com/ with the keywords "Live music". But the returned result is still the main page, not the search result page including lots of live music information. Could anyone show me out what the problem is?
from urllib2 import urlopen
from ClientForm import ParseResponse
response = urlopen("http://www.wegottickets.com/")
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
form.set_value("Live music", name="unified_query")
form.set_all_readonly(False)
control = form.find_control(type="submit")
print control.disabled
print control.readonly
#print form
request2 = form.click()
try:
response2 = urlopen(request2)
except:
print "Unsccessful query"
print response2.geturl()
print response2.info()
print response.read()
response2.close()
Thank you very much!
Never used it, but I've had success with the python mechanize module, if it turns out to be a fault in clientform.
However, as a first step, I'd suggest removing your try...except wrapper. What you're basically doing is saying "catch any error, then ignore the actual error and print 'Unsuccessful Query' instead". Not helpful for debugging. The exception will stop the program and print a useful error message, if you don't get in its way.