After timeout keep trying request - python

I've just started using Python to scrape the data. But my code as below freezes during work and I guess that's because some url did not response anything; I guess it would work if I just try that url again. My question here is, if I just revise the code like,
reshomee = requests.get(homeUrl, headers=headerss, timeout=10)
then does this code try that url again after 10 seconds with no response? I am just worried if it would be just over without trying again...?
I couldn't help asking this because I have no idea how to try this code since url freezes very rare and randomly. Thank you!
def reshome(tries=0):
try:
reshomee = requests.get(homeUrl, headers=headerss)
return reshomee
except Exception as e:
print(e)
if tries < 10:
print('try:' + str(tries))
sleep(tries*30+100)
return reshome(tries+1)
else:
print('cannot make it')

You can use requests.exceptions in the module.
def reshome(tries=0):
try:
reshomee = requests.get(homeUrl, headers=headerss, timeout=0.001)
return reshomee
except requests.exceptions.Timeout as e:
return reshome(tries+1)

Related

Handling bad URLs with requests

Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text

Handling a url which fails to open, error handling using urllib

I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?

URLRetrieve Error Handling

I have the following code that grabs images using urlretrieve working..... too a point.
def Opt3():
global conn
curs = conn.cursor()
results = curs.execute("SELECT stock_code FROM COMPANY")
for row in results:
#for image_name in list_of_image_names:
page = requests.get('url?prodid=' + row[0])
tree = html.fromstring(page.text)
pic = tree.xpath('//*[#id="bigImg0"]')
#print pic[0].attrib['src']
print 'URL'+pic[0].attrib['src']
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except:
pass
I am reading a CSV to input the image names. It works except when it hits an error/corrupt url (where there is no image I think). I was wondering if I could simply skip any corrupt urls and get the code to continue grabbing images? Thanks
urllib has a very bad support for error catching. urllib2 is a much better choice. The urlretrieve equivalent in urllib2 is:
resp = urllib2.urlopen(im_url)
with open(sav_name, 'wb') as f:
f.write(resp.read())
And the errors to catch are:
urllib2.URLError, urllib2.HTTPError, httplib.HTTPException
And you can also catch socket.error in case that the network is down.
Simply using except Exception is a very stupid idea. It'll catch every error in the above block even your typos.
Just use a try/except and continue if it fails
try:
page = requests.get('url?prodid=' + row[0])
except Exception,e:
print e
continue # continue to next row
Instead of pass why don't you try continue when an error occurs.
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except Exception e:
continue

Python, NameError in urllib2 module but only in a few websites

website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
for lendo in arquivo.readlines():
msmwebsite = website + lendo
try:
abrindo = urllib2.urlopen(msmwebsite)
abrindo2 = abrindo.read()
except URLError as e:
pass
if abrindo.code == 200:
palavras = ['registration', 'there is no form']
for palavras2 in palavras:
if palavras2 in abrindo2:
print msmwebsite, 'up'
else:
pass
else:
pass
It's working but for some reason, some websites I got this error:
if abrindo.code == 200:
NameError: name 'abrindo' is not defined
How to fix it?
.......................................................................................................................................................................................
Replace pass with continue. And at least do some error logging, as you silently skip erroneous links.
In case your request resulted in an URLError, no variable abrindo is defined, hence your error.
abrindo is created only in the try block. It will not be available if the catch block is executed. To fix this, move the block of code starting with
if abrindo.code == 200:
inside the try block. One more suggestion, if you are not doing anything in the else part, instead of explicitly writing that with pass, simply remove them.

Verify URL exists from file

So I have some code that I use to scrape through my mailbox looking for certain URL's. Once this is completed it creates a file called links.txt
I want to run a script against that file to get an output of all the current URL's that are live in that list. The script I have only allows for me to check on URL at a time
import urllib2
for url in ["www.google.com"]:
try:
connection = urllib2.urlopen(url)
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Use requests:
import requests
with open(filename) as f:
good_links = []
for link in file:
try:
r = requests.get(link.strip())
except Exception:
continue
good_links.append(r.url) #resolves redirects
You can also consider extracting the call to requests.get into a helper function:
def make_request(method, url, **kwargs):
for i in range(10):
try:
r = requests.request(method, url, **kwargs)
return r
except requests.ConnectionError as e:
print e.message
except requests.HTTPError as e:
print e.message
except requests.RequestException as e:
print e.message
raise Exception("requests did not succeed")
It is trivial to make this change, given that you're already iterating over a list of URLs:
import urllib2
for url in open("urllist.txt"): # change 1
try:
connection = urllib2.urlopen(url.rstrip()) # change 2
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Iterating over a file returns the lines of the file (complete with line endings). We use rstrip() on the URL to strip off the line endings.
There are other improvements you can make. For example, some will suggest you use with to make sure your file is closed. This is good practice but probably not necessary in this script.

Categories

Resources