Python and Urllib. How to handle the network exceptions? - python

I'm working on a simple code that is downloading a file over HTTP using the package urllib and urllib.request. Everything is working good excepted that I would like to be able to handle the network problems that could happens.
Checking if the computer is online (Connected to the internet). And proceed only if true.
Restarting the download of the file if during it, the connection is lost or too bad.
I would like, if possible, to use as less packages as possible.
Here is my actual code :
import urllib
import urllib.request
url = "http://my.site.com/myFile"
urlSplited = url.split('/')[-1];
print ("Downloading : "+urlSplited)
urllib.request.urlretrieve(url, urlSplited)
To check if a connection is etablished, I believe I can do something like
while connection() is true:
Download()
But that would do the downloading many times..
I'm working on Linux.

I suggest you to use a combination of try, while and sleep function. Like this:
import urllib
import urllib.request
import time
url = "http://my.site.com/myFile"
urlSplited = url.split('/')[-1];
try_again = True
print ("Downloading : "+urlSplited)
while try_again:
try:
urllib.request.urlretrieve(url, urlSplited, timeout = 100)
try_again = False
except Exception as e:
print(e)
time.sleep(600)

Replace while with if, then everything under it will run only once.
if connection() == True:
Download()
Also, connection() function could be something like this:
try:
urllib.urlopen(url, timeout=5)
return True
return False

Related

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Lambda Python Pool.map and urllib2.urlopen : Retry only failing processes, log only errors

I have an AWS Lambda function which calls a set of URLs using pool.map. The problem is that if one of the URLs returns anything other than a 200 the Lambda function fails and immediately retries. The problem is it immediately retries the ENTIRE lambda function. I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
This is the code as it currently sits (with some details removed), working only when all URLs are:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls);
#close the pool and wait for the work to finish
pool.close();
return pool.join();
I tried reading the official documentation but it seems to be lacking a bit in explaining the map function, specifically explaining return values.
Using the urlopen documentation I've tried modifying my code to the following:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url); # TODO: figure out which URL errored
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
#close the pool and wait for the work to finish
pool.close();
return true; # always return true so we never duplicate successful calls
I'm not sure if I'm correct to be doing exceptions that way, or if I'm even making python exception notation correctly. Again, my goal is I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
I figured out the answer thanks to a "lower-level" look at this question I posted here.
The answer was to create my own custom wrapper to the urllib2.urlopen function, since each thread itself needed to be try{}catch'd instead of the whole thing. That function looked like so:
def my_urlopen(url):
try:
return urllib2.urlopen(url)
except URLError:
urllib2.urlopen("https://example.com/log_error/?url="+url)
return None
I put that above the def lambda_handler function declaration, then I can replace the whole try/catch within it from this:
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url);
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
To this:
results = pool.map(my_urlopen, urls);
Q.E.D.

DeadLink exception from Python2 to Python3

I found this code written in Python 2.7 to bypass a deadlink while reading a list of urls and retrieving their content:
for i in xrange(lines):
try:
t = urllib2.urlopen(urllib2.Request(lines[i]))
deadlinkfound = False
except:
deadlinkfound = True
if not(deadlinkfound):
urllib.urlretrieve(lines[i], "Images/imag" + "-%s" % i)
It worked fine in Python2 but I can't find the equivalent in Python3 because of the urllib2 merging.
You can do the exact same thing with urllib.request here. Don't catch every conceivable exception, only catch what is reasonably going to be thrown:
from urllib import request, error
from http.client import HTTPException
for i, url in enumerate(lines):
try:
t = request.urlopen(request.Request(url, method='HEAD'))
except (HTTPException, error.HTTPError):
continue
request.urlretrieve(url, 'Images/imag-{}'.format(i))
This code does the same, but more efficiently.

How to auto reconnect during IOError in python

I'm doing fetch links using python. and suddenly I lost the connection and display an error as below.
IOError: [Errno socket error] [Errno 110] Connection timed out
how to reconnect with the same link?
for instance
import urllib
a = 'http://anzaholyman.files.wordpress.com/2011/12/zip-it.gif'
image = urllib.URLopener()
image.retrieve(a,'1.jpg')
You can simply use the try..except syntax:
import urllib
a = 'http://anzaholyman.files.wordpress.com/2011/12/zip-it.gif'
image = urllib.URLopener()
while True:
try:
image.retrieve(a,'1.jpg')
break
except IOError:
pass
If there is an actual problem loading your figure, then, a simple while loop will never end and your application will seem to hang.
To prevent this, I usually use a counter:
tries = 5
while tries:
try:
image.retrieve(a,'1.jpg')
break
except IOError:
tries -= 1 #and maybe with 0.1-1 second rest here
else:
warn_or_raise_something
In order to prevent trasient problems, I also use sometimes a delay (time.sleep) between succesive tries after failed calls

Argument is URL or path

What is the standard practice in Python when I have a command-line application taking one argument which is
URL to a web page
or
path to a HTML file somewhere on disk
(only one)
is sufficient the code?
if "http://" in sys.argv[1]:
print "URL"
else:
print "path to file"
import urlparse
def is_url(url):
return urlparse.urlparse(url).scheme != ""
is_url(sys.argv[1])
Depends on what the program must do. If it just prints whether it got a URL, sys.argv[1].startswith('http://') might do. If you must actually use the URL for something useful, do
from urllib2 import urlopen
try:
f = urlopen(sys.argv[1])
except ValueError: # invalid URL
f = open(sys.argv[1])
Larsmans might work, but it doesn't check whether the user actually specified an argument or not.
import urllib
import sys
try:
arg = sys.argv[1]
except IndexError:
print "Usage: "+sys.argv[0]+" file/URL"
sys.exit(1)
try:
site = urllib.urlopen(arg)
except ValueError:
file = open(arg)

Categories

Resources