I found this code written in Python 2.7 to bypass a deadlink while reading a list of urls and retrieving their content:
for i in xrange(lines):
try:
t = urllib2.urlopen(urllib2.Request(lines[i]))
deadlinkfound = False
except:
deadlinkfound = True
if not(deadlinkfound):
urllib.urlretrieve(lines[i], "Images/imag" + "-%s" % i)
It worked fine in Python2 but I can't find the equivalent in Python3 because of the urllib2 merging.
You can do the exact same thing with urllib.request here. Don't catch every conceivable exception, only catch what is reasonably going to be thrown:
from urllib import request, error
from http.client import HTTPException
for i, url in enumerate(lines):
try:
t = request.urlopen(request.Request(url, method='HEAD'))
except (HTTPException, error.HTTPError):
continue
request.urlretrieve(url, 'Images/imag-{}'.format(i))
This code does the same, but more efficiently.
Related
I'm right now trying to code a Maltego tranform to search through the Breach Compilation data, using query.sh
Code:
#!/usr/bin/env python
# Maltego transform for grabbing Breach Compilation results locally.
from MaltegoTransform import *
import sys
import os
import subprocess
email = sys.argv[1]
mt = MaltegoTransform()
try:
dataleak = subprocess.check_output("LOCATION OF QUERY.SH" + email, shell=True).splitlines()
for info in dataleak:
mt.addEntity('maltego.Phrase', info)
else:
mt.addUIMessage("")
except Exception as e:
mt.addUIMessage(str(e))
mt.returnOutput()
Empty value for #org.simpleframework.xml.Text(data=false,
required=true, empty=) on method 'value' in class
com.paterva.maltego.transform.protocol.v2api.messaging.TransformResponse$Notification
at line 26
Not sure what the problem is.
Turns out
else:
mt.addUIMessage("")
except Exception as e:
mt.addUIMessage(str(e))
that bit of code was the problem. Got it working just fine.
I'm working on a simple code that is downloading a file over HTTP using the package urllib and urllib.request. Everything is working good excepted that I would like to be able to handle the network problems that could happens.
Checking if the computer is online (Connected to the internet). And proceed only if true.
Restarting the download of the file if during it, the connection is lost or too bad.
I would like, if possible, to use as less packages as possible.
Here is my actual code :
import urllib
import urllib.request
url = "http://my.site.com/myFile"
urlSplited = url.split('/')[-1];
print ("Downloading : "+urlSplited)
urllib.request.urlretrieve(url, urlSplited)
To check if a connection is etablished, I believe I can do something like
while connection() is true:
Download()
But that would do the downloading many times..
I'm working on Linux.
I suggest you to use a combination of try, while and sleep function. Like this:
import urllib
import urllib.request
import time
url = "http://my.site.com/myFile"
urlSplited = url.split('/')[-1];
try_again = True
print ("Downloading : "+urlSplited)
while try_again:
try:
urllib.request.urlretrieve(url, urlSplited, timeout = 100)
try_again = False
except Exception as e:
print(e)
time.sleep(600)
Replace while with if, then everything under it will run only once.
if connection() == True:
Download()
Also, connection() function could be something like this:
try:
urllib.urlopen(url, timeout=5)
return True
return False
I have an AWS Lambda function which calls a set of URLs using pool.map. The problem is that if one of the URLs returns anything other than a 200 the Lambda function fails and immediately retries. The problem is it immediately retries the ENTIRE lambda function. I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
This is the code as it currently sits (with some details removed), working only when all URLs are:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls);
#close the pool and wait for the work to finish
pool.close();
return pool.join();
I tried reading the official documentation but it seems to be lacking a bit in explaining the map function, specifically explaining return values.
Using the urlopen documentation I've tried modifying my code to the following:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url); # TODO: figure out which URL errored
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
#close the pool and wait for the work to finish
pool.close();
return true; # always return true so we never duplicate successful calls
I'm not sure if I'm correct to be doing exceptions that way, or if I'm even making python exception notation correctly. Again, my goal is I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
I figured out the answer thanks to a "lower-level" look at this question I posted here.
The answer was to create my own custom wrapper to the urllib2.urlopen function, since each thread itself needed to be try{}catch'd instead of the whole thing. That function looked like so:
def my_urlopen(url):
try:
return urllib2.urlopen(url)
except URLError:
urllib2.urlopen("https://example.com/log_error/?url="+url)
return None
I put that above the def lambda_handler function declaration, then I can replace the whole try/catch within it from this:
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url);
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
To this:
results = pool.map(my_urlopen, urls);
Q.E.D.
What is the standard practice in Python when I have a command-line application taking one argument which is
URL to a web page
or
path to a HTML file somewhere on disk
(only one)
is sufficient the code?
if "http://" in sys.argv[1]:
print "URL"
else:
print "path to file"
import urlparse
def is_url(url):
return urlparse.urlparse(url).scheme != ""
is_url(sys.argv[1])
Depends on what the program must do. If it just prints whether it got a URL, sys.argv[1].startswith('http://') might do. If you must actually use the URL for something useful, do
from urllib2 import urlopen
try:
f = urlopen(sys.argv[1])
except ValueError: # invalid URL
f = open(sys.argv[1])
Larsmans might work, but it doesn't check whether the user actually specified an argument or not.
import urllib
import sys
try:
arg = sys.argv[1]
except IndexError:
print "Usage: "+sys.argv[0]+" file/URL"
sys.exit(1)
try:
site = urllib.urlopen(arg)
except ValueError:
file = open(arg)
I'm using the mechanize module to execute some web queries from Python. I want my program to be error-resilient and handle all kinds of errors (wrong URLs, 403/404 responsese) gracefully. However, I can't find in mechanize's documentation the errors / exceptions it throws for various errors.
I just call it with:
self.browser = mechanize.Browser()
self.browser.addheaders = [('User-agent', browser_header)]
self.browser.open(query_url)
self.result_page = self.browser.response().read()
How can I know what errors / exceptions can be thrown here and handle them ?
$ perl -0777 -ne'print qq($1) if /__all__ = \[(.*?)\]/s' __init__.py | grep Error
'BrowserStateError',
'ContentTooShortError',
'FormNotFoundError',
'GopherError',
'HTTPDefaultErrorHandler',
'HTTPError',
'HTTPErrorProcessor',
'LinkNotFoundError',
'LoadError',
'ParseError',
'RobotExclusionError',
'URLError',
Or:
>>> import mechanize
>>> filter(lambda s: "Error" in s, dir(mechanize))
['BrowserStateError', 'ContentTooShortError', 'FormNotFoundError', 'GopherError'
, 'HTTPDefaultErrorHandler', 'HTTPError', 'HTTPErrorProcessor', 'LinkNotFoundErr
or', 'LoadError', 'ParseError', 'RobotExclusionError', 'URLError']
While this has been posted a long time ago, I think there is still a need to answer the question correctly since it comes up in Google's search results for this very question.
As I write this, mechanize (version = (0, 1, 11, None, None)) in Python 265 raises urllib2.HTTPError and so the http status is available through catching this exception, eg:
import urllib2
try:
... br.open("http://www.example.org/invalid-page")
... except urllib2.HTTPError, e:
... print e.code
...
404
I found this in their docs:
One final thing to note is that there
are some catch-all bare except:
statements in the module, which are
there to handle unexpected bad input
without crashing your program. If this
happens, it's a bug in mechanize, so
please mail me the warning text.
So I guess they don't raise any exceptions. You can also search the source code for Exception subclasses and see how they are used.