How to continue other link when Urllib 404 error - python

I'm trying to download images from a csv with lot of links. Works fine until some link is broken (urllib.error.HTTPError: HTTP Error 404: Not Found) .
import pandas as pd
import urllib.request
import urllib.error
opener = urllib.request.build_opener()
def url_to_jpg (i,url,file_path) :
filename="image-{}".format(i)
full_path = "{}{}".format(file_path, filename)
opener.addheaders = [('User-Agent', 'Chrome/5.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url,full_path)
print ("{} Saved".format(filename))
return None
filename="listado.csv"
file_path="/Users/marcelomorelli/Downloads/tapas/imagenes"
urls=pd.read_csv(filename)
for i, url in enumerate(urls.values):
url_to_jpg (i,url[0],file_path)
Thanks!
Any idea how can I made to Python continue to the other link in the list everytime gets that error?

You can use a try pattern and ignore errors.
Your code would look like this:
for i, url in enumerate(urls.values):
try:
url_to_jpg(i,url[0],file_path)
except Exception as e:
print(f"Failed due to: {e}")
Reference: https://docs.python.org/3/tutorial/errors.html

Related

Python Short Url expander

i have a problem with expanding short URLs, since not all i work with use the same redirection:
the idea is to expand shortened urls: here a few examples of short url --> Final url. I need a function to get the shorten url and return the expanded url
http://chollo.to/675za --> http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094&dclid=COvjy8Xrz9UCFeMi0wod4ZULuw
So fa i have something semi working, it fails in the some of the abobe examples
import requests
import httplib
import urlparse
def unshorten_url(url):
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status / 100 == 3 and response.getheader('Location'):
url = requests.get(response.getheader('Location')).url
print url
return url
else:
url = requests.get(url).url
print url
return url
except Exception as e:
print(e)
The expected redirect does not appear to be well-formed according to requests:
import requests
response = requests.get('http://chollo.to/675za')
for resp in response.history:
print(resp.status_code, resp.url)
print(response.url)
print(response.is_redirect)
Output:
301 http://chollo.to/675za
http://web.epartner.es/click.asp?ref=754218&site=14010&type=text&tnb=39&diurl=https%3A%2F%2Fad.doubleclick.net%2Fddm%2Fclk%2F302111021%3B129203261%3By%3Fhttp%3A%2F%2Fwww.elcorteingles.es%2Flimite-48-horas%2Fequipaje%2F%3Fsorting%3DpriceAsc%26aff_id%3D2118094
False
This is likely intentional by epartner or doubleclick. For these types of nested urls you would need an extra step like:
from urllib.parse import unquote
# from urllib import unquote # python2
# if response.url.count('http') > 1:
url = 'http' + response.url.split('http')[-1]
unquote(url)
# http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094
Note: by doing this you might be avoiding intended ad revenues.

Python - Page Source when calling a URL

Im looking for a really simple code to call a url and print the html source code. This is what I am using. Im following an online course which has the code
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except:
return ""
print(get_page('https://www.yahoo.com/'))
Prints nothing but also no errors. Alternatively from browsing these forums I've tried
from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/'))
when I do this I get
<http.client.HTTPResponse object at 0x000001E947559710>
from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/').read().decode())
Assuming UTF-8 encoding was used
from urllib import request
def get_src_code(url):
r = request.urlopen("url")
byte_code = r.read()
src_code = bytecode.decode()
return src_code
It prints the empty string at the except block. Your code is generating error because there is no attribute called open in urllib module. You can't see the error because you are using a try-except block which is returning an empty string on every error. In your code, you can see the error like this:
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except Exception as e:
return e.args[0]
To get your expected output, do it like this:
def get_page(url):
try:
from urllib.request import urlopen
return urlopen(url).read().decode('utf-8')
except Exception as e:
return e.args[0]

Handling bad URLs with requests

Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text

Unshorten Flic.kr URLs

I have a Python script that unshortens URLs based on the answer posted here. So far it worked pretty well, e.g., with youtu.be, goo.gl,t.co, bit.ly, and tinyurl.com. But now I noticed that it doesn't work for Flickr's own URL shortener flic.kr.
For example, when I enter the URL
https://flic.kr/p/qf3mGd
into a browser, I get redirected correctly to
https://www.flickr.com/photos/106783633#N02/15911453212/
However, when using to unshorten the same URL with the Python script I get the following re-directs
https://flic.kr/p/qf3mgd
http://www.flickr.com/photo.gne?short=qf3mgd
http://www.flickr.com/signin/?acf=%2Fphoto.gne%3Fshort%3Dqf3mgd
https://login.yahoo.com/config/login?.src=flickrsignin&.pc=8190&.scrumb=[...]
thus eventually ending up on the Yahoo login page. Unshort.me, by the way, can unshorten the URL correctly. What am I missing here?
Here is the full source code of my script. I stumbled upon some pathological cases with the original script:
import urlparse
import httplib
def unshorten_url(url, max_tries=10):
return __unshorten_url(url, [], max_tries)
def __unshorten_url(url, check_urls, max_tries):
if max_tries == 0:
if len(check_urls) > 0:
return check_urls[0]
return url
if url in check_urls:
return url
unshortended = ''
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', url)
except:
return None
try:
response = h.getresponse()
except:
return url
if response.status/100 == 3 and response.getheader('Location'):
unshortended = response.getheader('Location')
else:
return url
#print max_tries, unshortended
if unshortended != url:
if 'http' not in unshortended:
return url
check_urls.append(url)
return __unshorten_url(unshortended, check_urls, (max_tries-1))
else:
return unshortended
print unshorten_url('http://t.co/5skmePb7gp')
EDIT: Full working example with a t.co URL
I'm using Request [0] rather than httplib in this way and it's works fine with https://flic.kr/p/qf3mGd like urls:
>>> import requests
>>> requests.head("https://flic.kr/p/qf3mGd", allow_redirects=True, verify=False).url
u'https://www.flickr.com/photos/106783633#N02/15911453212/'
[0] http://docs.python-requests.org/en/latest/

Verify URL exists from file

So I have some code that I use to scrape through my mailbox looking for certain URL's. Once this is completed it creates a file called links.txt
I want to run a script against that file to get an output of all the current URL's that are live in that list. The script I have only allows for me to check on URL at a time
import urllib2
for url in ["www.google.com"]:
try:
connection = urllib2.urlopen(url)
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Use requests:
import requests
with open(filename) as f:
good_links = []
for link in file:
try:
r = requests.get(link.strip())
except Exception:
continue
good_links.append(r.url) #resolves redirects
You can also consider extracting the call to requests.get into a helper function:
def make_request(method, url, **kwargs):
for i in range(10):
try:
r = requests.request(method, url, **kwargs)
return r
except requests.ConnectionError as e:
print e.message
except requests.HTTPError as e:
print e.message
except requests.RequestException as e:
print e.message
raise Exception("requests did not succeed")
It is trivial to make this change, given that you're already iterating over a list of URLs:
import urllib2
for url in open("urllist.txt"): # change 1
try:
connection = urllib2.urlopen(url.rstrip()) # change 2
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Iterating over a file returns the lines of the file (complete with line endings). We use rstrip() on the URL to strip off the line endings.
There are other improvements you can make. For example, some will suggest you use with to make sure your file is closed. This is good practice but probably not necessary in this script.

Categories

Resources