How to bypass missing link and continue to scrape good data?

How to bypass missing link and continue to scrape good data? - python

How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!

I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!

Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.

Related

How can I override the AttributeError to make my script continue

I'm trying to override the AttributeError message, so that it does not give me the error message and just continues with the script. The script finds and prints the office_manager name, but on some occasions there is no manager listed, as such I need it to just ignore those occasions. Can anyone help?
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
continue
finally:
print("none")

Since the error came from .find, then it should be the one to be on the try catch, or even better it should be like this.
try:
office_manager = soup.find(text="Office_Manager").findPrevious('h4')
except AttributeError as err:
print(err) # or print("none")
pass # return or continue
else:
for title in office_manager:
print(title)

With bs4 4.7.1. you can use :contains, :has and :not. The following prints the directors names (if there are no directors you will get an empty list)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)

I thought someone less lazy than me would convert my comment to an answer, but as not, here you go:
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
pass
finally:
print("none")
Using pass will skip the entry instead.

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?

I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).

start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system

As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

How to continue after receiving None response from xml parse

I am finding prices of products from Amazon using their API with Bottlenose and parsing the xml response with BeautifulSoup.
I have a predefined list of products that the code iterates through.
This is my code:
import bottlenose as BN
import lxml
from bs4 import BeautifulSoup
i = 0
amazon = BN.Amazon('myid','mysecretkey','myassoctag',Region='UK',MaxQPS=0.9)
list = open('list.txt', 'r')
print "Number", "New Price:","Used Price:"
for line in list:
i = i + 1
listclean = line.strip()
response = amazon.ItemLookup(ItemId=listclean, ResponseGroup="Large")
soup = BeautifulSoup(response, "xml")
usedprice=soup.LowestUsedPrice.Amount.string
newprice=soup.LowestNewPrice.Amount.string
print i , newprice, usedprice
This works fine and will run through my list of amazon products until it gets to a product which doesn't have any value for that set of tags, like no new/used price.
At which Python will throw up this response:
AttributeError: 'NoneType' object has no attribute 'Amount'
Which makes sense as there is no tags/string found by BS that I searched for. Having no value is perfectly fine from what I'm trying to achieve, however the code collapses at this point and will not continue.
I have tried:
if soup.LowestNewPrice.Amount != None:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
and also tried:
newprice=0
if soup.LowestNewPrice.Amount != 0:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
I am at a loss for how to continue after receiving the nonetype value return. Unsure whether the problem lies fundamentally in the language or in the libraries I'm using.

You can use exception handling:
try:
# operation which causes AttributeError
except AttributeError:
continue
The code in the try block will be executed and if an AttributeError is raised, the execution will immediately drop into the except block (which will cause the next item in the loop to be ran). If no error is raised, the code will happily skip the except block.
If you just wish to set the missing values to zero and print, you can do
try: newprice=soup.LowestNewPrice.Amount.string
except AttributeError: newprice=0
try: usedprice=soup.LowestUsedPrice.Amount.string
except AttributeError: usedprice=0
print i , newprice, usedprice

The correct way of comparing with None is is None, not == None or is not None, not != None.
Secondly, you also need to check soup.LowestNewPrice for None, not the Amount, i.e.:
if soup.LowestNewPrice is not None:
... read soup.LowestNewPrice.Amount

Python: handing exceptions when downloading non-existing files using urllib

I know how to download a file from the web using python, however I wish to handle cases where the file being requested does not exist. In which case, I want to print an error message ("404: File not found") and not write anything to disk. However, I still want to be able to continue executing the program (i.e. downloading other files in a list that may exist).
How do I do this? Below is some template code to download a file given its url (feel free to modify it if you believe there is a better way, but please keep it concise and simple).
import urllib
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")

from urllib2 import URLError
try:
# your file request code here
except URLError, e:
if e.code == 404:
# your appropriate code here
else:
# raise maybe?
I followed this guide, which has a specific section about handling exceptions, and found it really helpful.

import urllib, urllib2
try:
urllib.urlretrieve ("http://www.example.com/", "myfile.mp3")
except URLError, e:
if e.code == 404:
print "4 0 4"
else:
print "%s" % e
This is what your code does. It basically tries to retrieve the web page of www.example.com and writes it to myfile.mp3. It does not end into exception because it is not looking for the myfile.mp3, it basically writes everything it gets in html to myfile.mp3
If you are looking for code to download files at a certain location on the web, try this
How do I download a zip file in python using urllib2?

Your code should look like this:
try:
urllib.urlretrieve ("http://www.example.com/myfile.mp3", "myfile.mp3")
except URLError,e:
if e.code==404:
print 'file not found. moving on...'
pass

Python Clientform-can not get expexted result

I am trying to search through http://www.wegottickets.com/ with the keywords "Live music". But the returned result is still the main page, not the search result page including lots of live music information. Could anyone show me out what the problem is?
from urllib2 import urlopen
from ClientForm import ParseResponse
response = urlopen("http://www.wegottickets.com/")
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
form.set_value("Live music", name="unified_query")
form.set_all_readonly(False)
control = form.find_control(type="submit")
print control.disabled
print control.readonly
#print form
request2 = form.click()
try:
response2 = urlopen(request2)
except:
print "Unsccessful query"
print response2.geturl()
print response2.info()
print response.read()
response2.close()
Thank you very much!

Never used it, but I've had success with the python mechanize module, if it turns out to be a fault in clientform.
However, as a first step, I'd suggest removing your try...except wrapper. What you're basically doing is saying "catch any error, then ignore the actual error and print 'Unsuccessful Query' instead". Not helpful for debugging. The exception will stop the program and print a useful error message, if you don't get in its way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to bypass missing link and continue to scrape good data? - python

Related

How can I override the AttributeError to make my script continue

How to download books automatically from Gutenberg

How to continue after receiving None response from xml parse

Python: handing exceptions when downloading non-existing files using urllib

Python Clientform-can not get expexted result

Categories

Resources