I'm trying to override the AttributeError message, so that it does not give me the error message and just continues with the script. The script finds and prints the office_manager name, but on some occasions there is no manager listed, as such I need it to just ignore those occasions. Can anyone help?
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
continue
finally:
print("none")
Since the error came from .find, then it should be the one to be on the try catch, or even better it should be like this.
try:
office_manager = soup.find(text="Office_Manager").findPrevious('h4')
except AttributeError as err:
print(err) # or print("none")
pass # return or continue
else:
for title in office_manager:
print(title)
With bs4 4.7.1. you can use :contains, :has and :not. The following prints the directors names (if there are no directors you will get an empty list)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
I thought someone less lazy than me would convert my comment to an answer, but as not, here you go:
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
pass
finally:
print("none")
Using pass will skip the entry instead.
Related
The below come's up with the error:
"if soup.find(text=bbb).parent.parent.get_text(strip=True
AttributeError: 'NoneType' object has no attribute 'parent'"
Any help would be appreciated as I can't quite get it to run fully, python only returns results up to the error, I need it to return empty if there is no item and move on. I tried putting a IF statement but that doesnt work.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_e in soup.findAll('table', {'class' : 'neither'}):
Sold = item_e.get_text(strip=True)
bbb = re.compile('First listed')
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except:
Pass
try:
writer.writerow([ Sold, next_s])
except:
pass
trade_spider(2)
Your exception comes from trying to access an attribute on None. You don't intend to do that, but because some earlier part of your expression turns out to be None where you expected something else, the later parts break.
Specifically, either soup.find(text=bbb) or soup.find(text=bbb).parent is None (probably the former, since I think None is the returned value if find doesn't find anything).
There are a few ways you can write your code to address this issue. You could either try to detect that it's going to happen ahead of time (and do something else instead), or you can just go ahead and try the attribute lookup and react if it fails. These two approaches are often called "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).
Here's a bit of code using an LBYL approach that checks to make sure the values are not None before accessing their attributes:
val = soup.find(text=bbb)
if val and val.parent: # I'm assuming the non-None values are never falsey
next_s = val.parent.parent.get_text(strip=True)
else:
# do something else here?
The EAFP approach is perhaps simpler, but there's some risk that it could catch other unexpected exceptions instead of the ones we expect (so be careful using this design approach during development):
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except AttributeError: # try to catch the fewest exceptions possible (so you don't miss bugs)
# do something else here?
It's not obvious to me what your code should do in the "do something else here" sections in the code above. It might be that you can ignore the situation, but probably you'd need an alternative value for next_s to be used by later code. If there's no useful value to substitute, you might want to bail out of the function early instead (with a return statement).
I am finding prices of products from Amazon using their API with Bottlenose and parsing the xml response with BeautifulSoup.
I have a predefined list of products that the code iterates through.
This is my code:
import bottlenose as BN
import lxml
from bs4 import BeautifulSoup
i = 0
amazon = BN.Amazon('myid','mysecretkey','myassoctag',Region='UK',MaxQPS=0.9)
list = open('list.txt', 'r')
print "Number", "New Price:","Used Price:"
for line in list:
i = i + 1
listclean = line.strip()
response = amazon.ItemLookup(ItemId=listclean, ResponseGroup="Large")
soup = BeautifulSoup(response, "xml")
usedprice=soup.LowestUsedPrice.Amount.string
newprice=soup.LowestNewPrice.Amount.string
print i , newprice, usedprice
This works fine and will run through my list of amazon products until it gets to a product which doesn't have any value for that set of tags, like no new/used price.
At which Python will throw up this response:
AttributeError: 'NoneType' object has no attribute 'Amount'
Which makes sense as there is no tags/string found by BS that I searched for. Having no value is perfectly fine from what I'm trying to achieve, however the code collapses at this point and will not continue.
I have tried:
if soup.LowestNewPrice.Amount != None:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
and also tried:
newprice=0
if soup.LowestNewPrice.Amount != 0:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
I am at a loss for how to continue after receiving the nonetype value return. Unsure whether the problem lies fundamentally in the language or in the libraries I'm using.
You can use exception handling:
try:
# operation which causes AttributeError
except AttributeError:
continue
The code in the try block will be executed and if an AttributeError is raised, the execution will immediately drop into the except block (which will cause the next item in the loop to be ran). If no error is raised, the code will happily skip the except block.
If you just wish to set the missing values to zero and print, you can do
try: newprice=soup.LowestNewPrice.Amount.string
except AttributeError: newprice=0
try: usedprice=soup.LowestUsedPrice.Amount.string
except AttributeError: usedprice=0
print i , newprice, usedprice
The correct way of comparing with None is is None, not == None or is not None, not != None.
Secondly, you also need to check soup.LowestNewPrice for None, not the Amount, i.e.:
if soup.LowestNewPrice is not None:
... read soup.LowestNewPrice.Amount
How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!
I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!
Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.
I have the following code that grabs images using urlretrieve working..... too a point.
def Opt3():
global conn
curs = conn.cursor()
results = curs.execute("SELECT stock_code FROM COMPANY")
for row in results:
#for image_name in list_of_image_names:
page = requests.get('url?prodid=' + row[0])
tree = html.fromstring(page.text)
pic = tree.xpath('//*[#id="bigImg0"]')
#print pic[0].attrib['src']
print 'URL'+pic[0].attrib['src']
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except:
pass
I am reading a CSV to input the image names. It works except when it hits an error/corrupt url (where there is no image I think). I was wondering if I could simply skip any corrupt urls and get the code to continue grabbing images? Thanks
urllib has a very bad support for error catching. urllib2 is a much better choice. The urlretrieve equivalent in urllib2 is:
resp = urllib2.urlopen(im_url)
with open(sav_name, 'wb') as f:
f.write(resp.read())
And the errors to catch are:
urllib2.URLError, urllib2.HTTPError, httplib.HTTPException
And you can also catch socket.error in case that the network is down.
Simply using except Exception is a very stupid idea. It'll catch every error in the above block even your typos.
Just use a try/except and continue if it fails
try:
page = requests.get('url?prodid=' + row[0])
except Exception,e:
print e
continue # continue to next row
Instead of pass why don't you try continue when an error occurs.
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except Exception e:
continue
Is there any resume() function in python. I need to apply it on my program. need proper explanation I searched a lot but didn't get it.
Here is my code where I need to place the resume function.
try:
soup = BeautifulSoup(urllib2.urlopen(url))
abc = soup.find('div', attrs={})
link = abc.find('a')['href']
#result is dictionary
results['Link'] = "http://{0}".format(link)
print results
#pause.minute(1)
#time.sleep(10)
except Exception:
print "socket error continuing the process"
time.sleep(4)
#pause.minute(1)
#break
I apply pause, time.stamp and break but not getting the required result. If any error appears in try then I want to pause the program. try block is already inside loop.
To resume the code in case of an exception, put it inside a loop:
import time
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
for _ in range(max_retries):
try:
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout)
else:
break
else: # all max_retries attempts failed
raise last_error
soup = BeautifulSoup(html, from_encoding=encoding)
# ...