I am finding prices of products from Amazon using their API with Bottlenose and parsing the xml response with BeautifulSoup.
I have a predefined list of products that the code iterates through.
This is my code:
import bottlenose as BN
import lxml
from bs4 import BeautifulSoup
i = 0
amazon = BN.Amazon('myid','mysecretkey','myassoctag',Region='UK',MaxQPS=0.9)
list = open('list.txt', 'r')
print "Number", "New Price:","Used Price:"
for line in list:
i = i + 1
listclean = line.strip()
response = amazon.ItemLookup(ItemId=listclean, ResponseGroup="Large")
soup = BeautifulSoup(response, "xml")
usedprice=soup.LowestUsedPrice.Amount.string
newprice=soup.LowestNewPrice.Amount.string
print i , newprice, usedprice
This works fine and will run through my list of amazon products until it gets to a product which doesn't have any value for that set of tags, like no new/used price.
At which Python will throw up this response:
AttributeError: 'NoneType' object has no attribute 'Amount'
Which makes sense as there is no tags/string found by BS that I searched for. Having no value is perfectly fine from what I'm trying to achieve, however the code collapses at this point and will not continue.
I have tried:
if soup.LowestNewPrice.Amount != None:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
and also tried:
newprice=0
if soup.LowestNewPrice.Amount != 0:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
I am at a loss for how to continue after receiving the nonetype value return. Unsure whether the problem lies fundamentally in the language or in the libraries I'm using.
You can use exception handling:
try:
# operation which causes AttributeError
except AttributeError:
continue
The code in the try block will be executed and if an AttributeError is raised, the execution will immediately drop into the except block (which will cause the next item in the loop to be ran). If no error is raised, the code will happily skip the except block.
If you just wish to set the missing values to zero and print, you can do
try: newprice=soup.LowestNewPrice.Amount.string
except AttributeError: newprice=0
try: usedprice=soup.LowestUsedPrice.Amount.string
except AttributeError: usedprice=0
print i , newprice, usedprice
The correct way of comparing with None is is None, not == None or is not None, not != None.
Secondly, you also need to check soup.LowestNewPrice for None, not the Amount, i.e.:
if soup.LowestNewPrice is not None:
... read soup.LowestNewPrice.Amount
Related
I'm trying to override the AttributeError message, so that it does not give me the error message and just continues with the script. The script finds and prints the office_manager name, but on some occasions there is no manager listed, as such I need it to just ignore those occasions. Can anyone help?
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
continue
finally:
print("none")
Since the error came from .find, then it should be the one to be on the try catch, or even better it should be like this.
try:
office_manager = soup.find(text="Office_Manager").findPrevious('h4')
except AttributeError as err:
print(err) # or print("none")
pass # return or continue
else:
for title in office_manager:
print(title)
With bs4 4.7.1. you can use :contains, :has and :not. The following prints the directors names (if there are no directors you will get an empty list)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
I thought someone less lazy than me would convert my comment to an answer, but as not, here you go:
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
pass
finally:
print("none")
Using pass will skip the entry instead.
The below come's up with the error:
"if soup.find(text=bbb).parent.parent.get_text(strip=True
AttributeError: 'NoneType' object has no attribute 'parent'"
Any help would be appreciated as I can't quite get it to run fully, python only returns results up to the error, I need it to return empty if there is no item and move on. I tried putting a IF statement but that doesnt work.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_e in soup.findAll('table', {'class' : 'neither'}):
Sold = item_e.get_text(strip=True)
bbb = re.compile('First listed')
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except:
Pass
try:
writer.writerow([ Sold, next_s])
except:
pass
trade_spider(2)
Your exception comes from trying to access an attribute on None. You don't intend to do that, but because some earlier part of your expression turns out to be None where you expected something else, the later parts break.
Specifically, either soup.find(text=bbb) or soup.find(text=bbb).parent is None (probably the former, since I think None is the returned value if find doesn't find anything).
There are a few ways you can write your code to address this issue. You could either try to detect that it's going to happen ahead of time (and do something else instead), or you can just go ahead and try the attribute lookup and react if it fails. These two approaches are often called "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).
Here's a bit of code using an LBYL approach that checks to make sure the values are not None before accessing their attributes:
val = soup.find(text=bbb)
if val and val.parent: # I'm assuming the non-None values are never falsey
next_s = val.parent.parent.get_text(strip=True)
else:
# do something else here?
The EAFP approach is perhaps simpler, but there's some risk that it could catch other unexpected exceptions instead of the ones we expect (so be careful using this design approach during development):
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except AttributeError: # try to catch the fewest exceptions possible (so you don't miss bugs)
# do something else here?
It's not obvious to me what your code should do in the "do something else here" sections in the code above. It might be that you can ignore the situation, but probably you'd need an alternative value for next_s to be used by later code. If there's no useful value to substitute, you might want to bail out of the function early instead (with a return statement).
Python newbie here practicing my skills. I came across a roadbump and would be very happy to receive some help. What i'm trying to do is to get a list of links from a spreadsheet. From there, Python will get the data, extract a specific class and paste the data to ColB. Problem is, there are instances when the link is broken, hence there will be no data scraped. I used try and except to get around this but it seems like it's not working. What it seems to do is that when an error occurs, it just skips writing the data and proceeds to write the data on the wrong cell. here is my code:
credentials = ServiceAccountCredentials.from_json_keyfile_name('Te....4e.json', scope)
gc = gspread.authorize(credentials)
#selects the spreadsheet
sh = gc.open_by_url('https://docs.google.com/spreadsheets/d/1u7....0')
worksheet = sh.worksheet('Keywords')
colvalue = "A"
rownumber = 2
updaterowvalue = 2
while rownumber <100:
try:
val = worksheet.acell(colvalue +str(rownumber)).value
rownumber += 1
url = val
#scrape elements
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print titles only
h1 = soup.find("h1", class_= "sg-text--headline")
updatecolvalue = "B"
worksheet.update_acell(updatecolvalue +str(updaterowvalue), h1.get_text())
updaterowvalue +=1
except AttributeError:
pass
print('DONE')
I assume that the extra indentation on the line starting worksheet.update_acell is an error, since your code is invalid as given.
The problem is that when an exception occurs, updaterowvalue +=1 is not executed, which causes the results to get out of sync with the URLs.
Fixing this is simple, just stop using updaterowvalue and just use rownumber in the worksheet.update_acell() call. Since you want the result to be in the same row as the URL, updaterowvalue is unnecessary.
A more pythonic way of writing the loop would be:
for rownumber in range(2,100):
which allows you to eliminate the rownumber += 1 line too.
How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!
I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!
Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.
website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
for lendo in arquivo.readlines():
msmwebsite = website + lendo
try:
abrindo = urllib2.urlopen(msmwebsite)
abrindo2 = abrindo.read()
except URLError as e:
pass
if abrindo.code == 200:
palavras = ['registration', 'there is no form']
for palavras2 in palavras:
if palavras2 in abrindo2:
print msmwebsite, 'up'
else:
pass
else:
pass
It's working but for some reason, some websites I got this error:
if abrindo.code == 200:
NameError: name 'abrindo' is not defined
How to fix it?
.......................................................................................................................................................................................
Replace pass with continue. And at least do some error logging, as you silently skip erroneous links.
In case your request resulted in an URLError, no variable abrindo is defined, hence your error.
abrindo is created only in the try block. It will not be available if the catch block is executed. To fix this, move the block of code starting with
if abrindo.code == 200:
inside the try block. One more suggestion, if you are not doing anything in the else part, instead of explicitly writing that with pass, simply remove them.