I am trying to write code that allows me to do 4 things, and I am using try and except.
The code is as follows:
try:
for i in lista:
a = url1 + i
print(a)
wget.download(a, '/Users/******/downloads')
except:
for i in lista:
b = url2 + i
wget.download(b, '/Users/*****/downloads')
But I need to use 2 more exceptions. Can you explain to me how I can do it?
The main goal is to download a file; if it is still not there, download a second file, and so on and so forth.
You can specify the error after the except statement. For example:
urls = [
"algumsite.com",
"outrosite.org",
"sitezinho.com.br"
]
for url in urls:
try:
wget.download(url, "path_to_download_folder/")
except <error/s that can be raised in the previous try block>:
# code that will be executed if the error were raised
Related
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
Python newbie here practicing my skills. I came across a roadbump and would be very happy to receive some help. What i'm trying to do is to get a list of links from a spreadsheet. From there, Python will get the data, extract a specific class and paste the data to ColB. Problem is, there are instances when the link is broken, hence there will be no data scraped. I used try and except to get around this but it seems like it's not working. What it seems to do is that when an error occurs, it just skips writing the data and proceeds to write the data on the wrong cell. here is my code:
credentials = ServiceAccountCredentials.from_json_keyfile_name('Te....4e.json', scope)
gc = gspread.authorize(credentials)
#selects the spreadsheet
sh = gc.open_by_url('https://docs.google.com/spreadsheets/d/1u7....0')
worksheet = sh.worksheet('Keywords')
colvalue = "A"
rownumber = 2
updaterowvalue = 2
while rownumber <100:
try:
val = worksheet.acell(colvalue +str(rownumber)).value
rownumber += 1
url = val
#scrape elements
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print titles only
h1 = soup.find("h1", class_= "sg-text--headline")
updatecolvalue = "B"
worksheet.update_acell(updatecolvalue +str(updaterowvalue), h1.get_text())
updaterowvalue +=1
except AttributeError:
pass
print('DONE')
I assume that the extra indentation on the line starting worksheet.update_acell is an error, since your code is invalid as given.
The problem is that when an exception occurs, updaterowvalue +=1 is not executed, which causes the results to get out of sync with the URLs.
Fixing this is simple, just stop using updaterowvalue and just use rownumber in the worksheet.update_acell() call. Since you want the result to be in the same row as the URL, updaterowvalue is unnecessary.
A more pythonic way of writing the loop would be:
for rownumber in range(2,100):
which allows you to eliminate the rownumber += 1 line too.
I am finding prices of products from Amazon using their API with Bottlenose and parsing the xml response with BeautifulSoup.
I have a predefined list of products that the code iterates through.
This is my code:
import bottlenose as BN
import lxml
from bs4 import BeautifulSoup
i = 0
amazon = BN.Amazon('myid','mysecretkey','myassoctag',Region='UK',MaxQPS=0.9)
list = open('list.txt', 'r')
print "Number", "New Price:","Used Price:"
for line in list:
i = i + 1
listclean = line.strip()
response = amazon.ItemLookup(ItemId=listclean, ResponseGroup="Large")
soup = BeautifulSoup(response, "xml")
usedprice=soup.LowestUsedPrice.Amount.string
newprice=soup.LowestNewPrice.Amount.string
print i , newprice, usedprice
This works fine and will run through my list of amazon products until it gets to a product which doesn't have any value for that set of tags, like no new/used price.
At which Python will throw up this response:
AttributeError: 'NoneType' object has no attribute 'Amount'
Which makes sense as there is no tags/string found by BS that I searched for. Having no value is perfectly fine from what I'm trying to achieve, however the code collapses at this point and will not continue.
I have tried:
if soup.LowestNewPrice.Amount != None:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
and also tried:
newprice=0
if soup.LowestNewPrice.Amount != 0:
newprice=soup.LowestNewPrice.Amount.string
else:
continue
I am at a loss for how to continue after receiving the nonetype value return. Unsure whether the problem lies fundamentally in the language or in the libraries I'm using.
You can use exception handling:
try:
# operation which causes AttributeError
except AttributeError:
continue
The code in the try block will be executed and if an AttributeError is raised, the execution will immediately drop into the except block (which will cause the next item in the loop to be ran). If no error is raised, the code will happily skip the except block.
If you just wish to set the missing values to zero and print, you can do
try: newprice=soup.LowestNewPrice.Amount.string
except AttributeError: newprice=0
try: usedprice=soup.LowestUsedPrice.Amount.string
except AttributeError: usedprice=0
print i , newprice, usedprice
The correct way of comparing with None is is None, not == None or is not None, not != None.
Secondly, you also need to check soup.LowestNewPrice for None, not the Amount, i.e.:
if soup.LowestNewPrice is not None:
... read soup.LowestNewPrice.Amount
I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?
website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
for lendo in arquivo.readlines():
msmwebsite = website + lendo
try:
abrindo = urllib2.urlopen(msmwebsite)
abrindo2 = abrindo.read()
except URLError as e:
pass
if abrindo.code == 200:
palavras = ['registration', 'there is no form']
for palavras2 in palavras:
if palavras2 in abrindo2:
print msmwebsite, 'up'
else:
pass
else:
pass
It's working but for some reason, some websites I got this error:
if abrindo.code == 200:
NameError: name 'abrindo' is not defined
How to fix it?
.......................................................................................................................................................................................
Replace pass with continue. And at least do some error logging, as you silently skip erroneous links.
In case your request resulted in an URLError, no variable abrindo is defined, hence your error.
abrindo is created only in the try block. It will not be available if the catch block is executed. To fix this, move the block of code starting with
if abrindo.code == 200:
inside the try block. One more suggestion, if you are not doing anything in the else part, instead of explicitly writing that with pass, simply remove them.