Is there any resume() function in python. I need to apply it on my program. need proper explanation I searched a lot but didn't get it.
Here is my code where I need to place the resume function.
try:
soup = BeautifulSoup(urllib2.urlopen(url))
abc = soup.find('div', attrs={})
link = abc.find('a')['href']
#result is dictionary
results['Link'] = "http://{0}".format(link)
print results
#pause.minute(1)
#time.sleep(10)
except Exception:
print "socket error continuing the process"
time.sleep(4)
#pause.minute(1)
#break
I apply pause, time.stamp and break but not getting the required result. If any error appears in try then I want to pause the program. try block is already inside loop.
To resume the code in case of an exception, put it inside a loop:
import time
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
for _ in range(max_retries):
try:
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout)
else:
break
else: # all max_retries attempts failed
raise last_error
soup = BeautifulSoup(html, from_encoding=encoding)
# ...
Related
First of all thank you for taking your time to read through this post. I'd like to begin that I'm very new to programming in general and that I seek advice to solve a problem.
I'm trying to create a script that checks if the content of a html page has been changed. I'm doing this to monitor certain website pages for changes. I've managed to find a script and I have made some alterations that it will go through a list of URL's checking if the page has been changed. The problem here is that its checking the page sequential. This means that it will go through the list checking the URL's one by one while I want the script to run the URL's parallel. I'm also using a while loop to continue checking the pages because even if a change took place it will still have to monitor the page. I could write a thousand more words on explaining what i'm trying to do so therefor have a look at the code:
import requests
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
url = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
i = 0
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
while True:
try:
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
print('checking')
time.sleep(10)
response = urlopen(url[i]).read()
newHash = hashlib.sha224(response).hexdigest()
i +=1
if newHash == currentHash:
continue
else:
print('Change detected')
print (url[i])
time.sleep(10)
continue
except Exception as e:
i = 0
print('resetting increment')
continue
What you want to do is called multi-threading.
Conceptually this is how it works:
import hashlib
import time
from urllib.request import urlopen
import threading
# Define a function for the thread
def f(url):
initialHash = None
while True:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
if not initialHash:
initialHash = currentHash
if currentHash != initialHash:
print('Change detected')
print (url)
time.sleep(10)
continue
return
# Create two threads as follows
for url in ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]:
t = threading.Thread(target=f, args=(url,))
t.start()
Running example of OP code Using Thread Executor
Code
import concurrent.futures
import time
import requests
import hashlib
from urllib.request import urlopen
def check_change(url):
'''
Checks for a change in web page contents by comparing current to previous hash
'''
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(10)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash != currentHash:
return "Change to:", url
else:
return None
except Exception as e:
return "Error", e, url
page_urls = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
while True:
# We can use a Thread Execution Manager to ensure threads are clean up properly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(check_change, url): url for url in page_urls}
for future in concurrent.futures.as_completed(future_to_url):
# Output result of each thread upon it's completion
url = future_to_url[future]
try:
status = future.result()
if status:
print(*status)
else:
print(f'No change to: {url}')
except Exception as exc:
print('Site %r generated an exception: %s' % (url, exc))
time.sleep(10) # Wait 10 seconds before rechecking sites
Output
Change to: https://www.google.com
Change to: https://www.google.be
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
...
I'm trying to override the AttributeError message, so that it does not give me the error message and just continues with the script. The script finds and prints the office_manager name, but on some occasions there is no manager listed, as such I need it to just ignore those occasions. Can anyone help?
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
continue
finally:
print("none")
Since the error came from .find, then it should be the one to be on the try catch, or even better it should be like this.
try:
office_manager = soup.find(text="Office_Manager").findPrevious('h4')
except AttributeError as err:
print(err) # or print("none")
pass # return or continue
else:
for title in office_manager:
print(title)
With bs4 4.7.1. you can use :contains, :has and :not. The following prints the directors names (if there are no directors you will get an empty list)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
I thought someone less lazy than me would convert my comment to an answer, but as not, here you go:
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
pass
finally:
print("none")
Using pass will skip the entry instead.
I'm crawling some financial statement from 'naver'. In my code list(kospi_list),
an incorrect code exists. So, I made an exception to avoid it. However it doesn't work well.
Once the program notice the incorrect code, it process the program as an exception until the end. I just want to continue the loop.
Could you help me?
Thanks in advance.
def connect(self):
for code in self.kospi_list:
url = "https://companyinfo.stock.naver.com/v1/company/c1010001.aspx?cmp_cd=%s&cn=" % code
try:
self.driver.get(url)
html = self.driver.page_source
self.soup = BeautifulSoup(html, 'html.parser')
except:
pass
else:
name = self.get_name()
result = self.loop()
if result is not None:
self.save_sql(result, name)
print(result)
self.driver.close()
I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?
I have the following code that grabs images using urlretrieve working..... too a point.
def Opt3():
global conn
curs = conn.cursor()
results = curs.execute("SELECT stock_code FROM COMPANY")
for row in results:
#for image_name in list_of_image_names:
page = requests.get('url?prodid=' + row[0])
tree = html.fromstring(page.text)
pic = tree.xpath('//*[#id="bigImg0"]')
#print pic[0].attrib['src']
print 'URL'+pic[0].attrib['src']
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except:
pass
I am reading a CSV to input the image names. It works except when it hits an error/corrupt url (where there is no image I think). I was wondering if I could simply skip any corrupt urls and get the code to continue grabbing images? Thanks
urllib has a very bad support for error catching. urllib2 is a much better choice. The urlretrieve equivalent in urllib2 is:
resp = urllib2.urlopen(im_url)
with open(sav_name, 'wb') as f:
f.write(resp.read())
And the errors to catch are:
urllib2.URLError, urllib2.HTTPError, httplib.HTTPException
And you can also catch socket.error in case that the network is down.
Simply using except Exception is a very stupid idea. It'll catch every error in the above block even your typos.
Just use a try/except and continue if it fails
try:
page = requests.get('url?prodid=' + row[0])
except Exception,e:
print e
continue # continue to next row
Instead of pass why don't you try continue when an error occurs.
try:
urllib.urlretrieve('URL'+pic[0].attrib['src'],'images\\'+row[0]+'.jpg')
except Exception e:
continue