I'm crawling some financial statement from 'naver'. In my code list(kospi_list),
an incorrect code exists. So, I made an exception to avoid it. However it doesn't work well.
Once the program notice the incorrect code, it process the program as an exception until the end. I just want to continue the loop.
Could you help me?
Thanks in advance.
def connect(self):
for code in self.kospi_list:
url = "https://companyinfo.stock.naver.com/v1/company/c1010001.aspx?cmp_cd=%s&cn=" % code
try:
self.driver.get(url)
html = self.driver.page_source
self.soup = BeautifulSoup(html, 'html.parser')
except:
pass
else:
name = self.get_name()
result = self.loop()
if result is not None:
self.save_sql(result, name)
print(result)
self.driver.close()
Related
I am trying to use the below code to search for a keyword in a given URL (internal website at work) and I keep getting the error. It works fine on public site.
from html.parser import HTMLParser
import urllib.request
class CustomHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.tag_flag = False
self.tag_line_num = 0
self.tag_string = 'temporary_tag'
def initiate_vars(self, tag_string):
self.tag_string = tag_string
def handle_starttag(self, tag, attrs):
#if tag == 'tag_to_search_for':
if tag == self.tag_string:
self.tag_flag = True
self.tag_line_num = self.getpos()
if __name__== '__main__':
#simple_str = 'string_to_search_for'
simple_str = 'Host Status'
my_url = 'TEST_URL'
parser_obj = CustomHTMLParser()
#parser_obj.initiate_vars('tag_to_search_for')
parser_obj.initiate_vars('script')
#html_file = open('location_of_html_file//file.html')
my_request = urllib.request.Request(my_url)
try:
url_data = urllib.request.urlopen(my_request)
except:
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
#html_str = html_file.read()
#print (html_str)
html_search_result = html_str.lower().find(simple_str.lower())
if html_search_result != -1:
print ('The word {} was found'.format(simple_str))
else:
print ('The word {} was not found'.format(simple_str))
parser_obj.feed(html_str)
if parser_obj.tag_flag:
print ('Tag {0} was found at position {1}'.format(parser_obj.tag_string, parser_obj.tag_line_num))
else:
print ('Tag {} was not found'.format(parser_obj.tag_string))
but I keep getting the error
There was some error opening the URL
Traceback (most recent call last):
File "C:\TEMP\parse.py", line 40, in <module>
html_str = url_data.read().decode('utf8')
NameError: name 'url_data' is not defined
I believe I already tried using urllib2, using python v3.7
Not sure what to do. Is it worth trying user_agent?
EDIT1: I have now tried the below
>>> import urllib
>>> url = urllib.request.urlopen('https://concernedURL.com')
and I am getting this error "urllib.error.HTTPError: HTTP Error 401: Unauthorized". Should I be using the headers I have from my browser as well as SSL certs?
The problem is that you get an error in the try-block, and that leaves the url_data variable undefined:
try:
# if this errors, no url_data will exist
url_data = urllib.request.urlopen(my_request)
except:
# really bad to catch all exceptions!
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
You should probably just remove the try-except, or handle the error better. It's almost never advicable to use the bare except without a specific error since it can create all kinds of problems.
In this case your program should probably just stop running if you cannot open the requested url, since it really doesn't make any sense to try to operate on the url's data if the opening failed in the first place.
I'm trying to override the AttributeError message, so that it does not give me the error message and just continues with the script. The script finds and prints the office_manager name, but on some occasions there is no manager listed, as such I need it to just ignore those occasions. Can anyone help?
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
continue
finally:
print("none")
Since the error came from .find, then it should be the one to be on the try catch, or even better it should be like this.
try:
office_manager = soup.find(text="Office_Manager").findPrevious('h4')
except AttributeError as err:
print(err) # or print("none")
pass # return or continue
else:
for title in office_manager:
print(title)
With bs4 4.7.1. you can use :contains, :has and :not. The following prints the directors names (if there are no directors you will get an empty list)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
I thought someone less lazy than me would convert my comment to an answer, but as not, here you go:
for office_manager in soup.find(text="Office_Manager").findPrevious('h4'):
try:
print(office_manager)
except AttributeError:
pass
finally:
print("none")
Using pass will skip the entry instead.
I've just started using Python to scrape the data. But my code as below freezes during work and I guess that's because some url did not response anything; I guess it would work if I just try that url again. My question here is, if I just revise the code like,
reshomee = requests.get(homeUrl, headers=headerss, timeout=10)
then does this code try that url again after 10 seconds with no response? I am just worried if it would be just over without trying again...?
I couldn't help asking this because I have no idea how to try this code since url freezes very rare and randomly. Thank you!
def reshome(tries=0):
try:
reshomee = requests.get(homeUrl, headers=headerss)
return reshomee
except Exception as e:
print(e)
if tries < 10:
print('try:' + str(tries))
sleep(tries*30+100)
return reshome(tries+1)
else:
print('cannot make it')
You can use requests.exceptions in the module.
def reshome(tries=0):
try:
reshomee = requests.get(homeUrl, headers=headerss, timeout=0.001)
return reshomee
except requests.exceptions.Timeout as e:
return reshome(tries+1)
I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?
Is there any resume() function in python. I need to apply it on my program. need proper explanation I searched a lot but didn't get it.
Here is my code where I need to place the resume function.
try:
soup = BeautifulSoup(urllib2.urlopen(url))
abc = soup.find('div', attrs={})
link = abc.find('a')['href']
#result is dictionary
results['Link'] = "http://{0}".format(link)
print results
#pause.minute(1)
#time.sleep(10)
except Exception:
print "socket error continuing the process"
time.sleep(4)
#pause.minute(1)
#break
I apply pause, time.stamp and break but not getting the required result. If any error appears in try then I want to pause the program. try block is already inside loop.
To resume the code in case of an exception, put it inside a loop:
import time
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
for _ in range(max_retries):
try:
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout)
else:
break
else: # all max_retries attempts failed
raise last_error
soup = BeautifulSoup(html, from_encoding=encoding)
# ...