I am trying to use the below code to search for a keyword in a given URL (internal website at work) and I keep getting the error. It works fine on public site.
from html.parser import HTMLParser
import urllib.request
class CustomHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.tag_flag = False
self.tag_line_num = 0
self.tag_string = 'temporary_tag'
def initiate_vars(self, tag_string):
self.tag_string = tag_string
def handle_starttag(self, tag, attrs):
#if tag == 'tag_to_search_for':
if tag == self.tag_string:
self.tag_flag = True
self.tag_line_num = self.getpos()
if __name__== '__main__':
#simple_str = 'string_to_search_for'
simple_str = 'Host Status'
my_url = 'TEST_URL'
parser_obj = CustomHTMLParser()
#parser_obj.initiate_vars('tag_to_search_for')
parser_obj.initiate_vars('script')
#html_file = open('location_of_html_file//file.html')
my_request = urllib.request.Request(my_url)
try:
url_data = urllib.request.urlopen(my_request)
except:
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
#html_str = html_file.read()
#print (html_str)
html_search_result = html_str.lower().find(simple_str.lower())
if html_search_result != -1:
print ('The word {} was found'.format(simple_str))
else:
print ('The word {} was not found'.format(simple_str))
parser_obj.feed(html_str)
if parser_obj.tag_flag:
print ('Tag {0} was found at position {1}'.format(parser_obj.tag_string, parser_obj.tag_line_num))
else:
print ('Tag {} was not found'.format(parser_obj.tag_string))
but I keep getting the error
There was some error opening the URL
Traceback (most recent call last):
File "C:\TEMP\parse.py", line 40, in <module>
html_str = url_data.read().decode('utf8')
NameError: name 'url_data' is not defined
I believe I already tried using urllib2, using python v3.7
Not sure what to do. Is it worth trying user_agent?
EDIT1: I have now tried the below
>>> import urllib
>>> url = urllib.request.urlopen('https://concernedURL.com')
and I am getting this error "urllib.error.HTTPError: HTTP Error 401: Unauthorized". Should I be using the headers I have from my browser as well as SSL certs?
The problem is that you get an error in the try-block, and that leaves the url_data variable undefined:
try:
# if this errors, no url_data will exist
url_data = urllib.request.urlopen(my_request)
except:
# really bad to catch all exceptions!
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
You should probably just remove the try-except, or handle the error better. It's almost never advicable to use the bare except without a specific error since it can create all kinds of problems.
In this case your program should probably just stop running if you cannot open the requested url, since it really doesn't make any sense to try to operate on the url's data if the opening failed in the first place.
Related
So BS4 was working earlier today however it has problems when trying to load a page.
import requests
from bs4 import BeautifulSoup
name = input("")
twitter = requests.get("https://twitter.com/" + name)
#instagram = requests.get("https//instagram.com/" + name)
#website = requests.get("https://" + name + ".com")
twitter_soup = BeautifulSoup(twitter, 'html.parser')
twitter_available = twitter_soup.body.findAll(text="This account doesn't exist")
if twitter_available == True:
print("Available")
else:
print("Not Available")
So the line where twitter_soup is declared I get the following errors
Traceback (most recent call last):
File "D:\Programming\Python\name-checker.py", line 12, in
twitter_soup = BeautifulSoup(twitter, 'html.parser')
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\bs4_init_.py", line 310, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
I have also tried the other parsers the docs were suggesting however none are working.
I just figured it out.
So I had to use the actual html which is twitter.text in this situation instead of just using the request.
I would like some help on how to handle an url which fails to open, currently the whole program gets interrupted when it fails to open the url ( tree = ET.parse(opener.open(input_url)) )...
If the opening of an url fails on my first function call (motgift) I would like it to wait 10 seconds and then try to open the url again, if it once again fails I would like my script to continue with next function call (observer).
def spider_xml(input_url, extract_function, input_xpath, pipeline, object_table, object_model):
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(input_url))
print(object_table)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), object_model)
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model)
observer = spider_xml(observer_url, extract_xml_item, observer_xpath, observer_pipeline, observer_table, observer_model)
Would be very happy and appreciate an example on how to make this happen.
Would a Try Except block work?
error = 0
while error < 2:
try:
motgift = spider_xml(motgift_url, extract_xml_item, motgift_xpath, motgift_pipeline, motgift_table, motgift_model
break
except:
error += 1
sleep(10)
try:
resp = opener.open(input_url)
except Exception:
time.sleep(10)
try:
resp = opener.open(input_url)
except Exception:
pass
Are you looking for this?
Is there any resume() function in python. I need to apply it on my program. need proper explanation I searched a lot but didn't get it.
Here is my code where I need to place the resume function.
try:
soup = BeautifulSoup(urllib2.urlopen(url))
abc = soup.find('div', attrs={})
link = abc.find('a')['href']
#result is dictionary
results['Link'] = "http://{0}".format(link)
print results
#pause.minute(1)
#time.sleep(10)
except Exception:
print "socket error continuing the process"
time.sleep(4)
#pause.minute(1)
#break
I apply pause, time.stamp and break but not getting the required result. If any error appears in try then I want to pause the program. try block is already inside loop.
To resume the code in case of an exception, put it inside a loop:
import time
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
for _ in range(max_retries):
try:
r = urllib2.urlopen(url)
encoding = r.info().getparam('charset')
html = r.read()
except Exception as e:
last_error = e
time.sleep(retry_timeout)
else:
break
else: # all max_retries attempts failed
raise last_error
soup = BeautifulSoup(html, from_encoding=encoding)
# ...
website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
for lendo in arquivo.readlines():
msmwebsite = website + lendo
try:
abrindo = urllib2.urlopen(msmwebsite)
abrindo2 = abrindo.read()
except URLError as e:
pass
if abrindo.code == 200:
palavras = ['registration', 'there is no form']
for palavras2 in palavras:
if palavras2 in abrindo2:
print msmwebsite, 'up'
else:
pass
else:
pass
It's working but for some reason, some websites I got this error:
if abrindo.code == 200:
NameError: name 'abrindo' is not defined
How to fix it?
.......................................................................................................................................................................................
Replace pass with continue. And at least do some error logging, as you silently skip erroneous links.
In case your request resulted in an URLError, no variable abrindo is defined, hence your error.
abrindo is created only in the try block. It will not be available if the catch block is executed. To fix this, move the block of code starting with
if abrindo.code == 200:
inside the try block. One more suggestion, if you are not doing anything in the else part, instead of explicitly writing that with pass, simply remove them.
Getting the following error:
Traceback (most recent call last):
File "stack.py", line 31, in ?
print >> out, "%s" % escape(p) File
"/usr/lib/python2.4/cgi.py", line
1039, in escape
s = s.replace("&", "&") # Must be done first! TypeError: 'NoneType'
object is not callable
For the following code:
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
def talk_description(tag):
return tag.name == "p" and tag.findParent("h3")
links = []
desc = []
for pagenum in xrange(1, 5):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
page = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks/arvind_gupta_turning_trash_into_toys_for_learning.html"))
desc.extend(soup.findAll(talk_description))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th><th>Description</th></tr>"""
for x, a in enumerate(links):
print >> out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td>" % (x + 1, escape(a["title"]), escape(a["href"]))
for y, p in enumerate(page):
print >> out, "<td>%s</td>" % escape(p)
print >>out, "</tr></table>"
I think the issue is with % escape(p). I'm trying to take the contents of that <p> out. Am I not supposed to use escape?
Also having an issue with the line:
page = BeautifulSoup(urllib2.urlopen("%s") % a["href"])
That's what I want to do, but again running into errors and wondering if there's an alternate way of doing it. Just trying to collect the links I found from previous lines and run it through BeautifulSoup again.
You have to investigate (using pdb) why one of your links is returned as None instance.
In particular: the traceback is self-speaking. The escape() is called with None. So you have to investigate which argument is None...it's one of of your items in 'links'. So why is one of your items None?
Likely because one of your calls to
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
returns None because tag.findParent("dt", "thumbnail") returns None (due to your given HTML input).
So you have to check or filter your items in 'links' for None (or adjust your parser code above) in order to pickup only existing links according to your needs.
And please read your tracebacks carefully and think about what the problem might be - tracebacks are very helpful and provide you with valuable information about your problem.