BeautifulSoup error when loading the parser - python

So BS4 was working earlier today however it has problems when trying to load a page.
import requests
from bs4 import BeautifulSoup
name = input("")
twitter = requests.get("https://twitter.com/" + name)
#instagram = requests.get("https//instagram.com/" + name)
#website = requests.get("https://" + name + ".com")
twitter_soup = BeautifulSoup(twitter, 'html.parser')
twitter_available = twitter_soup.body.findAll(text="This account doesn't exist")
if twitter_available == True:
print("Available")
else:
print("Not Available")
So the line where twitter_soup is declared I get the following errors
Traceback (most recent call last):
File "D:\Programming\Python\name-checker.py", line 12, in
twitter_soup = BeautifulSoup(twitter, 'html.parser')
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\bs4_init_.py", line 310, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
I have also tried the other parsers the docs were suggesting however none are working.

I just figured it out.
So I had to use the actual html which is twitter.text in this situation instead of just using the request.

Related

Python script is working on local but not on Ubuntu server

This is my first post on this forum and I hope to explain my problem the right way.
So I wrote this little web crawler to update me when the price of a product on Amazon is updating the price. After that it is sending me a notification at Telegram.
def check_price():
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id='priceblock_ourprice').get_text() # this is the problematic line
converted_price = title[0:6]
converted_price = float(converted_price.replace(',', '.'))
if os.path.exists('data.txt'):
with open('data.txt', 'r+') as f:
f_contents = f.read()
if converted_price != float(f_contents):
send_msg('The price was updated to: ' + str(converted_price) + '€')
f.write(str(converted_price))
else:
send_msg('The price was updated to: ' + str(converted_price) + '€')
with open('data.txt', 'w') as f:
f.write(str(converted_price))
return
The problem is now that it works on my local machine and I get the notification. But when I try to run the code on the server I get this message:
Traceback (most recent call last):
File "main.py", line 44, in <module>
check_price()
File "main.py", line 16, in check_price
title = soup.find(id='priceblock_ourprice').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
I just post the main function for checking the price and not the sending because the problem is occurring before that.
I can't find the error in the way I did it. I hope you can help me and thanks.

delete image using msg_id in python?

I have a camera whose picture is achieve using the IP address of the camera in the web browser.
I can download the image link then and then download the image to my local system.
After that, I have to delete the image using msg_id, token, parameter.
I added the image link for delete using msg_id.
from time import sleep
import os
import sys
import requests
from bs4 import BeautifulSoup
import piexif
import os
from fractions import Fraction
archive_url = "http://192.168.42.1/SD/AMBA/191123000/"
def get_img_links():
# create response object
r = requests.get(archive_url)
# create beautiful-soup object
soup = BeautifulSoup(r.content,'html5lib')
# find all links on web-page
links = soup.findAll('a')
# filter the link sending with .mp4
img_links = [archive_url + link['href'] for link in links if link['href'].endswith('JPG')]
return img_links
def FileDelete():
FilesToProcess = get_img_links()
print(FilesToProcess)
FilesToProcessStr = "\n".join(FilesToProcess)
for FileTP in FilesToProcess:
tosend = '{"msg_id":1281,"token":%s,"param":"%s"}' %(token, FileTP)
print("Delete successfully")
Getting this error:
NameError: name 'token' is not defined
runfile('D:/EdallSystem/socket_pro/pic/hy/support.py', wdir='D:/EdallSystem/socket_pro/pic/hy')
['http://192.168.42.1/SD/AMBA/191123000/13063800.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13064200.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13064600.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13065000.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13065400.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13065800.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13072700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13073100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13073500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13073900.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13074300.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13074700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13075100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13075500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13075900.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13080300.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13080700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13081100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13081500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13081900.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13082300.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13082700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13083100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13083500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13083900.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13084300.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13084700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13085100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13085500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13085900.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13090300.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13090700.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13091100.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13091500.JPG', 'http://192.168.42.1/SD/AMBA/191123000/13091900.JPG']
Traceback (most recent call last):
File "D:\EdallSystem\socket_pro\pic\hy\support.py", line 82, in <module>
FileDelete()
File "D:\EdallSystem\socket_pro\pic\hy\support.py", line 74, in FileDelete
tosend = '{"msg_id":1281,"token":%s,"param":"%s"}' %( FileTP)
TypeError: not enough arguments for format string

NameError: name 'url_data' is not defined

I am trying to use the below code to search for a keyword in a given URL (internal website at work) and I keep getting the error. It works fine on public site.
from html.parser import HTMLParser
import urllib.request
class CustomHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.tag_flag = False
self.tag_line_num = 0
self.tag_string = 'temporary_tag'
def initiate_vars(self, tag_string):
self.tag_string = tag_string
def handle_starttag(self, tag, attrs):
#if tag == 'tag_to_search_for':
if tag == self.tag_string:
self.tag_flag = True
self.tag_line_num = self.getpos()
if __name__== '__main__':
#simple_str = 'string_to_search_for'
simple_str = 'Host Status'
my_url = 'TEST_URL'
parser_obj = CustomHTMLParser()
#parser_obj.initiate_vars('tag_to_search_for')
parser_obj.initiate_vars('script')
#html_file = open('location_of_html_file//file.html')
my_request = urllib.request.Request(my_url)
try:
url_data = urllib.request.urlopen(my_request)
except:
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
#html_str = html_file.read()
#print (html_str)
html_search_result = html_str.lower().find(simple_str.lower())
if html_search_result != -1:
print ('The word {} was found'.format(simple_str))
else:
print ('The word {} was not found'.format(simple_str))
parser_obj.feed(html_str)
if parser_obj.tag_flag:
print ('Tag {0} was found at position {1}'.format(parser_obj.tag_string, parser_obj.tag_line_num))
else:
print ('Tag {} was not found'.format(parser_obj.tag_string))
but I keep getting the error
There was some error opening the URL
Traceback (most recent call last):
File "C:\TEMP\parse.py", line 40, in <module>
html_str = url_data.read().decode('utf8')
NameError: name 'url_data' is not defined
I believe I already tried using urllib2, using python v3.7
Not sure what to do. Is it worth trying user_agent?
EDIT1: I have now tried the below
>>> import urllib
>>> url = urllib.request.urlopen('https://concernedURL.com')
and I am getting this error "urllib.error.HTTPError: HTTP Error 401: Unauthorized". Should I be using the headers I have from my browser as well as SSL certs?
The problem is that you get an error in the try-block, and that leaves the url_data variable undefined:
try:
# if this errors, no url_data will exist
url_data = urllib.request.urlopen(my_request)
except:
# really bad to catch all exceptions!
print("There was some error opening the URL")
html_str = url_data.read().decode('utf8')
You should probably just remove the try-except, or handle the error better. It's almost never advicable to use the bare except without a specific error since it can create all kinds of problems.
In this case your program should probably just stop running if you cannot open the requested url, since it really doesn't make any sense to try to operate on the url's data if the opening failed in the first place.

python's beautiful soup module giving error

I am using the following code in an attempt to do webscraping .
import sys , os
import requests, webbrowser,bs4
from PIL import Image
import pyautogui
p = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
n = open("exml.txt" , 'wb')
for i in p.iter_content(1000) :
n.write(i)
n.close()
n = open("exml.txt" , 'r')
soupy= bs4.BeautifulSoup(n,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
so what I am intending to do is to extract all the image links that is there in the xml response obtained from the page .
(Please correct me If I am wrong in thinking that requests.get returns the whole static html file of the webpage that opens on entering the URL)
However in the line :
soupy= bs4.BeautifulSoup(n,"html.parser")
I am getting the following error :
Traceback (most recent call last):
File "../../perl/webscratcher.txt", line 24, in <module>
soupy= bs4.BeautifulSoup(n,"html.parser")
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 24662: character maps to <undefined>
I am clueless about the error and the "Appdata" folder is empty .
How to proceed further ?
Post Trying suggestions :
I changed the extension of the filename to py and this error got removed . However on the following line :
soupy= bs4.BeautifulSoup(n,"lxml") I am getting the following error :
Traceback (most recent call last):
File "C:\perl\webscratcher.py", line 23, in
soupy= bs4.BeautifulSoup(p,"lxml")
File "C:\Users\PREMRAJ\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4_init_.py", line 192, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
How to tackle this ?
You are over-complicating things. Pass the bytes content of a Response object directly into the constructor of the BeautifulSoup object, instead of writing it to a file.
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
soup = BeautifulSoup(response.content, 'lxml')
for element in soup.select('img[src]'):
print(element)
Okay so you you might want to do a review on working with BeautifulSoup. I referenced an old project of mine and this is all you need for printing them. Check the BS documents to find the exact syntax you want with the select method.
This will print all the img tags from the html
import requests, bs4
site = 'http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1'
p = requests.get(site).text
soupy = bs4.BeautifulSoup(p,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)

BeautifulSoup Error (CGI Escape)

Getting the following error:
Traceback (most recent call last):
File "stack.py", line 31, in ?
print >> out, "%s" % escape(p) File
"/usr/lib/python2.4/cgi.py", line
1039, in escape
s = s.replace("&", "&") # Must be done first! TypeError: 'NoneType'
object is not callable
For the following code:
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
def talk_description(tag):
return tag.name == "p" and tag.findParent("h3")
links = []
desc = []
for pagenum in xrange(1, 5):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
page = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks/arvind_gupta_turning_trash_into_toys_for_learning.html"))
desc.extend(soup.findAll(talk_description))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th><th>Description</th></tr>"""
for x, a in enumerate(links):
print >> out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td>" % (x + 1, escape(a["title"]), escape(a["href"]))
for y, p in enumerate(page):
print >> out, "<td>%s</td>" % escape(p)
print >>out, "</tr></table>"
I think the issue is with % escape(p). I'm trying to take the contents of that <p> out. Am I not supposed to use escape?
Also having an issue with the line:
page = BeautifulSoup(urllib2.urlopen("%s") % a["href"])
That's what I want to do, but again running into errors and wondering if there's an alternate way of doing it. Just trying to collect the links I found from previous lines and run it through BeautifulSoup again.
You have to investigate (using pdb) why one of your links is returned as None instance.
In particular: the traceback is self-speaking. The escape() is called with None. So you have to investigate which argument is None...it's one of of your items in 'links'. So why is one of your items None?
Likely because one of your calls to
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
returns None because tag.findParent("dt", "thumbnail") returns None (due to your given HTML input).
So you have to check or filter your items in 'links' for None (or adjust your parser code above) in order to pickup only existing links according to your needs.
And please read your tracebacks carefully and think about what the problem might be - tracebacks are very helpful and provide you with valuable information about your problem.

Categories

Resources