BeautifulSoup Error (CGI Escape) - python

Getting the following error:
Traceback (most recent call last):
File "stack.py", line 31, in ?
print >> out, "%s" % escape(p) File
"/usr/lib/python2.4/cgi.py", line
1039, in escape
s = s.replace("&", "&") # Must be done first! TypeError: 'NoneType'
object is not callable
For the following code:
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
def talk_description(tag):
return tag.name == "p" and tag.findParent("h3")
links = []
desc = []
for pagenum in xrange(1, 5):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
page = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks/arvind_gupta_turning_trash_into_toys_for_learning.html"))
desc.extend(soup.findAll(talk_description))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th><th>Description</th></tr>"""
for x, a in enumerate(links):
print >> out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td>" % (x + 1, escape(a["title"]), escape(a["href"]))
for y, p in enumerate(page):
print >> out, "<td>%s</td>" % escape(p)
print >>out, "</tr></table>"
I think the issue is with % escape(p). I'm trying to take the contents of that <p> out. Am I not supposed to use escape?
Also having an issue with the line:
page = BeautifulSoup(urllib2.urlopen("%s") % a["href"])
That's what I want to do, but again running into errors and wondering if there's an alternate way of doing it. Just trying to collect the links I found from previous lines and run it through BeautifulSoup again.

You have to investigate (using pdb) why one of your links is returned as None instance.
In particular: the traceback is self-speaking. The escape() is called with None. So you have to investigate which argument is None...it's one of of your items in 'links'. So why is one of your items None?
Likely because one of your calls to
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
returns None because tag.findParent("dt", "thumbnail") returns None (due to your given HTML input).
So you have to check or filter your items in 'links' for None (or adjust your parser code above) in order to pickup only existing links according to your needs.
And please read your tracebacks carefully and think about what the problem might be - tracebacks are very helpful and provide you with valuable information about your problem.

Related

Can't print only text using Beautiful soup

I am struggling creating one of my first projects on python3. When I use the following code:
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc", cookies=all_cookies)
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
print(offer_name.text.strip())
I get the following error:
Traceback (most recent call last):
File "scrape_products.py", line 45, in <module>
scrape_offers()
File "scrape_products.py", line 40, in scrape_offers
print(offer_name.text.strip())
File "/usr/local/lib/python3.7/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I've read many similar cases on StackOverFlow but I still can't help myself. If someone have any ideas, please help :)
P.S.: If i run the code without .text it show the entire <a class=...> ... </a>
findchildren returns a list. Sometimes you get an empty list, sometimes you get a list with one element.
You should add an if statement to check if the length of the returned list is greater than 1, then print the text.
import requests
from bs4 import BeautifulSoup
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc")
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
if (len(offer_name) >= 1):
print(offer_name[0].text.strip())
scrape_offers()

AttributeError: 'NoneType' object has no attribute 'findAll' in a web scraper

I am making a program for web scraping but this is my first time. The tutorial that I am using is built for python 2.7, but I am using 3.8.2. I have mostly edited my code to fit it to python 3, but one error pops up and I can't fix it.
import requests
import csv
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(features="html.parser")
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
print(output)
handle = open('out-using-requests.csv', 'a')
outfile = csv.writer(handle)
outfile.writerows(output)
The error I get is:
Traceback (most recent call last):
File "C:\Code\scrape.py", line 17, in <module>
for row in results_table.findAll('tr'):
AttributeError: 'NoneType' object has no attribute 'findAll'
The tutorial I am using is https://first-web-scraper.readthedocs.io/en/latest/
I tried some other questions, but they didn't help.
Please help!!!
Edit: Never mind, I got a good answer.
find returns None if it doesn't find a match. You need to check for that before attempting to find any sub elements in it:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
if results_table:
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
The error allows the following conclusion:
results_table = None
Therefore, you cannot access the findAll() method because None.findAll() does not exist.
You should take a look, it is best to use a debugger to run through your program and see how the variables change line by line and why the mentioned line only returns ```None''. Especially important is the line:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
Because in this row results_table is initialized yes, so here the above none'' value is returned andresults_table'' is assigned.

TypeError on matching pattern with module "re"

I'm trying to extract the price of the item from my programme by parsing the HTML with help of "bs4" BeautifulSoup library
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"},text= not None)
pattern_1 = re.compile("/d+./d+").findall(element).text.strip()
print(pattern_1)
print(element)
and here is what I get as output :
Traceback (most recent call last):
File "/root/Desktop/Visual_Studio_Files/Python_sample.py", line 9, in <module>
pattern_1 = (re.compile("/d+./d+").findall(str_ele)).text.strip()
TypeError: expected string or bytes-like object
re.findall freaks out because your element variable has the type bs4.element.Tag.
You can find this out by adding print(type(element)) in your script.
Based on some quick poking around, I think you can extract the string you need from the tag using the contents attribute (which is a list) and taking the first member of this list (index 0).
Moreover, re.findall also returns a list, so instead of .text you need to use [0] to access its first member. Thus you will once again have a string which supports the .strip() method!
Last but not least, it seems you may have mis-typed your slashes and meant to use \ instead of /.
Here's a working version of your code:
pattern_1 = re.findall("\d+.\d+", element.contents[0])[0].strip()
This is definitely not pretty or very pythonic, but it will get the job done.
Note that I dropped the call to re.compile because that gets run in the background when you call re.findall anyway.
here is what it finally look like :)
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"}).text.strip()
# pattern_1 = re.compile("/d+./d+").findall(element)
# print (pattern_1)
print (element)
and this is the output :)
146.00
thank you every one :)

AttributeError: 'NoneType' object has no attribute 'lower' and warning

I was trying to make a simple program to extract words from paragraphs in a web page.
my code looks like this -
import requests
from bs4 import BeautifulSoup
import operator
def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for post_text in soup.find_all('p'):
cont = post_text.string
words = cont.lower().split()
for each_word in words:
print(each_word)
word_list.append(each_word)
start('https://lifehacker.com/why-finding-your-passion-isnt-enough-1826996673')
First I am getting this warning -
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 17 of the file D:/Projects/Crawler/Main.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")
markup_type=markup_type))
and then there is this error in the end:
Traceback (most recent call last):
File "D:/Projects/Crawler/Main.py", line 17, in <module>
start('https://lifehacker.com/why-finding-your-passion-isnt-enough-1826996673')
File "D:/Projects/Crawler/Main.py", line 11, in start
words = cont.lower().split()
AttributeError: 'NoneType' object has no attribute 'lower'
I have tried searching, but not able to resolve or understand the problem.
You are parsing that page using the paragraph tag <p>, but that tag does not always have textual content associated to it. For instance, if you were to instead run:
def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for post_text in soup.find_all('p'):
print(post_text)
You would see that you're getting hits off of things like advertisements: <p class="ad-label=bottom"></p>. As others have stated in the comment, None type does not have string methods, which is literally what your error is referring to.
A simple way to guard against this would be to wrap a section of your function in a try/except block:
for post_text in soup.find_all('p'):
try:
cont = post_text.string
words = cont.lower().split()
for each_word in words:
print(each_word)
word_list.append(each_word)
except AttributeError:
pass

BeautifulSoup KeyError Issue

I know that KeyErrors are fairly common with BeautifulSoup and, before you yell RTFM at me, I have done extensive reading in both the Python documentation and the BeautifulSoup documentation. Now that that's aside, I still haven't a clue what's going on with KeyErrors.
Here's the program I'm trying to run which constantly and consistently results in a KeyError on the last element of the URLs list.
I come from a C++ background, just to let you know, but I need to use BeautifulSoup for work, doing this in C++ would be an imaginable nightmare!
The idea is to return a list of all URLs in a website that contain on their pages links to a certain URL.
Here's what I got so far:
import urllib
from BeautifulSoup import BeautifulSoup
URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")
def print_links (link):
if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
if (link.startswith('/')):
link = "STARTING_WEBSITE" + link
print (link)
htmlSource = urllib.urlopen(link).read(200000)
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
if (item['href'].startswith('/') or
"tuftsalumni" in item['href']):
URLs.append(item['href'])
length = len(URLs)
if (item['href'] == "SITE_ON_PAGE"):
if (check_list(link, Locations) == "no"):
Locations.append(link)
def check_list (link, array):
for x in range (0, len(array)):
if (link == array[x]):
return "yes"
return "no"
print_links(URLs[0])
for x in range (0, (len(URLs))):
print_links(URLs[x])
The error I get is on the next to last element of URLs:
File "scraper.py", line 35, in <module>
print_links(URLs[x])
File "scraper.py", line 16, in print_links
if (item['href'].startswith('/') or
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/BeautifulSoup.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
Now I know I need to use get() to handle the KeyError default case. I have absolutely no idea how to actually do that, despite literally an hour of searching.
Thank you, if I can clarify this at all please do let me know.
If you just want to handle the error, you can catch the exception:
for item in soup.fetch('a'):
try:
if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
(...)
except KeyError:
pass # or some other fallback action
You can specify a default using item.get('key','default'), but I don't think that's what you need in this case.
Edit: If everything else fails, this is a barebones version that should be a reasonable starting point:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from BeautifulSoup import BeautifulSoup
links = ["http://www.tuftsalumni.org"]
def print_hrefs(link):
htmlSource = urllib.urlopen(link).read()
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
print item['href']
for link in links:
print_hrefs(link)
Also, check_list(item, l) can be replaced by item in l.

Categories

Resources