BeautifulSoup KeyError Issue - python

I know that KeyErrors are fairly common with BeautifulSoup and, before you yell RTFM at me, I have done extensive reading in both the Python documentation and the BeautifulSoup documentation. Now that that's aside, I still haven't a clue what's going on with KeyErrors.
Here's the program I'm trying to run which constantly and consistently results in a KeyError on the last element of the URLs list.
I come from a C++ background, just to let you know, but I need to use BeautifulSoup for work, doing this in C++ would be an imaginable nightmare!
The idea is to return a list of all URLs in a website that contain on their pages links to a certain URL.
Here's what I got so far:
import urllib
from BeautifulSoup import BeautifulSoup
URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")
def print_links (link):
if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
if (link.startswith('/')):
link = "STARTING_WEBSITE" + link
print (link)
htmlSource = urllib.urlopen(link).read(200000)
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
if (item['href'].startswith('/') or
"tuftsalumni" in item['href']):
URLs.append(item['href'])
length = len(URLs)
if (item['href'] == "SITE_ON_PAGE"):
if (check_list(link, Locations) == "no"):
Locations.append(link)
def check_list (link, array):
for x in range (0, len(array)):
if (link == array[x]):
return "yes"
return "no"
print_links(URLs[0])
for x in range (0, (len(URLs))):
print_links(URLs[x])
The error I get is on the next to last element of URLs:
File "scraper.py", line 35, in <module>
print_links(URLs[x])
File "scraper.py", line 16, in print_links
if (item['href'].startswith('/') or
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/BeautifulSoup.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
Now I know I need to use get() to handle the KeyError default case. I have absolutely no idea how to actually do that, despite literally an hour of searching.
Thank you, if I can clarify this at all please do let me know.

If you just want to handle the error, you can catch the exception:
for item in soup.fetch('a'):
try:
if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
(...)
except KeyError:
pass # or some other fallback action
You can specify a default using item.get('key','default'), but I don't think that's what you need in this case.
Edit: If everything else fails, this is a barebones version that should be a reasonable starting point:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from BeautifulSoup import BeautifulSoup
links = ["http://www.tuftsalumni.org"]
def print_hrefs(link):
htmlSource = urllib.urlopen(link).read()
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
print item['href']
for link in links:
print_hrefs(link)
Also, check_list(item, l) can be replaced by item in l.

Related

Building XPath Query from Easylist.txt to count number of ads on webpage

So, I'm trying to write a script to gather the number of ads on a webpage. I'm basing this script on the following answer however, I keep getting the following error:
File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: unknown error
This is the script:
import lxml.etree
import lxml.html
import requests
import cssselect
translator = cssselect.HTMLTranslator()
rules = []
rules_file = "easylist.txt"
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
# elemhide rules are prefixed by ## in the adblock filter syntax
if line[:3] == '##.':
try:
rules.append(translator.css_to_xpath(line[2:],prefix=""))
except cssselect.SelectorError:
# just skip bad selectors
pass
query = "|".join(rules)
url = 'http://google.com' # replace it with a url you want to apply the rules to
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
print(len(document.xpath(query)))```
Any ideas on how to fix this error or potential alternative solutions to count the number of ads on a webpage would be appreciated. This is my first time working with lxml so I'm not sure what's likely to be causing the issue in the query. For your reference, the EasyList I'm using is linked here
I'm pretty sure that this is an issue with the query that's being built from EasyList as the code works when I hardcode a simple xpath query.
Appreciate this is an old post so answering for the benefit of anyone else who needs a similar function. This worked for me to count the ads:
import lxml.etree
import lxml.html
import requests
import cssselect
def count_ads(url):
print("Counting ads")
translator = cssselect.HTMLTranslator()
rules_file = f"\\easylist.txt"
html = requests.get(url).text
count = 0
with open(rules_file, 'r',encoding="UTF-8") as f:
for line in f:
if line[:2] == '##': # elemhide rules are prefixed by ## in the adblock filter syntax
try:
rule = translator.css_to_xpath(line[2:])
document = lxml.html.document_fromstring(html)
result = len(document.xpath(rule))
if result>0:
count = count+result
except cssselect.SelectorError:
pass #skip bad selectors
return count

TypeError on matching pattern with module "re"

I'm trying to extract the price of the item from my programme by parsing the HTML with help of "bs4" BeautifulSoup library
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"},text= not None)
pattern_1 = re.compile("/d+./d+").findall(element).text.strip()
print(pattern_1)
print(element)
and here is what I get as output :
Traceback (most recent call last):
File "/root/Desktop/Visual_Studio_Files/Python_sample.py", line 9, in <module>
pattern_1 = (re.compile("/d+./d+").findall(str_ele)).text.strip()
TypeError: expected string or bytes-like object
re.findall freaks out because your element variable has the type bs4.element.Tag.
You can find this out by adding print(type(element)) in your script.
Based on some quick poking around, I think you can extract the string you need from the tag using the contents attribute (which is a list) and taking the first member of this list (index 0).
Moreover, re.findall also returns a list, so instead of .text you need to use [0] to access its first member. Thus you will once again have a string which supports the .strip() method!
Last but not least, it seems you may have mis-typed your slashes and meant to use \ instead of /.
Here's a working version of your code:
pattern_1 = re.findall("\d+.\d+", element.contents[0])[0].strip()
This is definitely not pretty or very pythonic, but it will get the job done.
Note that I dropped the call to re.compile because that gets run in the background when you call re.findall anyway.
here is what it finally look like :)
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"}).text.strip()
# pattern_1 = re.compile("/d+./d+").findall(element)
# print (pattern_1)
print (element)
and this is the output :)
146.00
thank you every one :)

TypeError Using regex and beautifulsoup

Im working on some code that is reading a html, being parsed by beautifulsoup, and then want to use regex to find some numbers (part of an assignment).
Now an earlier assignment I used socket instead of urllib and I know that the error is from data types (expecting string or bytes) but down the line Im missing what I need to encode/decode to process the data. The error occurs at my re.findall
Besides a fix, what is causing the issue, and I guess more importantly what are the data type differences because I seem to be missing something... that should feel inherent.
Thanks ahead.
#Py3 urllib is utllib.request
import urllib.request
#BeautifulSoup stuff bs4 in Py3
from bs4 import *
#Raw Input now input in Py3
#url = 'http://' + input('Enter - ')
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
html = url.read()
#html.parser is the parser that defaults. Usefull most of the time (according to the web)
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the tags specified
tags = soup('span')
for tag in tags:
print(re.findall('[0-9]+', tag))
So, I've been caught off guard with this before: BeautifulSoup returns objects, which just appear to be strings when you call print.
Just as a sanity check, try this:
import urllib.request
from bs4 import *
url = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_42.html')
soup = BeautifulSoup(url.read(), 'html.parser')
single_tag = soup('span')[0]
print("Type is: \"%s\"; prints as \"%s\"" % (type(single_tag), single_tag))
print("As a string: \"%s\"; prints as \"%s\"" % (type(str(single_tag)), str(single_tag)))
The following should be output:
Type is: "< class 'bs4.element.Tag' >"; prints as "< span
class="comments" >97< /span >"
As a string: "< class 'str' >"; prints as "< span class="comments" >97< /span >"
So, if you encapsulate "tag" in a str() call before sending it to the regex, that problem should be taken care of
I've always found that adding sanity print(type(var)) checks when things start to complain about unexpected variable types to be a useful debugging technique!

How to retrieve google URL from search query

So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])

BeautifulSoup Error (CGI Escape)

Getting the following error:
Traceback (most recent call last):
File "stack.py", line 31, in ?
print >> out, "%s" % escape(p) File
"/usr/lib/python2.4/cgi.py", line
1039, in escape
s = s.replace("&", "&") # Must be done first! TypeError: 'NoneType'
object is not callable
For the following code:
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
def talk_description(tag):
return tag.name == "p" and tag.findParent("h3")
links = []
desc = []
for pagenum in xrange(1, 5):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
page = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks/arvind_gupta_turning_trash_into_toys_for_learning.html"))
desc.extend(soup.findAll(talk_description))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th><th>Description</th></tr>"""
for x, a in enumerate(links):
print >> out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td>" % (x + 1, escape(a["title"]), escape(a["href"]))
for y, p in enumerate(page):
print >> out, "<td>%s</td>" % escape(p)
print >>out, "</tr></table>"
I think the issue is with % escape(p). I'm trying to take the contents of that <p> out. Am I not supposed to use escape?
Also having an issue with the line:
page = BeautifulSoup(urllib2.urlopen("%s") % a["href"])
That's what I want to do, but again running into errors and wondering if there's an alternate way of doing it. Just trying to collect the links I found from previous lines and run it through BeautifulSoup again.
You have to investigate (using pdb) why one of your links is returned as None instance.
In particular: the traceback is self-speaking. The escape() is called with None. So you have to investigate which argument is None...it's one of of your items in 'links'. So why is one of your items None?
Likely because one of your calls to
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
returns None because tag.findParent("dt", "thumbnail") returns None (due to your given HTML input).
So you have to check or filter your items in 'links' for None (or adjust your parser code above) in order to pickup only existing links according to your needs.
And please read your tracebacks carefully and think about what the problem might be - tracebacks are very helpful and provide you with valuable information about your problem.

Categories

Resources