SoupStrainer with encoding - python

I wrote the line below:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
The data is achieved by urllib.urlopen(XXX).read() in python2.7.
It works well when the XXX is a page that consists of total English characters, such as http://python.org. But when it goes for a page there is some Chinese characters, it fails.
There will be a KeyError. And [x for ...] returns an empty list.
What's more, if there is no parseOnlyThese=SoupStrainer('a'), it is OK for both.
Is there some bug of SoupStrainer?
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
data = urllib.urlopen('http://tudou.com').read()
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
gives the traceback:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
File "F:\ActivePython27\lib\site-packages\beautifulsoup-3.2.1-py2.7.egg\BeautifulSoup‌​.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'

There are <a> links on that page that do not have a href attribute. Use the following instead:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a')) if x.has_key('href')]
For example, it is perfectly normal to declare a link target with <a name="something" />; you are selecting those tags too, but they do not have a href attribute and your code fails on that.

Related

Bs4 and requests Python

I was trying to make a pokedex (https://replit.com/#Yoplayer1py/Gui-Pokedex) and I wanted to get the pokemon's description from https://www.pokemon.com/us/pokedex/{__pokename__} Here pokename means the name of the pokemon. for example: https://www.pokemon.com/us/pokedex/unown
There is a tag contains the description and the p tag's class is : version-xactive.
When i print the description i get nothing or sometimes i get None.
here's the code:
import requests
from bs4 import BeautifulSoup
# Assign URL
url = "https://www.pokemon.com/us/pokedex/"+text_id_name.get(1.0, "end-1c")
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
The text_id_name.get(1.0, "end-1c") is from tkinter text input.
it shows that :
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python3.8/tkinter/__init__.py", line 1883, in __call__
return self.func(*args)
File "main.py", line 57, in load_pokemon
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
AttributeError: 'NoneType' object has no attribute 'text'
Thanks in advance !!
It looks like the classes, multiple, of the description are version-x active (at least for Unown). That is why soup.find('p', attrs={'class': 'version-xactive'} is not finding the element, thereby returning None (hence why you are getting the error).
Adding a space will fix your problem: print(soup.find('p', attrs={'class': 'version-xactive'}).text). Just to note: if there are multiple p elements with the same classes, so find method might not return the element you want.
Adding a null check will also prevent the error from occurring:
description = soup.find('p', attrs={'class': 'version-x active'})
if description:
print(desription.text)
You should probably separate out your calls so you can do a safety check and a type check.
Replace
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
with
tags = soup.find('p', attrs={'class': 'version-xactive'})
print("Tags:", tags)
if(type(tags) != NoneType):
print(tags.text)
That should give you more information at least. It might still break on tags.text. If it does, put the printout from print("Tags:", tags) up so we can see what the data looks like.

Getting href using beautiful soup with different methods

I'm trying to scrape a website. I learned to scrape from two resources: one used tag.get('href') to get the href from an a tag, and one used tag['href'] to get the same. As far as I understand it, they both do the same thing. But when I tried this code:
link_list = [l.get('href') for l in soup.find_all('a')]
it worked with the .get method, but not with the dictionary access way.
link_list = [l['href'] for l in soup.find_all('a')]
This throws a KeyError. I'm very new to scraping, so please pardon if this is a silly one.
Edit - Both of the methods worked for the find method instead of find_all.
You may let BeautifulSoup find the links with existing href attributes only.
test
You can do it in two common ways, via find_all():
link_list = [a['href'] for a in soup.find_all('a', href=True)]
Or, with a CSS selector:
link_list = [a['href'] for a in soup.select('a[href]')]
Maybe HTML-string does not have a "href"?
For example:
from bs4 import BeautifulSoup
doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')
Nothing will happen, but
ahref['href']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'

Strange BeautifulSoup soup.findAll error: not working within functions

I'm trying to build a pretty simple scraper to harvest links as part of a crawler project. I've set up the following function to do the scraping:
import requests as rq
from bs4 import BeautifulSoup
def getHomepageLinks(page):
homepageLinks = []
response = rq.get(page)
text = response.text
soup = BeautifulSoup(text)
for a in soup.findAll('a'):
homepageLinks.append(a['href'])
return homepageLinks
I saved this file as "scraper2.py". When I try to run the code, I get the following error:
>>> import scraper2 as sc
>>> sc.getHomepageLinks('http://washingtonpost.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "scraper2.py", line 9, in getHomepageLinks
for a in soup.findAll('a'):
TypeError: 'NoneType' object is not callable
Now for the odd part: If I try to debug the code and just print the response, it works fine:
>>> response = rq.get('http://washingtonpost.com')
>>> text = response.text
>>> soup = BeautifulSoup(text)
>>> for a in soup.findAll('a'):
... print(a['href'])
...
https://www.washingtonpost.com
#
#
http://www.washingtonpost.com/politics/
https://www.washingtonpost.com/opinions/
http://www.washingtonpost.com/sports/
http://www.washingtonpost.com/local/
http://www.washingtonpost.com/national/
http://www.washingtonpost.com/world/
...
If I'm reading the error messages correctly, the problem is occurring with soup.findAll, but only when the findAll is part of a function. I'm sure I'm spelling it correctly (not findall or Findall, as many of the errors on here are), and I've tried a fix using lxml suggested on a previous post that didn't fix it. Does anyone have any ideas?
Try to replace your for-loop with the following:
for a in soup.findAll('a'):
url = a.get("href")
if url != None:
homepageLinks.append(url)

beautiful soup re-parse a returned set of table rows beautiful soup

I'm trying to parse a second set of data. I make a get request to gigya status page, I parse out the part that is important with beautiful soup. Then I take the return string of html trying to parse that with beautiful soup as well but I get a markup error however the returned content string is a string as well so im not sure why..
error
Traceback (most recent call last):
File "C:\Users\Administraor\workspace\ChronoTrack\get_gigiya.py", line 17, in <module>
soup2 = BeautifulSoup(rows)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
code
import requests
import sys
from bs4 import BeautifulSoup
url = ('https://console.gigya.com/site/apiStatus/getTable.ashx')
r = requests.request('GET', url)
content = str(r.content)
soup = BeautifulSoup(content)
table = soup.findAll('table')
rows = soup.findAll('tr')
rows = rows[8]
soup2 = BeautifulSoup(rows) #this is where it fails
items = soup2.findAll('td')
print items
The line soup2 = BeautifulSoup(rows) is unnecessary; rows at that point is already a BeautifulSoup.Tag object. You can simply do:
rows = rows[8]
items = rows.findAll('td')

Why does BeautifulSoup reformat my XML?

I do the following:
from BeautifulSoup import *
html = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(html)
soup.contents
As a result I get:
[<body><b>In Body</b><b>Second level</b></body>]
It looks strange to me since I see not the original XML. Originally I have a tag <b> that contains some text (In Body) and then it contains another tag <b>. However, the BeautifulSoup "thinks" that I have tag <b> and after it (after it is closed) I have another tag <b>. So, the tags are not perceived as nested into each other. Why is that?
ADDED
For the people who complain about validity of the HTML in my example I made the following example:
xml = u'<aaa><bbb>In Body<bbb>Second level</bbb></bbb></aaa>'
soup = BeautifulSoup(xml)
soup.contents
which returns:
[<aaa><bbb>In Body</bbb><bbb>Second level</bbb></aaa>]
ADDED 2
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1344, in unknown_starttag
and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
AttributeError: 'list' object has no attribute 'text'
Note that you're using the obsolete package, BeautifulSoup:
This package is OBSOLETE. It has been replaced by the beautifulsoup4
package. You should use Beautiful Soup 4 for all new projects
BeautifulSoup 3 contained some XML parsing features (the BeautifulStoneSoup) that really did not understand the same tag being nested again (as noted by 7stud in his answer; thus for all XML parsing needs it should be totally and utterly considered replaced by BeautifulSoup 4. Note that these packages can coexist even within an application - BeautifulSoup.BeautifulSoup for BS3, and bs4.BeautifulSoup for BS4.
BeautifulSoup 4 parses using HTML rules by default; you need to tell it explicitly to use XML (requires the lxml to be installed). Thus an example with BeautifulSoup 4 (PyPI beautifulsoup4):
>>> from bs4 import BeautifulSoup
>>> xml = u'<body><b>In Body<b>Second level</b></b></body>'
>>> soup = BeautifulSoup(xml, 'xml')
>>> soup.contents
[<body><b>In Body<b>Second level</b></b></body>]
>>> bs4.__version__
'4.1.3'
Notice that then the document must be well-formed XML; no leniency.
If you do not use the 'xml' argument, you will get incorrectly parsed documents:
>>> bs4.BeautifulSoup('<p><p></p></p>')
<html><body><p></p><p></p></body></html>
and with
>>> bs4.BeautifulSoup('<p><p></p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p/></p>
So, the tags are not perceived as nested into each other. Why is that?
According to the comments in the BeautifulSoup source code:
Tag nesting rules:
Most tags can't be nested at all. For instance, the occurance of a
<p> tag should implicitly close the previous <p> tag.
<p>Para1<p>Para2
should be transformed into:
<p>Para1</p><p>Para2
The source code then specifies several lists containing tag names that according to the HTML standard are allowed to nest within themselves--and <b> isn't one of them.
If I use:
xml = u'<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, ['lxml', 'xml'])
I get:
AttributeError: 'list' object has no attribute 'text'
You are getting that error because you can't pass a list as an argument to BeautifulSoup().
In order to alert BeautifulSoup that you aren't parsing html, you need to use BeautifulStoneSoup(). Unfortunately, my tests show that BeautifulStoneSoup() produces the same xml, so it appears that BeautifulStoneSoup() applies a similar nesting rule to your <b> tag.
If you aren't locked into using BeautifulSoup 3, you should use lxml or BeautifulSoup 4. lxml is considered by many to be the superior package (e.g. it's faster, you can use xpaths), but it can be tough to install. So I suggest you try to install lxml, and if that works, then great. Otherwise, install BeautifulSoup 4.
I've been using BeautifulSoup for so many years, that I prefer it; but I also use lxml when I want to use xpaths to search a document.
lxml example:
from lxml import etree
xml = '<body><b>In Body<b>Second level</b></b></body>'
tree = etree.fromstring(xml)
print etree.tostring(tree)
matching_tags = tree.xpath('/body/b/b')
inner_b_tag = matching_tags[0]
print inner_b_tag.text
--output:--
<body><b>In Body<b>Second level</b></b></body>
Second level
bs4 example:
from bs4 import BeautifulSoup
xml = '<body><b>In Body<b>Second level</b></b></body>'
soup = BeautifulSoup(xml, 'xml') #In BeautifulSoup 4, you pass a second argument to BeautifulSoup() to indicate that you are parsing xml.
print(soup)
body = soup.find('body')
inner_b_tag = body.b.b
print(inner_b_tag.string)
--output:--
<?xml version="1.0" encoding="utf-8"?>
<body><b>In Body<b>Second level</b></b></body>
Second level

Categories

Resources