Beautiful Soup finds only half the wanted table - python

So, I try to scrape the table named 'Germany, Kempten Average prices of Gouda' which exists on this page, using python and BeautifulSoup. It should be as straight-forward as implementing something like the following block of code:
import requests
import re
from bs4 import BeautifulSoup
web_page = 'https://www.clal.it/en/index.php?section=gouda_k'
page = requests.get(web_page)
soup = BeautifulSoup(page.text, "html.parser")
table = str(soup.find_all('table')[17])
Through trial end error I found that out of all the tables, the one that I want is the 17th. The problem is that it just scrapes the first two lines. If we check exactly what lives in the variable table, we see the following:
<table align="center" width="100%">
THE CONTENTS OF THE TABLE UP TO THE SECOND LINE
</table>
But if we review the page.txt, the </table> tag is not after the second line, but in the end of the table, as one would expect.
Question #1 : Why does bs4 finds a </table> tag where there should not be one?
Question #2 : Any suggestions to actually manage to parse the entirety of the table?

It seems the problem was caused due to the html.parser feature.
You can use html5 or lxml feature. But again these features have their limitations.
Here is the advantages and disadvantages of each parser library.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Related

Beautifulsoup4, BS4, Python Parsing Question

I am parsing a webpage using bs4. There are more then one data type I would like to select, with the same class name.
My parsing code:
rows_ranking = soup_ranking.select('#current-poll tbody tr .left')
The page I want to parse has two different ".left" identifiers in the table rows. How can I choose which one I would like. Here is an exmample of two of these table rows (one I would like my program to parse, the other I would like to ignore)
1 - <td class="left " data-stat="school_name" csk="Baylor.015">Baylor</td>
2 - <td class="left " data-stat="conf_abbr" csk="Big 12 Conference.015.001">Big 12</td>
As you can see they have the same class identifier. Is there a way I can have bs4 look only for the first of the two?
I hope my question makes sense, thanks in advance!
Haven't used BS4 or python for awhile, but If I remember correctly something like this should work on getting all elements with data_stat and school_name in the data.
results = soup.findAll("td", {"data_stat" : "school_name"})
Or if you want all results in data with the data_stat attribute and the value doesn't matter use -
results = soup.findAll("td", {"data_stat" : True})
You have a couple of options:
You can use soup.find_all and loop through your results.
Use the css selector for first.
Inspect and copy the selector for that element.

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Beautiful Soup filter function fails to find all rows of a table

I am trying to parse a large html document using the Python Beautiful Soup 4 library.
The page contains a very large table, structured like so:
<table summary='foo'>
<tbody>
<tr>
A bunch of data
</tr>
<tr>
More data
</tr>
.
.
.
100s of <tr> tags later
</tbody>
</table>
I have a function that evaluates whether a given tag in soup.descendants is of the kind I am looking for. This is necessary because the page is large (BeautifulSoup tells me the document contains around 4000 tags).
It is like so:
def isrow(tag):
if tag.name == u'tr':
if tag.parent.parent.name == u'table' and \
tag.parent.parent.has_attr('summary'):
return True
My problem is that when I iterate through soup.descendants, the function only returns True for the first 77 rows in the table, when I know that the <tr> tags continue on for hundreds of rows.
Is this a problem with my function or is there something I don't understand about how BeautifulSoup generates its collection of descendants? I suspect it might be a Python or a bs4 memory issue but I don't know how to go about troubleshooting it.
Still more like an educated guess, but I'll give it a try.
The way BeautifulSoup parses HTML heavily depends on the underlying parser. If you don't specify it explicitly, BeautifulSoup will choose the one automatically based on an internal ranking:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
In your case, I'd try to switch the parsers and see what results you would get:
soup = BeautifulSoup(data, "lxml") # needs lxml to be installed
soup = BeautifulSoup(data, "html5lib") # needs html5lib to be installed
soup = BeautifulSoup(data, "html.parser") # uses built-in html.parser

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code looks like what is below:
import re
import urllib2,sys
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from BeautifulSoup import BeautifulSoup, NavigableString
address='http://www.example.com/'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
html=soup.prettify()
html=html.replace('&nbsp', '&#160')
html=html.replace('&iacute','&#237')
root=fromstring(html)
I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.
EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.
It's not clear to me from your question why you need to worry about the div tags -- what about doing just:
soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string
On the HTML you give, running this emits exactly:
####I want whatever is located here ###
which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).
Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:
thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
print thetd.string
...i.e., not much harder at all!-)
or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
first, install pyquery with
easy_install pyquery
then your script could be as simple as
from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]
pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well
The lxml library is now the standard for parsing html in python. The interface can seem awkward at first, but it is very serviceable for what it does.
You should let the libary handle the xml specialism, such as those escaped &entities;
import lxml.html
html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
<td class="author">####I want whatever is located here, eh? í ###</td>
</tr></tbody></table></div></div></body></html>"""
root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")
print tds # gives [<Element td at 84ee2cc>]
print tds[0].text # what you want, including the 'í'
BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
author_td, end_td = makeHTMLTags("td")
# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))
search = author_td + SkipTo(end_td)("body") + end_td
for match in search.searchString(html):
print match.body
Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>" and "</tag>" expressions. It also handles:
caseless matching of tags
"<tag/>" syntax
zero or more attribute in the opening tag
attributes defined in arbitrary order
attribute names with namespaces
attribute values in single, double, or no quotes
intervening whitespace between tag and symbols, or attribute name, '=', and value
attributes are accessible after parsing as named results
These are the common pitfalls when considering using a regex for HTML scraping.

Categories

Resources