Beautifulsoup4, BS4, Python Parsing Question - python

I am parsing a webpage using bs4. There are more then one data type I would like to select, with the same class name.
My parsing code:
rows_ranking = soup_ranking.select('#current-poll tbody tr .left')
The page I want to parse has two different ".left" identifiers in the table rows. How can I choose which one I would like. Here is an exmample of two of these table rows (one I would like my program to parse, the other I would like to ignore)
1 - <td class="left " data-stat="school_name" csk="Baylor.015">Baylor</td>
2 - <td class="left " data-stat="conf_abbr" csk="Big 12 Conference.015.001">Big 12</td>
As you can see they have the same class identifier. Is there a way I can have bs4 look only for the first of the two?
I hope my question makes sense, thanks in advance!

Haven't used BS4 or python for awhile, but If I remember correctly something like this should work on getting all elements with data_stat and school_name in the data.
results = soup.findAll("td", {"data_stat" : "school_name"})
Or if you want all results in data with the data_stat attribute and the value doesn't matter use -
results = soup.findAll("td", {"data_stat" : True})

You have a couple of options:
You can use soup.find_all and loop through your results.
Use the css selector for first.
Inspect and copy the selector for that element.

Related

Python Beautifulsoup get previous element using find_all_previous

I would like to identify some size in specific category, for example, I would like to scrap '(2)募入決定額' under the category '6.価格競争入札について' and '7.非競争入札について'
But somehow the structure for these are a little bit tricky as there is no hierarchy for these elements.
The website I use is :
https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm
And I tried the following code but nothing print out.
rows = soup.findAll('span')
for cell in r:
if "募入決定額" in cell:
a=rows[0].find_all_previous("td")
for i in a:
print(a.get('text'))
Much appreciate for any help!
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
You could select all <td> that contains 募入決定額 and from there its nearest sibling <td> that contains a <span>.
soup.select('td:-soup-contains("募入決定額") ~ td>span')
To get its previous categorie iterate over all previous <tr>:
[x.td.text for x in e.find_all_previous('tr') if x.td.span][0]
Read more about bs4 and css selectors and under dev.mozilla
Example
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.mof.go.jp/jgbs/auction/calendar/nyusatsu/resul20211101.htm'
soup = BeautifulSoup(requests.get(base_url).content)
for e in soup.select('td:-soup-contains("募入決定額") ~ td>span'):
print(e.text)
# or
print([x.td.text for x in e.find_all_previous('tr') if x.td.span][0],e.text)
Output
2兆1,205億円
4億8,500万円
4,785億円
or
6. 2兆1,205億円
7. 4億8,500万円
8. 4,785億円

Beautiful Soup finds only half the wanted table

So, I try to scrape the table named 'Germany, Kempten Average prices of Gouda' which exists on this page, using python and BeautifulSoup. It should be as straight-forward as implementing something like the following block of code:
import requests
import re
from bs4 import BeautifulSoup
web_page = 'https://www.clal.it/en/index.php?section=gouda_k'
page = requests.get(web_page)
soup = BeautifulSoup(page.text, "html.parser")
table = str(soup.find_all('table')[17])
Through trial end error I found that out of all the tables, the one that I want is the 17th. The problem is that it just scrapes the first two lines. If we check exactly what lives in the variable table, we see the following:
<table align="center" width="100%">
THE CONTENTS OF THE TABLE UP TO THE SECOND LINE
</table>
But if we review the page.txt, the </table> tag is not after the second line, but in the end of the table, as one would expect.
Question #1 : Why does bs4 finds a </table> tag where there should not be one?
Question #2 : Any suggestions to actually manage to parse the entirety of the table?
It seems the problem was caused due to the html.parser feature.
You can use html5 or lxml feature. But again these features have their limitations.
Here is the advantages and disadvantages of each parser library.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Select a html a tag with specified display content

I'm new to scrapy and have been struggling for this problem for hours.
I need to scrape a page, with its source somehow looks like this:
<tr class="odd">
<td class="pfama_PF02816">Pfam</td>
<td>Alpha_kinase</td>
<td>1389</td>
<td>1590</td>
<td class="sh" style="display: none">21.30</td>
</tr>
I need to get the information of the tr.odd tag, if and only if the a tag has "Alpha_kinase" value
I can get all of those content (including "Alpha_kinase", 1389, 1590 and many other values) and then process the output to get "Alpha_kinase" only, but this approach will be significantly fragile and ugly. Currently I have to do that way:
positions = response.css('tr.odd td:not([class^="sh"]) td a::text').extract()
then do a for-loop to check.
Is there any condition (like td.not above) expression to put in response.css to solve my problem?
Thanks in advance. Any advice will be highly appreciated!
You can use another selector: response.xpath to select element from the html,
and filter the text with xpath contains function.
>>> response.xpath("//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]")
[<Selector xpath="//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]" data='<a href="http://pfam.xfam.org/family/Alp'>]
I assume there are multiple such tr elements on the page. If so, I would probably do something like:
# get only rows containing 'Alpha_kinase' in link text
for row in response.xpath('//tr[#class="odd" and contains(./td/a/text(), "Alpha_kinase")]'):
# extract all the information
item['link'] = row.xpath('./td[2]/a/#href').extract_first()
...
yield item

Extracting data from a wikipedia page

This question might be really specific. I am trying to extract the number of employees from the Wikipedia pages of companies such as https://en.wikipedia.org/wiki/3M.
I tried using the Wikipedia python API and some regex queries. However, I couldn't find anything solid that I could generalize for any company (not considering exceptions).
Also, because the table row does not have an id or a class I cannot directly access the value. Following is the source:
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;">89,800 (2015)<sup id="cite_ref-FY_1-5" class="reference">[1]</sup></td>
</tr>
So, even though I have the id of the table - infobox vcard so I couldn't figure out a way to scrape this information using beautifulSoup.
Is there a way to extract this information? It is present in the summary table on the right at the beginning of the page.
Using lxml.etree instead of BeautifulSoup, you can get what you want with an XPath expression:
>>> from lxml import etree
>>> import requests
>>> r = requests.get('https://en.wikipedia.org/wiki/3M')
>>> doc = etree.fromstring(r.text)
>>> e = doc.xpath('//table[#class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td')
>>> e[0].text
'89,800 (2015)'
Let's take a closer look at that expression:
//table[#class="infobox vcard"]/tr[th/div/text()="Number of employees"]/td
That says:
Find all table elements that have attribute class set to infobox
vcard, and inside those elements look for tr elements that have a
child th element that has a child div element that contains the
text "Number of employees", and inside that tr element, get the
first td element.
Why reinvent the wheel?
DBpedia
has this information in RDF triples.
See e.g.
http://dbpedia.org/page/3M

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Categories

Resources