Access value in BeautifulSoup4 - python

I have made an HTML request from which I would like to retrieve specific elements, but I don't know how to access them with BeautifulSoup4.
Here is an example of the returned html:
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
I would like to access the element AF118324 (which is the name after the Identifiers span class).
How could I access it? (without using a substring method of course)

Does this work for you?
html = '''
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find('span', text='Identifiers:').nextSibling
print(obj)
Which prints:
AF118324[sampleid]

Related

BeautifulSoup4 - Requests - How to find TBODY classes?

I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:
The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>

How to get value from piece of html code with BeautifulSoup?

I just started using python for some web page scraping and BeautifulSoup seems to be recommended everywhere.
I have the content like below:
<table class="table with-row-highlight table-archive">
<tbody>
<tr>
<td>
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">1</span>
<span class="game-result">0</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost"></i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">30 min</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">25</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">Aug 9, 2017</a>
</td>
</tr>
100 <tr></tr> here
</tbody>
</table>
My code stops here, how do I write the python code to loop all the <tr></tr> pair and extract all the class for each <span> pair in each <td> pair?
edit
I think maybe I didn't explain clearly here, what your code returns are the name of class in that HTML while what I am looking for are the correspondent values, e.g. there is a class username, I want to get its value of player1 and player2; there is a class country-flag-small flag-70 I want to get tip=Indonesia
This should do the trick:
import requests
from bs4 import BeautifulSoup
res = requests.get('someLink')
soup = BeautifulSoup(res.text)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I tested this using your html file and got the following results:
['table', 'with-row-highlight', 'table-archive', 'user-tagline', 'username', 'user-rating', 'country-flag-small', 'flag-113', 'user-tagline', 'username', 'user-rating', 'country-flag-small','flag-70', 'clickable-link', 'text-middle', 'pull-left', 'game-result', 'game-result', 'result', 'icon-square-minus', 'loss', 'text-center', 'clickable-link', 'text-right', 'clickable-link', 'text-middle', 'moves', 'text-right', 'miniboard', 'clickable-link', 'archive-date']
Do note that you will have to pip3 install requests if you haven't already
Also, if you want to test this with a file on your computer, you can do this:
from bs4 import BeautifulSoup
file = open('/path/To/Your/HtmlFile.html', 'r')
lines = file.read()
soup = BeautifulSoup(lines)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)

Beautiful Soup Parse Python

I've captured the following html using BS4, but can't seem to search for the artist tag.
I've assigned this block of code to a variable called container, and then tried
print container.tr.td["artist"]
without luck.
Any advice appreciated?
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">kool as the gang</td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
Your syntax is wrong, "artist" is the value of the "class" attribute try this:
from bs4 import BeautifulSoup
html = """
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">
kool as the gang </td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
td = soup.find('td',{'class': 'artist'})
print (td.text.strip())
Outputs:
kool as the gang
Another way.
Look for the element within container whose class is 'artist' with the select method. Since there could be more than one, but you know there is only one, select the only element in the list, and request its text attribute.
>>> HTML = open('sven.htm').read()
>>> import bs4
>>> container = bs4.BeautifulSoup(HTML, 'lxml')
>>> container.select('.artist')[0].text
'\n kool as the gang '

How To Grab <a href="url"> Links With No Classes Or ID's with BeautifulSoup4 (Python 2.7)

I am struggling trying to grab a tag that doesn't contain any class or id. It is just the a href, and then the link.
html code - there is more, but this is just a short bit of it. Im trying to grab the a href="url is here", but I can't just grab "a" because it will grab every link on the page.
<table>
<tbody>
<tr class="">
<td class="col1 align">
<a href="url is here">
1
</a>
</td>
<td class="col2">
<a href="www.example.com">
<img class="avatar" src="www.example.com" alt="le me">
le me
<img class="test" alt="test" title="test" src="test-icon.png">
</a>
</td>
<td class="col3 align">
<a href="www.example.com">
2,715
</a>
</td>
<td class="col4 align">
<a href="www.example.com">
5,400,000,000
</a>
</td>
</tr>
My code:
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll():
username = link.get()
print(username)
I don't have these filled in because anything I try won't work. Not sure what else to do.
You can select all a tags and using the has_attr function check if it has the class or id attributes:
for link in soup.findAll('a'):
if link.has_attr('class') or link.has_attr('id'):
continue
username = link.get('href')
print(username)

problem scraping with BeautifulSoup

I am trying to scrape the url http://www.kat.ph/search/beatles/?categories[]=music using BeautifulSoup
torrents = bs.findAll('tr',id = re.compile('torrent_*'))
torrents gets all the torrents on that page ,now every element of torrents contains a tr element.
My problem is that len(torrents[0].td) is 5 but i am not able to iterate over the td's.I mean something like for x in torrents[o].td is not working.
the data that i am getting for torrent[0] is :
<tr class="odd" id="torrent_2962816">
<td class="fontSize12px torrentnameCell">
<div class="iaconbox floatedRight">
<a title="Torrent magnet link" href="magnet:?xt=urn:btih:0898a4b562c1098eb69b9b801c61a51d788df0f5&dn=the+beatles+2009+greatest+hits+cdrip+ikmn+reupld&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce" onclick="_gaq.push(['_trackEvent', 'Download', 'Magnet Link', 'Music']);" class="imagnet icon16"></a>
<a title="Download torrent file" href="http://torrage.com/torrent/0898A4B562C1098EB69B9B801C61A51D788DF0F5.torrent?title=[kat.ph]the.beatles.2009.greatest.hits.cdrip.ikmn.reupld" onclick="_gaq.push(['_trackEvent', 'Download', 'Download torrent file', 'Music']);" class="idownload icon16"></a>
<a class="iPartner2 icon16" href="http://www.downloadweb.org/checking.php?acode=b146a357c57fddd450f6b5c446108672&r=d&qb=VGhlIEJlYXRsZXMgWzIwMDldIEdyZWF0ZXN0IEhpdHMgQ0RSaXAtIGlLTU4gUmVVUGxk" onclick="_gaq.push(['_trackEvent', 'Download', 'Download movie']);"></a>
<a class="iverif icon16" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html" title="Verified Torrent"></a> <a rel="2962816,0" class="icomment" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html#comments_tab">
<span class="icommentdiv"></span>145
</a>
</div>
<div class="torrentname">
The <strong class="red">Beatles</strong> [2009] Greatest Hits CDRip- iKMN ReUPld
<span>
Posted by <a class="plain" href="/user/iKMN/">iKMN</a>
<img src="http://static.kat.ph/images/verifup.png" alt="verified" /> in
<span id="cat_2962816">
Music
</span></span>
</div>
</td>
<td class="nobr">168.26 <span>MB</span></td>
<td>42</td>
<td>1 year</td>
<td class="green">1368</td>
<td class="red lasttd">94</td>
</tr>
I'd recommend using lxml or instead of BeautifulSoup, among other great features you can use xpath to grab your links:
import lxml.html
doc = lxml.html.parse('http://www.kat.ph/search/beatles/?categories[]=music')
links = doc.xpath('//a[contains(#class,"idownload")]/#href')

Categories

Resources