problem scraping with BeautifulSoup - python

I am trying to scrape the url http://www.kat.ph/search/beatles/?categories[]=music using BeautifulSoup
torrents = bs.findAll('tr',id = re.compile('torrent_*'))
torrents gets all the torrents on that page ,now every element of torrents contains a tr element.
My problem is that len(torrents[0].td) is 5 but i am not able to iterate over the td's.I mean something like for x in torrents[o].td is not working.
the data that i am getting for torrent[0] is :
<tr class="odd" id="torrent_2962816">
<td class="fontSize12px torrentnameCell">
<div class="iaconbox floatedRight">
<a title="Torrent magnet link" href="magnet:?xt=urn:btih:0898a4b562c1098eb69b9b801c61a51d788df0f5&dn=the+beatles+2009+greatest+hits+cdrip+ikmn+reupld&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce" onclick="_gaq.push(['_trackEvent', 'Download', 'Magnet Link', 'Music']);" class="imagnet icon16"></a>
<a title="Download torrent file" href="http://torrage.com/torrent/0898A4B562C1098EB69B9B801C61A51D788DF0F5.torrent?title=[kat.ph]the.beatles.2009.greatest.hits.cdrip.ikmn.reupld" onclick="_gaq.push(['_trackEvent', 'Download', 'Download torrent file', 'Music']);" class="idownload icon16"></a>
<a class="iPartner2 icon16" href="http://www.downloadweb.org/checking.php?acode=b146a357c57fddd450f6b5c446108672&r=d&qb=VGhlIEJlYXRsZXMgWzIwMDldIEdyZWF0ZXN0IEhpdHMgQ0RSaXAtIGlLTU4gUmVVUGxk" onclick="_gaq.push(['_trackEvent', 'Download', 'Download movie']);"></a>
<a class="iverif icon16" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html" title="Verified Torrent"></a> <a rel="2962816,0" class="icomment" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html#comments_tab">
<span class="icommentdiv"></span>145
</a>
</div>
<div class="torrentname">
The <strong class="red">Beatles</strong> [2009] Greatest Hits CDRip- iKMN ReUPld
<span>
Posted by <a class="plain" href="/user/iKMN/">iKMN</a>
<img src="http://static.kat.ph/images/verifup.png" alt="verified" /> in
<span id="cat_2962816">
Music
</span></span>
</div>
</td>
<td class="nobr">168.26 <span>MB</span></td>
<td>42</td>
<td>1 year</td>
<td class="green">1368</td>
<td class="red lasttd">94</td>
</tr>

I'd recommend using lxml or instead of BeautifulSoup, among other great features you can use xpath to grab your links:
import lxml.html
doc = lxml.html.parse('http://www.kat.ph/search/beatles/?categories[]=music')
links = doc.xpath('//a[contains(#class,"idownload")]/#href')

Related

BeautifulSoup4 - Requests - How to find TBODY classes?

I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:
The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>

How to get value from piece of html code with BeautifulSoup?

I just started using python for some web page scraping and BeautifulSoup seems to be recommended everywhere.
I have the content like below:
<table class="table with-row-highlight table-archive">
<tbody>
<tr>
<td>
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">1</span>
<span class="game-result">0</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost"></i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">30 min</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">25</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">Aug 9, 2017</a>
</td>
</tr>
100 <tr></tr> here
</tbody>
</table>
My code stops here, how do I write the python code to loop all the <tr></tr> pair and extract all the class for each <span> pair in each <td> pair?
edit
I think maybe I didn't explain clearly here, what your code returns are the name of class in that HTML while what I am looking for are the correspondent values, e.g. there is a class username, I want to get its value of player1 and player2; there is a class country-flag-small flag-70 I want to get tip=Indonesia
This should do the trick:
import requests
from bs4 import BeautifulSoup
res = requests.get('someLink')
soup = BeautifulSoup(res.text)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I tested this using your html file and got the following results:
['table', 'with-row-highlight', 'table-archive', 'user-tagline', 'username', 'user-rating', 'country-flag-small', 'flag-113', 'user-tagline', 'username', 'user-rating', 'country-flag-small','flag-70', 'clickable-link', 'text-middle', 'pull-left', 'game-result', 'game-result', 'result', 'icon-square-minus', 'loss', 'text-center', 'clickable-link', 'text-right', 'clickable-link', 'text-middle', 'moves', 'text-right', 'miniboard', 'clickable-link', 'archive-date']
Do note that you will have to pip3 install requests if you haven't already
Also, if you want to test this with a file on your computer, you can do this:
from bs4 import BeautifulSoup
file = open('/path/To/Your/HtmlFile.html', 'r')
lines = file.read()
soup = BeautifulSoup(lines)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)

Parse multiple href within one parent using BeautifulSoup

I have one line in my program, using BeautifulSoup's find():
print(table.find('td','monsters'))
This is the output of the above line:
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
Now I want to parse all five hrefs, so that it would output something like this:
/m154
/m153
/m152
/m155
/m147
I have attempted to convert my print line into a for loop by changing find() to find_all(), and then retrieve the href by using .a['href'] within the foor loop. However, no matter what I try, I would always only get one entry instead of five. Any suggestions for retrieving multiple href? Seeing that find_all() returns an array, would it make sense to make find_all() directly above the parent of a?
Input:
page = """<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "html.parser") # your source page parsed as html
links = soup.find_all('a', href=True) # get all links having href attribute
for i in links:
print(i['href'])
Result:
/m154
/m153
/m152
/m155
/m147
What you want to do is something like the following:
cell = table.find('td', 'monsters')
for a_tag in cell.find_all('a'):
print(a['href'])
Full Code, similar to posts above
import bs4
HTML= """<html>
<table>
<tr>
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
</tr>
</table>
</html>
"""
table = bs4.BeautifulSoup(HTML, 'lxml')
anker = table.find('td', 'monsters').find_all('a')
[print(a['href']) for a in anker]

How To Grab <a href="url"> Links With No Classes Or ID's with BeautifulSoup4 (Python 2.7)

I am struggling trying to grab a tag that doesn't contain any class or id. It is just the a href, and then the link.
html code - there is more, but this is just a short bit of it. Im trying to grab the a href="url is here", but I can't just grab "a" because it will grab every link on the page.
<table>
<tbody>
<tr class="">
<td class="col1 align">
<a href="url is here">
1
</a>
</td>
<td class="col2">
<a href="www.example.com">
<img class="avatar" src="www.example.com" alt="le me">
le me
<img class="test" alt="test" title="test" src="test-icon.png">
</a>
</td>
<td class="col3 align">
<a href="www.example.com">
2,715
</a>
</td>
<td class="col4 align">
<a href="www.example.com">
5,400,000,000
</a>
</td>
</tr>
My code:
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll():
username = link.get()
print(username)
I don't have these filled in because anything I try won't work. Not sure what else to do.
You can select all a tags and using the has_attr function check if it has the class or id attributes:
for link in soup.findAll('a'):
if link.has_attr('class') or link.has_attr('id'):
continue
username = link.get('href')
print(username)

Access value in BeautifulSoup4

I have made an HTML request from which I would like to retrieve specific elements, but I don't know how to access them with BeautifulSoup4.
Here is an example of the returned html:
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
I would like to access the element AF118324 (which is the name after the Identifiers span class).
How could I access it? (without using a substring method of course)
Does this work for you?
html = '''
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find('span', text='Identifiers:').nextSibling
print(obj)
Which prints:
AF118324[sampleid]

Categories

Resources