I just started using python for some web page scraping and BeautifulSoup seems to be recommended everywhere.
I have the content like below:
<table class="table with-row-highlight table-archive">
<tbody>
<tr>
<td>
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">1</span>
<span class="game-result">0</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost"></i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">30 min</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">25</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">Aug 9, 2017</a>
</td>
</tr>
100 <tr></tr> here
</tbody>
</table>
My code stops here, how do I write the python code to loop all the <tr></tr> pair and extract all the class for each <span> pair in each <td> pair?
edit
I think maybe I didn't explain clearly here, what your code returns are the name of class in that HTML while what I am looking for are the correspondent values, e.g. there is a class username, I want to get its value of player1 and player2; there is a class country-flag-small flag-70 I want to get tip=Indonesia
This should do the trick:
import requests
from bs4 import BeautifulSoup
res = requests.get('someLink')
soup = BeautifulSoup(res.text)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I tested this using your html file and got the following results:
['table', 'with-row-highlight', 'table-archive', 'user-tagline', 'username', 'user-rating', 'country-flag-small', 'flag-113', 'user-tagline', 'username', 'user-rating', 'country-flag-small','flag-70', 'clickable-link', 'text-middle', 'pull-left', 'game-result', 'game-result', 'result', 'icon-square-minus', 'loss', 'text-center', 'clickable-link', 'text-right', 'clickable-link', 'text-middle', 'moves', 'text-right', 'miniboard', 'clickable-link', 'archive-date']
Do note that you will have to pip3 install requests if you haven't already
Also, if you want to test this with a file on your computer, you can do this:
from bs4 import BeautifulSoup
file = open('/path/To/Your/HtmlFile.html', 'r')
lines = file.read()
soup = BeautifulSoup(lines)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
Related
I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:
The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>
I'm working on a webpage scraping project, using selenium library, in which I need to extract some data out of some tables. As a part of project, I need to iterate the table rows and extract the author of article condition, but it just works for the first row. It seems the variable saves the data of first row and doesn't change, even after each iterating.
This is mentioned part of my code:
div_result = driver.find_element_by_class_name("result-body-paper")
papers = div_result.find_elements_by_tag_name("tr")
papers_information = []
for paper in papers:
data = paper.find_elements_by_tag_name("td")
result_title = data[1].text
author = paper.find_element_by_xpath('//span[#data-paper-person="{id}"]'.format(id=person_id))
try:
first_author = author.find_element_by_tag_name("i").get_attribute("class")
except:
first_author = ""
author_condition = "Helper"
if first_author != "":
if "pencil" in first_author:
author_condition = "First Writer"
if "asterisk" in first_author:
author_condition = "Orginal Writer"
if "star" in first_author:
author_condition == "Orginal Worker"
papers_information.append([author_condition,result_title])
Unlike what I expect, every time first_author and author is the same as it was at the first row of table. However, other parts work correctly and operates properly.
Is that bug or something?
By the way, this is the part of html code I'm trying to extract data from (just consists two of table rows):
<tr class="zarEn selectable">
<td class="result row center" width="35">1</td>
<td class="result title ">Hepatic insulin resistance, metabolic syndrome and cardiovascular disease</td>
<td class="result author zarsmallEn" width="200">
<span data-paper-person="98155">
<a href="...">
<img src="..." class="person-avatar-mini">
<i class="fa fa-fw fa-pencil crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Clinical Biochemistry
</td>
<td class="result source_cs">
2.35
</td>
<td class="result published_year center">2009</td>
<td class="result citation center">217</td>
</tr>
<tr class="zarEn selectable">
<td class="result row center">2</td>
<td class="result title ">Molecular and cellular mechanisms linking inflammation to insulin resistance and β-cell dysfunction</td>
<td class="result author zarsmallEn">
<span data-paper-person="14144442">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-pencil lightgray absolute"></i>
</a>
</span>
<span data-paper-person="14137800">
<img src="...">
</span>
<span data-paper-person="98155">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-asterisk crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Translational Research
</td>
<td class="result source_cs">
4.26
</td>
<td class="result published_year center">2016</td>
<td class="result citation center">71</td>
</tr>
As you can see, class name of two "" is different, but first_author gets the first one and doesn't change anymore!
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.
I have made an HTML request from which I would like to retrieve specific elements, but I don't know how to access them with BeautifulSoup4.
Here is an example of the returned html:
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
I would like to access the element AF118324 (which is the name after the Identifiers span class).
How could I access it? (without using a substring method of course)
Does this work for you?
html = '''
<td valign="top" >
<span class="recordAttribute" >Taxonomy</span>: Mollusca, Gastropoda, Littorinimorpha, Hydrobiidae, Hydrobia<br>
<span class="recordAttribute" >Identifiers:</span> AF118324[sampleid] <br>
<span class="recordAttribute" >Depository</span>: Mined from GenBank, NCBI
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
obj = soup.find('span', text='Identifiers:').nextSibling
print(obj)
Which prints:
AF118324[sampleid]
I am trying to scrape the url http://www.kat.ph/search/beatles/?categories[]=music using BeautifulSoup
torrents = bs.findAll('tr',id = re.compile('torrent_*'))
torrents gets all the torrents on that page ,now every element of torrents contains a tr element.
My problem is that len(torrents[0].td) is 5 but i am not able to iterate over the td's.I mean something like for x in torrents[o].td is not working.
the data that i am getting for torrent[0] is :
<tr class="odd" id="torrent_2962816">
<td class="fontSize12px torrentnameCell">
<div class="iaconbox floatedRight">
<a title="Torrent magnet link" href="magnet:?xt=urn:btih:0898a4b562c1098eb69b9b801c61a51d788df0f5&dn=the+beatles+2009+greatest+hits+cdrip+ikmn+reupld&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce" onclick="_gaq.push(['_trackEvent', 'Download', 'Magnet Link', 'Music']);" class="imagnet icon16"></a>
<a title="Download torrent file" href="http://torrage.com/torrent/0898A4B562C1098EB69B9B801C61A51D788DF0F5.torrent?title=[kat.ph]the.beatles.2009.greatest.hits.cdrip.ikmn.reupld" onclick="_gaq.push(['_trackEvent', 'Download', 'Download torrent file', 'Music']);" class="idownload icon16"></a>
<a class="iPartner2 icon16" href="http://www.downloadweb.org/checking.php?acode=b146a357c57fddd450f6b5c446108672&r=d&qb=VGhlIEJlYXRsZXMgWzIwMDldIEdyZWF0ZXN0IEhpdHMgQ0RSaXAtIGlLTU4gUmVVUGxk" onclick="_gaq.push(['_trackEvent', 'Download', 'Download movie']);"></a>
<a class="iverif icon16" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html" title="Verified Torrent"></a> <a rel="2962816,0" class="icomment" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html#comments_tab">
<span class="icommentdiv"></span>145
</a>
</div>
<div class="torrentname">
The <strong class="red">Beatles</strong> [2009] Greatest Hits CDRip- iKMN ReUPld
<span>
Posted by <a class="plain" href="/user/iKMN/">iKMN</a>
<img src="http://static.kat.ph/images/verifup.png" alt="verified" /> in
<span id="cat_2962816">
Music
</span></span>
</div>
</td>
<td class="nobr">168.26 <span>MB</span></td>
<td>42</td>
<td>1 year</td>
<td class="green">1368</td>
<td class="red lasttd">94</td>
</tr>
I'd recommend using lxml or instead of BeautifulSoup, among other great features you can use xpath to grab your links:
import lxml.html
doc = lxml.html.parse('http://www.kat.ph/search/beatles/?categories[]=music')
links = doc.xpath('//a[contains(#class,"idownload")]/#href')