I've captured the following html using BS4, but can't seem to search for the artist tag.
I've assigned this block of code to a variable called container, and then tried
print container.tr.td["artist"]
without luck.
Any advice appreciated?
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">kool as the gang</td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
Your syntax is wrong, "artist" is the value of the "class" attribute try this:
from bs4 import BeautifulSoup
html = """
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">
kool as the gang </td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
td = soup.find('td',{'class': 'artist'})
print (td.text.strip())
Outputs:
kool as the gang
Another way.
Look for the element within container whose class is 'artist' with the select method. Since there could be more than one, but you know there is only one, select the only element in the list, and request its text attribute.
>>> HTML = open('sven.htm').read()
>>> import bs4
>>> container = bs4.BeautifulSoup(HTML, 'lxml')
>>> container.select('.artist')[0].text
'\n kool as the gang '
Related
I want to scrape several columns of text contained in td tags with a common css attribute inside of tr with a common css attribute inside of a table with a specific class inside of a div
For example, this is exactly how the website is structured.
<div class="stats-table>
<table class=stats_table>
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
.
.
.
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
I want to get the texts enclosed in the td tags
I have tried solving this problem by writing the code below
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.select(".stats_table")
all_data = [l.get_text(strip=True) for l in soup.select(".stats_table:has(> [data-row])")]
print(all_data)
But when I try to execute this code, I get an empty list. I need your help on this matter, thanks.
Why your solution did not work?
> is used when the element that you are selecting has the parent that you specified on the left side. But since in your case the parent of the td is tbody and not element with class .stats_table. So as stated if you specify the parent class in the selector it would work as expected. tr tag below is not necessary in the selector.
Also has tag means that selector matches element with class .stats_table that directly contains an element that has some element with data-row attribute in it.
soup.select(".stats_table tbody:has(> tr[data-row])")
But this won't give you the expected output. To get the expected output follow this below.
Solution
I see that you specifically want all the element "that has an attribute [data-row] inside the table class stats-table".
There are 2 ways in which you can do this.
Using regex
import re
html = '''
<div class="stats-table">
<table class="stats_table">
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
datastats = soup.find_all("td", {"data-stat" : re.compile(r".*")})
for stat in datastats:
print(stat.text)
which gives us the expected output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
Using CSS Selector
The below selector means that select all the td tags that has an attribute data-stat inside the table that has class stats_table. You may or may not use td beside [data-stat] as ("table.stats_table td[data-stat]")
datastats = soup.select("table.stats_table [data-stat]")
for stat in datastats:
print(stat.text)
which gives us the same output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
You can find more information on CSS_SELECTOR here
I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.
To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED
I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:
The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>
I'm trying to extract a value from a html table using bs4, however the structure of the table is in the form of:
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
The value I'm interested in is 575,42, however it has no id or other identifier to be used by bs4 to be extracted.
How can I call this value? Or under what id?
You can use any of the attributes to extract. For example, to use the
class = "celda400" attribute
response.find('td', {'class':"celda400"}).string
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,43
</td>
'''
doc = SimplifiedDoc(html)
texts = doc.selects('td.celda400').text
print (texts)
Result:
['575,42', '575,43']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
You can try it. I think, you can understand it:
from bs4 import BeautifulSoup
html_doc = """
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
875,42
</td>
"""
soup = BeautifulSoup(html_doc, 'lxml')
all_td = soup.find_all('td', {'class':"celda400"})
for td in all_td:
value = td.text.strip()
print(value)
I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?
I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())
This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())