Extracting a value from html table using BeautifulSoup

Extracting a value from html table using BeautifulSoup - python

I'm trying to extract a value from a html table using bs4, however the structure of the table is in the form of:
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
The value I'm interested in is 575,42, however it has no id or other identifier to be used by bs4 to be extracted.
How can I call this value? Or under what id?

You can use any of the attributes to extract. For example, to use the
class = "celda400" attribute
response.find('td', {'class':"celda400"}).string

Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,43
</td>
'''
doc = SimplifiedDoc(html)
texts = doc.selects('td.celda400').text
print (texts)
Result:
['575,42', '575,43']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples

You can try it. I think, you can understand it:
from bs4 import BeautifulSoup
html_doc = """
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
875,42
</td>
"""
soup = BeautifulSoup(html_doc, 'lxml')
all_td = soup.find_all('td', {'class':"celda400"})
for td in all_td:
value = td.text.strip()
print(value)

Related

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.

To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

BeautifulSoup4 - Requests - How to find TBODY classes?

I'm trying to retrieve data from the following website: http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm
Why the following code doesn't return anything?
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm').text
soup = BeautifulSoup(source, 'lxml')
soup.find('tbody')
Sample of the elements of the website:
<tbody>
<tr class="rgRow GridBovespaItemStyle" id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00__0" style="font-weight:normal;font-style:normal;text-decoration:none;">
<td class="rgSorted" align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>
</td><td align="left">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.354.228.928</span>
</td><td class="text-right">
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">3,003</span>
</td>
</tr>
</tbody>
Expected Output - The content of all table columns and rows:

The page you link to actually loads a iframe with the table in it. The URL of the document in the frame is http://bvmf.bmfbovespa.com.br/indices/ResumoCarteiraTeorica.aspx?Indice=IBOV&idioma=pt-br If you use that URL you'll see the <tbody>

How can I extract the name of a <tr> tag by using Beautiful Soup?

Given the following html document, how can I extract the string "CLN8"?
<tr data-table-chart-alt-symbol="CLN8" data-table-chart-symbol="#CL.1">
<td class="first text" data-field="symbol">OIL</td>
<td data-field="last"></td>
<td class="arrow" data-field="change_arrow"><div class="icon unch">---</div></td>
<td data-field="change"></td>
<td data-field="change_pct"></td>
<td data-field="volume"></td>
</tr>

You can use BeautifulSoup.find:
from bs4 import BeautifulSoup as soup
d = soup(html, 'html.parser').find('tr')['data-table-chart-alt-symbol']
Output:
'CLN8'

Beautiful Soup Parse Python

I've captured the following html using BS4, but can't seem to search for the artist tag.
I've assigned this block of code to a variable called container, and then tried
print container.tr.td["artist"]
without luck.
Any advice appreciated?
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">kool as the gang</td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>

Your syntax is wrong, "artist" is the value of the "class" attribute try this:
from bs4 import BeautifulSoup
html = """
<tr class="item">
<!-- <td class="image"><img src="https://www.stargreen.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/K/o/KoolAsTheGang.jpg" width="135" height="135" alt="KOOL AS THE GANG " /></td> -->
<td class="date">Sat, 30 Dec 2017</td>
<td class="artist">
kool as the gang </td>
<td class="venue">100 club</td>
<td class="link">
<p class="availability out-of-stock">
<span>Off Sale</span></p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
td = soup.find('td',{'class': 'artist'})
print (td.text.strip())
Outputs:
kool as the gang

Another way.
Look for the element within container whose class is 'artist' with the select method. Since there could be more than one, but you know there is only one, select the only element in the list, and request its text attribute.
>>> HTML = open('sven.htm').read()
>>> import bs4
>>> container = bs4.BeautifulSoup(HTML, 'lxml')
>>> container.select('.artist')[0].text
'\n kool as the gang '

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

I am trying to use Python with BeautifulSoup in order to pull multiple numbers from a web page. I know I am doing something wrong though because my script is returning an empty array. The fact that there are multiple spans and classes confuses me as well. Here is a sample of the HTML data I am working with:
<td class="confluenceTd" colspan="1">
<span>
Autoworks
</span>
</td>
<td class="confluenceTd" colspan="1">
900009
</td>
<td class="confluenceTd" colspan="1">
<p>
uyi: 3456778, 33344778, 11199087
</p>
<p>
PRY: 54675389
</p>
</td>
<td class="confluenceTd" colspan="1">
AutoNone
</td>
<td class="confluenceTd" colspan="1">
9998887
</td>
<td class="confluenceTd" colspan="1">
<p>
YUN: 232323, 6788889, 78695554
</p>
<p>
IOY: 3444666, 2343233, 1232322
</p>
</td>
Here is my Python code:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user': "user1", 'password':
'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("span", {"class":"confluenceTd"})
print data
Again, I am only trying to pull the actual numbers. Any help would be greatly appreciated. Thanks.

if you like to get all numbers present under specific class use regex/regular expressions to pull numbers and make sure requests is pulling html
import requests,re
from bs4 import BeautifulSoup
s = requests.Session()
s.post('https://wiki.example.com/login', data={'user':"user1",'password': 'pass1'})
r = s.get('https://wiki.example.com/example/section')
data_payload = r.content
soup = BeautifulSoup(data_payload, 'html.parser')
data = soup.findAll("td", {"class":"confluenceTd"})
for d in data:
m=re.search('([0-9]+)',str(d.findAll(text=True)))
if m:
print m.group(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting a value from html table using BeautifulSoup - python

You can use any of the attributes to extract. For example, to use the class = "celda400" attribute response.find('td', {'class':"celda400"}).string

Related

BeautifulSoup how to only return class objects

BeautifulSoup4 - Requests - How to find TBODY classes?

How can I extract the name of a <tr> tag by using Beautiful Soup?

Beautiful Soup Parse Python

Using Python with BeautifulSoup to extract numbers (multiple spans and classes)

Categories

Resources