Beautifulsoup add attribute to first <td> item in a table - python

I would like to get a table html code from a website with Beautifulsoup and I need to add attribute to the first td item. I have:
try:
description=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
except:
description=None
The selected description's code:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I would like to add a colspan attribute to the first <td> and keep changes in the description variable:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="" colspan="4">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I tried:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
description = soup.td['colspan'] = 4
...but it is not working, the output is "4", instead of the table's html code with attribute added.
I found it, it must be like this:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
soup.td['colspan'] = 4
description = soup

Just select the first <td> and add attribute colspan:
from bs4 import BeautifulSoup
html_doc = '''\
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.td['colspan'] = 4
print(soup.prettify())
Prints:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td colspan="4" style="" valign="top" width="704">
<p>
<span>
Short description
</span>
</p>
</td>
</tr>
<tr>
<td style="" valign="top" width="123">
<p>
<span>
Additional data
</span>
</p>
</td>
</tr>
</tbody>
</table>

Related

Returning None when scraping href using Python

Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script?
[<table border="0" class="ProductBox" id="Added0">
<tr>
<td align="center" colspan="2">
<div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div>
</td></tr><tr>
<td align="center" colspan="2" height="60px;" valign="top">
<div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div>
</td>
</tr>
<tr>
<td align="center" colspan="2" height="185">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
<img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a>
</td>
</tr>
<tr>
<td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td>
<td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
POR 0%
</td>
<td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
VAT 20%
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
**151 Heavy Duty Rubber Gloves - Ex Large**</a></td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
1s x 1
</td>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;">
<div class="tooltip">
<div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;">
<span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div>
</div>
<span class="OKStatus">In Stock </span>
</td>
</tr>
<tr>
<td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
<table style="margin-top : 10px;" width="100%">
<tr>
<td>
<img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/>
</td>
<td>
<input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/>
</td>
<td>
<img align="middle" alt="Add 1 To Qty" src="/images/add.png"/>
</td>
<td align="right">
<button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button>
</td>
</tr>
</table>
I tied to use the following
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')
for i in table:
links = [link.get('href') for link in i.find_all('a')]
print(links)
which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']
Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters:
DATA = """<table border="0" class="ProductBox" id="Added0">
<tr>
...
</table>"""
from bs4 import BeautifulSoup
from typing import Optional
def extract_name(data: str) -> Optional[str]:
soup = BeautifulSoup(data, "html.parser")
links = soup.select("td.ProdDetails a")
if len(links) >= 1:
return links[0].text.strip().strip("*").strip()
else:
return None
print(extract_name(DATA))
# like above
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
text = extract_name(tables[0])
Output: 151 Heavy Duty Rubber Gloves - Ex Large

Python Web Scraping - HTML error returning incomplete

When using my code, HTML is coming back missing data. What can it be ?
Before, everything was working fine, until changes were made to the code for expected conditions Selenium,
Code is not all complete because it was not accepted here, but I think you can see what is happening.
navegador = webdriver.Firefox(options = options)
wait = WebDriverWait(navegador, 30)
link = '******'
navegador.get(url = link)
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtLogin"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtSenha"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_btnEnviar"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_TreeView2t8"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[title='07 de dezembro']"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"]/option[2]'))).click()
teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="divScroll"]'))).get_attribute('innerHTML')
soup = BeautifulSoup(teste, "html.parser")
I get the following back.
<table align="center" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid" width="100%">
<tbody><tr>
<td>
<table>
<tbody><tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_Label1" style="font-size:12px;">Terminal - Empresa - Exportador:</span>
</td>
<td>
<select class="TextBox" id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa" name="ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa" onchange="javascript:setTimeout('__doPostBack(\'ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa\',\'\')', 0)" style="width: 475px;">
<option selected="selected" value="0">Selecione um Terminal.</option>
<option value="68623">TEAG - CARGILL - 04 CARGILL AGRICOLA S A - GUARUJA - SP</option>
<option value="68594">TEG - CARGILL - 04 CARGILL AGRICOLA S A - GUARUJA - SP</option>
</select>
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_lbl_titulo_principal" style="font-size:12px;">Disponibilização de vagas do dia: 07/12/2022</span></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
</td>
</tr>
<tr>
I should get that back.
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
<div id="ctl00_ctl00_Content_Content_pn_turno_1" style="width:100%;">
<table width="100%" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid">
<tbody><tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_lbl_turno_1">Turno 01 - intervalo: 7/12/2022 0:00:00 as 7/12/2022 1:00:00</span></td>
</tr>
<tr>
<td style="height:200px;width: 100%;" valign="top">
<table border="0" class="Grid" cellpadding="4" cellspacing="2" style="font-size:14;width: 100%;z-index: -1;">
</table>
<table border="0" class="Grid" cellpadding="3" cellspacing="2" style="font-size:14;width: 100%">
<tbody><tr class="GridRow">
<td width="12%" align="center">
<span id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_lblEmpresaTerminal_1" title="TEAG - CARGILL - 04 CARGILL AGRICOLA S A - GUARUJA - SP" style="font-size:7px;">CARGILL - TEAG</span>
<input type="image" name="ctl00$ctl00$Content$Content$rpt_turno_1$ctl01$imb_vaga_1" id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_imb_vaga_1" title="Vaga agendada." src="../App_Themes/SisLog/Images/caminhao.png" onclick="javascript:window.open('Cadastro.aspx?id_agenda=7054462&id_turno=7/12/2022 0:00:00;7/12/2022 1:00:00&data=07/12/2022&id_turno_exportador=198574&id_turno_agenda=61348&id_transportadora=23213&id_turno_transp=68623&id_Cliente=7708&codigo_terminal=7708&codigo_empresa=1&codigo_exportador=24978&codigo_transportador=23213&codigo_turno=1&turno_transp_vg=68623','_blank','height=850,width=1000,top=(screen.width)?(screen.width-1000)/2 : 0,left=(screen.height)?(screen.height-700)/2 : 0,toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=yes,resizable=no');" style="height:20px;border-width:0px;">
</td>
Since you did not share a link to the page you working on we can only guess what can cause your problem.
So, I guess you are extracting the text from not fully rendered element.
To try fix this try changing from presence_of_element_located to visibility_of_element_located in this line teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="divScroll"]'))).get_attribute('innerHTML') so it will be
teste = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="divScroll"]'))).get_attribute('innerHTML')
In case this will not be enough try adding some delay before extracting the text, as following:
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[#id="divScroll"]').get_attribute('innerHTML')
And in case that element is not visible so that visibility_of_element_located can not be applied on it just use presence_of_element_located with delay
wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[#id="divScroll"]').get_attribute('innerHTML')

I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project

I have this HTML code for example:
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
<span style="font-size: 8pt; color: #78578e"> MCO Date : 10-Aug-2018</span>
<br />
</td>
<td>
<h3 style="font-size: 14pt;"> </h3>
<h2> <br /> </h2>
</td>
</tr>
<tr>
<td style="margin-top: 0;" class="border-b">
<br />
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
</tr>
<tr style="height: 7%;" class="border-tb">
<td style="width: 130px" class="border-r">
<h4>METER NO</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PREVIOUS READING</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PRESENT READING</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>MF</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>UNITS</h4>
</td>
<td>
<h4>STATUS</h4>
</td>
</tr>
<tr style="height: 30px" class="content">
<td class="border-r">
3-P I 3301539<br> I 3301539<br> E 3301539<br> E 3301539<br>
</td>
<td class="border-r">
78693<br>16823<br>19740<br>8<br>
</td>
<td class="border-r">
80086<br>17210<br>20139<br>8<br>
</td>
<td class="border-r">
1<br>1<br>1<br>1<br>
</td>
<td class="border-r">
1393<br>387<br>399<br>0<br>
</td>
<td>
</td>
</tr>
<tr id="roshniMsg" style="height: 30px" class="content">
<td colspan="6">
<div style="width: 452pt">
<img style="max-width: 100%; max-height: 35%" src="/images/companies/iesco/roshniMsg.jpg"
alt="Roshni Message" />
</div>
</td>
</tr>
</table>
From this table I want to extract the paragraph and from there I want to get all the span tags in that paragraph.
I used soup.find_all() to get the table but I don't know how to use this function iteratively to pass it back to the original soup object so that I could find the paragraph and, moreover the span tags in that paragraph.
This is the code Python code I wrote:
soup = BeautifulSoup(string, 'html.parser')
#Getting the table tag
results = soup.find_all('table', attrs={'class':'nested4'})
#Getting the paragragh tag
results = soup.find_all('p', attrs={'style':'margin: 0; text-align: left; padding-left: 5px'})
#Getting all the span tags
results = soup.find_all('span', attrs={})
I just want help on how to get the paragraphs within the table. And then how to get the spans within the paragraph as I am getting the spans in all of the original HTML code. I don't know how to pass the bs4 object list back to the soup object to use soup.find_all iteratively.
from bs4 import BeautifulSoup
html = '''
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
'''
soup = BeautifulSoup(html, 'html.parser')
spans = soup.select_one('table.nested4').select('span')
for span in spans:
print(span.text)
This returns:
NAME & ADDRESS
MUHAMMAD AMIN
S/O MUHAMMAD KHAN
H-NO.38 MARGALLA ROAD
F-6/3 ISLAMABAD3
if you have one table:
soup = BeautifulSoup(string, 'html.parser')
table = soup.find('table', attrs={'class': 'nested4'})
p = table.find('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'})
results = p.find_all('span')
for result in results:
print(result.get_text(strip=True))
if you have list of tables:
soup = BeautifulSoup(string, 'html.parser')
for table in soup.find_all('table', attrs={'class': 'nested4'}):
for p in table.find_all('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}):
for span in p.find_all('span'):
print(span.get_text(strip=True))

How to handle nested html tables with beautifulsoup?

I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.
Here is the beginning of the html table showing the headers and the first row of data:
<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>
Below is the code that I wrote to read the data into a dataframe:
table = soup.find('table', id = 'tableid1')
table_rows = table.find_all('tr')
allData=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
allData.append(row)
headers = allData.pop(0)
self.d1_bundle_df = pd.DataFrame(allData, columns = headers)
When the above code is running, it generates the following error:
AssertionError: 3 columns passed, passed data had 4 columns
What's the best way to handle these nested tables?
This is still relatively new to me, so any direction would be greatly appreciated.
Problem is you are searching in row for all <td>, but these <td> can contain other <td> in your case. One solution is use CSS selectors and search only for <td> which don't have other <td>:
data = '''<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('#tableid1 > tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])
from pprint import pprint
pprint(rows)
Prints:
[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]
The CSS selector #tableid1 > tr will search for all <tr> that are directly under the tag with id=tableid1
The CSS selector td:not(:has(td)) will search for all <td> that don't contain other <td>.
Further reading:
CSS Selectors Reference

Scraping table with BeautifulSoup4

I am trying to scrape some particulars rows inside a table but I don't know how to access the information properly. Here is the html:
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Michael</td>
<td class="right">57</td>
<td class="right">0</td>
<td class="right">5</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">1 </td>
<td>Clara</td>
<td class="right">48</td>
<td class="right">0</td>
<td class="right">5</td>
</tr>
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Lisa</td>
<td class="right">44</td>
<td class="right">2</td>
<td class="right">5</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Joe</td>
<td class="right">43</td>
<td class="right">0</td>
<td class="right">13</td>
</tr>
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>John</td>
<td class="right">38</td>
<td class="right">3</td>
<td class="right">4</td>
</tr>
<tr class="odd">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Francesca</td>
<td class="right">35</td>
<td class="right">2</td>
<td class="right">5</td>
</tr>
<tr class="even">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Carlos</td>
<td class="right">27</td>
<td class="right">1</td>
<td class="right">2</td>
</tr>
What I try to obtain, is the text on the next td that comes after every td with the style of color F5645C, but unfortunately I am running into problems.
This is what I want the script to return:
Michael
Lisa
John
Francesca
Here is the code I currently have:
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find('td', style='background: #F5645C; color: #F5645C;').find_next_sibling('td').get_text()
print(td)
On running the script: AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
You can use CSS selector to select all <td> tags that contain attribute style with string color: #F5645C and then apply method find_next():
for td in soup.select('td[style*="color: #F5645C"]'):
print(td.find_next('td').text)
This prints:
Michael
Lisa
John
Francesca
data = BeautifulSoup(html)
for tr in data.find_all('tr'):
td = tr.find_all('td')
print(td[1].text)
Now you can take it further i think..
Use .findNext("td").text
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tr in soup.find_all("tr"):
print(tr.td.findNext("td").text)
Output:
Michael
Clara
Lisa
Joe
John
Francesca
Carlos
Use can use find_all and a filter for the style atribute:
bs = BeautifulSoup(htmlcontent)
bs.find_all('td', attrs={'style':'background-color: #F5645C, color: #F5645C'})

Categories

Resources