How to handle nested html tables with beautifulsoup? - python

I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.
Here is the beginning of the html table showing the headers and the first row of data:
<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>
Below is the code that I wrote to read the data into a dataframe:
table = soup.find('table', id = 'tableid1')
table_rows = table.find_all('tr')
allData=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
allData.append(row)
headers = allData.pop(0)
self.d1_bundle_df = pd.DataFrame(allData, columns = headers)
When the above code is running, it generates the following error:
AssertionError: 3 columns passed, passed data had 4 columns
What's the best way to handle these nested tables?
This is still relatively new to me, so any direction would be greatly appreciated.

Problem is you are searching in row for all <td>, but these <td> can contain other <td> in your case. One solution is use CSS selectors and search only for <td> which don't have other <td>:
data = '''<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('#tableid1 > tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])
from pprint import pprint
pprint(rows)
Prints:
[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]
The CSS selector #tableid1 > tr will search for all <tr> that are directly under the tag with id=tableid1
The CSS selector td:not(:has(td)) will search for all <td> that don't contain other <td>.
Further reading:
CSS Selectors Reference

Related

Beautifulsoup add attribute to first <td> item in a table

I would like to get a table html code from a website with Beautifulsoup and I need to add attribute to the first td item. I have:
try:
description=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
except:
description=None
The selected description's code:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I would like to add a colspan attribute to the first <td> and keep changes in the description variable:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="" colspan="4">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I tried:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
description = soup.td['colspan'] = 4
...but it is not working, the output is "4", instead of the table's html code with attribute added.
I found it, it must be like this:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
soup.td['colspan'] = 4
description = soup
Just select the first <td> and add attribute colspan:
from bs4 import BeautifulSoup
html_doc = '''\
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.td['colspan'] = 4
print(soup.prettify())
Prints:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td colspan="4" style="" valign="top" width="704">
<p>
<span>
Short description
</span>
</p>
</td>
</tr>
<tr>
<td style="" valign="top" width="123">
<p>
<span>
Additional data
</span>
</p>
</td>
</tr>
</tbody>
</table>

Regular expression to match field name value pairs from html

I'm trying to parse an HTML email from python code to extract various details and would appreciate a regular expression or two to help achieve this as it is too complex for my limited regex understanding. e.g. look for 'Travel Date' and extract 'October 30 2018 (Tue)'.
In all cases there is a field name contained within <td> tags followed by the field value contained within another set of <td> tags. Sometimes the name and value are contained within the same row <tr> tags (Case 1) and other times they are in separate row tags (Case 2). Other items like <span> and <img> need to be skipped over as well.
Case 1
<tr>
<td colspan="2"> </td></tr>
<tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #777777;">Travel Date</td>
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #444444;">October 30 2018 (Tue)</td>
</tr>
Case 2
<tr><td style="vertical-align: top;">
<span style="font-size: 10px; font-family: Arial; color: #999999; font-weight: bold; line-height: 19px; text-transform: uppercase;">Drop-off to Address</span>
</td></tr>
<tr><td style="vertical-align: top;">
<span style="font-size: 13px; font-family: Arial; color: #444444;"><img style="vertical-align:text-bottom;" src="https://d1lk4k9zl9klra.cloudfront.net/Email/Common/address_icon.png" alt="" width="14" height="14" /> 200 George St, Sydney NSW 2000, Australia</span>
</td></tr>
Instead of using regex, I would use Beautiful Soup. It makes it easier to go through HTML elements and scrape what you need. If you know the relationship between the key and value, then you could use that to extract information. Here's an example for case 1:
In [8]: from bs4 import BeautifulSoup
In [9]: text = """
...: <tr>
...: <td colspan="2"> </td></tr>
...: <tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>
...: <td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>
...: </tr>"""
In [11]: soup = BeautifulSoup(text, 'lxml')
In [13]: soup.find_all('td')
Out[13]:
[<td colspan="2"> </td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>]
In [15]: for tag in soup.find_all('td'):
...: if tag.text == "Travel Date":
...: print tag.find_next().text
...:
October 30 2018 (Tue)
Beautiful Soup gives a lot of flexibility when scraping HTML from the web.

Searching BeautifulSoup after text, need to get all data from table row

I have a table like this:
<table id="test" class="tablesorter">
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Major Lazer</td>
<td class="right">64</td>
<td>93.1.15.107</td>
<td>0x0110000105DAB310</td>
<td class="center">No</td>
<td class="center">No</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Michael gunin</td>
<td class="right">64</td>
<td>57.48.41.27</td>
<td>0x0110000631HDA213</td>
<td class="center">No</td>
<td class="center">No</td>
</tr>
...
</table>
This table has over 100 rows, in the same format. What I want to do is to search after the long id, and then find that table row and get the IP and name.
For example, search after: 0x0110000105DAB310
Then find the specific table row in which this text exists, and grab the rest of the info like: Major Lazer and 93.1.15.107
table = playerssoup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find('td', text='0x0110000101517CC6')
This shows me the td, but I don't know from here what to do.
One approach is to use find_previous_sibling('td')
Ex:
for tr in table_rows:
td = tr.find('td', text='0x0110000105DAB310')
if td is not None:
print( td.find_previous_sibling('td').text )
print( td.find_previous_sibling('td').find_previous_sibling('td').find_previous_sibling('td').text )

Python - How to get tr td after tf was already used

So, here's my code:
link = "https://nookipedia.com/w/api.php?action=query&list=categorymembers&cmtitle=Category:Insect&cmlimit=500&format=json"
async with aiohttp.get(link) as t:
result = await t.json()
foundCheck = False
for list in result["query"]["categorymembers"]:
print(list["title"])
if bug.lower() == list["title"].lower():
print(bug)
await self.bot.say("{} is a real bug".format(bug.title()))
bug2 = bug.replace(" ", "_")
url = "https://nookipedia.com/wiki/{}".format(bug2)
await self.bot.say(url)
async with aiohttp.get(url) as response:
soupObject = BeautifulSoup(await response.text(), "html.parser")
try:
info = soupObject.find(id="Infobox-bug").tr.td.get_text()
await self.bot.say("{}".format(info))
except:
await self.bot.say("Can't get the content from {}".format(url))
foundCheck = True
return
if not foundCheck:
await self.bot.say("That bug does not exist")
return
else:
await self.bot.say("Error")
and here's the html code i'm trying to get:
<table id="Infobox-bug" align="right" style="background: #adff2f; margin-left: 10px; margin-bottom: 10px; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px; border: 3px solid #9acd32; width: 25%">
<tr align="center">
<td colspan="2"> <big><big><b>Pill Bug</b></big></big>
</td></tr>
<tr align="center">
<td style="background: #caecc9; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px;" colspan="2"> <img alt="Pill Bug Picture.jpg" src="/w/images/b/bb/Pill_Bug_Picture.jpg" width="199" height="186" />
</td></tr>
<tr>
<th style="background: #86df2d; border-top-left-radius: 10px; -moz-border-radius-topleft: 10px; -webkit-border-top-left-radius: 10px; -khtml-border-top-left-radius: 10px; -icab-border-top-left-radius: 10px; -o-border-top-left-radius: 10px;" align="right"> Scientific name
</th>
<td style="background:#ffffff; border-top-right-radius: 10px; -moz-border-radius-topright: 10px; -webkit-border-top-right-radius: 10px; -khtml-border-top-right-radius: 10px; -icab-border-top-right-radius: 10px; -o-border-top-right-radius: 10px;" align="left"> <i>Armadillidium vulgare</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Family
</th>
<td style="background:#ffffff" align="left"> <i>Armadillidiidae - Terrestrial Custaceans</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of year
</th>
<td style="background:#ffffff" align="left"> All year
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of day
</th>
<td style="background:#ffffff" align="left"> All day
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Location
</th>
<td style="background:#ffffff" align="left"> Under rocks
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Size
</th>
<td style="background:#ffffff" align="left"> 2 mm
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Rarity
</th>
<td style="background:#ffffff" align="left"> Common
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Selling price
</th>
<td style="background:#ffffff" align="left"> 250 Bells
</td></tr>
<tr>
<th style="background: #86df2d; border-bottom-left-radius: 10px; -moz-border-radius-bottomleft: 10px; -webkit-border-bottom-left-radius: 10px; -khtml-border-bottom-left-radius: 10px; -icab-border-bottom-left-radius: 10px; -o-border-bottom-left-radius: 10px;" align="right"> Appearances
</th>
<td style="background:#ffffff; border-bottom-right-radius: 10px; -moz-border-radius-bottomright: 10px; -webkit-border-bottom-right-radius: 10px; -khtml-border-bottom-right-radius: 10px; -icab-border-bottom-right-radius: 10px; -o-border-bottom-right-radius: 10px;" align="left"> <i>Doubutsu no Mori</i>,<br /><i>Animal Crossing</i>,<br /><i>Animal Crossing: Wild World</i>,<br /><i>Animal Crossing: City Folk</i>,<br /><i>Animal Crossing: New Leaf</i>
</td></tr></table>
So, basically i got the "Pill Bug" (aka info) as it own string but i'm not sure how to get everything else after it (within the tr and td) without getting pill bug again? How would i do that so i can get each text as their own strings?
Thank you so much for the help.
BS has many methods to get tags and it parameters
soup.find(args)
soup.find_all(args)
soup.select(CSS_selection)
tag.get(param) or tag.get(param, default) or tag[param]
tag.text or tag.get_text()
tag.name
etc.
And find() / find_all() may use different arguments - so you have to read BS doc for more.
Example:
html = '''<table id="Infobox-bug" align="right" style="background: #adff2f; margin-left: 10px; margin-bottom: 10px; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px; border: 3px solid #9acd32; width: 25%">
<tr align="center">
<td colspan="2"> <big><big><b>Pill Bug</b></big></big>
</td></tr>
<tr align="center">
<td style="background: #caecc9; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px;" colspan="2"> <img alt="Pill Bug Picture.jpg" src="/w/images/b/bb/Pill_Bug_Picture.jpg" width="199" height="186" />
</td></tr>
<tr>
<th style="background: #86df2d; border-top-left-radius: 10px; -moz-border-radius-topleft: 10px; -webkit-border-top-left-radius: 10px; -khtml-border-top-left-radius: 10px; -icab-border-top-left-radius: 10px; -o-border-top-left-radius: 10px;" align="right"> Scientific name
</th>
<td style="background:#ffffff; border-top-right-radius: 10px; -moz-border-radius-topright: 10px; -webkit-border-top-right-radius: 10px; -khtml-border-top-right-radius: 10px; -icab-border-top-right-radius: 10px; -o-border-top-right-radius: 10px;" align="left"> <i>Armadillidium vulgare</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Family
</th>
<td style="background:#ffffff" align="left"> <i>Armadillidiidae - Terrestrial Custaceans</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of year
</th>
<td style="background:#ffffff" align="left"> All year
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of day
</th>
<td style="background:#ffffff" align="left"> All day
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Location
</th>
<td style="background:#ffffff" align="left"> Under rocks
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Size
</th>
<td style="background:#ffffff" align="left"> 2 mm
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Rarity
</th>
<td style="background:#ffffff" align="left"> Common
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Selling price
</th>
<td style="background:#ffffff" align="left"> 250 Bells
</td></tr>
<tr>
<th style="background: #86df2d; border-bottom-left-radius: 10px; -moz-border-radius-bottomleft: 10px; -webkit-border-bottom-left-radius: 10px; -khtml-border-bottom-left-radius: 10px; -icab-border-bottom-left-radius: 10px; -o-border-bottom-left-radius: 10px;" align="right"> Appearances
</th>
<td style="background:#ffffff; border-bottom-right-radius: 10px; -moz-border-radius-bottomright: 10px; -webkit-border-bottom-right-radius: 10px; -khtml-border-bottom-right-radius: 10px; -icab-border-bottom-right-radius: 10px; -o-border-bottom-right-radius: 10px;" align="left"> <i>Doubutsu no Mori</i>,<br /><i>Animal Crossing</i>,<br /><i>Animal Crossing: Wild World</i>,<br /><i>Animal Crossing: City Folk</i>,<br /><i>Animal Crossing: New Leaf</i>
</td></tr></table>'''
from bs4 import BeautifulSoup
#import requests
#r = requests.get('https://nookipedia.com/wiki/Pill_Bug')
#html = r.content
soup = BeautifulSoup(html, "html.parser")
tds = soup.find(id="Infobox-bug").find_all('td')
print('--- all td text ---')
for x in tds:
print('>', x.get_text().strip())
# or
print('>', x.text.strip())
print('--- one td text ---')
print(tds[0].text.strip())
print('--- one td a href ---')
print(tds[1].find('a').get('href'))
# or
print(tds[1].find('a')['href'])
print('--- all a href (using CSS selector) ---')
for a in soup.select('#Infobox-bug td a'):
print(a['href'])
print('--- all td and th ---')
for tt in soup.find(id='Infobox-bug').find_all({'td', 'th'}):
if tt.name == 'th':
print('[', tt.name, ']', tt.text.strip(), end=" --> ")
elif tt.name == 'td':
a = tt.find('a')
if a:
a = a['href']
else:
a = 'None'
print('[', tt.name, ']', tt.text.strip(), '(', a, ')')
Result:
--- all td text ---
> Pill Bug
> Pill Bug
>
>
> Armadillidium vulgare
> Armadillidium vulgare
> Armadillidiidae - Terrestrial Custaceans
> Armadillidiidae - Terrestrial Custaceans
> All year
> All year
> All day
> All day
> Under rocks
> Under rocks
> 2 mm
> 2 mm
> Common
> Common
> 250 Bells
> 250 Bells
> Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf
> Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf
--- one td text ---
Pill Bug
--- one td a href ---
/wiki/File:Pill_Bug_Picture.jpg
/wiki/File:Pill_Bug_Picture.jpg
--- all a href (using CSS selector) ---
/wiki/File:Pill_Bug_Picture.jpg
/wiki/Bells
/wiki/Doubutsu_no_Mori_(game)
/wiki/Animal_Crossing_(GCN)
/wiki/Animal_Crossing:_Wild_World
/wiki/Animal_Crossing:_City_Folk
/wiki/Animal_Crossing:_New_Leaf
--- all td and th ---
[ td ] Pill Bug ( None )
[ td ] ( /wiki/File:Pill_Bug_Picture.jpg )
[ th ] Scientific name --> [ td ] Armadillidium vulgare ( None )
[ th ] Family --> [ td ] Armadillidiidae - Terrestrial Custaceans ( None )
[ th ] Time of year --> [ td ] All year ( None )
[ th ] Time of day --> [ td ] All day ( None )
[ th ] Location --> [ td ] Under rocks ( None )
[ th ] Size --> [ td ] 2 mm ( None )
[ th ] Rarity --> [ td ] Common ( None )
[ th ] Selling price --> [ td ] 250 Bells ( /wiki/Bells )
[ th ] Appearances --> [ td ] Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf ( /wiki/Doubutsu_no_Mori_(game) )

Parsing HTML with BeautifulSoup in Python

I am trying to parse HTML with Python using BeautifulSoup, but I can't manage to get what I need.
This is a little module of a personal app I want to do, and it consists in a web login part with credentials, and once the script is logged in the web, I need to parse some information in order to manage it and process it.
The HTML code after getting logged is:
<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>
As you can see, it's not a very well-formatted HTML, but I'd need to extract the elements and their values, I mean, for example: "Daily earnings" and "150" | "Weekly earnings" and "500"...
I think that the "id" attribute may help, but when I try to parse it, it crashes.
The Python code I'm working with is:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
par = parsed_html.find('td', attrs={'id':'west1'}).string
print par
Where archivohtml is the saved html file after logging in the web
When I run the script, I only get errors.
I've also tried doing this:
def parseo(archivohtml):
soup = BeautifulSoup()
html = archivohtml
parsed_html = soup(html)
par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
print par
But the result is still the same.
The tag with id="west1" is an <a> tag. You are looking for the <td> tag that comes after this <a> tag:
import BeautifulSoup as bs
content = '''<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>'''
def parseo(archivohtml):
html = archivohtml
parsed_html = bs.BeautifulSoup(html)
par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')
print par.string.strip()
parseo(content)
yields
150
I can't tell from your question if this will be applicable to you, but here's another method:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
for line in parsed_html.stripped_strings:
print line.strip()
which yields:
Account Balance
Daily Earnings
150
Weekly Earnings
500
Monthly Earnings
1500
Total expended
430
Account Balance
840
And if you wanted the data in a list:
data = [line.strip() for line in parsed_html.stripped_strings]
[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']

Categories

Resources