how to get values from nested tables using beautifulsoup - python

I need to get the name and the price of each row in the sample html below, however when I'm using beatifulsoup to find_all('tr') it returns all the tr of the main table and the nested tables. what is the best way for extracting only the value and the price of each row?
soup = BeautifulSoup(f, 'html.parser')
priceTable = soup.find('table', attrs={"class":"table table-hover table-responsive"})
Above is what I have and it returns "all" the tr including the nested tables.
What I need is to get all the names and the price of each item in front of it, and finally save them in a csv file
<table class="table table-hover table-responsive">
<tbody><tr>
<td style="vertical-align: middle; width: 20%;" class="hidden-xs">
<img class="retailer-logo" data-placement="right" src="/images/20180813125BhYNMEK8lgOpXj3zxze53WmqeRWov7h.jpg" alt="Contact Energy" style="width:150px;" title="" data-original-title="" />
</td>
<td style="vertical-align: middle; width: 75px;" class="hidden-xs">
<img src="/images/result-arrow.png" />
</td>
<td>
<table style="width: 100%;">
<tbody><tr class="visible-xs">
<td class="text-center" colspan="2">
<img class="retailer-logo" data-placement="right" src="/images/20180813125BhYNMEK8lgOpXj3zxze53WmqeRWov7h.jpg" alt="Contact Energy" style="width:150px;" title="" data-original-title="" />
</td>
</tr>
<tr>
<td colspan="3"><h4>Contact Energy Saver Plus</h4></td>
</tr>
<tr style="text-transform: uppercase">
<td width="150px">Electricity:</td>
<td>$242.85 <a class="plan-breakdown" data-placement="right" title="" data-original-title="<table><tr><td>Anytime</td><td>$0.334</td><td>per kWh</td><tr><td>Daily</td><td>$0.333</td><td>per day</td><tr><td>EA Levy</td><td>$0.0013</td><td>per kWh</td></table>"><i class="glyphicon glyphicon-info-sign"> </i></a>
</td>
</tr>
<tr style="text-transform: uppercase">
<td>Discount:</td>
<td>$63.14 (26%)
</td>
</tr>
<tr>
<td colspan="3">
<a class="plan-detail" data-placement="right" title="" data-original-title="<ul><li>Provides fixed pricing until 31 June 2021 unless there are changes to taxes and levies.</li><li>24% Prompt Payment Discount when you pay on time. additional 1% discount for paying by direct debit (excl. credit card), and 1% discount for getting bills and correspondence by email. Up to 26% PPD available.</li><li>An early termination fee of $150 per contracted ICP if you terminate the contract before the end date�(31/06/2021). Fee may be waived if you are moving house and take Contact Energy to the new property.</li><li>Not available to prepay customers.</li></ul>"><i class="glyphicon glyphicon-info-sign"> </i> What you need to know</a>
</td>
</tr>
<tr class="visible-xs">
<td colspan="2">
<h3 class="total">$179.71</h3>
<div class="incentive">
<b style="text-transform: uppercase">SPECIAL SwitchMe OFFER</b><br />
Special PPD & Fixed rates<br />
<a style="font-size: 0.9em;" class="incentive-info" title="" data-original-title="Receive�a special Prompt Payment Discount and fixed rates until 31 June 2021 unless there are changes to taxes and levies">More Info</a>
</div>
</td>
</tr>
<tr class="visible-xs">
<td colspan="2">
<form id="w0" action="/switch/" method="post">
<input type="hidden" name="_csrf" value="Hi21xBvkP6NpUl0UcaFwxn4U5-94Jj8KqEeprOfuG9tMfP2gStRY6RFrBGdF6gGvT0uM3CAQaVvOPpnq1IddtQ==" /> <input type="hidden" name="query_id" value="409884" /> <input type="hidden" name="plan_group_id" value="54" /> <input type="hidden" name="plan_stage_id" value="367" /> <button type="submit" class="btn btn-block btn-switch" style="max-width: 100%; margin-top: 10px">Switch Now!</button> </form> <div class="wannatalk" style="max-width: 100%">
Want to talk?<br />
Call our friendly team on<br />
<b>0800 179 482</b>
</div>
</td>
</tr>
</tbody></table>
</td>
<td style="text-align: center" class="hidden-xs">
<h3 class="total">$179.71</h3>
<div class="incentive">
<b style="text-transform: uppercase">SPECIAL SwitchMe OFFER</b><br />
Special PPD & Fixed rates<br />
<a style="font-size: 0.9em;" class="incentive-info" title="" data-original-title="Receive�a special Prompt Payment Discount and fixed rates until 31 June 2021 unless there are changes to taxes and levies">More Info</a>
</div>
</td>
<td class="hidden-xs">
<form id="w1" action="/switch/" method="post">
<input type="hidden" name="_csrf" value="Hi21xBvkP6NpUl0UcaFwxn4U5-94Jj8KqEeprOfuG9tMfP2gStRY6RFrBGdF6gGvT0uM3CAQaVvOPpnq1IddtQ==" /> <input type="hidden" name="query_id" value="409884" /> <input type="hidden" name="plan_group_id" value="54" /> <input type="hidden" name="plan_stage_id" value="367" /> <button type="submit" class="btn btn-block btn-switch">Switch Now!</button> </form> <div class="wannatalk">
Want to talk?<br />
Call our friendly team on<br />
<b>0800 179 482</b>
</div>
</td>
</tr>
<tr>
<td style="vertical-align: middle; width: 20%;" class="hidden-xs">
<img class="retailer-logo" data-placement="right" src="/images/20171013102LzWd_kdtQOk4yxxyZuCZBG6q7xIuClx.jpg" alt="Powershop" style="width:150px;" title="" data-original-title="" />
</td>
<td style="vertical-align: middle; width: 75px;" class="hidden-xs">
<img src="/images/result-arrow.png" />
</td>
<td>
<table style="width: 100%;">
<tbody><tr class="visible-xs">
<td class="text-center" colspan="2">
<img class="retailer-logo" data-placement="right" src="/images/20171013102LzWd_kdtQOk4yxxyZuCZBG6q7xIuClx.jpg" alt="Powershop" style="width:150px;" title="" data-original-title="" />
</td>
</tr>
<tr>
<td colspan="3"><h4>Powershop Saver</h4></td>
</tr>
<tr style="text-transform: uppercase">
<td width="150px">Electricity:</td>
<td>$183.40 <a class="plan-breakdown" data-placement="right" title="" data-original-title="<table><tr><td>Anytime</td><td>$0.2508</td><td>per kWh</td><tr><td>Daily</td><td>$0.30</td><td>per day</td><tr><td>EA Levy</td><td>$0.00</td><td>per kWh</td></table>"><i class="glyphicon glyphicon-info-sign"> </i></a>
</td>
</tr>
<tr style="text-transform: uppercase">
<td>Discount:</td>
<td>$0.00 (0%)
</td>
</tr>
<tr>
<td colspan="3">
<a class="plan-detail" data-placement="right" title="" data-original-title="<ul><li>The price estimate is based on forecast charges from Powershop for the next 12 months.</li><li>It assumes you purchase the Powershop Simple Saver powerpack once a month and special powerpacks that are made available from time to time.</li><li>This offer does not require a contract or a minimum supply period.</li><li>New customers will get a $150 power credit applied over their first 12 months ($25 straight away, $10 on the next 10�monthly account
review periods, and a final credit of $25 in the final account review period of
your first year as a Powershop customer).�</li></ul>"><i class="glyphicon glyphicon-info-sign"> </i> What you need to know</a>
</td>
</tr>
<tr class="visible-xs">
<td colspan="2">
<h3 class="total">$183.40</h3>
<div class="incentive">
<b style="text-transform: uppercase">SPECIAL SwitchMe OFFER</b><br />
Get $150 off your bill over 12 months!<br /> <a style="font-size: 0.9em;" class="incentive-info" title="" data-original-title="<div><div>New customers will get a $150 power credit applied over their first 12 months ($25 straight away, then $10 for the next 10�monthly account
review periods, and a final credit of $25 in the final account review period of
your first year as a Powershop customer).</div><div>�</div></div><div><br></div><div><br></div>">More Info</a> </div>
</td>
</tr>
<tr class="visible-xs">
<td colspan="2">
<form id="w2" action="/switch/" method="post">
<input type="hidden" name="_csrf" value="Hi21xBvkP6NpUl0UcaFwxn4U5-94Jj8KqEeprOfuG9tMfP2gStRY6RFrBGdF6gGvT0uM3CAQaVvOPpnq1IddtQ==" /> <input type="hidden" name="query_id" value="409884" /> <input type="hidden" name="plan_group_id" value="53" /> <input type="hidden" name="plan_stage_id" value="273" /> <button type="submit" class="btn btn-block btn-switch" style="max-width: 100%; margin-top: 10px">Switch Now!</button> </form><div class="wannatalk" style="max-width: 100%">
Want to talk?<br />
Call our friendly team on<br />
<b>0800 179 482</b>
</div>
</td>
</tr>
</tbody></table>
</td>
so the output should be:
from td[3] and td[4] in first row:
Contact Energy Saver Plus
$179.71
and then the next row:
Powershop Saver
$183.40
and so on until the last row ( of the main table).

Similar process to that given in comments but different selectors
from bs4 import BeautifulSoup as bs
html = '''yourhtml'''
soup = bs(html, 'lxml')
names = [item.text for item in soup.select('.table h4 ')]
prices = [item.text for item in soup.select('[colspan="2"] > .total')]
results = list(zip(names, prices))
print(results)

I actually managed to solve this with using regex. I like the approach in the above answer much better specially using zip(), but I though pasting my solution here in case it becomes handy to some other readers.
deals=[]
prices=[]
results={}
with open("prices.html", "r") as f:
soup = BeautifulSoup(f, 'html.parser')
priceTable = soup.find('table', attrs={"class":"table table-hover table-responsive"})
tbody = priceTable.find('tbody')
pplanPattern = '<td\ colspan="3"><h4>([^<]+)<\/h4><\/td>'
pricePatterns = '<h3 class="total">([^<]+)<\/h3>'
for rw in tbody:
plan = re.search(pplanPattern, rw)
price = re.search(pricePatterns, rw)
if plan:
deals.append(plan.group(1))
if price:
deals.append(price.group(1))
results[plan.group(1)] = price.group(1)

Related

I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project

I have this HTML code for example:
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
<span style="font-size: 8pt; color: #78578e"> MCO Date : 10-Aug-2018</span>
<br />
</td>
<td>
<h3 style="font-size: 14pt;"> </h3>
<h2> <br /> </h2>
</td>
</tr>
<tr>
<td style="margin-top: 0;" class="border-b">
<br />
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
</tr>
<tr style="height: 7%;" class="border-tb">
<td style="width: 130px" class="border-r">
<h4>METER NO</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PREVIOUS READING</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PRESENT READING</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>MF</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>UNITS</h4>
</td>
<td>
<h4>STATUS</h4>
</td>
</tr>
<tr style="height: 30px" class="content">
<td class="border-r">
3-P I 3301539<br> I 3301539<br> E 3301539<br> E 3301539<br>
</td>
<td class="border-r">
78693<br>16823<br>19740<br>8<br>
</td>
<td class="border-r">
80086<br>17210<br>20139<br>8<br>
</td>
<td class="border-r">
1<br>1<br>1<br>1<br>
</td>
<td class="border-r">
1393<br>387<br>399<br>0<br>
</td>
<td>
</td>
</tr>
<tr id="roshniMsg" style="height: 30px" class="content">
<td colspan="6">
<div style="width: 452pt">
<img style="max-width: 100%; max-height: 35%" src="/images/companies/iesco/roshniMsg.jpg"
alt="Roshni Message" />
</div>
</td>
</tr>
</table>
From this table I want to extract the paragraph and from there I want to get all the span tags in that paragraph.
I used soup.find_all() to get the table but I don't know how to use this function iteratively to pass it back to the original soup object so that I could find the paragraph and, moreover the span tags in that paragraph.
This is the code Python code I wrote:
soup = BeautifulSoup(string, 'html.parser')
#Getting the table tag
results = soup.find_all('table', attrs={'class':'nested4'})
#Getting the paragragh tag
results = soup.find_all('p', attrs={'style':'margin: 0; text-align: left; padding-left: 5px'})
#Getting all the span tags
results = soup.find_all('span', attrs={})
I just want help on how to get the paragraphs within the table. And then how to get the spans within the paragraph as I am getting the spans in all of the original HTML code. I don't know how to pass the bs4 object list back to the soup object to use soup.find_all iteratively.
from bs4 import BeautifulSoup
html = '''
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
'''
soup = BeautifulSoup(html, 'html.parser')
spans = soup.select_one('table.nested4').select('span')
for span in spans:
print(span.text)
This returns:
NAME & ADDRESS
MUHAMMAD AMIN
S/O MUHAMMAD KHAN
H-NO.38 MARGALLA ROAD
F-6/3 ISLAMABAD3
if you have one table:
soup = BeautifulSoup(string, 'html.parser')
table = soup.find('table', attrs={'class': 'nested4'})
p = table.find('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'})
results = p.find_all('span')
for result in results:
print(result.get_text(strip=True))
if you have list of tables:
soup = BeautifulSoup(string, 'html.parser')
for table in soup.find_all('table', attrs={'class': 'nested4'}):
for p in table.find_all('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}):
for span in p.find_all('span'):
print(span.get_text(strip=True))

BeautifulSoup only scraping half my table?

I'm scraping a webpage for a table using BeautifulSoup, but for some reason it is only scraping half the table. The half I'm getting is the part that doesn't contain the input fields. Here is the html data:
<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>
Here is my code:
url = driver.page_source
soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["th","td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
What am I doing wrong? Why is it only printing the left side of the table? Any recommendations about what to change would be much appreciated.
Results:
Portfolio Allocation (%)
AdvisorGuided (Capital Portfolio)
100 100
AdvisorGuided 2 (Capital Portfolio)
0 100
Client Directed (Capital Portfolio)
0 100
Holding MMKT (Capital Portfolio)
0 100
Total
100 100
You'll have to go further into the child and sibling nodes and pull out the attributes (those values aren't actual text/content.
import pandas as pd
import bs4
html = '''<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>'''
soup = bs4.BeautifulSoup(html, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
list_of_cells.append(val)
list_of_cells.append(max_val)
except:
pass
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
To make a table, you could do something like this. You'll have to do a bitt of clean up, but should get you going:
results = pd.DataFrame()
for row in table.findAll('tr'):
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
except:
val = ''
max_val = ''
pass
temp_df = pd.DataFrame([[text, val, max_val]], columns=['text','value','maxvalue'])
results = results.append(temp_df).reset_index(drop=True)
A few things come to mind.
First: it should be rows = table.findAll('tr') as the tr HTML tag designates rows. Subsequently, it should for row in table.findAll('td'): as the td HTML tag is the cell tag. But you're not even using the rows variable, so the point is moot. If you want you could do something like this:
soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll("tr")
list_of_rows = []
for row in rows:
list_of_cells = []
for cell in row.findAll(['th', 'td']):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
Second, this code wouldn't get the text in the input fields, so this is probably why you only see the text on the left side.
Finally, you could try a difference parser, such as html5lib.

PYTHON Beautiful Soup Web Scraping Handling Dynamic Values

The below HTML code has dynamic attributes for different individual series. Example, one series can have multiple units, like Million or Thousands.
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
Total Vehicle Sales
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:49px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-meta">
<span class="attributes">Monthly</span>
<br class="clear">
</div>
<div class="series-meta">
<input class="pager-item-checkbox" type="checkbox" name="sids[0]" value="TOTALSA">
<a href="/series/TOTALSA">
Millions of Units,
Seasonally Adjusted Annual Rate
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
<br class="clear">
<input class="pager-item-checkbox" type="checkbox" name="sids[1]" value="TOTALNSA">
<a href="/series/TOTALNSA">
Thousands of Units,
Not Seasonally Adjusted
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
</div>
</td>
</tr>
<tr><td colspan="2" style="font-size:9px"> </td></tr>
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
Light Weight Vehicle Sales: Autos and Light Trucks
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:46px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-single">
<input class="pager-item-checkbox" type="checkbox" name="sids[2]" value="ALTSALES">
<span class="attributes" style="width:350px;">Millions of Units, Monthly, Seasonally Adjusted Annual Rate</span><span class="series-meta-dates">Jan 1976 to Jul 2017 (4 days ago)</span>
<br class="clear">
</div>
<a href="/series/ALTSALES">
</a>
</td>
This gets me somewhat close, however it fails to obtain the 2nd frequency for the "Total Vehicle Sales," it only obtains the first "Millions of Units, Seasonally Adjusted Annual Rate." Aside from this issue, my assumption is that I would be mis-classifying things in general with my current query. Code I have created thus far:
browser=webdriver.Chrome(executable_path='F:\Anaconda\chromedriver\chromedriver_win32\chromedriver.exe')
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
for l in soup.find_all('tbody'):
series_count=len(l.find_all('tr',attrs={'class':'series-pager-title'}))
series_data=l.find_all('tr',attrs={'class':'series-pager-title'})
attrs_data=l.find_all('tr',attrs={'class':'series-pager-attr'})
print(series_count)
print(len(attrs_data))
for m in range(0,series_count):
print(series_data[m].find('a',href=True).text+' | '+attrs_data[m].find('a',href=True).text.strip().replace(' ',' '))
In the above query, can someone please assist in creating the desired outcome:
If someone comes across this with a better solution I am all ears... In the interim, this seems to do the trick...
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
test=soup.tbody
children=[child for child in test if child != '\n']
series_data=pd.DataFrame([],columns=['series_index','series_title','series_href'])
sub_series_data=pd.DataFrame([],columns=['series_index','frequency','sub_series_units','sub_series_href'])
series_index=0
for index,child in enumerate(children):
if child.find('a',attrs={'class':'series-title'}):
series_index+=1
series_title=child.text.strip()
series_link=child.find('a',href=True).attrs['href']
temp_series_df=({'series_index':series_index,
'series_title':series_title,
'series_href':series_link})
series_data=series_data.append([temp_series_df],ignore_index=True)
if child.find('div',attrs={'class':'series-meta'}):
frequency=child.find('span',attrs={'class':'attributes'}).text.strip()
for i in child.find_all('a',href=True):
temp_sub_series_df=({'series_index':series_index,
'frequency':frequency.strip(),
#'sub_series_units':i.text.strip(),
'sub_series_units':re.sub(' +',' ',re.sub('\n',' ',i.text)),
'sub_series_href':'https://fred.stlouisfed.org'+i.attrs['href']})
sub_series_data=sub_series_data.append([temp_sub_series_df],ignore_index=True)
print(series_data)
print(sub_series_data)
combine_series_data=pd.merge(series_data,sub_series_data,how='left',on=['series_index'])

Parsing a string in pandas where there isn't a delimiter

I've just parsed a web page via pandas:
r = requests.post("https://www.eigroup.co.uk/clients/auctions/fulldetails.aspx?auctionid=17999 ", params=payload)
parsed_page = pd.read_html(r.text, attrs={"class": "table-search-result"})
(an example of the HTML being parsed)
<table cellspacing="0" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1" style="width:100%;border-collapse:collapse;">
<tr>
<td colspan="2">
<table class="table-search-result">
<tr>
<th>66D Charlwood Street, Pimlico, London, SW1V 4PQ</th>
<th style="text-align: right; white-space: nowrap;">
<a href="http://www.englishhouseprices.com/results.aspx?postcode=SW1V 4PQ" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A2" class="icon" target="_blank">
<img src="/content/images/icons/32/houseprices.png" alt="Compare with Property Prices" title="Compare with Property Prices in this Postcode" /></a>
<a id="" title="View Auction Details" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank"><img title="View Auction Details" src="/content/images/icons/32/auctiondetails.png" alt="" /></a>
<a id="" title="Trend Analysis" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/lots/trend-analysis.aspx?lotid=756425" target="_blank"><img title="Trend Analysis" src="/content/images/icons/32/piechart.png" alt="" /></a>
<a href='http://maps.google.co.uk?q=SW1V 4PQ' target="_blank">
<img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageLocationMap" title="Location Map" class="icon" src="/content/images/icons/32/compass.png" /></a>
<a href='http://www.multimap.com/map/photo.cgi?scale=5000&mapsize=big&pc=SW1V 4PQ' target="_blank">
<img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageAerialPhoto" title="Aerial Photo" class="icon" src="/content/images/icons/32/camera.png" /></a>
<a href='/clients/search/search-results.aspx?searchtype=comparable&lotid=756425' title="Find similar properties like this one">
<img src="/content/images/icons/32/find.png" alt="Find other properties matching this tenant" title="Find similar properties like this one" class="icon" /></a>
<a href='/clients/search/search-results.aspx?searchtype=history&lotid=756425'>
<img src="/content/images/icons/32/history.png" alt="Find history of property in this street" title="Find history of property in this street" class="icon" /></a>
<a id="" title="Add to one of my portfolios" class="icon" Title="Add to portfolio" onclick="return o(this,650,500,1,1)" href="/clients/portfolios/lot.aspx?lotid=756425" target="_blank"><img title="Add to one of my portfolios" src="/content/images/icons/32/briefcase.png" alt="" /></a>
<a href="https://www.eigroup.co.uk/files/55/17999/6ec339ec-d59e-4b8a-9136-dc6e9a583328.pdf" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A4" target="_blank">
<img src="/content/images/icons/32/catalogue.png" alt="Catalogue Entry" class="icon" title="Full Catalogue Entry" /></a>
<a id="" title="Add to my shortlist" class="icon" Title="Add to shortlist" onclick="return o(this,900,650,1,1)" href="/clients/lots/shortlist.aspx?lotid=756425" target="shortlist"><img title="Add to my shortlist" src="/content/images/icons/32/shortlist.png" alt="" /></a>
</th>
</tr>
<tr>
<td colspan="2" style="background-color: #f5f5f5;">
<table style="width: 100%">
<tr>
<td style="background-color: #f1f1f1; width: 170px; text-align: center;">
<a href='/clients/lots/details.aspx?lotid=756425&hb=1' target='756425' onclick="window.open(this.href,this.target,'width=900,height=650,resizable=yes,scrollbars=yes');return false" title="Auction property in Pimlico, London, SW1">
<img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_Image1" src="https://www.eigroup.co.uk/files/55/17999/de591a4f-7da1-4bcd-a42c-76731bd72a23.jpg" alt="Pimlico, London, SW1" style="border-color:Black;border-width:2px;border-style:Solid;width:150px;" />
</a>
</td>
<td style="padding-left: 10px; width: 50%;">
<p>
<b>Description</b><br />
Leasehold 2nd Floor Studio Flat Unmodernised Vacant
</p>
<p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P1">
<b>Guide Price</b><br />
£450,000 Plus
</p>
<p>
<b>Lot Number</b><br />
2
</p>
<p>
<b> </b>
</p>
</td>
<td style="white-space: nowrap;">
<p>
<b>Auctioneer</b><br />
<a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctioneers/details.aspx?auctioneerid=55" target="_blank">Savills (London - National)</a>
</p>
<p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P3">
<b>Vendor</b><br />
Housing Association
</p>
</td>
<td style="white-space: nowrap;">
<p>
<b>Auction Date</b><br />
<a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank">28 October 2014</a>
</p>
<p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P7">
<b>Lease Details</b><br />
125 Yr, commencing 01/01/2013 (GR.£250.PA)
</p>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
and I get the following:
In [86]: parsed_page[1][0][1]
Out[86]: u'Description Leasehold 2nd Floor Studio Flat Unmodernised Vacant Guide Price \xa3450,000 Plus Lot Number 2 Auctioneer Savills (London - National) Vendor Housing Association Auction Date 28 October 2014 Lease Details 125 Yr, commencing 01/01/2013 (GR.\xa3250.PA)'
The problem is, I want to be able to extract the description, guide price etc, but there aren't any delimiters and the number of characters afterwards is variable. Am I missing a keyword when I'm parsing?
How can I then split them into new columns?
Using beautifulSoup as I recommended in an answer to your last question, you can split the text and make a dict :
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
s = soup.find_all("p")
details = (ele.text.strip().split("\n") for ele in s)
d = {}
for det in details:
if len(det) == 2:
d[det[0].strip()] = det[1].strip()
{u'Vendor': u'Housing Association', u'Description': u'Leasehold 2nd Floor Studio Flat Unmodernised Vacant', u'Auction Date': u'28 October 2014', u'Auctioneer': u'Savills (London - National)', u'Lot Number': u'2', u'Guide Price': u'\xc2\u0141450,000 Plus', u'Lease Details': u'125 Yr, commencing 01/01/2013 (GR.\xc2\u0141250.PA)'}

Extracting table data from html with python and BeautifulSoup

I'm new with python and beautifulsopu lib. I have tried many things, but no luck.
My html code could be like:
<form method = "post" id="FORM1" name="FORM1">
<table cellpadding=0 cellspacing=1 border=0 align="center" bgcolor="#cccccc">
<tr>
<td class="producto"><b>Club</b><br>
<input value="CLUB TENIS DE MESA PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtClub" size="60" maxlength="55">
</td>
<tr>
<td colspan="2" class="producto"><b>Nombre Equipo</b><br>
<input value="C.T.M. PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtNomEqu" size="100" maxlength="80">
</td>
</tr>
<tr>
<td class="producto"><b>Telefono fijo</b><br>
<input value="63097005534" disabled class="txtmascaraform" type="TEXT" name="txtTelf" size="15" maxlength="10">
</td
and I need JUST to take what is within <"b"><"/b"> and its "input value" .
Many thanks!!
First find() your form by id, then find_all() inputs inside and get the value of value attribute:
from bs4 import BeautifulSoup
data = """<form method = "post" id="FORM1" name="FORM1">
<table cellpadding=0 cellspacing=1 border=0 align="center" bgcolor="#cccccc">
<tr>
<td class="producto"><b>Club</b><br>
<input value="CLUB TENIS DE MESA PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtClub" size="60" maxlength="55">
</td>
<tr>
<td colspan="2" class="producto"><b>Nombre Equipo</b><br>
<input value="C.T.M. PORTOBAIL" disabled class="txtmascaraform" type="TEXT" name="txtNomEqu" size="100" maxlength="80">
</td>
</tr>
<tr>
<td class="producto"><b>Telefono fijo</b><br>
<input value="63097005534" disabled class="txtmascaraform" type="TEXT" name="txtTelf" size="15" maxlength="10">
</td>
</tr>
</table>
</form>"""
soup = BeautifulSoup(data)
form = soup.find("form", {'id': "FORM1"})
print [item.get('value') for item in form.find_all('input')]
# UPDATE for getting table cell values
table = form.find("table")
print [item.text.strip() for item in table.find_all('td')]
prints:
['CLUB TENIS DE MESA PORTOBAIL', 'C.T.M. PORTOBAIL', '63097005534']
[u'Club', u'Nombre Equipo', u'Telefono fijo']

Categories

Resources