Parsing HTML with BeautifulSoup in Python - python
I am trying to parse HTML with Python using BeautifulSoup, but I can't manage to get what I need.
This is a little module of a personal app I want to do, and it consists in a web login part with credentials, and once the script is logged in the web, I need to parse some information in order to manage it and process it.
The HTML code after getting logged is:
<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>
As you can see, it's not a very well-formatted HTML, but I'd need to extract the elements and their values, I mean, for example: "Daily earnings" and "150" | "Weekly earnings" and "500"...
I think that the "id" attribute may help, but when I try to parse it, it crashes.
The Python code I'm working with is:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
par = parsed_html.find('td', attrs={'id':'west1'}).string
print par
Where archivohtml is the saved html file after logging in the web
When I run the script, I only get errors.
I've also tried doing this:
def parseo(archivohtml):
soup = BeautifulSoup()
html = archivohtml
parsed_html = soup(html)
par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
print par
But the result is still the same.
The tag with id="west1" is an <a> tag. You are looking for the <td> tag that comes after this <a> tag:
import BeautifulSoup as bs
content = '''<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>'''
def parseo(archivohtml):
html = archivohtml
parsed_html = bs.BeautifulSoup(html)
par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')
print par.string.strip()
parseo(content)
yields
150
I can't tell from your question if this will be applicable to you, but here's another method:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
for line in parsed_html.stripped_strings:
print line.strip()
which yields:
Account Balance
Daily Earnings
150
Weekly Earnings
500
Monthly Earnings
1500
Total expended
430
Account Balance
840
And if you wanted the data in a list:
data = [line.strip() for line in parsed_html.stripped_strings]
[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']
Related
I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project
I have this HTML code for example: <table class="nested4"> <tr> <td colspan="1"></td> <td colspan="2"> <h2 class="zeroMargin" id="govtMsg" visible="false"></h2> </td> <td colspan="2"> <h2 class="zeroMargin "> Net Metering Conn. </h2> </td> <td colspan="2"> <h2 class="zeroMargin" hidden> Life Line Consumer</h2> </td> </tr> <tr> <td colspan="2"> <p style="margin: 0; text-align: left; padding-left: 5px"> <span>NAME & ADDRESS</span> <br /> <span>MUHAMMAD AMIN </span> <br /> <span>S/O MUHAMMAD KHAN </span> <br /> <span>H-NO.38 MARGALLA ROAD </span> <br /> <span>F-6/3 ISLAMABAD3 </span> <br /> <span></span> </p> </td> <td colspan="3" style="text-align: left"> <h2 class="color-red">Say No To Corruption</h2> <span style="font-size: 8pt; color: #78578e"> MCO Date : 10-Aug-2018</span> <br /> </td> <td> <h3 style="font-size: 14pt;"> </h3> <h2> <br /> </h2> </td> </tr> <tr> <td style="margin-top: 0;" class="border-b"> <br /> </td> <td colspan="1" style="margin-top: 0;" class="border-b"> </td> <td colspan="1" style="margin-top: 0;" class="border-b"> </td> </tr> <tr style="height: 7%;" class="border-tb"> <td style="width: 130px" class="border-r"> <h4>METER NO</h4> </td> <td style="width: 90px" class="border-r"> <h4>PREVIOUS READING</h4> </td> <td style="width: 90px" class="border-r"> <h4>PRESENT READING</h4> </td> <td style="width: 60px" class="border-r"> <h4>MF</h4> </td> <td style="width: 60px" class="border-r"> <h4>UNITS</h4> </td> <td> <h4>STATUS</h4> </td> </tr> <tr style="height: 30px" class="content"> <td class="border-r"> 3-P I 3301539<br> I 3301539<br> E 3301539<br> E 3301539<br> </td> <td class="border-r"> 78693<br>16823<br>19740<br>8<br> </td> <td class="border-r"> 80086<br>17210<br>20139<br>8<br> </td> <td class="border-r"> 1<br>1<br>1<br>1<br> </td> <td class="border-r"> 1393<br>387<br>399<br>0<br> </td> <td> </td> </tr> <tr id="roshniMsg" style="height: 30px" class="content"> <td colspan="6"> <div style="width: 452pt"> <img style="max-width: 100%; max-height: 35%" src="/images/companies/iesco/roshniMsg.jpg" alt="Roshni Message" /> </div> </td> </tr> </table> From this table I want to extract the paragraph and from there I want to get all the span tags in that paragraph. I used soup.find_all() to get the table but I don't know how to use this function iteratively to pass it back to the original soup object so that I could find the paragraph and, moreover the span tags in that paragraph. This is the code Python code I wrote: soup = BeautifulSoup(string, 'html.parser') #Getting the table tag results = soup.find_all('table', attrs={'class':'nested4'}) #Getting the paragragh tag results = soup.find_all('p', attrs={'style':'margin: 0; text-align: left; padding-left: 5px'}) #Getting all the span tags results = soup.find_all('span', attrs={}) I just want help on how to get the paragraphs within the table. And then how to get the spans within the paragraph as I am getting the spans in all of the original HTML code. I don't know how to pass the bs4 object list back to the soup object to use soup.find_all iteratively.
from bs4 import BeautifulSoup html = ''' <table class="nested4"> <tr> <td colspan="1"></td> <td colspan="2"> <h2 class="zeroMargin" id="govtMsg" visible="false"></h2> </td> <td colspan="2"> <h2 class="zeroMargin "> Net Metering Conn. </h2> </td> <td colspan="2"> <h2 class="zeroMargin" hidden> Life Line Consumer</h2> </td> </tr> <tr> <td colspan="2"> <p style="margin: 0; text-align: left; padding-left: 5px"> <span>NAME & ADDRESS</span> <br /> <span>MUHAMMAD AMIN </span> <br /> <span>S/O MUHAMMAD KHAN </span> <br /> <span>H-NO.38 MARGALLA ROAD </span> <br /> <span>F-6/3 ISLAMABAD3 </span> <br /> <span></span> </p> </td> <td colspan="3" style="text-align: left"> <h2 class="color-red">Say No To Corruption</h2> ''' soup = BeautifulSoup(html, 'html.parser') spans = soup.select_one('table.nested4').select('span') for span in spans: print(span.text) This returns: NAME & ADDRESS MUHAMMAD AMIN S/O MUHAMMAD KHAN H-NO.38 MARGALLA ROAD F-6/3 ISLAMABAD3
if you have one table: soup = BeautifulSoup(string, 'html.parser') table = soup.find('table', attrs={'class': 'nested4'}) p = table.find('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}) results = p.find_all('span') for result in results: print(result.get_text(strip=True)) if you have list of tables: soup = BeautifulSoup(string, 'html.parser') for table in soup.find_all('table', attrs={'class': 'nested4'}): for p in table.find_all('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}): for span in p.find_all('span'): print(span.get_text(strip=True))
Problem extracting text of td from table row (tr) with scrapy
I am parsing data table from the following URL: https://www.signalstart.com/search-signals In particular, I am trying to extract the data from the table rows. The table row has a series of table-data cells: <table class="table table-striped table-bordered dataTable table-hover" id="searchSignalsTable"> <thead> <tr> <th class="sorting sorting_asc">Rank</th> <th class="sorting ">Name</th> <th class="sorting ">Gain</th> <th class="sorting ">Pips</th> <th class="sorting ">DD</th> <th class="sorting ">Trades</th> <th class="sorting ">Type</th> <th>Monthly</th> <th>Chart</th> <th class="sorting ">Price</th> <th class="sorting " style="width: 40px">Age</th> <th class="sorting " style="width: 70px">Added</th> <th>Action</th> </tr> </thead> <tbody> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/joker-1k/110059">Joker 1k</a> </td> <td><span class="red">-9.99%</span></td> <td><span class="green">2,092.3</span></td> <td>15.3%</td> <td>108</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark110059"><canvas width="12" height="25" style="display: inline-block; vertical-align: top; width: 12px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark110059"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 1m 24d </td> <td> Mar 29, 2020 </td> <td><a onclick="getMasterPricingData('110059');" data-toggle="modal"><button id="subscribeToMasterBtn110059" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="110059" value="-1.78,-3.68,-4.86"> <input type="hidden" class="dailyGrowthData" oid="110059" value="0.00,-0.03,-1.78,-5.69,-6.75,-5.59,-7.61,-5.31,-6.20,-3.81,-4.40,-8.00,-2.88,-3.78,-4.38,-0.20,-5.40,-10.66,-13.69,-12.51,-13.23,-9.99"> <input type="hidden" class="dailyEquityData" oid="110059" value="0.00,-0.23,-1.41,-5.02,-6.25,-4.29,-6.68,-3.91,-5.37,-4.10,-4.40,-3.59,-1.78,-1.75,-2.65,-0.21,-4.87,-10.76,-13.90,-11.58,-13.23,-10.18"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/fxabakus/56043">FXabakus</a> </td> <td><span class="red">-19.57%</span></td> <td><span class="red">-8,615.2</span></td> <td>42%</td> <td>1642</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark56043"><canvas width="80" height="25" style="display: inline-block; vertical-align: top; width: 80px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark56043"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 1y 7m </td> <td> May 4, 2019 </td> <td><a onclick="getMasterPricingData('56043');" data-toggle="modal"><button id="subscribeToMasterBtn56043" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="56043" value="1.22,1.35,3.92,1.35,-1.57,1.77,2.01,1.11,0.38,-14.89,-14.70,-5.21,5.97,7.03,-17.54,2.92,3.11,-8.94,13.38,1.77"> <input type="hidden" class="dailyGrowthData" oid="56043" value="-27.87,-29.29,-29.01,-26.76,-25.76,-25.59,-30.57,-30.13,-29.78,-29.60,-29.25,-28.34,-28.07,-27.89,-25.20,-25.08,-23.66,-23.46,-21.54,-21.02,-21.62,-20.28,-18.31,-26.97,-27.48,-27.00,-28.21,-24.20,-23.46,-30.04,-31.37,-34.62,-33.84,-32.87,-32.20,-30.99,-30.43,-30.30,-29.75,-27.64,-27.45,-24.34,-24.71,-24.09,-24.15,-21.48,-21.08,-20.97,-19.54,-19.57"> <input type="hidden" class="dailyEquityData" oid="56043" value="-27.87,-29.29,-28.89,-26.76,-25.76,-28.10,-34.47,-32.34,-31.54,-40.80,-32.76,-32.90,-33.50,-30.65,-25.37,-25.05,-22.88,-23.29,-21.54,-21.02,-21.54,-20.90,-19.11,-27.76,-35.15,-29.17,-27.79,-24.20,-26.23,-34.32,-35.95,-51.20,-33.84,-32.76,-32.71,-31.62,-30.43,-39.93,-29.75,-27.64,-28.35,-27.62,-28.41,-24.20,-24.51,-22.06,-21.08,-20.97,-18.82,-30.27"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/af-investing-pro-final/122603">AF Investing Pro Final</a> </td> <td><span class="green">56.69%</span></td> <td><span class="green">29,812</span></td> <td>8.6%</td> <td>476</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark122603"><canvas width="8" height="25" style="display: inline-block; vertical-align: top; width: 8px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark122603"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$250</td> <td> 17d 12h </td> <td> Apr 30, 2020 </td> <td><a onclick="getMasterPricingData('122603');" data-toggle="modal"><button id="subscribeToMasterBtn122603" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="122603" value="55.18,0.98"> <input type="hidden" class="dailyGrowthData" oid="122603" value="-0.02,0.04,54.78,55.02,55.18,55.82,55.86,55.99,56.06,56.25,56.69"> <input type="hidden" class="dailyEquityData" oid="122603" value="-8.60,16.85,54.86,54.11,55.44,55.85,54.38,52.15,45.00,51.07,56.25"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/rapid-growth/111340">Rapid growth</a> </td> <td><span class="green">130.78%</span></td> <td><span class="green">1,102.9</span></td> <td>44.3%</td> <td>126</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark111340"><canvas width="12" height="25" style="display: inline-block; vertical-align: top; width: 12px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark111340"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$31</td> <td> 2m 8d </td> <td> Apr 1, 2020 </td> <td><a onclick="getMasterPricingData('111340');" data-toggle="modal"><button id="subscribeToMasterBtn111340" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="111340" value="87.85,18.28,3.87"> <input type="hidden" class="dailyGrowthData" oid="111340" value="0.00,0.64,1.40,1.40,1.90,2.91,7.53,8.21,11.19,11.30,17.60,19.60,23.03,37.74,47.75,54.75,59.91,69.79,73.60,79.36,87.85,93.14,93.40,94.70,95.93,96.01,99.95,100.71,101.85,102.10,102.12,104.36,108.76,110.11,110.14,110.23,112.58,115.10,115.54,117.17,121.24,122.19,123.40,124.18,124.88,124.89,130.09,130.78"> <input type="hidden" class="dailyEquityData" oid="111340" value="-1.80,0.67,0.97,1.91,-0.64,2.58,6.82,6.72,8.65,8.46,16.29,17.71,19.96,34.10,47.24,51.91,59.07,69.79,73.58,79.26,88.01,91.03,93.43,87.85,96.19,95.80,100.29,95.63,98.94,101.71,98.33,104.12,108.26,108.46,86.24,108.42,112.83,114.51,94.42,116.29,120.16,121.93,123.05,115.67,122.81,124.45,130.47,130.14"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/dream-presentation-1/66543">Dream Presentation 1</a> </td> <td><span class="red">-99.9%</span></td> <td><span class="red">-2,724.1</span></td> <td>99.9%</td> <td>1612</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark66543"><canvas width="28" height="25" style="display: inline-block; vertical-align: top; width: 28px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark66543"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 6m 13d </td> <td> Nov 8, 2019 </td> <td><a onclick="getMasterPricingData('66543');" data-toggle="modal"><button id="subscribeToMasterBtn66543" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="66543" value="-100.14,-98.54,-98.79,-91.71,-98.23,-100.00,-88.82"> <input type="hidden" class="dailyGrowthData" oid="66543" value="24.18,-99.90,-99.89,-99.88,-99.88,-99.88,-99.87,-99.87,-99.86,-99.84,-99.83,-99.90,-99.89,-99.90,-99.90,-99.81,-99.81,-99.80,-99.90,-99.90,-99.86,-99.83,-99.79,-99.90,-99.90,-99.90,-99.88,-99.89,-99.89,-99.88,-99.82,-99.74,-99.85,-99.37,-99.88,-99.90,-99.90,-99.90,-99.90,-99.87,-99.83,-99.80,-99.75,-99.64,-99.56,-99.90,-99.90"> <input type="hidden" class="dailyEquityData" oid="66543" value="7.87,-99.90,-99.89,-99.88,-99.88,-99.88,-99.88,-99.87,-99.86,-99.84,-99.83,-99.90,-99.89,-99.90,-99.89,-99.83,-99.88,-99.88,-99.90,-99.90,-99.87,-99.83,-99.84,-99.72,-99.90,-99.90,-99.88,-99.89,-99.88,-99.92,-99.86,-99.74,-99.86,-99.39,-99.88,-99.90,-99.90,-99.90,-99.90,-99.87,-99.83,-99.79,-99.76,-99.63,-99.55,-100.16,-99.83"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/limerence-ea-suite-3/93679">Limerence EA Suite 3</a> </td> <td><span class="green">1,246.66%</span></td> <td><span class="green">199.8</span></td> <td>34.2%</td> <td>8</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark93679"><canvas width="20" height="25" style="display: inline-block; vertical-align: top; width: 20px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark93679"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$75</td> <td> 7m 11d </td> <td> Feb 11, 2020 </td> <td><a onclick="getMasterPricingData('93679');" data-toggle="modal"><button id="subscribeToMasterBtn93679" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="93679" value="95.40,82.01,94.38,87.49,3.90"> <input type="hidden" class="dailyGrowthData" oid="93679" value="0.00,95.40,255.64,591.28,552.49,1234.12,1196.10,1246.66"> <input type="hidden" class="dailyEquityData" oid="93679" value="0.00,95.40,255.64,591.28,1034.76,1234.12,1196.10,1246.66"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/easy-money/31727">Easy Money</a> </td> <td><span class="red">-99.9%</span></td> <td><span class="green">2,430.6</span></td> <td>100%</td> <td>1095</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark31727"><canvas width="96" height="25" style="display: inline-block; vertical-align: top; width: 96px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark31727"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 2y 2m </td> <td> Apr 1, 2018 </td> <td><a onclick="getMasterPricingData('31727');" data-toggle="modal"><button id="subscribeToMasterBtn31727" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="31727" value="6.22,-6.15,22.04,-5.08,0.08,12.08,-69.31,-99.82,245.26,88.44,113.73,52.29,25.38,77.72,-29.07,-24.73,-86.48,-89.27,195.77,-7.65,-99.98,278.89,-69.98,-65.48"> <input type="hidden" class="dailyGrowthData" oid="31727" value="-99.66,-99.69,-99.72,-99.73,-99.77,-99.77,-99.78,-99.81,-99.90,-99.90,-99.89,-99.84,-99.83,-99.82,-99.81,-99.75,-99.78,-99.77,-99.79,-99.78,-99.77,-99.48,-99.46,-99.36,-99.34,-99.33,-99.33,-99.31,-99.33,-99.34,-99.40,-99.45,-99.33,-99.58,-99.65,-99.73,-99.71,-99.70,-99.68,-99.68,-99.69,-99.68,-99.71,-99.68,-99.80,-99.80,-99.77,-99.81,-99.84,-99.90"> <input type="hidden" class="dailyEquityData" oid="31727" value="-99.66,-99.69,-99.73,-99.70,-99.85,-99.89,-99.95,-99.77,-99.85,-99.90,-99.88,-99.84,-99.83,-99.82,-99.79,-99.75,-99.78,-99.77,-99.70,-99.68,-99.59,-99.48,-99.46,-99.36,-99.34,-99.33,-99.32,-99.25,-99.30,-99.34,-99.37,-99.37,-99.35,-99.58,-99.61,-99.73,-99.71,-99.69,-99.68,-99.68,-99.68,-99.68,-99.71,-99.68,-99.80,-99.76,-99.73,-99.79,-99.80,-99.89"> </div> </td> </tr> </tbody> </table> My code successfully extracts the data from the first table-data cell (the rank). But it is showing as blank for the second table data cell (the name). What is wrong with this source code: import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): for provider in response.xpath("//div[#class='row']//tr"): yield { 'rank': provider.xpath('td[1]/text()').get(), 'name': provider.xpath('td[2]/text()').get(), } UPDATE I am now iterating over the td cells within tr and getting the td cells, but my final problem is: how to get the text from the td cells that I have? import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): cols = "rank name gain pips drawdown trades type monthly chart price age added action" skip = [9,13] td = dict() for i, col in enumerate(cols.split()): td[i] = col Behold().show('td') for provider in response.xpath("//div[#class='row']//tr"): data_row = dict() for i, datum in enumerate(provider.xpath('td')): if i in skip: continue data_row[td[i]] = datum # Behold().show('datum') yield data_row
The correct answer was provided by gallaecio_ in the Scrapy IRC channel - here is the code: import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): cols = "rank name gain pips drawdown trades type monthly chart price age added action" skip = [9,13] td = dict() for i, col in enumerate(cols.split()): td[i] = col Behold().show('td') for provider in response.xpath("//div[#class='row']//tr"): data_row = dict() for i, datum in enumerate(provider.xpath('td/text()')): if i in skip: continue data_row[td[i]] = datum.get() # Behold().show('datum') yield data_row for more involved cases you may need https://github.com/TeamHG-Memex/html-text
Scraper problems with ASP.NET locating objects - Selenium
Im new into python, and im trying to make a scraper into a ASPX website. I got two types of results in this page, the empty ones and the results, My code can get the empty ones but i cant get the results when they exist, I try all the kinds of paths and still cant get the result, Can someone help me? thats my code import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import openpyxl from openpyxl import load_workbook planilha = load_workbook('./BASE 05-09.xlsx') driver = webdriver.Chrome(executable_path=r'C:\Python37\webdriver\chromedriver.exe') wait = WebDriverWait(driver, 10) sheet = planilha['Aba1'] driver.get("http://www1.cfc.org.br/sisweb/siscnai/externaConsultaCadastro.aspx") for Count in range(2, 1101): driver.find_element_by_id("ContentPlaceHolder1_tbxCPF").send_keys(sheet.cell(row=Count, column=5).value, Keys.RETURN) results = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[id*='ContentPlaceHolder1_gvwProfissional'] > tbody > tr"))) resultado_pesquisa = results[0].text.strip() if "ContentPlaceHolder1_gvwProfissional" in results[0].get_attribute("class") else results[0].find_element_by_xpath("./td[1]").text.strip() driver.find_element_by_id("ContentPlaceHolder1_tbxCPF").clear() sheet.cell(row=Count, column=7).value = resultado_pesquisa planilha.save("BASE 05-09.xlsx") driver.quit() thats the page code when i got results, i wanna get the "5433" <html> <head id="Head1"><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" /><title> CNAI </title></head> <body> <form method="post" action="externaConsultaCadastro.aspx" id="form1"> <div class="aspNetHidden"> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="AM7kGtthPHcCeQeZ3wWqQvzOI0fCr5HN29F2i/xZ5Ix7EkcYSSc9FlCfCcbHtX2Qulw1TLFpz/+RNvGQPU1/OqZpxByvUPSE2gaVonfaQsQn7zvoossHNNUDTiQmHv9XT5KkXiFi4Oa2B2Ix/MNkWIIw86rgaBK3NhQHUE7S+DsAlvsqZ1sy59fb1+/d/FF32dYRXcocqfcP4TL8ZtLhlRKt3rP1C+kS8/CkywxSTqBxQQ3h52z9Fm9dxcfjgHXQzisjVQuYhYWPnV6gcfJU2r9Hed49Zmx/mC4ydsTI7mbNpVYbwi4AqZKQvg0KAa+K+5ZLto2yg61qut6rUG0HyrpY5yOQk5XEH/BfK8qYoHbouJbYY9mbMwspkzg0bkNSFPz1dG45NLdibrvGoO5PSrHOzJpZhTufdzUPu5gVpUhlhrpU98c8ZzHJjS07xBZ72BwPp1eb1e9hPwUuPkD+SQ7w4ekSdaFVqUi2dWVP+uTcgL8pISRKt7viiraxvarsnQBiuyI7I+8gIMb5KMP0rB6R/AIKHNZZJI9fFipjabgtixU/+c5qsCvT1yLxx9XhO+nLBdYtgxOXuhjZ1dQ2DGe5E19ypAYDcqyGJotx4xQwXjMyYAhKLCWwZV9hPFVuQ3I/FRkI9u4+zWB782qmVkRZPl8Hde5wHrOW4V1DfxQz0191Ti+esid2SicQZZReSA1U5l1rv7qtKfWx+5nSJRdP13Z/vZVazAdpq1N6r2WzSOaDaa/1To87twg4kZP8kz/7VHU6fIoGIrrovke0XWvgsKiOUa9xqQ4fiW+Dl7HB1JrnLOPENKOnvmFfaI0DnWbKuWwB0CBao2pzxUtpd5Up195UesvowkUjNq4GgtsYo3I4NRag/M0ALN+0zz+3XVoqKzWHMWcy0yGJtbHcR5B++S66UlJOKdX0mGS6swfHz5twjLIOYxiuhRN6PBX0ZukZajaoRH3/GfN/kaj2GykyeVvhd+ds+qIpWKz+7d9PKqkwZiQLbXgaY3YjxjS9LpHseL5bAJkEMnundiHnjMVpjt0fZARNugggeEbei0xNntXUltc5A8xqQ3O5LXmUsw+i9QpsGcb5rFPO6ybOwAchyvZckeuEWsNC+blZY9iybQzGR7dyI1XhMHnJyEPvodso2tqwzVP/R4W9jMcUhr/V6gOnztsvGUnY6dfEW949ep9x9kkVPNJIpabJF1Cmgl/SVVm1/4TR7FZPx0PNpgyeieHvL0ieRSdlwgcuJm/rrgpNT8ka8u40I3PZB05288oTVagKY2fwdLUiU4gE9E2PSzyi/i224cjSPZ9b+yrnJz+Kn27Q+spsgzo0WW6QkwtxZx2hJ5q2n1WQRICU9oVmCY1BLyUxdIHq2jcb0gQ=" /> </div> <script src='masks.js'></script> <div class="aspNetHidden"> <input type="hidden" name="__VIEWSTATEENCRYPTED" id="__VIEWSTATEENCRYPTED" value="" /> <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="8EDkWLerXUVmMKlcXe/qqVujpBci2zX1ECnhJ4g+15vGRAo3rFrF5XP2X12Kr6nfAFlOpsUn3Sk/WM9LlAM0W+s6+LTZ6pSJUuUQu6ct75AmlJs+cWh7RTtWu1D6Arg8HDLFWhdvHDK/0sPW+VZ2wM58r+zcvQx/1wJmx/xkhtWuh0gkFFHVfq6zEAaL4SEnWvfH0wF5JZtGdnWgKhq0PQPkQPCk4gwWjZf9UJWX/I7BMFZetip0QShBtkQKaYFPyQ1riFre9eizciXNPJYrSU42IhZGnEWK4CCOBKetrpMTHaJiO2/lCpYWtMiMArUqeJz6gicoZc/q4GF6bgWAYIT+ItMiQC6N5eQFhwwGgKr/oRDush9H1IKBmg2kty1juv54o20yTrR19urRTyMut35n55+dHkkbMc2QKouXCKGrxXNE7t8/tOhAbaV+56FJjYFydcxrvWCpOKJzy5By3QR6xl4RPAFZrcAP5qGsSxugndJVM8lbgneoQEqjceeC8b8BFcZOSYIPOLD0CRAOSXD9FljgX8N5yz1RkJkOvYPpi6TIjugrILSgXMJtOx1BKfSL7vmYLVmm8hAHGssGnQXfBWnCqTu7e242s6TUotUbIuiJKFGpGhXnzbleDqXBMxjXLbOHQgsMxDPw9SoZYEVgtA2DZMfDWobpetTeQTc/ykyDmwXyCS9q+VK6seNRtFUIG62lVnzlMloIvGIWZkm7RVpz+FdtVXo75qAotGIhzDMhnbw1tvSW+huEdnBllFEJDedPdiUTM8ONKdkdaKsDbpPDI/K3vXGvc9V8t1MKihxXD42SPHdhzhSUNmsB6uxgOFP4iXBSATzdLBDD5FaaoJI/EaLVzSCpQGAMNwHilXBGMo97h77TLSnQu8x1adkEFUmkF/wmiQcyzEHhmxwI/bY7lKdtELEDO4JOP3g=" /> </div> <div> <table border="0" cellpadding="0" cellspacing="0"> <tr> <td colspan="2" style="height: 68px; width: 801px;"> <img src="Imagens/banner_cnai_externo.jpg" /></td> </tr> <tr> <td colspan="2" style="width: 801px; height: 232px;"> <div align=center> <br /> <table style="font-weight: bold; font-size: 12pt; width: 800px; color: white; font-family: verdana; height: 7px; background-color: firebrick"> <tr> <td> CONSULTAR CADASTRO CNAI</td> </tr> </table> <br /> <span style="font-size: 10pt; color: red; font-family: Verdana"><strong>Utilize <span style="text-decoration: underline">qualquer um</span> dos campos abaixo para fazer a pesquisa:</strong></span><br /> <br /> <table> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Nome:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNome" type="text" maxlength="100" id="ContentPlaceHolder1_tbxNome" style="font-family:Verdana;font-size:10pt;width:295px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Número CNAI:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNumeroCNAI" type="text" maxlength="8" id="ContentPlaceHolder1_tbxNumeroCNAI" style="font-family:Verdana;font-size:10pt;width:100px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">CPF:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxCPF" type="text" value="057.367.539-28" maxlength="14" id="ContentPlaceHolder1_tbxCPF" style="font-family:Verdana;font-size:10pt;width:150px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Registro:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNumeroRegistro" type="text" maxlength="8" id="ContentPlaceHolder1_tbxNumeroRegistro" style="font-family:Verdana;font-size:10pt;width:100px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Habilitação:</span></td> <td style="text-align: left"> <table id="ContentPlaceHolder1_cbxlCredenciamento" style="font-family:Verdana;font-size:10pt;"> <tr> <td><input id="ContentPlaceHolder1_cbxlCredenciamento_0" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$0" value="1" /><label for="ContentPlaceHolder1_cbxlCredenciamento_0">QTG</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_1" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$1" value="2" /><label for="ContentPlaceHolder1_cbxlCredenciamento_1">BCB</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_2" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$2" value="3" /><label for="ContentPlaceHolder1_cbxlCredenciamento_2">SUSEP</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_3" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$3" value="4" /><label for="ContentPlaceHolder1_cbxlCredenciamento_3">CVM</label></td> </tr> </table></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">UF:</span></td> <td style="text-align: left"> <select name="ctl00$ContentPlaceHolder1$ddlUF" id="ContentPlaceHolder1_ddlUF" style="font-family:Verdana;font-size:10pt;"> <option selected="selected" value=""></option> <option value="AC">AC</option> <option value="AL">AL</option> <option value="AM">AM</option> <option value="AP">AP</option> <option value="BA">BA</option> <option value="CE">CE</option> <option value="DF">DF</option> <option value="ES">ES</option> <option value="GO">GO</option> <option value="MA">MA</option> <option value="MG">MG</option> <option value="MS">MS</option> <option value="MT">MT</option> <option value="PA">PA</option> <option value="PB">PB</option> <option value="PE">PE</option> <option value="PI">PI</option> <option value="PR">PR</option> <option value="RJ">RJ</option> <option value="RN">RN</option> <option value="RO">RO</option> <option value="RR">RR</option> <option value="RS">RS</option> <option value="SE">SE</option> <option value="SC">SC</option> <option value="SP">SP</option> <option value="TO">TO</option> </select></td> </tr> <tr> <td colspan="2"> <br /> <input type="submit" name="ctl00$ContentPlaceHolder1$btnConsultar" value="Consultar" id="ContentPlaceHolder1_btnConsultar" style="font-family:Verdana;font-size:8pt;width:100px;" /> <input type="submit" name="ctl00$ContentPlaceHolder1$btnVoltar" value="<<< Voltar" id="ContentPlaceHolder1_btnVoltar" style="font-family:Verdana;font-size:8pt;width:100px;" /></td> </tr> </table> <br /> <span id="ContentPlaceHolder1_lblQtdRegistros" style="color:Firebrick;font-family:Verdana;font-size:10pt;font-weight:bold;">Quantidade de registros encontrados: 1</span><br /> <br /> <div> <table cellspacing="0" cellpadding="4" id="ContentPlaceHolder1_gvwProfissional" style="color:#333333;font-family:Verdana;font-size:8pt;width:790px;border-collapse:collapse;"> <tr style="color:White;background-color:DimGray;font-weight:bold;"> <th scope="col">Nº CNAI</th><th scope="col">Nome</th><th scope="col">Registro CRC</th><th scope="col">UF</th><th scope="col">Ativo Desde</th><th scope="col">Habilitação</th> </tr><tr style="color:#333333;background-color:#FFFBD6;"> <td>5433</td><td align="left" valign="middle">ADRIEL PAUL</td><td>SC-038746/O</td><td>SC</td><td>16/10/2017</td><td>QTG</td> </tr> </table> </div> <br /> <br /> </div> </td> </tr> <tr> <td colspan="2" style="height: 29px; background-color: #ffff92; text-align: center"> <span style="font-size: 8pt; color: firebrick; font-family: Verdana"><strong> <hr style="width: 790px" /> <span style="color: firebrick">CFC/DEINF - Departamento de Informática</span></strong></span></td> </tr> </table> </div> <script>_b0ea08358a064398935a96570c90f08e = new Mask("###.###.###-##");_b0ea08358a064398935a96570c90f08e.attach(document.getElementById('ContentPlaceHolder1_tbxCPF'));</script></form> </body> </html> thats the page code when the result is empty, in this case i wanna get the "Nenhum registro encontrado." <html xmlns="http://www.w3.org/1999/xhtml" > <head id="Head1"><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" /><title> CNAI </title></head> <body> <form method="post" action="externaConsultaCadastro.aspx" id="form1"> <div class="aspNetHidden"> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="N4CP+jvK/5b+U+rTB9wr1ebhSvqp5jRbCS2nCn9YQmpBGkOzPBNz77XaZqDmpks4sRpruRnk5/iODtmwHpy/TgS6IoY1opVEWGrstOsGKd9qS12fLEJcrl0C4qMMX6749LvuwRu85AopjkujK6QBv1+IEz18b30UAbvkGt9UELokaKjcjtOSOLK7AsBGf0EQ20q97wEeiJm9TE85TMflKNLDXWm/juP5rpG9cU/THT/piFUCakmhaupUwYKt84cRk2Ax7Cg45MUJXLMlOBqqiBvZYiDachCY4HYWVzt0/HNny5+Ylsw9GS3Ay/VnSVJ3+FFQnhAzpEgQqGubFeW3/fmeOI/vcA/JWB6cFux8rfKD0jnCjJvwWetFPlrtRr+O1xj9jmrzwo6cpV+KsAIQvdkmDN4rPQocbKH8gL7Na3zEUM9eCse8IGFIb4ZTdspkD7LcN9irH3bYyrBZsR1P6RQPWwX//nw99cFO72DDrCAZPUQZ/oyxNt7OPolmL88KEtCvedK/aNdbrjjZLlUeqQk41VwNZ/H8CO6NX2Gv1Kf/F6bQoWfVsUP5UZN53kCaaYitCdsgJp+Pnvyrh2oh49IhYp7VKXCK5a5HcZuWFPB7iabfi2EU8W1xonpvSG2PPsrg0rU4/CdLIKuhHtXV9fNiAREpqkq4g7m6u8heKmCXBrvxwODcpScXuFnSwRgGh3Yfv2EDQWcpV23Gcz/aBSoSw0i+g9tU8RmQgVI3KqlyEPQ29T95wAlS4inUiyXzhf5x4egIgJ8pd9/2XxS2+N29HSlWuuOYetLezzA+SL9CWP7QB9kg73o6vvJNmLAsQju91/H0pF1dDkJYb/Gd1hO3vATKttcvGtyEN/GmI6grXnwgx4bTkhJTEdoEuN8C6kD7x77sTXk1IqTSgBLvWF4KeOJvzgic6BgIFDxJyb0REGmXTgLnB/b6NA7fjLP/" /> </div> <script src='masks.js'></script> <div class="aspNetHidden"> <input type="hidden" name="__VIEWSTATEENCRYPTED" id="__VIEWSTATEENCRYPTED" value="" /> <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="KcA31Q+gCa7NxEiVU/OBNKTzdcD+hDJlBKesntp7xzs3YJoMskiWyBNqo48LSDxqEwpoAVhjW6XfPoOB6lAyHWd+/ffQ/UCoXcJLUVk4OsecSTxThSzxv3SdnIx4pE+ytZJVeAG7Ix3UCOgVU6tAOMY0Atbta3Kz9cnNAsQ8C2IHF/vORmd1XwPBYHXCe2FSjU+G1kfQwKV1du386WIfBCbbwl5DBW7qsVdbVaGMR+qgOd6Tjk3IV1IuJU0oCUDUm8CcVhm/R6mFrXfCUXl6LVyPHVPKiKaMsdqnGI/IKjI2TjkwkU4+UJJjjobo6ABr4v+Xc1Gwpj4/QVxMBoF5g6izDSGDO9sk5WWeQqFBQKRhABUHEpHnuNgZwYmDC+UbjJ8pArD4Gg9SJexKzZgXkAgwHp/glsGoa5/dYolKx2Nu03tomY14YXkbNq/ml4LmZ3HSPKGuEniZq5gcmd+oCNtQulHCFijcUW39e7PmrKp4MGPk9/0sjmYPa2UZAwF0/RQ0QikZQmOxLokzN/5U865m8hjp4Gj3ndmZpPHKPBa5iHbTqTHSj1qPVnn/v+9wlU4mG7fISLwaALSQHBtOGXyNHNq2F4JExT7R1QskvwzQMF8kJPnysoLhqVmN04i2rXLTH6xY+iUnAN4NOPoIP+T5YBs5DniT5K4RyjMioWQmv6a2eQES1tRxtkKBaPbztolYIVxKmabkzsEjXdOxHIxj21Z/R5UHa6bVnOPaeHKgSpSqyqhDMRu9e5vLkbA3o953g0TZx9xEfB0lw+j/MhqnI35mwplWucjxm9uA/0zTEDAHZ2ATd//iCKR4SWaxjL+y3BTBEn9Icy+LFh77qfj4yHn4Ye7Y5gyIn8oiFJOiNei51in80ZJyGkDP/MG5bKsC+f8R1LukFlur5JoefSmB6oRj7g9KVOw+FW31suQ=" /> </div> <div> <table border="0" cellpadding="0" cellspacing="0"> <tr> <td colspan="2" style="height: 68px; width: 801px;"> <img src="Imagens/banner_cnai_externo.jpg" /></td> </tr> <tr> <td colspan="2" style="width: 801px; height: 232px;"> <div align=center> <br /> <table style="font-weight: bold; font-size: 12pt; width: 800px; color: white; font-family: verdana; height: 7px; background-color: firebrick"> <tr> <td> CONSULTAR CADASTRO CNAI</td> </tr> </table> <br /> <span style="font-size: 10pt; color: red; font-family: Verdana"><strong>Utilize <span style="text-decoration: underline">qualquer um</span> dos campos abaixo para fazer a pesquisa:</strong></span><br /> <br /> <table> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Nome:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNome" type="text" maxlength="100" id="ContentPlaceHolder1_tbxNome" style="font-family:Verdana;font-size:10pt;width:295px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Número CNAI:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNumeroCNAI" type="text" maxlength="8" id="ContentPlaceHolder1_tbxNumeroCNAI" style="font-family:Verdana;font-size:10pt;width:100px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">CPF:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxCPF" type="text" value="462.929.158-08" maxlength="14" id="ContentPlaceHolder1_tbxCPF" style="font-family:Verdana;font-size:10pt;width:150px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Registro:</span></td> <td style="text-align: left"> <input name="ctl00$ContentPlaceHolder1$tbxNumeroRegistro" type="text" maxlength="8" id="ContentPlaceHolder1_tbxNumeroRegistro" style="font-family:Verdana;font-size:10pt;width:100px;" /></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">Habilitação:</span></td> <td style="text-align: left"> <table id="ContentPlaceHolder1_cbxlCredenciamento" style="font-family:Verdana;font-size:10pt;"> <tr> <td><input id="ContentPlaceHolder1_cbxlCredenciamento_0" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$0" value="1" /><label for="ContentPlaceHolder1_cbxlCredenciamento_0">QTG</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_1" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$1" value="2" /><label for="ContentPlaceHolder1_cbxlCredenciamento_1">BCB</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_2" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$2" value="3" /><label for="ContentPlaceHolder1_cbxlCredenciamento_2">SUSEP</label></td><td><input id="ContentPlaceHolder1_cbxlCredenciamento_3" type="checkbox" name="ctl00$ContentPlaceHolder1$cbxlCredenciamento$3" value="4" /><label for="ContentPlaceHolder1_cbxlCredenciamento_3">CVM</label></td> </tr> </table></td> </tr> <tr> <td style="text-align: right; font-weight: bold; color: firebrick; font-family: verdana;"> <span style="font-size: 10pt; font-family: Verdana">UF:</span></td> <td style="text-align: left"> <select name="ctl00$ContentPlaceHolder1$ddlUF" id="ContentPlaceHolder1_ddlUF" style="font-family:Verdana;font-size:10pt;"> <option selected="selected" value=""></option> <option value="AC">AC</option> <option value="AL">AL</option> <option value="AM">AM</option> <option value="AP">AP</option> <option value="BA">BA</option> <option value="CE">CE</option> <option value="DF">DF</option> <option value="ES">ES</option> <option value="GO">GO</option> <option value="MA">MA</option> <option value="MG">MG</option> <option value="MS">MS</option> <option value="MT">MT</option> <option value="PA">PA</option> <option value="PB">PB</option> <option value="PE">PE</option> <option value="PI">PI</option> <option value="PR">PR</option> <option value="RJ">RJ</option> <option value="RN">RN</option> <option value="RO">RO</option> <option value="RR">RR</option> <option value="RS">RS</option> <option value="SE">SE</option> <option value="SC">SC</option> <option value="SP">SP</option> <option value="TO">TO</option> </select></td> </tr> <tr> <td colspan="2"> <br /> <input type="submit" name="ctl00$ContentPlaceHolder1$btnConsultar" value="Consultar" id="ContentPlaceHolder1_btnConsultar" style="font-family:Verdana;font-size:8pt;width:100px;" /> <input type="submit" name="ctl00$ContentPlaceHolder1$btnVoltar" value="<<< Voltar" id="ContentPlaceHolder1_btnVoltar" style="font-family:Verdana;font-size:8pt;width:100px;" /></td> </tr> </table> <br /> <span id="ContentPlaceHolder1_lblQtdRegistros" style="color:Firebrick;font-family:Verdana;font-size:10pt;font-weight:bold;">Quantidade de registros encontrados: 0</span><br /> <br /> <div> <table cellspacing="0" cellpadding="4" id="ContentPlaceHolder1_gvwProfissional" style="color:#333333;font-family:Verdana;font-size:8pt;width:790px;border-collapse:collapse;"> <tr style="color:Red;font-family:verdana;font-size:10pt;"> <td colspan="9">Nenhum registro encontrado.</td> </tr> </table> </div> <br /> <br /> </div> </td> </tr> <tr> <td colspan="2" style="height: 29px; background-color: #ffff92; text-align: center"> <span style="font-size: 8pt; color: firebrick; font-family: Verdana"><strong> <hr style="width: 790px" /> <span style="color: firebrick">CFC/DEINF - Departamento de Informática</span></strong></span></td> </tr> </table> </div> <script>_20d372f0c34740b2ae81fb5d201835ad = new Mask("###.###.###-##");_20d372f0c34740b2ae81fb5d201835ad.attach(document.getElementById('ContentPlaceHolder1_tbxCPF'));</script></form> </body> </html> i keep receiving this error: --------------------------------------------------------------------------- NoSuchElementException Traceback (most recent call last) <ipython-input-48-eb337bf8471d> in <module> 19 20 results = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[id*='ContentPlaceHolder1_gvwProfissional'] > tbody > tr"))) ---> 21 resultado_pesquisa = results[0].text.strip() if "ContentPlaceHolder1_gvwProfissional" in results[0].get_attribute("class") else results[0].find_element_by_xpath("./td[1]").text.strip() 22 23 driver.find_element_by_id("ContentPlaceHolder1_tbxCPF").clear() c:\python37\lib\site-packages\selenium\webdriver\remote\webelement.py in find_element_by_xpath(self, xpath) 349 element = element.find_element_by_xpath('//div/td[1]') 350 """ --> 351 return self.find_element(by=By.XPATH, value=xpath) 352 353 def find_elements_by_xpath(self, xpath): c:\python37\lib\site-packages\selenium\webdriver\remote\webelement.py in find_element(self, by, value) 657 658 return self._execute(Command.FIND_CHILD_ELEMENT, --> 659 {"using": by, "value": value})['value'] 660 661 def find_elements(self, by=By.ID, value=None): c:\python37\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params) 631 params = {} 632 params['id'] = self._id --> 633 return self._parent.execute(command, params) 634 635 def find_element(self, by=By.ID, value=None): c:\python37\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params) 319 response = self.command_executor.execute(driver_command, params) 320 if response: --> 321 self.error_handler.check_response(response) 322 response['value'] = self._unwrap_value( 323 response.get('value', None)) c:\python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response) 240 alert_text = value['alert'].get('text') 241 raise exception_class(message, screen, stacktrace, alert_text) --> 242 raise exception_class(message, screen, stacktrace) 243 244 def _value_or_default(self, obj, key, default): NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"./td[1]"} (Session info: chrome=77.0.3865.90)
Change code to check empty results with code below: results = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[id*='ContentPlaceHolder1_gvwProfissional'] > tbody > tr"))) resultado_pesquisa = "Nenhum registro encontrado." if "Nenhum registro encontrado." in results[0].text else results[1].find_element_by_xpath("./td[1]").text.strip() To check not empty, share one value to enter to the CPF field.
To looking at the table the data you are trying to get is the second row not the first row. Try this one. results = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[id*='ContentPlaceHolder1_gvwProfissional'] > tbody > tr"))) if "ContentPlaceHolder1_gvwProfissional" in results[0].get_attribute("class"): resultado_pesquisa = results[0].text.strip() else: resultado_pesquisa=results[1].find_element_by_xpath("./td[1]").text.strip() Print(resultado_pesquisa)
Selenium Python - How to click a span with given text
I need to click a row in a table, I'm unable to do so. To be more specific, I need to click ALL_USA, you can see in the code below. My HTML code is below: <div id="table" arid="1" arwindowid="0" style="height: 299px; width: 638px;"> <div class="TableHdr" style="visibility: hidden; display: none; width: 638px;"> <div class="TableInner" style="top: 0px; height: 277px; width: 638px;"> <div class="BaseTableOuter" draghandler="BaseTable_DragHandler" style="height: 275px; width: 636px;"> <div class="BaseTableColHeaders" style="width: 636px; left: 0px;"> <div class="BaseTableInner" style="top: 16px; height: 259px; width: 636px; overflow-y: auto; overflow-x: hidden;"> <table id="T1" class="BaseTable" title="" style="width: 2px;"> <colgroup cols="1"> <tbody> <tr class="hiddentablehdr"> <tr class="" tabindex="0" arrow="0"> <tr tabindex="0" arrow="1"> <tr tabindex="0" arrow="2"> <tr tabindex="0" arrow="3"> <tr class="SelPrimary" tabindex="0" arrow="4"> <td class="BaseTableCellOdd BaseTableCellOddColor BaseTableStaticText" "scope="row" style="width: 636px;"> <nobr class="dp " style="text-align: left; width: 636px;"> <span style="padding: 1px 4px;float:left;">ALL_USA</span> </nobr> </td> </tr> <tr tabindex="0" arrow="5">
You can use an xpath to target the span text: driver.find_element_by_xpath("//span[text()='ALL_USA']").click();
If ALL_USA is subject to change, then you can use: driver.find_element_by_css_selector("table.BaseTable tr.SelPrimary td span").click();
Extract table from html file using python
I want to extract table from an html file. I have written the following code-snippet to extract the first table: import urllib2 import os import time import traceback from bs4 import BeautifulSoup #find('table',{'class':'tbl_with_brdr'}) outfile= open('D:/Dropbox/Python/apelec.txt','wb') rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') rsoup = BeautifulSoup(rfile) nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') for node in nodes[1:]: x = node.find('th').find('b').get_text().encode("utf-8") print x y = node.find('th').findNext('th').find('b').get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close() Here is the error: 9 rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') 10 rsoup = BeautifulSoup(rfile) ---> 11 nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') 12 for node in nodes[1:]: 13 x = node.find('th').find('b').get_text().encode("utf-8") AttributeError: 'NoneType' object has no attribute 'find' And the html file is: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <link rel="icon" type="image/ico" href="images/favicon.ico"/> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <link rel="stylesheet" href="themes/panchayat_default.css" type="text/css"/> <title>consolidated Election Report</title> </head> <body> <!-- To blur the background while processing dwr --> <div class="faded_div process"></div> <div class="popup_block_div process" style="display: none;"> <img alt="" src="images/loading_animation.gif" style="margin-left: auto; margin-right: auto;"> </div> <div id="maincontainer" class="resize"> <div id="headerwrap"> <!-- Header --> <html> <head> <script type='text/javascript' src="/profilerdwr/engine.js"> </script> <script type='text/javascript' src="/profilerdwr/util.js"> </script> <script type="text/javascript" src="/profilerdwr/interface/lgdDao.js"></script> <script type="text/javascript" src="js/common_util_js.js"></script> <link rel="stylesheet" href="css/common_css.css" type="text/css"></link> <meta http-equiv='Content-Type' content='text/html; charset=UTF-8' /> </head> <body > <div class="clear"></div> <div id="headerwrap"> <div id="header"> <div id="new_header"> <div id="logoleft">Area Profiler</div> <div id="logoright"></div> <div class="clear"></div> </div> <div class="clear"></div> <div id="loginnav" align="right"> <table width="100%" class="tbl_no_brdr"> <tr> <td class="tblclear" align="left"> <div id="mainnav">Home </div> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="topnav"> <table width="100%" class="tbl_no_brdr"> <tr> <td width="85" class="tblclear">Choose Theme :</td> <td width="200" class="tblclear"> <form id="themeForm" name="themeForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select name="theme" id="themeId" class="combofield" onchange="submitThemeForm()" style="width: 120px;"> <option value="default">Default Theme</option> <option value="mustard">Mustard Theme</option> <option value="peach">Peach Theme</option> <option value="green">Green Theme</option> <option value="blue">Blue Theme</option> </select> </form> </td> <td style="padding: 0px"> </td> <td class="tblclear"> </td> <td width="14" class="tblclear txticon"><img src="images/btnMinus.jpg" width="16" height="14" border="0" /></div></td> <td width="14" class="tblclear txticon"><img src="images/btnDefault.jpg" width="16" height="14" border="0" /> </td> <td width="28" class="tblclear txticon"><img src="images/btnPlus.jpg" width="16" height="14" border="0" /></td> <script type="text/javascript" > //documenttextsizer.setup("shared_css_class_of_toggler_controls") documenttextsizer.setup("texttoggler") </script> <td width="100" align="right" class="tblclear">Select Language :</td> <td width="108" align="right" class="tblclear"> <form id="languageForm" name="languageForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select id="languageId" name="language" class="combofield" style="width: 120px;" onchange="submitLanguageForm()" > <option value=""> Select Language </option> </select> </form> </td> </tr> </table> </div> <div id="breadcrumbnav"> </div> </div> <script type="text/javascript"> function submitThemeForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('themeForm').submit(); } else { return; } } function submitLanguageForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('languageForm').submit(); } else { return; } } </script> </body> </html> </div> <div class="clear"></div> <div id="content"> <div id="leftpnl"> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="100%" valign="top" class="tblclear"> <!-- content -->. <script type="text/javascript" src="js/common_js.js"></script> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <script type="text/javascript"> var pathname; $(document).ready(function() {pathname = window.location.pathname;}); function onBack(s) { var position =pathname.indexOf("/", 2); var newPath = ""; var val = s.indexOf("?", 1); if(val>0) { newPath = s+"&redirect=true"; } else { newPath = s+"?redirect=true"; } window.location.replace(".."+pathname.substring(0,position)+"/"+newPath); } function downloadReport(repformat){ //window.location="downloadConsolidatedElectionReportPDF.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; //document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?repformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?reportformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].method="POST"; document.getElementById('electionReportForm').target="_blank"; document.forms["electionReportForm"].submit(); } </script> <style type="text/css"> .data_link{ color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .disable_link { cursor:default; color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:VISITED { color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:HOVER{ text-decoration: underline; } </style> </head> <body> <div id="frmcontent"> <div class="frmhd"> <table width="100%" class="tbl_no_brdr"> <tr> <td align="left" width="90%"> Consolidated Election</td> </tr> </table> </div> <div class="clear"></div> <div class="frmpnlbrdr"> <div class="frmpnlbg"> <div class="frmtxt"> <table width="100%" style="margin-bottom: 10px;" class="tbl_with_brdr"> <tr class="tblRowTitle tblclear" > <th align="left" ><b>State Name</b></th> <th align="left" ><b>Local Body Type</b></th> <th align="left" ><b>Election Term</b></th> <th align="left" ><b>Local Body Name</b></th> </tr> <tr class="tblRowB" style="color: blue;"> <th align="left" >ANDHRA PRADESH</th> <th align="left" >Village Panchayat</th> <th align="left" > 02-Aug-2013 To 01-Aug-2018 </th> <th align="left" >KODIHALLI</th> </tr> </table> <div class="frmhdtitle">Consolidated Election</div> <table width="100%" class="tbl_with_brdr"> <thead> <tr class="tblRowTitle tblclear"> <th align="center" width="5%" ><b>S.No.</b></th> <th align="left" width="9%"><b>Name</b></th> 0 <th align="left" width="9%"><b>Age</b></th> 1 <th align="left" width="9%"><b>Caste Category</b></th> 2 <th align="left" width="9%"><b>Gender</b></th> 3 <th align="left" width="9%"><b>Qualification</b></th> 4 <th align="left" width="9%"><b>Occupation</b></th> 5 <th align="left" width="9%"><b>Email Address</b></th> 6 <th align="left" width="9%"><b>Ward Name</b></th> 7 <th align="left" width="9%"><b>Reservation</b></th> 8 </tr> </thead> <tbody> <tr class="tblRowB"> <td align="center" >1</td> <td>Kambanna</td> <td>36</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>N/A</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >2</td> <td>Ramesh</td> <td>39</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 1</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >3</td> <td>S.Manjunath</td> <td>29</td> <td>OBC</td> <td>Male</td> <td>Higher Secondary or Intermediate or Pre University or Senior Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 2</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >4</td> <td>Obuleshu</td> <td>48</td> <td>OBC</td> <td>Male</td> <td>Below Primary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 3</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >5</td> <td>Mamatha</td> <td>24</td> <td>OBC</td> <td>Female</td> <td>Matriculation or Junior School Certificate or Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 4</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >6</td> <td>Shivamma</td> <td>38</td> <td>OBC</td> <td>Female</td> <td>Below Primary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 5</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >7</td> <td>Hanumantappa</td> <td>46</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 6</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >8</td> <td>Malingappa</td> <td>45</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 7</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >9</td> <td>Kamalamma</td> <td>52</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 8</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >10</td> <td>Muddamma</td> <td>48</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 9</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >11</td> <td>Patta Tayamma</td> <td>45</td> <td>SC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 10</td> <td > Yes (SC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >12</td> <td>Sujatha</td> <td>35</td> <td>OBC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 11</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >13</td> <td>Kadurappa</td> <td>35</td> <td>SC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 12</td> <td > Yes (SC / Others) </td> </tr> </tbody> </table> <br /> <table width="100%" class="tbl_no_brdr"> <tr> <td align="center"> <input type="button" class="btn" onclick="onClose('welcome.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU')" value=Close /> <input type="button" class="btn" onclick="this.disabled=true; this.value='Please Wait .!';onBack('consolidatedElectionReport.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU&electionTermId=35107&stateId=28')" value=Back /> </td> </tr> </table> <form id="electionReportForm" name="electionReportForm" action="#" method="post"> <div align="center"><br/> <input type="button" class="btn" onclick="downloadReport('pdf');" value="Export to PDF" size="5" /> <input type="button" class="btn" onclick="downloadReport('xls');" value="Export to Excel" size="5" /> </div> </form> </div> <div class="myclass" style="font-family: Times; text-align: center; font-size: 10.0pt; color: white; font-weight: bold; border: 1px solid gray"> Report generated through Area Profiler (http://areaprofiler.gov.in)Thu Oct 02 22:34:20 IST 2014 </div> </div> </div> </div> </body> </html> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="footer"> <!-- Footer --> <html> <head> </head> <body> <table width="100%" class="tbl_no_brdr"> <tr> <td colspan="3" class="fotbrdr"></td> </tr> <tr> <td width="161" class="btmlogospace"><a href="http://www.negp.gov.in/" target= "_blank" ><img src="images/e_governance_logo.jpg" width="161" height="38" /></a></td> <td width="93" class="btmlogospace"><a href="http://www.panchayat.gov.in/" target= "_blank" ><img src="images/panchayatilogo.jpg" width="93" height="38" /></a></td> <td align="right" class="btmlogospace">Site is designed, hosted and maintained by National Informatics Centre<br /> Contents on this website is owned,updated and managed by the Ministry of Panchayati Raj</td> </tr> </table> </body> </html> </div> </div> </body> </html>
I paste here an approach, it is not exactly the solution but you can use it as a guide. You have to traverse the DOM tree and extract the values you want. I changed the class of the div you look for from frmtext to frmtxt and in the traversal you have to check if anything is found or not. import urllib2 import os import time import traceback from bs4 import BeautifulSoup outfile= open('out.txt','wb') rfile = open('195778.html') rsoup = BeautifulSoup(rfile) nodes1 = rsoup.find('div',{'class':'frmtxt'}) nodes = nodes1.find('table').find_all('tr') for node in nodes: a = node.find('th') x = None if a != None: x1 = x.find('b') if x1 != None: x2 = x1.get_text().encode("utf-8") print x2 x = x2 y = node.find('th') if y != None: print 'y',y y2 = y.findNext('th') if y2 != None: print 'y2',y2 y3 = y2.find('b') if y3 != None: y = y3.get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close()