Selecting Current Day in Seleniunm - python
I'm trying to open a calendar and select only the current day of the specific calendar (month, and using selenium).
So far, I have this:
#Click on calendar and open the same
self.search_field = self.driver.find_element_by_xpath("//form[#id='searchform']/div/div/div[2]/div/input")
self.search_field.click()
# Get current day
current_date = date.today()
today_day = current_date.day
print today_day
Now that I've got the "current day," how would I select that "current day" within the given calendar?
[Edit]
Calendar HTML
<div id="ui-datepicker-div" class="ui-datepicker ui-widget ui-widget-content ui-helper-clearfix ui-corner-all" style="position: absolute; top: 151px; left: 222.5px; z-index: 1; display: block;">
<div class="ui-datepicker-header ui-widget-header ui-helper-clearfix ui-corner-all">
<table class="ui-datepicker-calendar">
<thead>
<tbody>
<tr>
<tr>
<tr>
<tr>
<td class=" ui-datepicker-week-end " data-year="2015" data-month="10" data-event="click" data-handler="selectDay">
<td class=" " data-year="2015" data-month="10" data-event="click" data-handler="selectDay">
<td class=" ui-datepicker-days-cell-over ui-datepicker-today" data-year="2015" data-month="10" data-event="click" data-handler="selectDay">
<a class="ui-state-default ui-state-highlight" href="#">24</a>
</td>
<td class=" ui-datepicker-unselectable ui-state-disabled ">
<td class=" ui-datepicker-unselectable ui-state-disabled ">
<td class=" ui-datepicker-unselectable ui-state-disabled ">
<td class=" ui-datepicker-week-end ui-datepicker-unselectable ui- state-disabled ">
</tr>
<tr>
</tbody>
</table>
</div>
Related
Returning None when scraping href using Python
Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script? [<table border="0" class="ProductBox" id="Added0"> <tr> <td align="center" colspan="2"> <div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div> </td></tr><tr> <td align="center" colspan="2" height="60px;" valign="top"> <div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div> </td> </tr> <tr> <td align="center" colspan="2" height="185"> <a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;"> <img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a> </td> </tr> <tr> <td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td> <td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> </td> </tr> <tr> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> POR 0% </td> <td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> VAT 20% </td> </tr> <tr> <td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;"> <a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;"> **151 Heavy Duty Rubber Gloves - Ex Large**</a></td> </tr> <tr> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> 1s x 1 </td> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;"> <div class="tooltip"> <div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;"> <span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div> </div> <span class="OKStatus">In Stock </span> </td> </tr> <tr> <td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> <table style="margin-top : 10px;" width="100%"> <tr> <td> <img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/> </td> <td> <input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/> </td> <td> <img align="middle" alt="Add 1 To Qty" src="/images/add.png"/> </td> <td align="right"> <button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button> </td> </tr> </table> I tied to use the following r = s.get(url) soup = BeautifulSoup(r.text, 'lxml') table = soup.find_all('table') for i in table: links = [link.get('href') for link in i.find_all('a')] print(links) which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']
Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters: DATA = """<table border="0" class="ProductBox" id="Added0"> <tr> ... </table>""" from bs4 import BeautifulSoup from typing import Optional def extract_name(data: str) -> Optional[str]: soup = BeautifulSoup(data, "html.parser") links = soup.select("td.ProdDetails a") if len(links) >= 1: return links[0].text.strip().strip("*").strip() else: return None print(extract_name(DATA)) # like above r = s.get(url) soup = BeautifulSoup(r.text, 'lxml') tables = soup.find_all('table') text = extract_name(tables[0]) Output: 151 Heavy Duty Rubber Gloves - Ex Large
I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project
I have this HTML code for example: <table class="nested4"> <tr> <td colspan="1"></td> <td colspan="2"> <h2 class="zeroMargin" id="govtMsg" visible="false"></h2> </td> <td colspan="2"> <h2 class="zeroMargin "> Net Metering Conn. </h2> </td> <td colspan="2"> <h2 class="zeroMargin" hidden> Life Line Consumer</h2> </td> </tr> <tr> <td colspan="2"> <p style="margin: 0; text-align: left; padding-left: 5px"> <span>NAME & ADDRESS</span> <br /> <span>MUHAMMAD AMIN </span> <br /> <span>S/O MUHAMMAD KHAN </span> <br /> <span>H-NO.38 MARGALLA ROAD </span> <br /> <span>F-6/3 ISLAMABAD3 </span> <br /> <span></span> </p> </td> <td colspan="3" style="text-align: left"> <h2 class="color-red">Say No To Corruption</h2> <span style="font-size: 8pt; color: #78578e"> MCO Date : 10-Aug-2018</span> <br /> </td> <td> <h3 style="font-size: 14pt;"> </h3> <h2> <br /> </h2> </td> </tr> <tr> <td style="margin-top: 0;" class="border-b"> <br /> </td> <td colspan="1" style="margin-top: 0;" class="border-b"> </td> <td colspan="1" style="margin-top: 0;" class="border-b"> </td> </tr> <tr style="height: 7%;" class="border-tb"> <td style="width: 130px" class="border-r"> <h4>METER NO</h4> </td> <td style="width: 90px" class="border-r"> <h4>PREVIOUS READING</h4> </td> <td style="width: 90px" class="border-r"> <h4>PRESENT READING</h4> </td> <td style="width: 60px" class="border-r"> <h4>MF</h4> </td> <td style="width: 60px" class="border-r"> <h4>UNITS</h4> </td> <td> <h4>STATUS</h4> </td> </tr> <tr style="height: 30px" class="content"> <td class="border-r"> 3-P I 3301539<br> I 3301539<br> E 3301539<br> E 3301539<br> </td> <td class="border-r"> 78693<br>16823<br>19740<br>8<br> </td> <td class="border-r"> 80086<br>17210<br>20139<br>8<br> </td> <td class="border-r"> 1<br>1<br>1<br>1<br> </td> <td class="border-r"> 1393<br>387<br>399<br>0<br> </td> <td> </td> </tr> <tr id="roshniMsg" style="height: 30px" class="content"> <td colspan="6"> <div style="width: 452pt"> <img style="max-width: 100%; max-height: 35%" src="/images/companies/iesco/roshniMsg.jpg" alt="Roshni Message" /> </div> </td> </tr> </table> From this table I want to extract the paragraph and from there I want to get all the span tags in that paragraph. I used soup.find_all() to get the table but I don't know how to use this function iteratively to pass it back to the original soup object so that I could find the paragraph and, moreover the span tags in that paragraph. This is the code Python code I wrote: soup = BeautifulSoup(string, 'html.parser') #Getting the table tag results = soup.find_all('table', attrs={'class':'nested4'}) #Getting the paragragh tag results = soup.find_all('p', attrs={'style':'margin: 0; text-align: left; padding-left: 5px'}) #Getting all the span tags results = soup.find_all('span', attrs={}) I just want help on how to get the paragraphs within the table. And then how to get the spans within the paragraph as I am getting the spans in all of the original HTML code. I don't know how to pass the bs4 object list back to the soup object to use soup.find_all iteratively.
from bs4 import BeautifulSoup html = ''' <table class="nested4"> <tr> <td colspan="1"></td> <td colspan="2"> <h2 class="zeroMargin" id="govtMsg" visible="false"></h2> </td> <td colspan="2"> <h2 class="zeroMargin "> Net Metering Conn. </h2> </td> <td colspan="2"> <h2 class="zeroMargin" hidden> Life Line Consumer</h2> </td> </tr> <tr> <td colspan="2"> <p style="margin: 0; text-align: left; padding-left: 5px"> <span>NAME & ADDRESS</span> <br /> <span>MUHAMMAD AMIN </span> <br /> <span>S/O MUHAMMAD KHAN </span> <br /> <span>H-NO.38 MARGALLA ROAD </span> <br /> <span>F-6/3 ISLAMABAD3 </span> <br /> <span></span> </p> </td> <td colspan="3" style="text-align: left"> <h2 class="color-red">Say No To Corruption</h2> ''' soup = BeautifulSoup(html, 'html.parser') spans = soup.select_one('table.nested4').select('span') for span in spans: print(span.text) This returns: NAME & ADDRESS MUHAMMAD AMIN S/O MUHAMMAD KHAN H-NO.38 MARGALLA ROAD F-6/3 ISLAMABAD3
if you have one table: soup = BeautifulSoup(string, 'html.parser') table = soup.find('table', attrs={'class': 'nested4'}) p = table.find('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}) results = p.find_all('span') for result in results: print(result.get_text(strip=True)) if you have list of tables: soup = BeautifulSoup(string, 'html.parser') for table in soup.find_all('table', attrs={'class': 'nested4'}): for p in table.find_all('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}): for span in p.find_all('span'): print(span.get_text(strip=True))
Problem extracting text of td from table row (tr) with scrapy
I am parsing data table from the following URL: https://www.signalstart.com/search-signals In particular, I am trying to extract the data from the table rows. The table row has a series of table-data cells: <table class="table table-striped table-bordered dataTable table-hover" id="searchSignalsTable"> <thead> <tr> <th class="sorting sorting_asc">Rank</th> <th class="sorting ">Name</th> <th class="sorting ">Gain</th> <th class="sorting ">Pips</th> <th class="sorting ">DD</th> <th class="sorting ">Trades</th> <th class="sorting ">Type</th> <th>Monthly</th> <th>Chart</th> <th class="sorting ">Price</th> <th class="sorting " style="width: 40px">Age</th> <th class="sorting " style="width: 70px">Added</th> <th>Action</th> </tr> </thead> <tbody> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/joker-1k/110059">Joker 1k</a> </td> <td><span class="red">-9.99%</span></td> <td><span class="green">2,092.3</span></td> <td>15.3%</td> <td>108</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark110059"><canvas width="12" height="25" style="display: inline-block; vertical-align: top; width: 12px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark110059"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 1m 24d </td> <td> Mar 29, 2020 </td> <td><a onclick="getMasterPricingData('110059');" data-toggle="modal"><button id="subscribeToMasterBtn110059" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="110059" value="-1.78,-3.68,-4.86"> <input type="hidden" class="dailyGrowthData" oid="110059" value="0.00,-0.03,-1.78,-5.69,-6.75,-5.59,-7.61,-5.31,-6.20,-3.81,-4.40,-8.00,-2.88,-3.78,-4.38,-0.20,-5.40,-10.66,-13.69,-12.51,-13.23,-9.99"> <input type="hidden" class="dailyEquityData" oid="110059" value="0.00,-0.23,-1.41,-5.02,-6.25,-4.29,-6.68,-3.91,-5.37,-4.10,-4.40,-3.59,-1.78,-1.75,-2.65,-0.21,-4.87,-10.76,-13.90,-11.58,-13.23,-10.18"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/fxabakus/56043">FXabakus</a> </td> <td><span class="red">-19.57%</span></td> <td><span class="red">-8,615.2</span></td> <td>42%</td> <td>1642</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark56043"><canvas width="80" height="25" style="display: inline-block; vertical-align: top; width: 80px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark56043"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 1y 7m </td> <td> May 4, 2019 </td> <td><a onclick="getMasterPricingData('56043');" data-toggle="modal"><button id="subscribeToMasterBtn56043" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="56043" value="1.22,1.35,3.92,1.35,-1.57,1.77,2.01,1.11,0.38,-14.89,-14.70,-5.21,5.97,7.03,-17.54,2.92,3.11,-8.94,13.38,1.77"> <input type="hidden" class="dailyGrowthData" oid="56043" value="-27.87,-29.29,-29.01,-26.76,-25.76,-25.59,-30.57,-30.13,-29.78,-29.60,-29.25,-28.34,-28.07,-27.89,-25.20,-25.08,-23.66,-23.46,-21.54,-21.02,-21.62,-20.28,-18.31,-26.97,-27.48,-27.00,-28.21,-24.20,-23.46,-30.04,-31.37,-34.62,-33.84,-32.87,-32.20,-30.99,-30.43,-30.30,-29.75,-27.64,-27.45,-24.34,-24.71,-24.09,-24.15,-21.48,-21.08,-20.97,-19.54,-19.57"> <input type="hidden" class="dailyEquityData" oid="56043" value="-27.87,-29.29,-28.89,-26.76,-25.76,-28.10,-34.47,-32.34,-31.54,-40.80,-32.76,-32.90,-33.50,-30.65,-25.37,-25.05,-22.88,-23.29,-21.54,-21.02,-21.54,-20.90,-19.11,-27.76,-35.15,-29.17,-27.79,-24.20,-26.23,-34.32,-35.95,-51.20,-33.84,-32.76,-32.71,-31.62,-30.43,-39.93,-29.75,-27.64,-28.35,-27.62,-28.41,-24.20,-24.51,-22.06,-21.08,-20.97,-18.82,-30.27"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/af-investing-pro-final/122603">AF Investing Pro Final</a> </td> <td><span class="green">56.69%</span></td> <td><span class="green">29,812</span></td> <td>8.6%</td> <td>476</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark122603"><canvas width="8" height="25" style="display: inline-block; vertical-align: top; width: 8px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark122603"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$250</td> <td> 17d 12h </td> <td> Apr 30, 2020 </td> <td><a onclick="getMasterPricingData('122603');" data-toggle="modal"><button id="subscribeToMasterBtn122603" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="122603" value="55.18,0.98"> <input type="hidden" class="dailyGrowthData" oid="122603" value="-0.02,0.04,54.78,55.02,55.18,55.82,55.86,55.99,56.06,56.25,56.69"> <input type="hidden" class="dailyEquityData" oid="122603" value="-8.60,16.85,54.86,54.11,55.44,55.85,54.38,52.15,45.00,51.07,56.25"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/rapid-growth/111340">Rapid growth</a> </td> <td><span class="green">130.78%</span></td> <td><span class="green">1,102.9</span></td> <td>44.3%</td> <td>126</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark111340"><canvas width="12" height="25" style="display: inline-block; vertical-align: top; width: 12px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark111340"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$31</td> <td> 2m 8d </td> <td> Apr 1, 2020 </td> <td><a onclick="getMasterPricingData('111340');" data-toggle="modal"><button id="subscribeToMasterBtn111340" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="111340" value="87.85,18.28,3.87"> <input type="hidden" class="dailyGrowthData" oid="111340" value="0.00,0.64,1.40,1.40,1.90,2.91,7.53,8.21,11.19,11.30,17.60,19.60,23.03,37.74,47.75,54.75,59.91,69.79,73.60,79.36,87.85,93.14,93.40,94.70,95.93,96.01,99.95,100.71,101.85,102.10,102.12,104.36,108.76,110.11,110.14,110.23,112.58,115.10,115.54,117.17,121.24,122.19,123.40,124.18,124.88,124.89,130.09,130.78"> <input type="hidden" class="dailyEquityData" oid="111340" value="-1.80,0.67,0.97,1.91,-0.64,2.58,6.82,6.72,8.65,8.46,16.29,17.71,19.96,34.10,47.24,51.91,59.07,69.79,73.58,79.26,88.01,91.03,93.43,87.85,96.19,95.80,100.29,95.63,98.94,101.71,98.33,104.12,108.26,108.46,86.24,108.42,112.83,114.51,94.42,116.29,120.16,121.93,123.05,115.67,122.81,124.45,130.47,130.14"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/dream-presentation-1/66543">Dream Presentation 1</a> </td> <td><span class="red">-99.9%</span></td> <td><span class="red">-2,724.1</span></td> <td>99.9%</td> <td>1612</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark66543"><canvas width="28" height="25" style="display: inline-block; vertical-align: top; width: 28px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark66543"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 6m 13d </td> <td> Nov 8, 2019 </td> <td><a onclick="getMasterPricingData('66543');" data-toggle="modal"><button id="subscribeToMasterBtn66543" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="66543" value="-100.14,-98.54,-98.79,-91.71,-98.23,-100.00,-88.82"> <input type="hidden" class="dailyGrowthData" oid="66543" value="24.18,-99.90,-99.89,-99.88,-99.88,-99.88,-99.87,-99.87,-99.86,-99.84,-99.83,-99.90,-99.89,-99.90,-99.90,-99.81,-99.81,-99.80,-99.90,-99.90,-99.86,-99.83,-99.79,-99.90,-99.90,-99.90,-99.88,-99.89,-99.89,-99.88,-99.82,-99.74,-99.85,-99.37,-99.88,-99.90,-99.90,-99.90,-99.90,-99.87,-99.83,-99.80,-99.75,-99.64,-99.56,-99.90,-99.90"> <input type="hidden" class="dailyEquityData" oid="66543" value="7.87,-99.90,-99.89,-99.88,-99.88,-99.88,-99.88,-99.87,-99.86,-99.84,-99.83,-99.90,-99.89,-99.90,-99.89,-99.83,-99.88,-99.88,-99.90,-99.90,-99.87,-99.83,-99.84,-99.72,-99.90,-99.90,-99.88,-99.89,-99.88,-99.92,-99.86,-99.74,-99.86,-99.39,-99.88,-99.90,-99.90,-99.90,-99.90,-99.87,-99.83,-99.79,-99.76,-99.63,-99.55,-100.16,-99.83"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/limerence-ea-suite-3/93679">Limerence EA Suite 3</a> </td> <td><span class="green">1,246.66%</span></td> <td><span class="green">199.8</span></td> <td>34.2%</td> <td>8</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark93679"><canvas width="20" height="25" style="display: inline-block; vertical-align: top; width: 20px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark93679"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$75</td> <td> 7m 11d </td> <td> Feb 11, 2020 </td> <td><a onclick="getMasterPricingData('93679');" data-toggle="modal"><button id="subscribeToMasterBtn93679" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="93679" value="95.40,82.01,94.38,87.49,3.90"> <input type="hidden" class="dailyGrowthData" oid="93679" value="0.00,95.40,255.64,591.28,552.49,1234.12,1196.10,1246.66"> <input type="hidden" class="dailyEquityData" oid="93679" value="0.00,95.40,255.64,591.28,1034.76,1234.12,1196.10,1246.66"> </div> </td> </tr> <tr> <td style="text-align: center;"> - </td> <td><a class="pointer" target="_blank" href="https://www.signalstart.com/analysis/easy-money/31727">Easy Money</a> </td> <td><span class="red">-99.9%</span></td> <td><span class="green">2,430.6</span></td> <td>100%</td> <td>1095</td> <td>Real</td> <td><span class="monthlySparkline" id="monthlySpark31727"><canvas width="96" height="25" style="display: inline-block; vertical-align: top; width: 96px; height: 25px;"></canvas></span></td> <td><span class="dayliSparkline" id="dayliSpark31727"><canvas width="100" height="25" style="display: inline-block; vertical-align: top; width: 100px; height: 25px;"></canvas></span></td> <td>$30</td> <td> 2y 2m </td> <td> Apr 1, 2018 </td> <td><a onclick="getMasterPricingData('31727');" data-toggle="modal"><button id="subscribeToMasterBtn31727" class="btn btn-circle btn-sm green" type="button">Copy</button></a> <div style="display: none;"> <input type="hidden" class="monthlyData" oid="31727" value="6.22,-6.15,22.04,-5.08,0.08,12.08,-69.31,-99.82,245.26,88.44,113.73,52.29,25.38,77.72,-29.07,-24.73,-86.48,-89.27,195.77,-7.65,-99.98,278.89,-69.98,-65.48"> <input type="hidden" class="dailyGrowthData" oid="31727" value="-99.66,-99.69,-99.72,-99.73,-99.77,-99.77,-99.78,-99.81,-99.90,-99.90,-99.89,-99.84,-99.83,-99.82,-99.81,-99.75,-99.78,-99.77,-99.79,-99.78,-99.77,-99.48,-99.46,-99.36,-99.34,-99.33,-99.33,-99.31,-99.33,-99.34,-99.40,-99.45,-99.33,-99.58,-99.65,-99.73,-99.71,-99.70,-99.68,-99.68,-99.69,-99.68,-99.71,-99.68,-99.80,-99.80,-99.77,-99.81,-99.84,-99.90"> <input type="hidden" class="dailyEquityData" oid="31727" value="-99.66,-99.69,-99.73,-99.70,-99.85,-99.89,-99.95,-99.77,-99.85,-99.90,-99.88,-99.84,-99.83,-99.82,-99.79,-99.75,-99.78,-99.77,-99.70,-99.68,-99.59,-99.48,-99.46,-99.36,-99.34,-99.33,-99.32,-99.25,-99.30,-99.34,-99.37,-99.37,-99.35,-99.58,-99.61,-99.73,-99.71,-99.69,-99.68,-99.68,-99.68,-99.68,-99.71,-99.68,-99.80,-99.76,-99.73,-99.79,-99.80,-99.89"> </div> </td> </tr> </tbody> </table> My code successfully extracts the data from the first table-data cell (the rank). But it is showing as blank for the second table data cell (the name). What is wrong with this source code: import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): for provider in response.xpath("//div[#class='row']//tr"): yield { 'rank': provider.xpath('td[1]/text()').get(), 'name': provider.xpath('td[2]/text()').get(), } UPDATE I am now iterating over the td cells within tr and getting the td cells, but my final problem is: how to get the text from the td cells that I have? import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): cols = "rank name gain pips drawdown trades type monthly chart price age added action" skip = [9,13] td = dict() for i, col in enumerate(cols.split()): td[i] = col Behold().show('td') for provider in response.xpath("//div[#class='row']//tr"): data_row = dict() for i, datum in enumerate(provider.xpath('td')): if i in skip: continue data_row[td[i]] = datum # Behold().show('datum') yield data_row
The correct answer was provided by gallaecio_ in the Scrapy IRC channel - here is the code: import scrapy from behold import Behold class SignalStartSpider(scrapy.Spider): name = 'signalstart' start_urls = [ 'https://www.signalstart.com/search-signals', ] def parse(self, response): cols = "rank name gain pips drawdown trades type monthly chart price age added action" skip = [9,13] td = dict() for i, col in enumerate(cols.split()): td[i] = col Behold().show('td') for provider in response.xpath("//div[#class='row']//tr"): data_row = dict() for i, datum in enumerate(provider.xpath('td/text()')): if i in skip: continue data_row[td[i]] = datum.get() # Behold().show('datum') yield data_row for more involved cases you may need https://github.com/TeamHG-Memex/html-text
Python Beautiful Soup Iterate over Multiple Tables
Trying to find multiple tables using the CSS names and I am only getting the CSS in the output initially. I want to loop over each of the small tables and from there each row contains player info with the tds attributes about each player. How come what I have there doesn't actually print the table contents to begin with? I want to confirm I have made this first step right, before I then go on and into the tr and tds for each mini table. I think part of the issue is that the first table. My program - import requests from bs4 import BeautifulSoup #url = 'https://www.skysports.com/premier-league-table' base_url = 'https://www.skysports.com' # Squad Data squad_url = base_url + '/liverpool-squad' squad_r = requests.get(squad_url) print(squad_r.status_code) premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser') premier_squad_table = premier_squad_soup.find_all = ('table', {'class': 'table -small no-wrap football-squad-table '}) print(premier_squad_table) HTML - each table looks like the following but with a different title <table class="table -small no-wrap football-squad-table " title="Goalkeeper"> <colgroup> <col class="" style=""> <col class="digit-4 -bp30-hdn"> <col class="digit-3 "> <col class="digit-3 "> <col class="digit-3 "> </colgroup> <thead> <tr class="text-s -interact text-h6" style=""> <th class=" text-h4 -txt-left" title="">Goalkeeper</th> <th class=" text-h6" title="Played">Pld</th> <th class=" text-h6" title="Goals">G</th> <th class=" text-h6" title="Yellow Cards ">YC</th> <th class=" text-h6" title="Red Cards">RC</th> </tr> </thead> <tbody> <tr class="text-h6 -center"> <td> <a href="/football/player/141016/alisson-ramses-becker"> <div class="row-table -2cols"> <span class="col span4/5 -txt-left"><h6 class=" text-h5">Alisson Ramses Becker</h6></span> </div> </a> </td> <td> 13 (0) </td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr class="text-h6 -center"> <td> <a href="/simon-mignolet"> <div class="row-table -2cols"> <span class="col span4/5 -txt-left"><h6 class=" text-h5">Simon Mignolet</h6></span> </div> </a> </td> <td> 1 (0) </td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr class="text-h6 -center"> <td> <a href="/football/player/153304/kamil-grabara"> <div class="row-table -2cols"> <span class="col span4/5 -txt-left"><h6 class=" text-h5">Kamil Grabara</h6></span> </div> </a> </td> <td> 1 (1) </td> <td>0</td> <td>0</td> <td>0</td> </tr> </tbody> </table> Output - 200 ('table', {'class': 'table -small no-wrap football-squad-table '})
Had to find the div first to then get the table inside the div premier_squad_div = premier_squad_soup.find('div', {'class': '-bp30-box col span1/1'}) premier_squad_table = premier_squad_div.find_all('table', {'class': 'table -small no-wrap football-squad-table '})
Extract table from html file using python
I want to extract table from an html file. I have written the following code-snippet to extract the first table: import urllib2 import os import time import traceback from bs4 import BeautifulSoup #find('table',{'class':'tbl_with_brdr'}) outfile= open('D:/Dropbox/Python/apelec.txt','wb') rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') rsoup = BeautifulSoup(rfile) nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') for node in nodes[1:]: x = node.find('th').find('b').get_text().encode("utf-8") print x y = node.find('th').findNext('th').find('b').get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close() Here is the error: 9 rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') 10 rsoup = BeautifulSoup(rfile) ---> 11 nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') 12 for node in nodes[1:]: 13 x = node.find('th').find('b').get_text().encode("utf-8") AttributeError: 'NoneType' object has no attribute 'find' And the html file is: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <link rel="icon" type="image/ico" href="images/favicon.ico"/> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <link rel="stylesheet" href="themes/panchayat_default.css" type="text/css"/> <title>consolidated Election Report</title> </head> <body> <!-- To blur the background while processing dwr --> <div class="faded_div process"></div> <div class="popup_block_div process" style="display: none;"> <img alt="" src="images/loading_animation.gif" style="margin-left: auto; margin-right: auto;"> </div> <div id="maincontainer" class="resize"> <div id="headerwrap"> <!-- Header --> <html> <head> <script type='text/javascript' src="/profilerdwr/engine.js"> </script> <script type='text/javascript' src="/profilerdwr/util.js"> </script> <script type="text/javascript" src="/profilerdwr/interface/lgdDao.js"></script> <script type="text/javascript" src="js/common_util_js.js"></script> <link rel="stylesheet" href="css/common_css.css" type="text/css"></link> <meta http-equiv='Content-Type' content='text/html; charset=UTF-8' /> </head> <body > <div class="clear"></div> <div id="headerwrap"> <div id="header"> <div id="new_header"> <div id="logoleft">Area Profiler</div> <div id="logoright"></div> <div class="clear"></div> </div> <div class="clear"></div> <div id="loginnav" align="right"> <table width="100%" class="tbl_no_brdr"> <tr> <td class="tblclear" align="left"> <div id="mainnav">Home </div> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="topnav"> <table width="100%" class="tbl_no_brdr"> <tr> <td width="85" class="tblclear">Choose Theme :</td> <td width="200" class="tblclear"> <form id="themeForm" name="themeForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select name="theme" id="themeId" class="combofield" onchange="submitThemeForm()" style="width: 120px;"> <option value="default">Default Theme</option> <option value="mustard">Mustard Theme</option> <option value="peach">Peach Theme</option> <option value="green">Green Theme</option> <option value="blue">Blue Theme</option> </select> </form> </td> <td style="padding: 0px"> </td> <td class="tblclear"> </td> <td width="14" class="tblclear txticon"><img src="images/btnMinus.jpg" width="16" height="14" border="0" /></div></td> <td width="14" class="tblclear txticon"><img src="images/btnDefault.jpg" width="16" height="14" border="0" /> </td> <td width="28" class="tblclear txticon"><img src="images/btnPlus.jpg" width="16" height="14" border="0" /></td> <script type="text/javascript" > //documenttextsizer.setup("shared_css_class_of_toggler_controls") documenttextsizer.setup("texttoggler") </script> <td width="100" align="right" class="tblclear">Select Language :</td> <td width="108" align="right" class="tblclear"> <form id="languageForm" name="languageForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select id="languageId" name="language" class="combofield" style="width: 120px;" onchange="submitLanguageForm()" > <option value=""> Select Language </option> </select> </form> </td> </tr> </table> </div> <div id="breadcrumbnav"> </div> </div> <script type="text/javascript"> function submitThemeForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('themeForm').submit(); } else { return; } } function submitLanguageForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('languageForm').submit(); } else { return; } } </script> </body> </html> </div> <div class="clear"></div> <div id="content"> <div id="leftpnl"> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="100%" valign="top" class="tblclear"> <!-- content -->. <script type="text/javascript" src="js/common_js.js"></script> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <script type="text/javascript"> var pathname; $(document).ready(function() {pathname = window.location.pathname;}); function onBack(s) { var position =pathname.indexOf("/", 2); var newPath = ""; var val = s.indexOf("?", 1); if(val>0) { newPath = s+"&redirect=true"; } else { newPath = s+"?redirect=true"; } window.location.replace(".."+pathname.substring(0,position)+"/"+newPath); } function downloadReport(repformat){ //window.location="downloadConsolidatedElectionReportPDF.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; //document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?repformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?reportformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].method="POST"; document.getElementById('electionReportForm').target="_blank"; document.forms["electionReportForm"].submit(); } </script> <style type="text/css"> .data_link{ color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .disable_link { cursor:default; color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:VISITED { color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:HOVER{ text-decoration: underline; } </style> </head> <body> <div id="frmcontent"> <div class="frmhd"> <table width="100%" class="tbl_no_brdr"> <tr> <td align="left" width="90%"> Consolidated Election</td> </tr> </table> </div> <div class="clear"></div> <div class="frmpnlbrdr"> <div class="frmpnlbg"> <div class="frmtxt"> <table width="100%" style="margin-bottom: 10px;" class="tbl_with_brdr"> <tr class="tblRowTitle tblclear" > <th align="left" ><b>State Name</b></th> <th align="left" ><b>Local Body Type</b></th> <th align="left" ><b>Election Term</b></th> <th align="left" ><b>Local Body Name</b></th> </tr> <tr class="tblRowB" style="color: blue;"> <th align="left" >ANDHRA PRADESH</th> <th align="left" >Village Panchayat</th> <th align="left" > 02-Aug-2013 To 01-Aug-2018 </th> <th align="left" >KODIHALLI</th> </tr> </table> <div class="frmhdtitle">Consolidated Election</div> <table width="100%" class="tbl_with_brdr"> <thead> <tr class="tblRowTitle tblclear"> <th align="center" width="5%" ><b>S.No.</b></th> <th align="left" width="9%"><b>Name</b></th> 0 <th align="left" width="9%"><b>Age</b></th> 1 <th align="left" width="9%"><b>Caste Category</b></th> 2 <th align="left" width="9%"><b>Gender</b></th> 3 <th align="left" width="9%"><b>Qualification</b></th> 4 <th align="left" width="9%"><b>Occupation</b></th> 5 <th align="left" width="9%"><b>Email Address</b></th> 6 <th align="left" width="9%"><b>Ward Name</b></th> 7 <th align="left" width="9%"><b>Reservation</b></th> 8 </tr> </thead> <tbody> <tr class="tblRowB"> <td align="center" >1</td> <td>Kambanna</td> <td>36</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>N/A</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >2</td> <td>Ramesh</td> <td>39</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 1</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >3</td> <td>S.Manjunath</td> <td>29</td> <td>OBC</td> <td>Male</td> <td>Higher Secondary or Intermediate or Pre University or Senior Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 2</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >4</td> <td>Obuleshu</td> <td>48</td> <td>OBC</td> <td>Male</td> <td>Below Primary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 3</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >5</td> <td>Mamatha</td> <td>24</td> <td>OBC</td> <td>Female</td> <td>Matriculation or Junior School Certificate or Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 4</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >6</td> <td>Shivamma</td> <td>38</td> <td>OBC</td> <td>Female</td> <td>Below Primary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 5</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >7</td> <td>Hanumantappa</td> <td>46</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 6</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >8</td> <td>Malingappa</td> <td>45</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 7</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >9</td> <td>Kamalamma</td> <td>52</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 8</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >10</td> <td>Muddamma</td> <td>48</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 9</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >11</td> <td>Patta Tayamma</td> <td>45</td> <td>SC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 10</td> <td > Yes (SC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >12</td> <td>Sujatha</td> <td>35</td> <td>OBC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 11</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >13</td> <td>Kadurappa</td> <td>35</td> <td>SC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 12</td> <td > Yes (SC / Others) </td> </tr> </tbody> </table> <br /> <table width="100%" class="tbl_no_brdr"> <tr> <td align="center"> <input type="button" class="btn" onclick="onClose('welcome.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU')" value=Close /> <input type="button" class="btn" onclick="this.disabled=true; this.value='Please Wait .!';onBack('consolidatedElectionReport.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU&electionTermId=35107&stateId=28')" value=Back /> </td> </tr> </table> <form id="electionReportForm" name="electionReportForm" action="#" method="post"> <div align="center"><br/> <input type="button" class="btn" onclick="downloadReport('pdf');" value="Export to PDF" size="5" /> <input type="button" class="btn" onclick="downloadReport('xls');" value="Export to Excel" size="5" /> </div> </form> </div> <div class="myclass" style="font-family: Times; text-align: center; font-size: 10.0pt; color: white; font-weight: bold; border: 1px solid gray"> Report generated through Area Profiler (http://areaprofiler.gov.in)Thu Oct 02 22:34:20 IST 2014 </div> </div> </div> </div> </body> </html> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="footer"> <!-- Footer --> <html> <head> </head> <body> <table width="100%" class="tbl_no_brdr"> <tr> <td colspan="3" class="fotbrdr"></td> </tr> <tr> <td width="161" class="btmlogospace"><a href="http://www.negp.gov.in/" target= "_blank" ><img src="images/e_governance_logo.jpg" width="161" height="38" /></a></td> <td width="93" class="btmlogospace"><a href="http://www.panchayat.gov.in/" target= "_blank" ><img src="images/panchayatilogo.jpg" width="93" height="38" /></a></td> <td align="right" class="btmlogospace">Site is designed, hosted and maintained by National Informatics Centre<br /> Contents on this website is owned,updated and managed by the Ministry of Panchayati Raj</td> </tr> </table> </body> </html> </div> </div> </body> </html>
I paste here an approach, it is not exactly the solution but you can use it as a guide. You have to traverse the DOM tree and extract the values you want. I changed the class of the div you look for from frmtext to frmtxt and in the traversal you have to check if anything is found or not. import urllib2 import os import time import traceback from bs4 import BeautifulSoup outfile= open('out.txt','wb') rfile = open('195778.html') rsoup = BeautifulSoup(rfile) nodes1 = rsoup.find('div',{'class':'frmtxt'}) nodes = nodes1.find('table').find_all('tr') for node in nodes: a = node.find('th') x = None if a != None: x1 = x.find('b') if x1 != None: x2 = x1.get_text().encode("utf-8") print x2 x = x2 y = node.find('th') if y != None: print 'y',y y2 = y.findNext('th') if y2 != None: print 'y2',y2 y3 = y2.find('b') if y3 != None: y = y3.get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close()