Clicking a link using Selenium Python Library - python
How could I click the link highlighted in the attached image using python selenium library, i have tried everything (most attempts shown below), but all not working.
note in the attached picture the element tree.
Page code (link in question is the last link in the code below:
<HEAD>
<TITLE> Constructor Self Service</TITLE>
</HEAD>
<script type="text/javascript">
......
</script>
<link rel="stylesheet"
href="/global/res/themes/corporate/css/style3.css"
type="text/css">
<script type="text/javascript"
src="/global/res/javascript/horizontal_subsection_HM_Loader.js"></script>
<link rel=stylesheet href="css/tcss.css" TYPE="text/css">
<script type="text/javascript" src="arrays/aaclib.js"></script>
<script type="text/javascript" src="arrays/csslib.js"></script>
<table cellspacing=0 cellpadding=0 border=0 width="100%" height="100%" >
<tr><td valign="top">
<form name="mainForm" action="/TCSSPRODapp/Controller"
method="POST" onsubmit="return false;">
<input type="hidden" name="command">
<input type="hidden" name="page">
<input type="hidden" name="menuIndex">
<input type="hidden" name="submitted" value="false">
<input type="hidden" name="pageToken"
value=679>
<input type="hidden" name="isInternal"
value=false>
<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0"
marginheight="0">
<!--begin global header--><!--Don't place any other HTML code on this line!!-->
<script language="javascript" src="/global/javascript/head_array.js" type="text/javascript"></script>
<script language="javascript" type="text/javascript">
var v6=0;
if(typeof hmVisi!="undefined"&&typeof sectionId!="undefined" )v6=1;
if(v6)document.write('</head><body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="TEL_onLoad()">');
function alertUser() {
if(confirm("Warning: Any unsaved data would be lost.\rDo you want to Continue?")) {
return true;
} else {
return false;
}
}
</script>
<link rel="stylesheet" href="/global/css/gwy_hed.css" type="text/css">
<a name="top"></a>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tr>
<td rowspan="4" valign="top"><img src="/global/images/telstra_logo2.gif" width="85" height="85" border="0" alt="telstra.com"></td>
<td align="right" class="ap1"><span class="hd1"><img src="/global/images/sin.gif" width="1" height="37" border="0" alt="press enter now to toggle the accessibile text mode of this page" align="absmiddle">
Telstra Homepage | Contact Us | Search<img src="/global/images/sin.gif" width="12" height="1" alt=""></span></td>
<td width="50%" class="ap1"><img src="/global/images/sin.gif" width="1" height="57" alt=""></td>
</tr>
<tr class="hd3">
<!--<td><img src="/global/images/sin.gif" width="500" height="2" alt=""></td>-->
<td colspan=50 width="100%"><img src="/global/images/sin.gif" width="100" height="2" alt=""></td>
<td><img src="/global/images/sin.gif" width="1" height="2" alt=""></td>
</tr>
<tr>
<td align="right"><img src="/global/images/gwy_grad.gif" width="328" height="6" alt=""></td>
<td class="hd4"><img src="/global/images/sin.gif" width="1" height="1" alt=""></td>
</tr>
<tr>
<td colspan=2 class="hd5" nowrap><NOBR>
<table cellpadding=1 cellspacing=1 border=0>
<tr>
<!-- AMCO-WORKFLOW:MODIFY:START -->
<!--<td rowspan=3 class="hd5" nowrap><img src="/global/images/sin.gif" width="15" height="10" alt="" align="top">Telstra Contractor Self Service<img src="/global/images/sin.gif" width="25" height="1" alt="" align="top"></td>-->
<td rowspan=4 class="hd5" nowrap><img src="/global/images/sin.gif" width="5" height="10" alt="" align="top">Telstra Constructor Self Service<img src="/global/images/sin.gif" width="5" height="1" alt="" align="top"></td>
<!-- AMCO-WORKFLOW:MODIFY:END -->
<TD nowrap><img src="/global/images/sin.gif" width="10" height="12"></TD>
<TD nowrap></TD>
<TD nowrap></TD>
<!-- AMCO-WORKFLOW:ADD:START -->
<!--5.14.01 Anand Changes: start-->
<!--<TD nowrap></TD>-->
<!--5.14.01 Anand Changes: start-->
<!-- AMCO-WORKFLOW:ADD:END -->
<!--5.14.01 Anand Changes: start-->
</tr>
<tr style="background-color:#99ccff;">
<!-- AMCO-WORKFLOW:ADD:START -->
<TD class=button>
<a class=button style="text-decoration:none"
title="AMCO"
href="javascript:doCommand('doWidebandWorkSummary')"
onmousemove="window.status='View Work Summary'"
onmouseout="window.status=window.defaultStatus">
AMCO</a>
</TD>
<!-- AMCO-WORKFLOW:ADD:END -->
<TD nowrap style="background-color:#ffffff;"></TD>
<TD class=button nowrap>
<a class=button style="text-decoration:none"
title="View Inbox"
href="javascript:doCommand('doInboxStat')"
onmousemove="window.status='View Inbox'"
onmouseout="window.status=window.defaultStatus">
<NOBR>Inbox 0</NOBR></a>
</TD>
<TD class=button>
<a class=button style="text-decoration:none"
title="View Outbox"
href="javascript:doCommand('doOutboxStat')"
onmousemove="window.status='View Outbox'"
onmouseout="window.status=window.defaultStatus">
<NOBR>Outbox 0</NOBR></a>
</TD>
<!--Start : IPaC Stage II Drop1 : Refresh button added-->
<TD class=button>
<a class=button style="text-decoration:none"
title="Refresh Inbox and Outbox"
href="javascript:doCommand('doInboxOutboxRefresh')"
onmousemove="window.status='Refresh Inbox and Outbox'"
onmouseout="window.status=window.defaultStatus"
onclick="return alertUser()">
<NOBR>Refresh</NOBR></a>
</TD>
<!--End : IPaC Stage II Drop1-->
<TD nowrap style="background-color:#ffffff;"></TD>
<TD class=button nowrap>
<a class=button style="text-decoration:none"
title="Display TCSS help"
href="javascript:loadHelpWindow()"
onmousemove="window.status='Display TCSS help'"
onmouseout="window.status=window.defaultStatus">
<NOBR>Help</NOBR></a>
</TD>
<!-- AMCO-WORKFLOW:ADD:START -->
<!--5.14.01 Anand Changes: start-->
<!--<TD nowrap width="151" style="background-color:#ffffff;"></TD>-->
<!--5.14.01 Anand Changes: end-->
</tr>
<tr>
<TD nowrap><img src="/global/images/sin.gif" width="38" height="12"></TD>
<!--Start : IPaC Stage II Drop1 : Display message if refresh didnt occur-->
<TD nowrap style="background-color:#ffffff;"></TD>
<TD nowrap><img src="/global/images/sin.gif" width="38" height="12"></TD>
<TD nowrap style="background-color:#ffffff;"></TD></tr>
<tr>
<TD nowrap><img src="/global/images/sin.gif" width="38" height="12"></TD>
<TD nowrap style="background-color:#ffffff;"></TD>
</tr>
<!--<TD nowrap></TD>-->
<!--<TD nowrap></TD>-->
<!-- AMCO-WORKFLOW:ADD:START -->
<!--<TD nowrap></TD>-->
<!-- AMCO-WORKFLOW:ADD:END -->
<!--End : IPaC Stage II Drop1-->
</tr>
</table>
</NOBR>
</td>
</tr>
</table>
<script language="javascript" type="text/javascript">
<!--5.14.01 Anand Changes: changed from 100% to 15 % start-->
var s='<table width="50%" cellpadding="0" cellspacing="0" border="0"><tr><td class="siteletTitleTab"><img src="/global/images/sin.gif" width="1" height="3" alt=""></td></tr></table>';
<!--5.14.01 Anand Changes: changed from 100% to 15 % end-->
if(v6)document.write(s);
</script>
<!--end global header--><!--Don't place any other HTML code on this line!!-->
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr class="">
<td width="100%" rowspan="2" align=right>
</td>
<TD>
<INPUT type="hidden" name="internalConstructor" size=10 value="false"> </TD>
</tr>
</table>
<table border=0 cellspacing=0 cellpadding=0 width=100%>
<tr><td colspan=50 class=sitelettitletab>
<img src=/global/res/images/sin.gif width=0 height=3>
</td></tr><tr class=sitelettitletab>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doCommand("doFrontPage")' onmousemove="window.status='Home Page'" onmouseout="window.status=window.defaultStatus" title='Home Page'>Home Page</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doCommand("doTransmittalsInbox")' onmousemove="window.status='My TCSS'" onmouseout="window.status=window.defaultStatus" title='My TCSS'>My TCSS</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=siteactivetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=siteactivetabnav valign=top><img src=/global/res/images/sin.gif width=5 height=20></td><td class=siteactivetabnav valign=middle align=center nowrap><img src=/global/res/themes/corporate/images/tick.gif width=8 height=9> <a class=siteactivetabnav onmousemove="window.status='Work Under Contract'" onmouseout="window.status=window.defaultStatus">Work Under Contract</a></td><td align=right class=siteactivetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20></td>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doCommand("doGenDocSearch")' onmousemove="window.status='Documents'" onmouseout="window.status=window.defaultStatus" title='Documents'>Documents</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doCommand("doAssetSearch")' onmousemove="window.status='Reference Library'" onmouseout="window.status=window.defaultStatus" title='Reference Library'>Reference Library</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doCommand("doAbout")' onmousemove="window.status='Support'" onmouseout="window.status=window.defaultStatus" title='Support'>Support</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td><img src=/global/res/images/sin.gif width=3 height=1></td><td class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_left_inactive.gif width=8 height=20></td><td class=sitetabnav valign=middle align=center nowrap><a class=sitetabnav href='javascript:doLogout()' onmousemove="window.status='Close TCSS'" onmouseout="window.status=window.defaultStatus" title='Close TCSS'>Close TCSS</a></td><td align=right class=sitetabnav valign=top><img src=/global/res/themes/corporate/images/tab_right_inactive.gif width=8 height=20>
<td width=100%><img src=/global/res/images/sin.gif width=1 height=1></td></tr><tr><td colspan=50><img src=/global/res/images/sin.gif width=0 height=2></td></tr></table>
<table cellpadding=0 cellspacing=0 border=0 width=100%><tr><td>
<img src='/global/res/images/sin.gif' width=1 height=3></td></tr><tr class=sitetabnav><td>
<table cellpadding=0 cellspacing=0 border=0><tr><script type='text/javascript'>
TEL_horizontalSubsectionNav('corporate')
</script><td width=100%> </td></tr></table></td></tr></table>
<table cellspacing=10 cellpadding=0 border=0 width="100%">
<tr><td valign=top width="100%">
<TABLE border=0 cellpadding=0 cellspacing=0 width=100%>
<TR><TD bgcolor=#99ccff>
<TABLE cellspacing=2 cellpadding=0 border=0 width=100%>
<TR><TD bgcolor=#F1F8FE>
<TABLE border=0 cellpadding=0 cellspacing=0>
<TR><TD colspan=2 height=1 bgcolor=#F1F8FE></TD>
<TD rowspan=3><IMG border=0 src='images/corner.jpg'></TD>
</TR><TR><TD><IMG border=0 width=1 height=1 src='images/whitedot.gif'></TD>
<TD class='groupboxheader'> WUC CSA - Scope Variation - Site & Financial Details </TD></TR></TABLE>
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" WIDTH="100%">
<TR>
<TD CLASS="label" WIDTH="20%">Contract Number:</TD>
<TD WIDTH="25%" CLASS="fieldlabel">20003171</TD>
<TD CLASS="label" WIDTH="20%">Separable Portion:</TD>
<TD WIDTH="25%" CLASS="fieldlabel">30056120</TD>
</TR>
<TR>
<TD CLASS="label">Work Order:</TD>
<TD CLASS="fieldLabel">1</TD>
<TD CLASS="label">CSA Number:</TD>
<TD CLASS="fieldLabel">4117298</TD>
</TR>
<TR>
<TD CLASS="label">Current CSA Status:</TD>
<TD class="fieldLabel">
Draft</TD>
<TD CLASS="label">Initiated By:</TD>
<TD CLASS="fieldlabel">CONSTRUCTOR</TD>
</TR>
<TR>
<TD CLASS="label">Issue Number:</TD>
<TD CLASS="fieldlabel">1</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
<SCRIPT type="text/javascript">
function showQuote(sequenceNo) {
var f = document.mainForm;
switch (parseInt(sequenceNo)) {
case 10001:
f.command.value = "doCsaViewPSDetails";
break;
case 10002:
f.command.value = "doCsaViewDRDetails";
break;
case 10003:
f.command.value = "doCsaViewLSDetails";
break;
default:
f.sequenceNo.value = sequenceNo;
f.command.value = "doCsaViewDIDetails";
}
doSubmit(f);
}
function showMaterialQuote(sequenceNo) {
var f = document.mainForm;
f.sequenceNo.value = sequenceNo;
f.command.value = "doCsaViewMatDetails";
doSubmit(f);
}
//NDCG:For Link to Non Catalogued material screen
//NDCG:Add:Start
function showNonCatMaterialQuote(sequenceNo) {
var f = document.mainForm;
f.command.value = "doCsaViewMatExDetails";
f.sequenceNo.value = sequenceNo;
doSubmit(f);
}
//NDCG:ADD:End
</SCRIPT>
<INPUT type="hidden" name="sequenceNo">
<BR>
<TABLE border="0" cellspacing="1" cellpadding="0" class="table" width="100%">
<TR align="middle">
<TD rowspan="2" class=colHeader>Type</TD>
<TD colspan="3" class=colHeader>TCSS Calculated</TD>
<TD colspan="3" class=colHeader>Quoted</TD>
<TD colspan="3" class=colHeader>Approved</TD>
<TD rowspan="2" class=colHeader>View</TD>
</TR>
<TR align="middle">
<TD class=colHeader>Value</TD>
<TD class=colHeader>GST</TD>
<TD class=colHeader>Price</TD>
<TD class=colHeader>Value</TD>
<TD class=colHeader>GST</TD>
<TD class=colHeader>Price</TD>
<TD class=colHeader>Value</TD>
<TD class=colHeader>GST</TD>
<TD class=colHeader>Price</TD>
</TR>
<TR>
<TD class=cell>1. PENRITH - Generic Land & Building - Project</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD class=cell align="center" nowrap>
<IMG SRC="/global/res/images/view_white.gif" width=44 height=18 onmousemove="window.status='View Item'" onmouseout="window.status=window.defaultStatus" ALT="View Item" BORDER=0>
<IMG SRC=/global/res/images/material.gif width=16 height=16 border=0 onmousemove="window.status='Catalogued Material'" onmouseout="window.status=window.defaultStatus" alt='Catalogued Material'>
<IMG SRC=/global/res/images/material.gif width=16 height=16 border=0 onmousemove="window.status='Non Catalogued Material'" onmouseout="window.status=window.defaultStatus" alt='Non Catalogued Material'>
</TD>
</TR>
<TR>
<TD class=cell>Daywork Rates</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD class=cell align="center" nowrap>
<IMG SRC="/global/res/images/view_white.gif" width=44 height=18 onmousemove="window.status='View Item'" onmouseout="window.status=window.defaultStatus" ALT="View Item" BORDER=0>
</TD>
</TR>
<TR>
<TD class=cell>Lump Sums</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD nowrap class=rightcell>$0.00</TD>
<TD class=cell align="center" nowrap>
<IMG SRC="/global/res/images/view_white.gif" width=44 height=18 onmousemove="window.status='View Item'" onmouseout="window.status=window.defaultStatus" ALT="View Item" BORDER=0>
</TD>
</TR>
Attempts:
driver.find_element_by_css_selector("a[href='javascript:showQuote('10003')']").click()
driver.find_element_by_xpath("//a[#href="javascript:showQuote('10003')"]").click()
driver.find_element_by_partial_link_text('showQuote('10003')').click()
driver.find_element_by_css_selector('[href^=javascript:showQuote('10003')]').click()
driver.execute_script('showQuote('10003')')
Did you tried by using xpath something like below
//tbody/tr[5]/td[11]/a/img
i know taking positions of tr and td like 5, 11 is not appropriate if they go on changing. but try once to see is it able to click or not.
You can try like this
//a[starts-with(#href, 'javascript:showQuote') and contains(#href, '10003')]
If you are using firefox browser then use firebug with firepath add-on to check if it is fetching required unique element or not.
Thank You,
Murali
You need to escape the quotes to make it work:
driver.find_element_by_css_selector("[href='javascript:showQuote(\\'10003\\')']").click()
Or with a literal string:
driver.find_element_by_css_selector(r"[href='javascript:showQuote(\'10003\')']").click()
Related
Returning None when scraping href using Python
Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script? [<table border="0" class="ProductBox" id="Added0"> <tr> <td align="center" colspan="2"> <div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div> </td></tr><tr> <td align="center" colspan="2" height="60px;" valign="top"> <div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div> </td> </tr> <tr> <td align="center" colspan="2" height="185"> <a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;"> <img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a> </td> </tr> <tr> <td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td> <td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> </td> </tr> <tr> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> POR 0% </td> <td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> VAT 20% </td> </tr> <tr> <td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;"> <a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;"> **151 Heavy Duty Rubber Gloves - Ex Large**</a></td> </tr> <tr> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> 1s x 1 </td> <td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;"> <div class="tooltip"> <div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;"> <span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div> </div> <span class="OKStatus">In Stock </span> </td> </tr> <tr> <td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"> <table style="margin-top : 10px;" width="100%"> <tr> <td> <img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/> </td> <td> <input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/> </td> <td> <img align="middle" alt="Add 1 To Qty" src="/images/add.png"/> </td> <td align="right"> <button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button> </td> </tr> </table> I tied to use the following r = s.get(url) soup = BeautifulSoup(r.text, 'lxml') table = soup.find_all('table') for i in table: links = [link.get('href') for link in i.find_all('a')] print(links) which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']
Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters: DATA = """<table border="0" class="ProductBox" id="Added0"> <tr> ... </table>""" from bs4 import BeautifulSoup from typing import Optional def extract_name(data: str) -> Optional[str]: soup = BeautifulSoup(data, "html.parser") links = soup.select("td.ProdDetails a") if len(links) >= 1: return links[0].text.strip().strip("*").strip() else: return None print(extract_name(DATA)) # like above r = s.get(url) soup = BeautifulSoup(r.text, 'lxml') tables = soup.find_all('table') text = extract_name(tables[0]) Output: 151 Heavy Duty Rubber Gloves - Ex Large
Xpath Python Extract Data From Table Between Two Headings
I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span inside a header (I need id="Perlis", which lies between Perlis and Kedah): <h2> <span class="mw-headline" id="Perlis">Perlis</span> <span class="mw-editsection"> <span class="mw-editsection-bracket">[</span> edit <span class="mw-editsection-bracket">]</span> </span> </h2> <table class="wikitable" style="text-align:center; font-size:90%; width:100%;"> <tbody> <tr> <th width="30"># </th> <th width="150">Constituency s </th> <th width="150">Winner </th> <th width="80">Votes </th> <th width="80">Majority </th> <th width="150">Opponent(s) </th> <th width="80">Votes </th> <th width="150">Incumbent </th> <th width="80"> <b>Incumbent Majority</b> </th> </tr> <tr> <td colspan="13"> BN <b>2</b> | GS <b>0</b> | PH <b>1</b> | Independent <b>0</b> </td> </tr> <tr align="center"> <td rowspan="2">P1 </td> <td rowspan="2"> Padang Besar </td> <td rowspan="2" bgcolor="#B5BED9"> Zahidi Zainul Abidin <br /> ( <b>BN</b>- <b>UMNO</b>) </td> <td rowspan="2"> <b>15,032</b> </td> <td rowspan="2"> <b>1,438</b> </td> <td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td> <td> <b>13,594</b> </td> <td rowspan="2" bgcolor="#B5BED9"> Zahidi Zainul Abidin <br /> ( <b>BN</b>- <b>UMNO</b>) </td> <td rowspan="2"> <b>7,426</b> </td> </tr> <tr> <td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td> <td> <b>7,874</b> </td> </tr> <tr align="center"> <td rowspan="2">P2 </td> <td rowspan="2"> Kangar </td> <td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td> <td rowspan="2"> <b>20,909</b> </td> <td rowspan="2"> <b>5,603</b> </td> <td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td> <td> <b>15,306</b> </td> <td rowspan="2" bgcolor="#B5BED9"> Shaharuddin Ismail <br /> ( <b>BN</b>- <b>UMNO</b>) </td> <td rowspan="2"> <b>4,037</b> </td> </tr> <tr> <td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td> <td> <b>8,465</b> </td> </tr> </tbody> </table> <h2> <span class="mw-headline" id="Kedah">Kedah</span> <span class="mw-editsection"> <span class="mw-editsection-bracket">[</span> edit <span class="mw-editsection-bracket">]</span> </span> </h2> <table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table> This is the resulting JSON that I am trying to construct: [ { "state": "Perlis", "constituencies": [ { "id": "P1", "name": "Padang Besar" }, { "id": "P2", "name": "Kangar" } ] } ] I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind: class PostSpider(scrapy.Spider): name = 'manual_spider' start_urls = [ '%URL%' ] def parse(self, response): doc = response.xpath('//comment()').getall() //This is the bit I need //code continues here
Can this beautifulsoup script be simplified with Regex?
I wrote some beautifulsoup scripts, and one part seems really redundant, I am thinking if it can be simplified with Regex. All posts from this forum are marked with different colors, what I did is to search each color with one line. For six colors I did six lines with only one words difference. red = soup.find_all('a', style="font-weight: bold;color: red") blue = soup.find_all('a', style="font-weight: bold;color: blue") green = soup.find_all('a', style="font-weight: bold;color: green") purple = soup.find_all('a', style="font-weight: bold;color: purple") orange = soup.find_all('a', style="font-weight: bold;color: orange") lime = soup.find_all('a', style="color: green") I am not sure if it is possible to be simplified. Maybe something like: re.compile("(color: red|blue|green|purple|orange)", re.(whatever the letter is)) if it's not regex, or could it be something else? This is partial DOM: <th class="common"> <label> <img alt="" src="images/green001/agree.gif"/> <img alt="本版置顶" src="images/green001/pin_1.gif"/> </label> <em>[美臀]</em> <span id="thread_10431427">(本中)(HND-???) 二宮ひかり</span> <img alt="附件" class="attach" src="images/attachicons/common.gif"/> </th> <td class="author"> <cite> 第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>6 </cite> <em>2019-4-22</em> </td> <td class="nums"><strong>2</strong> / <em>12234</em></td> <td class="nums">5.02G / MP4 </td> <td class="lastpost"> <em>2019-4-23 20:22</em> <cite>by zj376104288</cite> </td> </tr> </tbody><!-- 三級置頂分開 --> <!-- 三級置頂分開 --> <tbody id="stickthread_10431424"> <tr> <td class="folder"><img src="images/green001/folder_common.gif"/></td> <td class="icon"> </td> <th class="common"> <label> <img alt="" src="images/green001/agree.gif"/> <img alt="本版置顶" src="images/green001/pin_1.gif"/> </label> <em>[VR]</em> <span id="thread_10431424">(WAAP)(WPVR-???)葵百合香</span> <img alt="附件" class="attach" src="images/attachicons/common.gif"/> </th> <td class="author"> <cite> 第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>5 </cite> <em>2019-4-22</em> </td> <td class="nums"><strong>0</strong> / <em>7265</em></td> <td class="nums">3.85G / MP4 </td> <td class="lastpost"> <em>2019-4-22 20:57</em> <cite>by 第一會所新片</cite> </td> </tr> </tbody><!-- 三級置頂分開 --> <!-- 三級置頂分開 --> <tbody id="stickthread_10431423"> <tr> <td class="folder"><img src="images/green001/folder_common.gif"/></td> <td class="icon"> </td> <th class="common"> <label> <img alt="" src="images/green001/agree.gif"/> <img alt="本版置顶" src="images/green001/pin_1.gif"/> </label> <em>[VR]</em> <span id="thread_10431423">(KMP)(SAVR-???)舞島あかり</span> <img alt="附件" class="attach" src="images/attachicons/common.gif"/> </th> <td class="author"> <cite> 第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>4 </cite> <em>2019-4-22</em> </td> <td class="nums"><strong>0</strong> / <em>6226</em></td> <td class="nums">23.39G / MP4 </td> <td class="lastpost"> <em>2019-4-22 20:57</em> <cite>by 第一會所新片</cite> </td> </tr> </tbody><!-- 三級置頂分開 --> <!-- 三級置頂分開 --> <tbody id="stickthread_10431422"> <tr> <td class="folder"><img src="images/green001/folder_common.gif"/></td> <td class="icon"> </td>
You can pass a attribute list to css select with ends with operator [style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange'] So, items = [item for item in soup.select("[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']")
How to scrape all values in html table including spans and a href
I have the following html code: <!DOCTYPE html> <html> <head> <title></title> </head> <body> <table border="1" cellspacing="0" class="Quote xaltrow" id="MainContent_Quote1_Table1_Table1" style="border-collapse:collapse;border-collapse:collapse;"> <thead> <tr class="xheader"> <th> <span>Sym</span> <a class="ToggleNames" href="/Analytics/MostActive.aspx">-Names</a> <span class="arrow"></span> </th> <th colspan="3">Bid - Ask</th> <th>Last <span class="arrow"></span></th> <th>Chg <span class="arrow"></span></th> <th>%Ch <span class="arrow"></span></th> <th>Vol <span class="arrow"></span></th> <th>$Vol <span class="arrow"></span></th> <th>#Tr <span class="arrow"></span></th> <th>Open-Hi-Lo</th> <th>Year Hi-Lo</th> <th>Last Tr</th> <th>News</th> <th>Delay</th> </tr> </thead> <tbody> <tr class="Upd UpdURHHBY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=RHHBY®ion=U">RHHBY</a> <span>- Q</span> <span class="Name">- ROCHE HLDG LTD SPONS</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">31.92</td> <td class=" xred UpdC">-0.31</td> <td class="xsmall UpdCP xred">-1.0</td> <td class="q-regright UpdV">851.0</td> <td class="q-smright UpdW">27,163</td> <td class="xsmall UpdT">1,461</td> <td class="xsmall xcentre"><span class="UpdO">32.03</span> <span class="UpdH">32.067</span> <span class="UpdI">31.84</span></td> <td class="xsmall xcentre">33.74 27.09</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUNSRGY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=NSRGY®ion=U">NSRGY</a> <span>- Q</span> <span class="Name">- NESTLE SA REG SHRS S</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">76.07</td> <td class=" xred UpdC">-0.23</td> <td class="xsmall UpdCP xred">-0.3</td> <td class="q-regright UpdV">336.2</td> <td class="q-smright UpdW">25,574</td> <td class="xsmall UpdT">1,785</td> <td class="xsmall xcentre"><span class="UpdO">75.89</span> <span class="UpdH">76.07</span> <span class="UpdI">75.66</span></td> <td class="xsmall xcentre">83.00 66.28</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUNTTYY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=NTTYY®ion=U">NTTYY</a> <span>- Q</span> <span class="Name">- NIPPON TELEGRAPH AND TELEPHONE C</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">43.90</td> <td class=" xred UpdC">-0.56</td> <td class="xsmall UpdCP xred">-1.3</td> <td class="q-regright UpdV">316.2</td> <td class="q-smright UpdW">13,883</td> <td class="xsmall UpdT">889</td> <td class="xsmall xcentre"><span class="UpdO">44.145</span> <span class="UpdH">44.15</span> <span class="UpdI">43.89</span></td> <td class="xsmall xcentre">44.57 43.00</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUTCEHY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=TCEHY®ion=U">TCEHY</a> <span>- Q</span> <span class="Name">- TENCENT HOLDINGS ADR</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">29.63</td> <td class=" xgreen UpdC">+0.06</td> <td class="xsmall UpdCP xgreen">0.2</td> <td class="q-regright UpdV">380.1</td> <td class="q-smright UpdW">11,263</td> <td class="xsmall UpdT">1,341</td> <td class="xsmall xcentre"><span class="UpdO">29.65</span> <span class="UpdH">29.78</span> <span class="UpdI">29.60</span></td> <td class="xsmall xcentre">29.85 19.74</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUATLKY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=ATLKY®ion=U">ATLKY</a> <span>- Q</span> <span class="Name">- ATLAS COPCO AB SER A</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">35.46</td> <td class=" xred UpdC">-0.23</td> <td class="xsmall UpdCP xred">-0.6</td> <td class="q-regright UpdV">316.2</td> <td class="q-smright UpdW">11,213</td> <td class="xsmall UpdT">209</td> <td class="xsmall xcentre"><span class="UpdO">35.74</span> <span class="UpdH">35.81</span> <span class="UpdI">35.46</span></td> <td class="xsmall xcentre">35.72 23.58</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUVLKAY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=VLKAY®ion=U">VLKAY</a> <span>- Q</span> <span class="Name">- VOLKSWAGEN A G SPONS</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">29.15</td> <td class=" xred UpdC">-0.34</td> <td class="xsmall UpdCP xred">-1.2</td> <td class="q-regright UpdV">323.6</td> <td class="q-smright UpdW">9,432</td> <td class="xsmall UpdT">782</td> <td class="xsmall xcentre"><span class="UpdO">28.935</span> <span class="UpdH">29.25</span> <span class="UpdI">28.90</span></td> <td class="xsmall xcentre">33.60 25.88</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUTMICY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=TMICY®ion=U">TMICY</a> <span>- Q</span> <span class="Name">- TREND MICRO ADR #</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">42.78</td> <td class=" xred UpdC">-0.64</td> <td class="xsmall UpdCP xred">-1.5</td> <td class="q-regright UpdV">210.6</td> <td class="q-smright UpdW">9,011</td> <td class="xsmall UpdT">155</td> <td class="xsmall xcentre"><span class="UpdO">42.905</span> <span class="UpdH">42.93</span> <span class="UpdI">42.78</span></td> <td class="xsmall xcentre">44.75 32.04</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> <tr class="Upd UpdUALIOY-"> <td class="sym"> <a class="qn Name" href="/Quote/Detail.aspx?symbol=ALIOY®ion=U">ALIOY</a> <span>- Q</span> <span class="Name">- ACTELION LTD</span> </td> <td class="bac" colspan="3">no orders</td> <td class="UpdL">70.66</td> <td class=" xgreen UpdC">+0.06</td> <td class="xsmall UpdCP xgreen">0.1</td> <td class="q-regright UpdV">123.3</td> <td class="q-smright UpdW">8,715</td> <td class="xsmall UpdT">56</td> <td class="xsmall xcentre"><span class="UpdO">70.538</span> <span class="UpdH">70.70</span> <span class="UpdI">70.50</span></td> <td class="xsmall xcentre">70.89 34.83</td> <td class="xsmall xcentre UpdE"></td> <td class="xsmall xcentre"></td> <td class="xsmall xcentre">realtime</td> </tr> </tbody> </table> </body> </html> This is my code for grabbing the table: import lxml.html response = open('test.html') html2 = response.read() root = lxml.html.fromstring(html2) for row in root.xpath('//*[#id="MainContent_Quote1_Table1_Table1"]/tbody/tr'): cells = row.xpath('.//td/text()') print cells This is the result: ['no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', u'\xa0\xa0', u'\xa0\xa0', u'33.74\xa0\xa027.09', 'realtime'] ['no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', u'\xa0\xa0', u'\xa0\xa0', u'83.00\xa0\xa066.28', 'realtime'] ['no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', u'\xa0\xa0', u'\xa0\xa0', u'44.57\xa0\xa043.00', 'realtime'] ['no orders', '29.63', '+0.06', '0.2', '380.1', '11,263', '1,341', u'\xa0\xa0', u'\xa0\xa0', u'29.85\xa0\xa019.74', 'realtime'] ['no orders', '35.46', '-0.23', '-0.6', '316.2', '11,213', '209', u'\xa0\xa0', u'\xa0\xa0', u'35.72\xa0\xa023.58', 'realtime'] ['no orders', '29.15', '-0.34', '-1.2', '323.6', '9,432', '782', u'\xa0\xa0', u'\xa0\xa0', u'33.60\xa0\xa025.88', 'realtime'] ['no orders', '42.78', '-0.64', '-1.5', '210.6', '9,011', '155', u'\xa0\xa0', u'\xa0\xa0', u'44.75\xa0\xa032.04', 'realtime'] ['no orders', '70.66', '+0.06', '0.1', '123.3', '8,715', '56', u'\xa0\xa0', u'\xa0\xa0', u'70.89\xa0\xa034.83', 'realtime'] I would like it to be the following instead: ['RHHBY', 'ROCHE HLDG LTD SPONS', 'no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', '32.03', '32.067', '31.84', '33.74', '27.09', '', '', 'realtime'] ['NSRGY', 'NESTLE SA REG SHRS S', 'no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', '75.89', '76.07', '75.66', '83.00', '66.28', '', '', 'realtime'] ['NTTYY', 'NIPPON TELEGRAPH AND TELEPHONE C', 'no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', '44.145', '44.15', '43.89', '44.57', '43.00', 'realtime'] ... How do I get the values inside the <td>'s where there are spans and/or <a>'s? This table can be very large. I would like it to be a fast script like above. I am planning on taking this array and writing it to a database or CSV.
Don't know if this is specifically what you want but you could grab the text in the //td/span and a elements: import lxml.html response = open('test.html') html2 = response.read() root = lxml.html.fromstring(html2) for row in root.xpath('//*[#id="MainContent_Quote1_Table1_Table1"]/tbody/tr'): cells=[] cells = row.xpath('.//td/a/text()') cells = cells + row.xpath('.//td/text()') cells = cells + row.xpath('.//td/span/text()') print(cells) To eliminate the formatting, you could use something like: print([c.replace('\xa0','') for c in cells])
Extract table from html file using python
I want to extract table from an html file. I have written the following code-snippet to extract the first table: import urllib2 import os import time import traceback from bs4 import BeautifulSoup #find('table',{'class':'tbl_with_brdr'}) outfile= open('D:/Dropbox/Python/apelec.txt','wb') rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') rsoup = BeautifulSoup(rfile) nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') for node in nodes[1:]: x = node.find('th').find('b').get_text().encode("utf-8") print x y = node.find('th').findNext('th').find('b').get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close() Here is the error: 9 rfile = open('D:/Dropbox/PRI/Data/AP/195778.html') 10 rsoup = BeautifulSoup(rfile) ---> 11 nodes = rsoup.find('div',{'class':'frmtext'}).find('table').find('tr') 12 for node in nodes[1:]: 13 x = node.find('th').find('b').get_text().encode("utf-8") AttributeError: 'NoneType' object has no attribute 'find' And the html file is: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <link rel="icon" type="image/ico" href="images/favicon.ico"/> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <link rel="stylesheet" href="themes/panchayat_default.css" type="text/css"/> <title>consolidated Election Report</title> </head> <body> <!-- To blur the background while processing dwr --> <div class="faded_div process"></div> <div class="popup_block_div process" style="display: none;"> <img alt="" src="images/loading_animation.gif" style="margin-left: auto; margin-right: auto;"> </div> <div id="maincontainer" class="resize"> <div id="headerwrap"> <!-- Header --> <html> <head> <script type='text/javascript' src="/profilerdwr/engine.js"> </script> <script type='text/javascript' src="/profilerdwr/util.js"> </script> <script type="text/javascript" src="/profilerdwr/interface/lgdDao.js"></script> <script type="text/javascript" src="js/common_util_js.js"></script> <link rel="stylesheet" href="css/common_css.css" type="text/css"></link> <meta http-equiv='Content-Type' content='text/html; charset=UTF-8' /> </head> <body > <div class="clear"></div> <div id="headerwrap"> <div id="header"> <div id="new_header"> <div id="logoleft">Area Profiler</div> <div id="logoright"></div> <div class="clear"></div> </div> <div class="clear"></div> <div id="loginnav" align="right"> <table width="100%" class="tbl_no_brdr"> <tr> <td class="tblclear" align="left"> <div id="mainnav">Home </div> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="topnav"> <table width="100%" class="tbl_no_brdr"> <tr> <td width="85" class="tblclear">Choose Theme :</td> <td width="200" class="tblclear"> <form id="themeForm" name="themeForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select name="theme" id="themeId" class="combofield" onchange="submitThemeForm()" style="width: 120px;"> <option value="default">Default Theme</option> <option value="mustard">Mustard Theme</option> <option value="peach">Peach Theme</option> <option value="green">Green Theme</option> <option value="blue">Blue Theme</option> </select> </form> </td> <td style="padding: 0px"> </td> <td class="tblclear"> </td> <td width="14" class="tblclear txticon"><img src="images/btnMinus.jpg" width="16" height="14" border="0" /></div></td> <td width="14" class="tblclear txticon"><img src="images/btnDefault.jpg" width="16" height="14" border="0" /> </td> <td width="28" class="tblclear txticon"><img src="images/btnPlus.jpg" width="16" height="14" border="0" /></td> <script type="text/javascript" > //documenttextsizer.setup("shared_css_class_of_toggler_controls") documenttextsizer.setup("texttoggler") </script> <td width="100" align="right" class="tblclear">Select Language :</td> <td width="108" align="right" class="tblclear"> <form id="languageForm" name="languageForm" method="get" action="welcome.do"> <input type="hidden" name='OWASP_CSRFTOKEN' value='CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU' /> <select id="languageId" name="language" class="combofield" style="width: 120px;" onchange="submitLanguageForm()" > <option value=""> Select Language </option> </select> </form> </td> </tr> </table> </div> <div id="breadcrumbnav"> </div> </div> <script type="text/javascript"> function submitThemeForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('themeForm').submit(); } else { return; } } function submitLanguageForm() { var isOK = confirm("This will Refresh Your Page. Any Unsaved data will be Lost. Do You still want to Continue?"); if(isOK) { document.getElementById('languageForm').submit(); } else { return; } } </script> </body> </html> </div> <div class="clear"></div> <div id="content"> <div id="leftpnl"> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="100%" valign="top" class="tblclear"> <!-- content -->. <script type="text/javascript" src="js/common_js.js"></script> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <script type="text/javascript"> var pathname; $(document).ready(function() {pathname = window.location.pathname;}); function onBack(s) { var position =pathname.indexOf("/", 2); var newPath = ""; var val = s.indexOf("?", 1); if(val>0) { newPath = s+"&redirect=true"; } else { newPath = s+"?redirect=true"; } window.location.replace(".."+pathname.substring(0,position)+"/"+newPath); } function downloadReport(repformat){ //window.location="downloadConsolidatedElectionReportPDF.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; //document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?repformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].action="downloadConsolidatedElectionReportPDF.do?reportformat="+repformat+"&OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU"; document.forms["electionReportForm"].method="POST"; document.getElementById('electionReportForm').target="_blank"; document.forms["electionReportForm"].submit(); } </script> <style type="text/css"> .data_link{ color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .disable_link { cursor:default; color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:VISITED { color:blue; display: block; text-decoration: none; font-size: 1em; font-weight: bolder; } .data_link:HOVER{ text-decoration: underline; } </style> </head> <body> <div id="frmcontent"> <div class="frmhd"> <table width="100%" class="tbl_no_brdr"> <tr> <td align="left" width="90%"> Consolidated Election</td> </tr> </table> </div> <div class="clear"></div> <div class="frmpnlbrdr"> <div class="frmpnlbg"> <div class="frmtxt"> <table width="100%" style="margin-bottom: 10px;" class="tbl_with_brdr"> <tr class="tblRowTitle tblclear" > <th align="left" ><b>State Name</b></th> <th align="left" ><b>Local Body Type</b></th> <th align="left" ><b>Election Term</b></th> <th align="left" ><b>Local Body Name</b></th> </tr> <tr class="tblRowB" style="color: blue;"> <th align="left" >ANDHRA PRADESH</th> <th align="left" >Village Panchayat</th> <th align="left" > 02-Aug-2013 To 01-Aug-2018 </th> <th align="left" >KODIHALLI</th> </tr> </table> <div class="frmhdtitle">Consolidated Election</div> <table width="100%" class="tbl_with_brdr"> <thead> <tr class="tblRowTitle tblclear"> <th align="center" width="5%" ><b>S.No.</b></th> <th align="left" width="9%"><b>Name</b></th> 0 <th align="left" width="9%"><b>Age</b></th> 1 <th align="left" width="9%"><b>Caste Category</b></th> 2 <th align="left" width="9%"><b>Gender</b></th> 3 <th align="left" width="9%"><b>Qualification</b></th> 4 <th align="left" width="9%"><b>Occupation</b></th> 5 <th align="left" width="9%"><b>Email Address</b></th> 6 <th align="left" width="9%"><b>Ward Name</b></th> 7 <th align="left" width="9%"><b>Reservation</b></th> 8 </tr> </thead> <tbody> <tr class="tblRowB"> <td align="center" >1</td> <td>Kambanna</td> <td>36</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>N/A</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >2</td> <td>Ramesh</td> <td>39</td> <td>OBC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 1</td> <td > Yes (OBC / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >3</td> <td>S.Manjunath</td> <td>29</td> <td>OBC</td> <td>Male</td> <td>Higher Secondary or Intermediate or Pre University or Senior Secondary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 2</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >4</td> <td>Obuleshu</td> <td>48</td> <td>OBC</td> <td>Male</td> <td>Below Primary</td> <td>Workers not reporting any occupations</td> <td> N/A </td> <td>Ward no 3</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >5</td> <td>Mamatha</td> <td>24</td> <td>OBC</td> <td>Female</td> <td>Matriculation or Junior School Certificate or Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 4</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >6</td> <td>Shivamma</td> <td>38</td> <td>OBC</td> <td>Female</td> <td>Below Primary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 5</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >7</td> <td>Hanumantappa</td> <td>46</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 6</td> <td > No (General / Others) </td> </tr> <tr class="tblRowA"> <td align="center" >8</td> <td>Malingappa</td> <td>45</td> <td>SC</td> <td>Male</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 7</td> <td > No (General / Others) </td> </tr> <tr class="tblRowB"> <td align="center" >9</td> <td>Kamalamma</td> <td>52</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 8</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >10</td> <td>Muddamma</td> <td>48</td> <td>OBC</td> <td>Female</td> <td>Illiterate</td> <td>N/A</td> <td> N/A </td> <td>Ward no 9</td> <td > Yes (General / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >11</td> <td>Patta Tayamma</td> <td>45</td> <td>SC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 10</td> <td > Yes (SC / Female) </td> </tr> <tr class="tblRowA"> <td align="center" >12</td> <td>Sujatha</td> <td>35</td> <td>OBC</td> <td>Female</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 11</td> <td > Yes (OBC / Female) </td> </tr> <tr class="tblRowB"> <td align="center" >13</td> <td>Kadurappa</td> <td>35</td> <td>SC</td> <td>Male</td> <td>Middle or Lower Secondary</td> <td>N/A</td> <td> N/A </td> <td>Ward no 12</td> <td > Yes (SC / Others) </td> </tr> </tbody> </table> <br /> <table width="100%" class="tbl_no_brdr"> <tr> <td align="center"> <input type="button" class="btn" onclick="onClose('welcome.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU')" value=Close /> <input type="button" class="btn" onclick="this.disabled=true; this.value='Please Wait .!';onBack('consolidatedElectionReport.do?OWASP_CSRFTOKEN=CN72-BGJW-G7FM-K1S3-P5FF-V1EN-IO4T-GHWU&electionTermId=35107&stateId=28')" value=Back /> </td> </tr> </table> <form id="electionReportForm" name="electionReportForm" action="#" method="post"> <div align="center"><br/> <input type="button" class="btn" onclick="downloadReport('pdf');" value="Export to PDF" size="5" /> <input type="button" class="btn" onclick="downloadReport('xls');" value="Export to Excel" size="5" /> </div> </form> </div> <div class="myclass" style="font-family: Times; text-align: center; font-size: 10.0pt; color: white; font-weight: bold; border: 1px solid gray"> Report generated through Area Profiler (http://areaprofiler.gov.in)Thu Oct 02 22:34:20 IST 2014 </div> </div> </div> </div> </body> </html> </td> </tr> </table> </div> </div> <div class="clear"></div> <div id="footer"> <!-- Footer --> <html> <head> </head> <body> <table width="100%" class="tbl_no_brdr"> <tr> <td colspan="3" class="fotbrdr"></td> </tr> <tr> <td width="161" class="btmlogospace"><a href="http://www.negp.gov.in/" target= "_blank" ><img src="images/e_governance_logo.jpg" width="161" height="38" /></a></td> <td width="93" class="btmlogospace"><a href="http://www.panchayat.gov.in/" target= "_blank" ><img src="images/panchayatilogo.jpg" width="93" height="38" /></a></td> <td align="right" class="btmlogospace">Site is designed, hosted and maintained by National Informatics Centre<br /> Contents on this website is owned,updated and managed by the Ministry of Panchayati Raj</td> </tr> </table> </body> </html> </div> </div> </body> </html>
I paste here an approach, it is not exactly the solution but you can use it as a guide. You have to traverse the DOM tree and extract the values you want. I changed the class of the div you look for from frmtext to frmtxt and in the traversal you have to check if anything is found or not. import urllib2 import os import time import traceback from bs4 import BeautifulSoup outfile= open('out.txt','wb') rfile = open('195778.html') rsoup = BeautifulSoup(rfile) nodes1 = rsoup.find('div',{'class':'frmtxt'}) nodes = nodes1.find('table').find_all('tr') for node in nodes: a = node.find('th') x = None if a != None: x1 = x.find('b') if x1 != None: x2 = x1.get_text().encode("utf-8") print x2 x = x2 y = node.find('th') if y != None: print 'y',y y2 = y.findNext('th') if y2 != None: print 'y2',y2 y3 = y2.find('b') if y3 != None: y = y3.get_text().encode("utf-8") print y outfile.write(str(x)+"\t"+str(y)+"\n") outfile.close()