Not able to scrape checkbox value from each table tr

Not able to scrape checkbox value from each table tr - python

Please see the below html table
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<input checked type=checkbox name=jobs[] value='610974'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'></div>
</div>
</td>
</td>
<tr >
<td style='border-bottom:1px solid silver;background:#ffffff;'>
<div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div>
<input checked type=checkbox name=jobs[] value='610975'>
<table border=0 cellpadding=2 cellspacing=0 style='border:4px #70797a; border-radius: 5px;'>
<tr>
<td style='background:lightgreen;' valign=top>
<img src='../images/checkwhite.png' style='width:30px;'>
</td>
<td style='background:lightgreen;'> 123 Charter Rd Wethersfield CT 06109 </td>
<tr>
<td>Your Input</td>
<td>123 CHARTER RD WETHERSFIELD CT 06109</td>
</tr>
</table>
<br clear=all>
<div style='margin-left:40px;'>09/11/2018
<br>Exterior BPO - Light Photo Set (3 photos*)
<br>$9.00 We found a rep 6.2 miles from job.
<span style='color:silver'> 640x480 Add Datestamp, </span>
<br clear=all>
<div style=float:left;'>
I need the output as :
id="610974" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
id="610975" and Address="123 CHARTER RD WETHERSFIELD CT 06109" [Ist checkbox value is id and corresponding address]
etc....
soup = BeautifulSoup(bodystrip, "lxml")
for tr in response.find_all('tr'):
tds = tr.find_all('td')
print(tds[0].text)
jobid = tds[0].find('input')
print(jobid)
this is getting error on address are properly getting

With Scrapy:
for input_node in response.xpath('//input[#name="jobs[]"]'):
id = input_node.xpath(./#value).extract_first()
address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

With beautifulsoup this should work:
for job in soup.find_all('input',attrs={"type":"checkbox"}):
print(job['value'])
print(job.parent.find_all('td',attrs={'style':True})[1].text)

Related

Returning None when scraping href using Python

Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script?
[<table border="0" class="ProductBox" id="Added0">
<tr>
<td align="center" colspan="2">
<div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div>
</td></tr><tr>
<td align="center" colspan="2" height="60px;" valign="top">
<div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div>
</td>
</tr>
<tr>
<td align="center" colspan="2" height="185">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
<img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a>
</td>
</tr>
<tr>
<td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td>
<td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
POR 0%
</td>
<td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
VAT 20%
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
**151 Heavy Duty Rubber Gloves - Ex Large**</a></td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
1s x 1
</td>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;">
<div class="tooltip">
<div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;">
<span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div>
</div>
<span class="OKStatus">In Stock </span>
</td>
</tr>
<tr>
<td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
<table style="margin-top : 10px;" width="100%">
<tr>
<td>
<img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/>
</td>
<td>
<input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/>
</td>
<td>
<img align="middle" alt="Add 1 To Qty" src="/images/add.png"/>
</td>
<td align="right">
<button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button>
</td>
</tr>
</table>
I tied to use the following
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')
for i in table:
links = [link.get('href') for link in i.find_all('a')]
print(links)
which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']

Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters:
DATA = """<table border="0" class="ProductBox" id="Added0">
<tr>
...
</table>"""
from bs4 import BeautifulSoup
from typing import Optional
def extract_name(data: str) -> Optional[str]:
soup = BeautifulSoup(data, "html.parser")
links = soup.select("td.ProdDetails a")
if len(links) >= 1:
return links[0].text.strip().strip("*").strip()
else:
return None
print(extract_name(DATA))
# like above
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
text = extract_name(tables[0])
Output: 151 Heavy Duty Rubber Gloves - Ex Large

I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project

I have this HTML code for example:
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
<span style="font-size: 8pt; color: #78578e"> MCO Date : 10-Aug-2018</span>
<br />
</td>
<td>
<h3 style="font-size: 14pt;"> </h3>
<h2> <br /> </h2>
</td>
</tr>
<tr>
<td style="margin-top: 0;" class="border-b">
<br />
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
<td colspan="1" style="margin-top: 0;" class="border-b">
</td>
</tr>
<tr style="height: 7%;" class="border-tb">
<td style="width: 130px" class="border-r">
<h4>METER NO</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PREVIOUS READING</h4>
</td>
<td style="width: 90px" class="border-r">
<h4>PRESENT READING</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>MF</h4>
</td>
<td style="width: 60px" class="border-r">
<h4>UNITS</h4>
</td>
<td>
<h4>STATUS</h4>
</td>
</tr>
<tr style="height: 30px" class="content">
<td class="border-r">
3-P I 3301539<br> I 3301539<br> E 3301539<br> E 3301539<br>
</td>
<td class="border-r">
78693<br>16823<br>19740<br>8<br>
</td>
<td class="border-r">
80086<br>17210<br>20139<br>8<br>
</td>
<td class="border-r">
1<br>1<br>1<br>1<br>
</td>
<td class="border-r">
1393<br>387<br>399<br>0<br>
</td>
<td>
</td>
</tr>
<tr id="roshniMsg" style="height: 30px" class="content">
<td colspan="6">
<div style="width: 452pt">
<img style="max-width: 100%; max-height: 35%" src="/images/companies/iesco/roshniMsg.jpg"
alt="Roshni Message" />
</div>
</td>
</tr>
</table>
From this table I want to extract the paragraph and from there I want to get all the span tags in that paragraph.
I used soup.find_all() to get the table but I don't know how to use this function iteratively to pass it back to the original soup object so that I could find the paragraph and, moreover the span tags in that paragraph.
This is the code Python code I wrote:
soup = BeautifulSoup(string, 'html.parser')
#Getting the table tag
results = soup.find_all('table', attrs={'class':'nested4'})
#Getting the paragragh tag
results = soup.find_all('p', attrs={'style':'margin: 0; text-align: left; padding-left: 5px'})
#Getting all the span tags
results = soup.find_all('span', attrs={})
I just want help on how to get the paragraphs within the table. And then how to get the spans within the paragraph as I am getting the spans in all of the original HTML code. I don't know how to pass the bs4 object list back to the soup object to use soup.find_all iteratively.

from bs4 import BeautifulSoup
html = '''
<table class="nested4">
<tr>
<td colspan="1"></td>
<td colspan="2">
<h2 class="zeroMargin" id="govtMsg" visible="false"></h2>
</td>
<td colspan="2">
<h2 class="zeroMargin "> Net Metering Conn. </h2>
</td>
<td colspan="2">
<h2 class="zeroMargin" hidden> Life Line Consumer</h2>
</td>
</tr>
<tr>
<td colspan="2">
<p style="margin: 0; text-align: left; padding-left: 5px">
<span>NAME & ADDRESS</span>
<br />
<span>MUHAMMAD AMIN </span>
<br />
<span>S/O MUHAMMAD KHAN </span>
<br />
<span>H-NO.38 MARGALLA ROAD </span>
<br />
<span>F-6/3 ISLAMABAD3 </span>
<br />
<span></span>
</p>
</td>
<td colspan="3" style="text-align: left">
<h2 class="color-red">Say No To Corruption</h2>
'''
soup = BeautifulSoup(html, 'html.parser')
spans = soup.select_one('table.nested4').select('span')
for span in spans:
print(span.text)
This returns:
NAME & ADDRESS
MUHAMMAD AMIN
S/O MUHAMMAD KHAN
H-NO.38 MARGALLA ROAD
F-6/3 ISLAMABAD3

if you have one table:
soup = BeautifulSoup(string, 'html.parser')
table = soup.find('table', attrs={'class': 'nested4'})
p = table.find('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'})
results = p.find_all('span')
for result in results:
print(result.get_text(strip=True))
if you have list of tables:
soup = BeautifulSoup(string, 'html.parser')
for table in soup.find_all('table', attrs={'class': 'nested4'}):
for p in table.find_all('p', attrs={'style': 'margin: 0; text-align: left; padding-left: 5px'}):
for span in p.find_all('span'):
print(span.get_text(strip=True))

Can this beautifulsoup script be simplified with Regex?

I wrote some beautifulsoup scripts, and one part seems really redundant, I am thinking if it can be simplified with Regex.
All posts from this forum are marked with different colors, what I did is to search each color with one line. For six colors I did six lines with only one words difference.
red = soup.find_all('a', style="font-weight: bold;color: red")
blue = soup.find_all('a', style="font-weight: bold;color: blue")
green = soup.find_all('a', style="font-weight: bold;color: green")
purple = soup.find_all('a', style="font-weight: bold;color: purple")
orange = soup.find_all('a', style="font-weight: bold;color: orange")
lime = soup.find_all('a', style="color: green")
I am not sure if it is possible to be simplified. Maybe something like:
re.compile("(color: red|blue|green|purple|orange)", re.(whatever the letter is))
if it's not regex, or could it be something else?
This is partial DOM:
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[美臀]</em> <span id="thread_10431427">(本中)(HND-???) 二宮ひかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>6 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>2</strong> / <em>12234</em></td>
<td class="nums">5.02G / MP4
</td>
<td class="lastpost">
<em>2019-4-23 20:22</em>
<cite>by zj376104288</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431424">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431424">(WAAP)(WPVR-???)葵百合香</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>5 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>7265</em></td>
<td class="nums">3.85G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431423">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431423">(KMP)(SAVR-???)舞島あかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>4 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>6226</em></td>
<td class="nums">23.39G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431422">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>

You can pass a attribute list to css select with ends with operator
[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']
So,
items = [item for item in soup.select("[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']")

How to scrape span ids' texts in beautifulsoup in the following html?

<div align="justify" style="text-align: center">
<div>
<table cellspacing="0" rules="all" border="1" id="ContentPlaceHolder1_grd_reminder" style="width:555px;border-collapse:collapse;">
<tr>
<th class="grdheading2" scope="col">Book</th>
<th class="grdheading2" scope="col">Issue Date</th>
<th class="grdheading2" scope="col">Submition Date</th>
</tr>
<tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_0">Engineering Mechanics</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_0">17-Oct-2016</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_0">31-Oct-2016</span>
</td>
</tr>
<tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_1">ATB of Engineering Mathematics</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_1">17-Oct-2016</span>
</td>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_1">31-Oct-2016</span>
</td>
</tr>
</table>
</div>
</div>
I want to extract the text Engineering mechanics and it's corresponding date (text) 31-Oct-2016 and the textATB of Engineering Mathematics and it's corresponding date (text) 31-Oct-2016. All of these all located in the span ids. How can I extract and print them? I'm new to web scraping.

First you can use find_all() to find all tr tags, and using loop you can use find_all() to find all span tags in every tr. This way you can control scraped data
html = '''<div align="justify" style="text-align: center">
<div>
<table cellspacing="0" rules="all" border="1" id="ContentPlaceHolder1_grd_reminder" style="width:555px;border-collapse:collapse;">
<tr>
<th class="grdheading2" scope="col">Book</th><th class="grdheading2" scope="col">Issue Date</th><th class="grdheading2" scope="col">Submition Date</th>
</tr><tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_0">Engineering Mechanics</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_0">17-Oct-2016</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_0">31-Oct-2016</span>
</td>
</tr><tr>
<td>
<span id="ContentPlaceHolder1_grd_reminder_Label1_1">ATB of Engineering Mathematics</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label2_1">17-Oct-2016</span>
</td><td>
<span id="ContentPlaceHolder1_grd_reminder_Label3_1">31-Oct-2016</span>
</td>
</tr>
</table>
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
trs = soup.find_all('tr')
for tr in trs:
spans = tr.find_all('span')
if spans:
print 'title:', spans[0].text
print 'date:', spans[2].text
Result
title: Engineering Mechanics
date: 31-Oct-2016
title: ATB of Engineering Mathematics
date: 31-Oct-2016

Accesing <td> elements in a table that do not have ID or class when using python and BeautifulSoup on html/css pages

I am scraping a page using Selenium, Python and Beautiful Soup, and I want to output the rows of a table as comma delimited values. Unfortunately the HTML of the page is all over the place. So far I have managed to extract two columns by using the IDs of their elements. The rest of the values are just contained in without an identifier such as class or id. Here is a sample of the results.
<table id="tblResults" style="z-index: 102; left: 18px; width: 956px;
height: 547px" cellspacing="1" width="956" border="0">
<tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
<td> </td>
<td> </td>
<td>Select</td>
<td>T</td>
<td>Party</td>
<td>Opposite Party</td>
<td style="width:50px;">Type</td>
<td style="width:100px;">Book-Page</td>
<td style="width:70px;">Date</td>
<td>Town</td>
</tr>
<tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
">MOSES ALBERT G</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
"></span>
</td>
<td valign="top">MAP</td>
<td valign="top">- </td>
<td valign="top">01/16/1953</td>
<td valign="top">TOWN OF BINGHAMTON</td>
</tr>
<tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">MOSES ALEXANDRA/GDN</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">GOODRICH MERLE L</span>
</td>
</table>
This is the script that i have written so far that works for two columns:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")} )
for table_ in myTable:
party = table_.find("span", {"id": re.compile("Party1_*")})
oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
print(party.get_text()+ "," + oppositeParty.get_text())
I have tried doing using children of myTable as follows:
myTable.children

If all you want is to just dump out the content, something like this should do:
myTable = bsObj.find_element_by_tag_name("table")
for table_ in myTable:
rows = table_.find_elements_by_tag_name("tr")
for row_ in rows:
columns = row_.find_elements_by_tag_name("td")
for column_ in columns:
# print out comma delimited text of columns...
# print the end of your row
If you're really wanting to scrape specific information, you'll need to provide us with more instructions about what your ultimate goal is.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to scrape checkbox value from each table tr - python

With Scrapy: for input_node in response.xpath('//input[#name="jobs[]"]'): id = input_node.xpath(./#value).extract_first() address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

With beautifulsoup this should work: for job in soup.find_all('input',attrs={"type":"checkbox"}): print(job['value']) print(job.parent.find_all('td',attrs={'style':True})[1].text)

Related

Returning None when scraping href using Python

I need to pass the result of soup.find_all to another soup.find_all function to filter the HTML code for a project

Can this beautifulsoup script be simplified with Regex?

How to scrape span ids' texts in beautifulsoup in the following html?

Accesing <td> elements in a table that do not have ID or class when using python and BeautifulSoup on html/css pages

Categories

Resources