Xpath Python Extract Data From Table Between Two Headings - python

I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span inside a header (I need id="Perlis", which lies between Perlis and Kedah):
<h2>
<span class="mw-headline" id="Perlis">Perlis</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;">
<tbody>
<tr>
<th width="30"># </th>
<th width="150">Constituency s </th>
<th width="150">Winner </th>
<th width="80">Votes </th>
<th width="80">Majority </th>
<th width="150">Opponent(s) </th>
<th width="80">Votes </th>
<th width="150">Incumbent </th>
<th width="80">
<b>Incumbent Majority</b>
</th>
</tr>
<tr>
<td colspan="13">
BN
<b>2</b> | GS
<b>0</b> | PH
<b>1</b> | Independent
<b>0</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P1 </td>
<td rowspan="2">
Padang Besar
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>15,032</b>
</td>
<td rowspan="2">
<b>1,438</b>
</td>
<td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td>
<td>
<b>13,594</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>7,426</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>7,874</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P2 </td>
<td rowspan="2">
Kangar
</td>
<td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td>
<td rowspan="2">
<b>20,909</b>
</td>
<td rowspan="2">
<b>5,603</b>
</td>
<td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td>
<td>
<b>15,306</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Shaharuddin Ismail
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>4,037</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>8,465</b>
</td>
</tr>
</tbody>
</table>
<h2>
<span class="mw-headline" id="Kedah">Kedah</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table>
This is the resulting JSON that I am trying to construct:
[
{
"state": "Perlis",
"constituencies": [
{
"id": "P1",
"name": "Padang Besar"
},
{
"id": "P2",
"name": "Kangar"
}
]
}
]
I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind:
class PostSpider(scrapy.Spider):
name = 'manual_spider'
start_urls = [
'%URL%'
]
def parse(self, response):
doc = response.xpath('//comment()').getall() //This is the bit I need
//code continues here

Related

Returning None when scraping href using Python

Hi I'm trying to scrape 151 Heavy Duty Rubber Gloves - Ex Large from table with following inspect script. Can someone please help with the right Python script?
[<table border="0" class="ProductBox" id="Added0">
<tr>
<td align="center" colspan="2">
<div style="width:100%;float:left;display:inline;float:left;height:37px;"><div style="float:left;font-size:16px;font-family: 'Roboto Condensed', sans-serif;color:white;margin-top:4%;margin-left:6%;"> </div></div>
</td></tr><tr>
<td align="center" colspan="2" height="60px;" valign="top">
<div class="PromoPriceText"> <br/><br/></div><div class="StdPrice">£0.69</div><div class="UnitCost">(£0.69/Unit)</div>
</td>
</tr>
<tr>
<td align="center" colspan="2" height="185">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
<img alt="" class="effectfront" id="prod" src="/~uldir/104373t.jpg" style="height:165px !important;"/></a>
</td>
</tr>
<tr>
<td class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;"><input name="product_code" type="hidden" value="104373"/>104373</td>
<td align="right" class="ProdDetails" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
POR 0%
</td>
<td align="right" class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
VAT 20%
</td>
</tr>
<tr>
<td class="ProdDetails" colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;height:50px;">
<a href="/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211" style=" line-height: 20px; padding-left: 0px;">
**151 Heavy Duty Rubber Gloves - Ex Large**</a></td>
</tr>
<tr>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
1s x 1
</td>
<td class="ProdDetails" colspan="1" style="padding-left:10px;padding-right:10px;margin-bottom:5px;float:right;width:98%;text-align:right;">
<div class="tooltip">
<div class="IconWishNS" id="IconWishNS104373" onclick="AddToWish('104373','A')" style="display:inline-block;">
<span class="tooltiptext tooltip-bottom" style="font-size:12px;">Add to Wish List</span></div>
</div>
<span class="OKStatus">In Stock </span>
</td>
</tr>
<tr>
<td colspan="2" style="padding-left:10px;padding-right:10px;margin-bottom:5px;">
<table style="margin-top : 10px;" width="100%">
<tr>
<td>
<img align="middle" alt="Take 1 Off Qty" src="/images/minus.png"/>
</td>
<td>
<input class="iQtyBox" id="104373_qty" maxlength="4" name="104373_qty" oninput="this.value=(parseInt(this.value)||'')" tabindex="1" type="text" value="1"/>
</td>
<td>
<img align="middle" alt="Add 1 To Qty" src="/images/add.png"/>
</td>
<td align="right">
<button class="subBlackButtonDiv subButtonDiv" style="width:70px;margin:0px;" type="button" value="add">Add</button>
</td>
</tr>
</table>
I tied to use the following
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all('table')
for i in table:
links = [link.get('href') for link in i.find_all('a')]
print(links)
which unfortunately returns: ['/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '/products/DetailsPortal.asp?product_code=104373&Page=Products&BreadPath=/products/gridlist.asp?DeptCode=14*prodgroup=211', '#', '#', '#']
Can use the td.ProductDetails a selector (an a tag inside td with the class ProductDetails) to target the text you are interested in, then call .strip() a few times to remove extra characters:
DATA = """<table border="0" class="ProductBox" id="Added0">
<tr>
...
</table>"""
from bs4 import BeautifulSoup
from typing import Optional
def extract_name(data: str) -> Optional[str]:
soup = BeautifulSoup(data, "html.parser")
links = soup.select("td.ProdDetails a")
if len(links) >= 1:
return links[0].text.strip().strip("*").strip()
else:
return None
print(extract_name(DATA))
# like above
r = s.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.find_all('table')
text = extract_name(tables[0])
Output: 151 Heavy Duty Rubber Gloves - Ex Large

Concatenate and remove td cells in beautifulsoup python

I have a table like this (old html):
<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>
and I want it to look like:
U.S. federal statutory income tax rate 35.0% 35.0%
Federal income tax at statutory rate $(2,813) $5,834
State and local income taxes, net of federal income tax effect (733) 812
Provision (benefit) for income taxes $(3,546) $6,646
Effective income tax rate 44.1% 39.9%
I have two problems getting from the code to the code above to the table below:
1. there are empty cells like
2. some values are distributed over cells
I want to get rid of the empty cells by decomposing them and concatenate some cells like (2,813 and ) or 44.1 and %
I tried the following code for decomposing but it does not work and I have no clue how to concatenate cells in BeautifulSoup:
s= """<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>"""
soup = bs(s, "lxml")
table = soup.find('table')
for row in table.find_all('tr'):
for cell in row.find_all('td'):
if cell.text=='':
cell.decompose()
df = pd.read_html(str(soup))
print(df)
Provided you can isolate the right table then just loop the trs within attribute valign and concantenate the tds where != ' '
from bs4 import BeautifulSoup as bs
html = '''<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('table tr[valign]'):
print(' '.join([td.text for td in tr.select('td') if td.text != ' ']))

Can this beautifulsoup script be simplified with Regex?

I wrote some beautifulsoup scripts, and one part seems really redundant, I am thinking if it can be simplified with Regex.
All posts from this forum are marked with different colors, what I did is to search each color with one line. For six colors I did six lines with only one words difference.
red = soup.find_all('a', style="font-weight: bold;color: red")
blue = soup.find_all('a', style="font-weight: bold;color: blue")
green = soup.find_all('a', style="font-weight: bold;color: green")
purple = soup.find_all('a', style="font-weight: bold;color: purple")
orange = soup.find_all('a', style="font-weight: bold;color: orange")
lime = soup.find_all('a', style="color: green")
I am not sure if it is possible to be simplified. Maybe something like:
re.compile("(color: red|blue|green|purple|orange)", re.(whatever the letter is))
if it's not regex, or could it be something else?
This is partial DOM:
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[美臀]</em> <span id="thread_10431427">(本中)(HND-???) 二宮ひかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>6 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>2</strong> / <em>12234</em></td>
<td class="nums">5.02G / MP4
</td>
<td class="lastpost">
<em>2019-4-23 20:22</em>
<cite>by zj376104288</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431424">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431424">(WAAP)(WPVR-???)葵百合香</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>5 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>7265</em></td>
<td class="nums">3.85G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431423">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
<th class="common">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[VR]</em> <span id="thread_10431423">(KMP)(SAVR-???)舞島あかり</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
</th>
<td class="author">
<cite>
第一會所新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>4 </cite>
<em>2019-4-22</em>
</td>
<td class="nums"><strong>0</strong> / <em>6226</em></td>
<td class="nums">23.39G / MP4
</td>
<td class="lastpost">
<em>2019-4-22 20:57</em>
<cite>by 第一會所新片</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
<!-- 三級置頂分開 -->
<tbody id="stickthread_10431422">
<tr>
<td class="folder"><img src="images/green001/folder_common.gif"/></td>
<td class="icon">
  </td>
You can pass a attribute list to css select with ends with operator
[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']
So,
items = [item for item in soup.select("[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']")

Web Scraping tables from an HTML file

Hello all I am hoping to get some help with taking the tables in my HTML file and importing them into a csv file. I am very very new to web scraping so for give me if I am completely wrong with my code. The HTML file holds three separate table I am trying to extract; estimate, sampling error, and number of non-zero plots in estimate.
My code is shown below:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
Here is the HTML for the first table "Estimate":
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`
I made four tables almost identical to yours and put them into a fairly respectable page of HTML. Then I ran this code.
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
The results were four files like this, named table0.csv through table3.csv.
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
Perhaps the main thing I should mention is that I skipped the same number of rows in each table that BeautifulSoup delivered. If the number of header lines in the tables varies then you will have to do something more clever or just discard lines in the output files and omit the skiprows parameter.
Unsure as to what the exact question is here but right off the bat I can see an error that will throw you off a bit.
new_table = pd.DataFrame(columns=range(0-4))
Needs to be
new_table = pd.DataFrame(columns=range(0,4))
The result of range(0-4) is actually range(-4) which evaluates to range(0,-4) whereas you want range(0,4). You can just pass range(4) as the parameter or range(0,4).

Accesing <td> elements in a table that do not have ID or class when using python and BeautifulSoup on html/css pages

I am scraping a page using Selenium, Python and Beautiful Soup, and I want to output the rows of a table as comma delimited values. Unfortunately the HTML of the page is all over the place. So far I have managed to extract two columns by using the IDs of their elements. The rest of the values are just contained in without an identifier such as class or id. Here is a sample of the results.
<table id="tblResults" style="z-index: 102; left: 18px; width: 956px;
height: 547px" cellspacing="1" width="956" border="0">
<tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
<td> </td>
<td> </td>
<td>Select</td>
<td>T</td>
<td>Party</td>
<td>Opposite Party</td>
<td style="width:50px;">Type</td>
<td style="width:100px;">Book-Page</td>
<td style="width:70px;">Date</td>
<td>Town</td>
</tr>
<tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
">MOSES ALBERT G</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
"></span>
</td>
<td valign="top">MAP</td>
<td valign="top">- </td>
<td valign="top">01/16/1953</td>
<td valign="top">TOWN OF BINGHAMTON</td>
</tr>
<tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">MOSES ALEXANDRA/GDN</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">GOODRICH MERLE L</span>
</td>
</table>
This is the script that i have written so far that works for two columns:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")} )
for table_ in myTable:
party = table_.find("span", {"id": re.compile("Party1_*")})
oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
print(party.get_text()+ "," + oppositeParty.get_text())
I have tried doing using children of myTable as follows:
myTable.children
If all you want is to just dump out the content, something like this should do:
myTable = bsObj.find_element_by_tag_name("table")
for table_ in myTable:
rows = table_.find_elements_by_tag_name("tr")
for row_ in rows:
columns = row_.find_elements_by_tag_name("td")
for column_ in columns:
# print out comma delimited text of columns...
# print the end of your row
If you're really wanting to scrape specific information, you'll need to provide us with more instructions about what your ultimate goal is.

Categories

Resources