Web Scraping tables from an HTML file

Web Scraping tables from an HTML file - python

Hello all I am hoping to get some help with taking the tables in my HTML file and importing them into a csv file. I am very very new to web scraping so for give me if I am completely wrong with my code. The HTML file holds three separate table I am trying to extract; estimate, sampling error, and number of non-zero plots in estimate.
My code is shown below:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
Here is the HTML for the first table "Estimate":
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`

I made four tables almost identical to yours and put them into a fairly respectable page of HTML. Then I ran this code.
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
The results were four files like this, named table0.csv through table3.csv.
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
Perhaps the main thing I should mention is that I skipped the same number of rows in each table that BeautifulSoup delivered. If the number of header lines in the tables varies then you will have to do something more clever or just discard lines in the output files and omit the skiprows parameter.

Unsure as to what the exact question is here but right off the bat I can see an error that will throw you off a bit.
new_table = pd.DataFrame(columns=range(0-4))
Needs to be
new_table = pd.DataFrame(columns=range(0,4))
The result of range(0-4) is actually range(-4) which evaluates to range(0,-4) whereas you want range(0,4). You can just pass range(4) as the parameter or range(0,4).

Related

Xpath Python Extract Data From Table Between Two Headings

I'm trying to extract data from a table that lies in between two headers in an html file using Python. IN this case, the required id to lookup lies in a span inside a header (I need id="Perlis", which lies between Perlis and Kedah):
<h2>
<span class="mw-headline" id="Perlis">Perlis</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;">
<tbody>
<tr>
<th width="30"># </th>
<th width="150">Constituency s </th>
<th width="150">Winner </th>
<th width="80">Votes </th>
<th width="80">Majority </th>
<th width="150">Opponent(s) </th>
<th width="80">Votes </th>
<th width="150">Incumbent </th>
<th width="80">
<b>Incumbent Majority</b>
</th>
</tr>
<tr>
<td colspan="13">
BN
<b>2</b> | GS
<b>0</b> | PH
<b>1</b> | Independent
<b>0</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P1 </td>
<td rowspan="2">
Padang Besar
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>15,032</b>
</td>
<td rowspan="2">
<b>1,438</b>
</td>
<td bgcolor="#F18A8F">Izizam Ibrahim <br /> ( <b>PH</b>- <b>PPBM</b>) </td>
<td>
<b>13,594</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Zahidi Zainul Abidin
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>7,426</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mokhtar Senik <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>7,874</b>
</td>
</tr>
<tr align="center">
<td rowspan="2">P2 </td>
<td rowspan="2">
Kangar
</td>
<td rowspan="2" bgcolor="#C7F2F2">Noor Amin Ahmad <br /> ( <b>PH</b>- <b>PKR</b>) </td>
<td rowspan="2">
<b>20,909</b>
</td>
<td rowspan="2">
<b>5,603</b>
</td>
<td bgcolor="#B5BED9">Ramli Shariff <br /> ( <b>BN</b>- <b>UMNO</b>) </td>
<td>
<b>15,306</b>
</td>
<td rowspan="2" bgcolor="#B5BED9">
Shaharuddin Ismail
<br /> ( <b>BN</b>- <b>UMNO</b>)
</td>
<td rowspan="2">
<b>4,037</b>
</td>
</tr>
<tr>
<td bgcolor="#B2DBB2">Mohamad Zahid Ibrahim <br /> ( <b>GS</b>- <b>PAS</b>) </td>
<td>
<b>8,465</b>
</td>
</tr>
</tbody>
</table>
<h2>
<span class="mw-headline" id="Kedah">Kedah</span>
<span class="mw-editsection">
<span class="mw-editsection-bracket">[</span>
edit
<span class="mw-editsection-bracket">]</span>
</span>
</h2>
<table class="wikitable" style="text-align:center; font-size:90%; width:100%;"></table>
This is the resulting JSON that I am trying to construct:
[
{
"state": "Perlis",
"constituencies": [
{
"id": "P1",
"name": "Padang Besar"
},
{
"id": "P2",
"name": "Kangar"
}
]
}
]
I'd like to know how to reference the specific table so I can extract the data into a JSON format. I have used Scrapy before but not sure how to in this case- this is what I had in mind:
class PostSpider(scrapy.Spider):
name = 'manual_spider'
start_urls = [
'%URL%'
]
def parse(self, response):
doc = response.xpath('//comment()').getall() //This is the bit I need
//code continues here

How to split dataframe at headers that are in a row

I've got a page I'm scraping and most of the tables are in the format Heading --info. I can iterate through most of the tables and create separate dataframes for all the various information using pandas.read_html.
However, there are some where they've combined information into one table with subheadings that I want to be separate dataframes with the text of that row as the heading (appending a list).
Is there an easy way to split this dataframe - It will always be heading followed by associated rows, new heading followed by new associated rows.
eg.
Col1 Col2
0 thing
1 1 2
2 2 3
3 thing2
4 1 2
5 2 3
6 3 4
Should be
thing
1 1 1
2 2 2
thing2
4 1 2
5 2 3
6 3 4
It'd be nice if people would just create web pages that made sense with the data but that's not the case here.
I've tried iterrows but cannot seem to come up with a good way to create what I want.
Help would be very much appreciated!
<div class="ranking">
<h6>Sprint</h6>
<table>
<tbody>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Canneto - km 137</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">21</span></td>
<td class="name">
<a class="10010085859" href="javascript:;">
<abbr title="Young rider">*</abbr>
BAGIOLI Nicola
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">54</span></td>
<td class="name">
<a class="10008688453" href="javascript:;">
ORSINI Umberto
</a>
</td>
<td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">247</span></td>
<td class="name">
<a class="10005658114" href="javascript:;">
ZARDINI Edoardo
</a>
</td>
<td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">63</span></td>
<td class="name">
<a class="10003349312" href="javascript:;">
BODNAR Maciej
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
<tbody>
<tr>
<td class="title" colspan="8">Follonica - km 190</td>
</tr>
<tr>
<td class="rank"><span title="Rank">1</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">62</span></td>
<td class="name">
<a class="10007738055" href="javascript:;">
ACKERMANN Pascal
</a>
</td>
<td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">12</td>
</tr>
<tr>
<td class="rank"><span title="Rank">2</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">231</span></td>
<td class="name">
<a class="10008656828" href="javascript:;">
GAVIRIA RENDON Fernando
</a>
</td>
<td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
<td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">10</td>
</tr>
<tr>
<td class="rank"><span title="Rank">3</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">137</span></td>
<td class="name">
<a class="10007506366" href="javascript:;">
ZABEL Rick
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">8</td>
</tr>
<tr>
<td class="rank"><span title="Rank">4</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">91</span></td>
<td class="name">
<a class="10008661777" href="javascript:;">
BALLERINI Davide
</a>
</td>
<td class="team"><img alt="DECEUNINCK - QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK - QUICK - STEP "/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">7</td>
</tr>
<tr>
<td class="rank"><span title="Rank">5</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">12</span></td>
<td class="name">
<a class="10007096239" href="javascript:;">
MERLIER Tim
</a>
</td>
<td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">6</td>
</tr>
<tr>
<td class="more" colspan="8">More...</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">6</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">133</span></td>
<td class="name">
<a class="10028417041" href="javascript:;">
CIMOLAI Davide
</a>
</td>
<td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">5</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">7</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">213</span></td>
<td class="name">
<a class="10007216275" href="javascript:;">
MANZIN Lorrenzo
</a>
</td>
<td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
<td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">4</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">8</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">23</span></td>
<td class="name">
<a class="10007744624" href="javascript:;">
PACIONI Luca
</a>
</td>
<td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
<td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">3</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">9</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">147</span></td>
<td class="name">
<a class="10010946028" href="javascript:;">
<abbr title="Young rider">*</abbr>
VERMEERSCH Florian
</a>
</td>
<td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
<td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">2</td>
</tr>
<tr style="display: none;">
<td class="rank"><span title="Rank">10</span></td>
<td class="any-progression"></td>
<td class="bib"><span title="Bib">195</span></td>
<td class="name">
<a class="10006631548" href="javascript:;">
TEUNISSEN Mike
</a>
</td>
<td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
<td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
<td class="bonif">
</td>
<td class="points" title="Points">1</td>
</tr>
</tbody>
</table>
</div>

You can use np.split()
import numpy as np
res = [x.reset_index(drop=True) for x in np.split(df, np.where(df.applymap(lambda x: x == ''))[0]) if not x.empty]
for x in res:
x = x.rename(columns=x.iloc[0]).drop(x.index[0])
print(x)
Output:
thing
1 1 2
2 2 3
thing2
1 1 2
2 2 3
3 3 4

Identify the headers and use cumsum() to groupby then append each group to a list.
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'thing', 1: '1', 2: '2', 3: 'thing2', 4: '1',5: '2', 6: '3'},
'Col2': {0:'' , 1: 2, 2: 3, 3:'' , 4: 2, 5: 3, 6: 4}})
gb = df.groupby((~df.Col2.astype(bool)).cumsum())
dfs = []
for k,g in gb:
dfs.append(g.copy())
In [42]: dfs[0]
Out[42]:
Col1 Col2
0 thing
1 1 2
2 2 3
In [43]: dfs[1]
Out[43]:
Col1 Col2
3 thing2
4 1 2
5 2 3
6 3 4

Concatenate and remove td cells in beautifulsoup python

I have a table like this (old html):
<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>
and I want it to look like:
U.S. federal statutory income tax rate 35.0% 35.0%
Federal income tax at statutory rate $(2,813) $5,834
State and local income taxes, net of federal income tax effect (733) 812
Provision (benefit) for income taxes $(3,546) $6,646
Effective income tax rate 44.1% 39.9%
I have two problems getting from the code to the code above to the table below:
1. there are empty cells like
2. some values are distributed over cells
I want to get rid of the empty cells by decomposing them and concatenate some cells like (2,813 and ) or 44.1 and %
I tried the following code for decomposing but it does not work and I have no clue how to concatenate cells in BeautifulSoup:
s= """<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>"""
soup = bs(s, "lxml")
table = soup.find('table')
for row in table.find_all('tr'):
for cell in row.find_all('td'):
if cell.text=='':
cell.decompose()
df = pd.read_html(str(soup))
print(df)

Provided you can isolate the right table then just loop the trs within attribute valign and concantenate the tds where != ' '
from bs4 import BeautifulSoup as bs
html = '''<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('table tr[valign]'):
print(' '.join([td.text for td in tr.select('td') if td.text != ' ']))

BeautifulSoup, get text of all td's (some text with commas) inside tr's

Im currently working on a table that is created in ASP, its very messy but with some code help I think Ill be getting what I need from this table.
I have an HTML code that I want the output to be one array for each tr with td's. I also do not want the "-" to be a part of the output in the arrays.
Some td's have 2 commas and some texts in the td's are separated by only an empty space " ":
The code is like this
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Golbasi Ankara, Turkey
</td>
<td width="10">
-
</td>
<td>
Konya Havalimani Turkey
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Kucuktepe, Van, Turkey
</td>
<td width="10">
-
</td>
<td>
Maltepe, Istanbul, Turkey
</td>
<td colspan="2">
</td>
</tr>
Some td's have 2 commas and some texts in the td's are separated by only an empty space " ":
[['4342141', '25.07.2018', '09:00', 'Golbasi Ankara, Turkey', '-', 'Konya Havalimani Turkey', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Kucuktepe, Van, Turkey', '-', 'Maltepe, Istanbul, Turkey', 'free.asp?detail=hide&c_id=4134123']]

Assuming data will hold the HTML text:
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('tr'):
row = [td.text.strip() for td in tr.select('td') if td.text.strip() and td.text.strip() != '-']
if row:
rows.append(row)
pprint(rows, width=120)
This will print:
[['4342141', '25.07.2018 09:00', 'Golbasi Ankara, Turkey', 'Konya Havalimani Turkey'],
['4134123', '26.07.2018 09:00', 'Kucuktepe, Van, Turkey', 'Maltepe, Istanbul, Turkey']]
For writing the rows list to csv you can use this script:
import csv
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(rows)
Then in data.csv file you will have:
4342141,25.07.2018 09:00,"Golbasi Ankara, Turkey",Konya Havalimani Turkey
4134123,26.07.2018 09:00,"Kucuktepe, Van, Turkey","Maltepe, Istanbul, Turkey"

How to parse an HTML table with rowspans in Python?

The problem
I'm trying to parse an HTML table with rowspans in it, as in, I'm trying to parse my college schedule.
I'm running into the problem where if the last row contains a rowspan, the next row is missing a TD where the rowspan is now that TD that is missing.
I have no clue how to account for this and I hope to be able to parse this schedule.
What I tried
Pretty much everything I can think of.
The result I get
[
{
'blok_eind': 4,
'blok_start': 3,
'dag': 4, # Should be 5
'leraar': 'DOODF000',
'lokaal': 'ALK C212',
'vak': 'PROJ-T',
},
]
As you can see, there's a vak key with the value PROJ-T in the output snippet above, dag is 4 while it's supposed to be 5 (a.k.a Friday/Vrijdag), as seen here:
The result I want
A Python dict() that looks like the one posted above, but with the right value
Where:
day/dag is an int from 1~5 representing Monday~Friday
block_start/blok_start is an int that represents when the course starts (Time block, left side of table)
block_end/blok_eind is an int that represent in what block the course ends
classroom/lokaal is the classroom's code the course is in
teacher/leraar is the teacher's ID
course/vak is the ID of the course
Basic HTML Structure for above data
<center>
<table>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<font>
TEACHER-ID
</font>
</td>
<td>
<font>
<b>
CLASSROOM ID
</b>
</font>
</td>
</tr>
<tr>
<td>
<font>
COURSE ID
</font>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</table>
</center>
The code
HTML
<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
<font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
</font> <font size="4" face="Arial">
IO1B
</font>
<BR>
<TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
<TR>
<TD align="center">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
ICT algemeen Prakti
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
</TABLE>
<TABLE cellspacing="1" cellpadding="1">
<TR>
<TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r & P e t t e r s S o f t w a r e
</font></CENTER>
Python
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
daytable = {
1: "Maandag",
2: "Dinsdag",
3: "Woensdag",
4: "Donderdag",
5: "Vrijdag"
}
timetable = {
1: ("8:30", "9:20"),
2: ("9:20", "10:10"),
3: ("10:25", "11:15"),
4: ("11:15", "12:05"),
5: ("12:05", "12:55"),
6: ("12:55", "13:45"),
7: ("13:45", "14:35"),
8: ("14:50", "15:40"),
9: ("15:40", "16:30"),
10: ("16:30", "17:20"),
}
page = BeautifulSoup(r.content, "lxml")
roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
# There are 5 days, first column is not data we want
for day in range(2, 7):
dayroster = {
"dag": 0,
"blok_start": 0,
"blok_eind": 0,
"lokaal": "",
"leraar": "",
"vak": ""
}
# This selector provides the classroom
table_bold = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font > b")
# This selector provides the teacher's code and the course ID
table = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font")
# This gets the rowspan on the current row and column
rowspan = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ")")
try:
if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
last_row_big = True
# Setting end of class
dayroster["blok_eind"] = (block_count // 2) + 1
else:
last_row_big = False
# Setting end of class
dayroster["blok_eind"] = (block_count // 2)
except IndexError:
pass
if table_bold:
x = table_bold[0]
# Classroom ID
dayroster["lokaal"] = x.contents[0]
if table:
iter = 0
for x in table:
content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
# Cell has data
if content != "":
# Set start of class
dayroster["blok_start"] = block_count // 2
# Set day of class
dayroster["dag"] = day - 1
if iter == 0:
# Teacher ID
dayroster["leraar"] = content
elif iter == 1:
# Course ID
dayroster["vak"] = content
iter += 1
if table or table_bold:
# Store the data
roster.append(dayroster)
# Remove duplicates
seen = set()
new_l = []
for d in roster:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
pprint(new_l)

You'll have to track the rowspans on previous rows, one per column.
You could do this simply by copying the integer value of a rowspan into a dictionary, and subsequent rows decrement the rowspan value until it drops to 1 (or we could store the integer value minus 1 and drop to 0 for ease of coding). Then you can adjust subsequent table counts based on preceding rowspans.
Your table complicates this a little by using a default span of size 2, incrementing in steps of two, but that can easily be brought back to manageable numbers by dividing by 2.
Rather than use massive CSS selectors, select just the table rows and we'll iterate over those:
roster = []
rowspans = {} # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
# take direct child td cells, but skip the first cell:
daycells = row.select('> td')[1:]
rowspan_offset = 0
for daynum, daycell in enumerate(daycells, 1):
# rowspan handling; if there is a rowspan here, adjust to find correct position
daynum += rowspan_offset
while rowspans.get(daynum, 0):
rowspan_offset += 1
rowspans[daynum] -= 1
daynum += 1
# now we have a correct day number for this cell, adjusted for
# rowspanning cells.
# update the rowspan accounting for this cell
rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
if rowspan:
rowspans[daynum] = rowspan
texts = daycell.select("table > tr > td > font")
if texts:
# class info found
teacher, classroom, course = (c.get_text(strip=True) for c in texts)
roster.append({
'blok_start': block,
'blok_eind': block + rowspan,
'dag': daynum,
'leraar': teacher,
'lokaal': classroom,
'vak': course
})
# days that were skipped at the end due to a rowspan
while daynum < 5:
daynum += 1
if rowspans.get(daynum, 0):
rowspans[daynum] -= 1
This produces correct output:
[{'blok_eind': 2,
'blok_start': 1,
'dag': 5,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021',
'vak': u'WEBD'},
{'blok_eind': 3,
'blok_start': 2,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'WEBD'},
{'blok_eind': 4,
'blok_start': 3,
'dag': 5,
'leraar': u'DOODF000',
'lokaal': u'ALK C212',
'vak': u'PROJ-T'},
{'blok_eind': 5,
'blok_start': 4,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'MENT'},
{'blok_eind': 7,
'blok_start': 6,
'dag': 5,
'leraar': u'JONGJ003',
'lokaal': u'ALK B008',
'vak': u'BURG'},
{'blok_eind': 8,
'blok_start': 7,
'dag': 3,
'leraar': u'FLUIP000',
'lokaal': u'ALK B004',
'vak': u'ICT algemeen Prakti'},
{'blok_eind': 9,
'blok_start': 8,
'dag': 5,
'leraar': u'KOOLE000',
'lokaal': u'ALK B008',
'vak': u'NED'}]
Moreover, this code will continue to work even if courses span more than 2 blocks, or just one block; any rowspan size is supported.

Maybe it is better to use bs4 builtin function like "findAll" to parse your table.
You may use the following code :
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", {},recursive=False)
tr_count=0
trs.pop(0)
final_table={}
for tr in trs:
tds=tr.findAll("td", {},recursive=False)
if tds:
td_count=0
tds.pop(0)
for td in tds:
if td.has_attr('rowspan'):
final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
if int(td.attrs['rowspan'])==4:
final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
td_count=td_count+1
td_count=td_count+1
tr_count=tr_count+1
roster=[]
for i in range(0,10): #iterate over time
for j in range(0,5): #iterate over day
item=final_table[str(i)+"-"+str(j)]
if len(item)!=0:
block_eind=i+1
try:
if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
block_eind=i+2
except:
pass
try:
lokaal=item.split('\r\n \n\n')[0]
leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
vak=item.split('\n \n\r\n')[1]
except:
lokaal=leraar=vak="---"
dayroster = {
"dag": j+1,
"blok_start": i+1,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
dayroster_double = {
"dag": j+1,
"blok_start": i,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
#use to prevent double dict for same event
if dayroster_double not in roster:
roster.append(dayroster)
print (roster)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping tables from an HTML file - python

Related

Xpath Python Extract Data From Table Between Two Headings

How to split dataframe at headers that are in a row

Concatenate and remove td cells in beautifulsoup python

BeautifulSoup, get text of all td's (some text with commas) inside tr's

How to parse an HTML table with rowspans in Python?

Categories

Resources