How to parse an HTML table with rowspans in Python? - python

The problem
I'm trying to parse an HTML table with rowspans in it, as in, I'm trying to parse my college schedule.
I'm running into the problem where if the last row contains a rowspan, the next row is missing a TD where the rowspan is now that TD that is missing.
I have no clue how to account for this and I hope to be able to parse this schedule.
What I tried
Pretty much everything I can think of.
The result I get
[
{
'blok_eind': 4,
'blok_start': 3,
'dag': 4, # Should be 5
'leraar': 'DOODF000',
'lokaal': 'ALK C212',
'vak': 'PROJ-T',
},
]
As you can see, there's a vak key with the value PROJ-T in the output snippet above, dag is 4 while it's supposed to be 5 (a.k.a Friday/Vrijdag), as seen here:
The result I want
A Python dict() that looks like the one posted above, but with the right value
Where:
day/dag is an int from 1~5 representing Monday~Friday
block_start/blok_start is an int that represents when the course starts (Time block, left side of table)
block_end/blok_eind is an int that represent in what block the course ends
classroom/lokaal is the classroom's code the course is in
teacher/leraar is the teacher's ID
course/vak is the ID of the course
Basic HTML Structure for above data
<center>
<table>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<font>
TEACHER-ID
</font>
</td>
<td>
<font>
<b>
CLASSROOM ID
</b>
</font>
</td>
</tr>
<tr>
<td>
<font>
COURSE ID
</font>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</table>
</center>
The code
HTML
<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
<font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
</font> <font size="4" face="Arial">
IO1B
</font>
<BR>
<TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
<TR>
<TD align="center">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
ICT algemeen Prakti
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
</TABLE>
<TABLE cellspacing="1" cellpadding="1">
<TR>
<TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r & P e t t e r s S o f t w a r e
</font></CENTER>
Python
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
daytable = {
1: "Maandag",
2: "Dinsdag",
3: "Woensdag",
4: "Donderdag",
5: "Vrijdag"
}
timetable = {
1: ("8:30", "9:20"),
2: ("9:20", "10:10"),
3: ("10:25", "11:15"),
4: ("11:15", "12:05"),
5: ("12:05", "12:55"),
6: ("12:55", "13:45"),
7: ("13:45", "14:35"),
8: ("14:50", "15:40"),
9: ("15:40", "16:30"),
10: ("16:30", "17:20"),
}
page = BeautifulSoup(r.content, "lxml")
roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
# There are 5 days, first column is not data we want
for day in range(2, 7):
dayroster = {
"dag": 0,
"blok_start": 0,
"blok_eind": 0,
"lokaal": "",
"leraar": "",
"vak": ""
}
# This selector provides the classroom
table_bold = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font > b")
# This selector provides the teacher's code and the course ID
table = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font")
# This gets the rowspan on the current row and column
rowspan = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ")")
try:
if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
last_row_big = True
# Setting end of class
dayroster["blok_eind"] = (block_count // 2) + 1
else:
last_row_big = False
# Setting end of class
dayroster["blok_eind"] = (block_count // 2)
except IndexError:
pass
if table_bold:
x = table_bold[0]
# Classroom ID
dayroster["lokaal"] = x.contents[0]
if table:
iter = 0
for x in table:
content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
# Cell has data
if content != "":
# Set start of class
dayroster["blok_start"] = block_count // 2
# Set day of class
dayroster["dag"] = day - 1
if iter == 0:
# Teacher ID
dayroster["leraar"] = content
elif iter == 1:
# Course ID
dayroster["vak"] = content
iter += 1
if table or table_bold:
# Store the data
roster.append(dayroster)
# Remove duplicates
seen = set()
new_l = []
for d in roster:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
pprint(new_l)

You'll have to track the rowspans on previous rows, one per column.
You could do this simply by copying the integer value of a rowspan into a dictionary, and subsequent rows decrement the rowspan value until it drops to 1 (or we could store the integer value minus 1 and drop to 0 for ease of coding). Then you can adjust subsequent table counts based on preceding rowspans.
Your table complicates this a little by using a default span of size 2, incrementing in steps of two, but that can easily be brought back to manageable numbers by dividing by 2.
Rather than use massive CSS selectors, select just the table rows and we'll iterate over those:
roster = []
rowspans = {} # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
# take direct child td cells, but skip the first cell:
daycells = row.select('> td')[1:]
rowspan_offset = 0
for daynum, daycell in enumerate(daycells, 1):
# rowspan handling; if there is a rowspan here, adjust to find correct position
daynum += rowspan_offset
while rowspans.get(daynum, 0):
rowspan_offset += 1
rowspans[daynum] -= 1
daynum += 1
# now we have a correct day number for this cell, adjusted for
# rowspanning cells.
# update the rowspan accounting for this cell
rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
if rowspan:
rowspans[daynum] = rowspan
texts = daycell.select("table > tr > td > font")
if texts:
# class info found
teacher, classroom, course = (c.get_text(strip=True) for c in texts)
roster.append({
'blok_start': block,
'blok_eind': block + rowspan,
'dag': daynum,
'leraar': teacher,
'lokaal': classroom,
'vak': course
})
# days that were skipped at the end due to a rowspan
while daynum < 5:
daynum += 1
if rowspans.get(daynum, 0):
rowspans[daynum] -= 1
This produces correct output:
[{'blok_eind': 2,
'blok_start': 1,
'dag': 5,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021',
'vak': u'WEBD'},
{'blok_eind': 3,
'blok_start': 2,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'WEBD'},
{'blok_eind': 4,
'blok_start': 3,
'dag': 5,
'leraar': u'DOODF000',
'lokaal': u'ALK C212',
'vak': u'PROJ-T'},
{'blok_eind': 5,
'blok_start': 4,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'MENT'},
{'blok_eind': 7,
'blok_start': 6,
'dag': 5,
'leraar': u'JONGJ003',
'lokaal': u'ALK B008',
'vak': u'BURG'},
{'blok_eind': 8,
'blok_start': 7,
'dag': 3,
'leraar': u'FLUIP000',
'lokaal': u'ALK B004',
'vak': u'ICT algemeen Prakti'},
{'blok_eind': 9,
'blok_start': 8,
'dag': 5,
'leraar': u'KOOLE000',
'lokaal': u'ALK B008',
'vak': u'NED'}]
Moreover, this code will continue to work even if courses span more than 2 blocks, or just one block; any rowspan size is supported.

Maybe it is better to use bs4 builtin function like "findAll" to parse your table.
You may use the following code :
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", {},recursive=False)
tr_count=0
trs.pop(0)
final_table={}
for tr in trs:
tds=tr.findAll("td", {},recursive=False)
if tds:
td_count=0
tds.pop(0)
for td in tds:
if td.has_attr('rowspan'):
final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
if int(td.attrs['rowspan'])==4:
final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
td_count=td_count+1
td_count=td_count+1
tr_count=tr_count+1
roster=[]
for i in range(0,10): #iterate over time
for j in range(0,5): #iterate over day
item=final_table[str(i)+"-"+str(j)]
if len(item)!=0:
block_eind=i+1
try:
if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
block_eind=i+2
except:
pass
try:
lokaal=item.split('\r\n \n\n')[0]
leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
vak=item.split('\n \n\r\n')[1]
except:
lokaal=leraar=vak="---"
dayroster = {
"dag": j+1,
"blok_start": i+1,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
dayroster_double = {
"dag": j+1,
"blok_start": i,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
#use to prevent double dict for same event
if dayroster_double not in roster:
roster.append(dayroster)
print (roster)

Related

Concatenate and remove td cells in beautifulsoup python

I have a table like this (old html):
<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>
and I want it to look like:
U.S. federal statutory income tax rate 35.0% 35.0%
Federal income tax at statutory rate $(2,813) $5,834
State and local income taxes, net of federal income tax effect (733) 812
Provision (benefit) for income taxes $(3,546) $6,646
Effective income tax rate 44.1% 39.9%
I have two problems getting from the code to the code above to the table below:
1. there are empty cells like
2. some values are distributed over cells
I want to get rid of the empty cells by decomposing them and concatenate some cells like (2,813 and ) or 44.1 and %
I tried the following code for decomposing but it does not work and I have no clue how to concatenate cells in BeautifulSoup:
s= """<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>"""
soup = bs(s, "lxml")
table = soup.find('table')
for row in table.find_all('tr'):
for cell in row.find_all('td'):
if cell.text=='':
cell.decompose()
df = pd.read_html(str(soup))
print(df)
Provided you can isolate the right table then just loop the trs within attribute valign and concantenate the tds where != ' '
from bs4 import BeautifulSoup as bs
html = '''<table>
<!-- Begin Table Body -->
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">U.S. federal statutory income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">35.0</td>
<td nowrap="">%</td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Federal income tax at statutory rate</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(2,813</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">5,834</td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">State and local income taxes, net of federal income tax effect</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">(733</td>
<td nowrap="">)</td>
<td> </td>
<td> </td>
<td align="right">812</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="1"/> </td>
<td> </td>
</tr>
<tr valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Provision (benefit) for income taxes</div></td>
<td> </td>
<td align="right" nowrap="">$</td>
<td align="right">(3,546</td>
<td nowrap="">)</td>
<td> </td>
<td align="right">$</td>
<td align="right">6,646</td>
<td> </td>
</tr>
<tr style="font-size: 1px">
<td><div style="margin-left:10px; text-indent:-10px"> </div></td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
<td> </td>
<td> </td>
<td align="right"><hr noshade="" size="4"/> </td>
<td> </td>
</tr>
<tr style="background: #eeeeee" valign="bottom">
<td><div style="margin-left:10px; text-indent:-10px">Effective income tax rate</div></td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">44.1</td>
<td nowrap="">%</td>
<td> </td>
<td align="right" nowrap=""> </td>
<td align="right">39.9</td>
<td nowrap="">%</td>
</tr>
<!-- End Table Body -->
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('table tr[valign]'):
print(' '.join([td.text for td in tr.select('td') if td.text != ' ']))

Unable to select Month and Year at DatePicker use Selenium + Python

Actually, I am able to select day and put the value.
Already try use some solution from other link :
1. Getting availability from datepicker
2. Python Selenium Date Picker
3. Python & Selenium Cannot select date in datepicker
When try to select month and year still no lock to get the result
Below my code :
start_date = wait.until(EC.visibility_of_element_located((
By.CSS_SELECTOR, "#departureDate_i")))
start_date.click() #Show Datepciker
browser.execute_script("document.getElementsByClassName('next')[0].click()")
current_month = browser.find_element_by_css_selector(".datepicker-months").text
print("current_month:", current_month)
Below HTML format :
<div class="datepicker datepicker-dropdown dropdown-menu datepicker-orient-left datepicker-orient-top" style="display: none; top: 176.4px; left: 448.667px;">
<div class="datepicker-days" style="display: block;">
<table class=" table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: hidden;"></th>
<th colspan="5" class="datepicker-switch">January 2019</th>
<th class="next" style="visibility: visible;"></th>
</tr>
<tr>
<th class="dow">Su</th>
<th class="dow">Mo</th>
<th class="dow">Tu</th>
<th class="dow">We</th>
<th class="dow">Th</th>
<th class="dow">Fr</th>
<th class="dow">Sa</th>
</tr>
</thead>
<tbody>
<tr>
<td class="day disabled old">30</td>
<td class="day disabled old">31</td>
<td class="day disabled">1</td>
<td class="day disabled">2</td>
<td class="day">3</td>
<td class="day today">4</td>
<td class="day">5</td>
</tr>
<tr>
<td class="day">6</td>
<td class="day">7</td>
<td class="day">8</td>
<td class="day">9</td>
<td class="day">10</td>
<td class="day">11</td>
<td class="day">12</td>
</tr>
<tr>
<td class="day">13</td>
<td class="day">14</td>
<td class="day">15</td>
<td class="day active">16</td>
<td class="day">17</td>
<td class="day">18</td>
<td class="day">19</td>
</tr>
<tr>
<td class="day">20</td>
<td class="day">21</td>
<td class="day">22</td>
<td class="day">23</td>
<td class="day">24</td>
<td class="day">25</td>
<td class="day">26</td>
</tr>
<tr>
<td class="day">27</td>
<td class="day">28</td>
<td class="day">29</td>
<td class="day">30</td>
<td class="day">31</td>
<td class="day new">1</td>
<td class="day new">2</td>
</tr>
<tr>
<td class="day new">3</td>
<td class="day new">4</td>
<td class="day new">5</td>
<td class="day new">6</td>
<td class="day new">7</td>
<td class="day new">8</td>
<td class="day new">9</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-months" style="display: none;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: hidden;"></th>
<th colspan="5" class="datepicker-switch">2019</th>
<th class="next" style="visibility: visible;"></th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7" style=""><span class="month active">Jan</span><span class="month">Feb</span><span class="month">Mar</span><span class="month">Apr</span><span class="month">May</span><span class="month">Jun</span><span class="month">Jul</span><span class="month">Aug</span><span class="month">Sep</span><span class="month">Oct</span><span class="month">Nov</span><span class="month">Dec</span></td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-years" style="display: none;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: hidden;"></th>
<th colspan="5" class="datepicker-switch">2010-2019</th>
<th class="next" style="visibility: visible;"></th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7"><span class="year old disabled">2009</span><span class="year disabled">2010</span><span class="year disabled">2011</span><span class="year disabled">2012</span><span class="year disabled">2013</span><span class="year disabled">2014</span><span class="year disabled">2015</span><span class="year disabled">2016</span><span class="year disabled">2017</span><span class="year disabled">2018</span><span class="year active">2019</span><span class="year new">2020</span></td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
</div>
much appreciate for suggest how to handle it
Thank you

How to add a 2nd Y-axis on a grouped bar chart using Altair? and sort the bar using value of one of the column from the data

I'm trying to add a 3rd axis or 2nd Y-axis to the group chart. I'm not sure if it is possible.
Ideally, I want to -
1) add a line to this chart, which represents the "percentage of Arrest" made for the given year and a crime type.
2) sort the bars with each group using a value of column "rank" from the data.
Here is my code and the current visualization. Your valuable feedback is much appreciated. Thank you.
import altair as alt
base = alt.Chart().encode(
x=alt.X('primary_type',scale=alt.Scale(rangeStep=12),title=None,sort=alt.EncodingSortField(op='sum', field='rank')),
color=alt.Color('primary_type:N')
)
bar = base.mark_bar().encode(
alt.Y('sum(Number_of_Incidents):Q',title='Total Number of Incidents')
)
line = base.mark_line(color='red').encode(
alt.Y('percent_arrest',
axis=alt.Axis(title=None))
)
combined = alt.layer(bar, line, data=q13a)
combined.facet(
column=alt.Column('year')
).resolve_scale(x='independent'
).configure_view(
stroke='transparent'
)
Sample Data -
<table class="table table-bordered table-hover table-condensed">
<thead><tr><th title="Field #1">year</th>
<th title="Field #2">primary_type</th>
<th title="Field #3">Number_of_Incidents</th>
<th title="Field #4">number_of_arrests</th>
<th title="Field #5">percent_arrest</th>
<th title="Field #6">rank</th>
</tr></thead>
<tbody><tr>
<td align="right">2018</td>
<td>THEFT</td>
<td align="right">57330</td>
<td align="right">5503</td>
<td align="right">9.6</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2018</td>
<td>BATTERY</td>
<td align="right">44667</td>
<td align="right">8886</td>
<td align="right">19.89</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2018</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">24889</td>
<td align="right">1498</td>
<td align="right">6.02</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2018</td>
<td>ASSAULT</td>
<td align="right">18229</td>
<td align="right">2931</td>
<td align="right">16.08</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2018</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">15879</td>
<td align="right">713</td>
<td align="right">4.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2017</td>
<td>THEFT</td>
<td align="right">64334</td>
<td align="right">6459</td>
<td align="right">10.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2017</td>
<td>BATTERY</td>
<td align="right">49213</td>
<td align="right">10060</td>
<td align="right">20.44</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2017</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">29040</td>
<td align="right">1747</td>
<td align="right">6.02</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2017</td>
<td>ASSAULT</td>
<td align="right">19298</td>
<td align="right">3455</td>
<td align="right">17.9</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2017</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">18816</td>
<td align="right">805</td>
<td align="right">4.28</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2016</td>
<td>THEFT</td>
<td align="right">61600</td>
<td align="right">6518</td>
<td align="right">10.58</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2016</td>
<td>BATTERY</td>
<td align="right">50292</td>
<td align="right">10328</td>
<td align="right">20.54</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2016</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">31018</td>
<td align="right">1668</td>
<td align="right">5.38</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2016</td>
<td>ASSAULT</td>
<td align="right">18738</td>
<td align="right">3490</td>
<td align="right">18.63</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2016</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">18733</td>
<td align="right">815</td>
<td align="right">4.35</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2015</td>
<td>THEFT</td>
<td align="right">57335</td>
<td align="right">6771</td>
<td align="right">11.81</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2015</td>
<td>BATTERY</td>
<td align="right">48918</td>
<td align="right">11558</td>
<td align="right">23.63</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2015</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">28675</td>
<td align="right">1835</td>
<td align="right">6.4</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2015</td>
<td>NARCOTICS</td>
<td align="right">23883</td>
<td align="right">23875</td>
<td align="right">99.97</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2015</td>
<td>OTHER OFFENSE</td>
<td align="right">17552</td>
<td align="right">4795</td>
<td align="right">27.32</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2014</td>
<td>THEFT</td>
<td align="right">61561</td>
<td align="right">7415</td>
<td align="right">12.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2014</td>
<td>BATTERY</td>
<td align="right">49447</td>
<td align="right">12517</td>
<td align="right">25.31</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2014</td>
<td>NARCOTICS</td>
<td align="right">29116</td>
<td align="right">29000</td>
<td align="right">99.6</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2014</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">27798</td>
<td align="right">2095</td>
<td align="right">7.54</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2014</td>
<td>OTHER OFFENSE</td>
<td align="right">16979</td>
<td align="right">4159</td>
<td align="right">24.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2013</td>
<td>THEFT</td>
<td align="right">71530</td>
<td align="right">7727</td>
<td align="right">10.8</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2013</td>
<td>BATTERY</td>
<td align="right">54002</td>
<td align="right">12927</td>
<td align="right">23.94</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2013</td>
<td>NARCOTICS</td>
<td align="right">34127</td>
<td align="right">33819</td>
<td align="right">99.1</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2013</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">30853</td>
<td align="right">2107</td>
<td align="right">6.83</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2013</td>
<td>OTHER OFFENSE</td>
<td align="right">17993</td>
<td align="right">3400</td>
<td align="right">18.9</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2012</td>
<td>THEFT</td>
<td align="right">75460</td>
<td align="right">8249</td>
<td align="right">10.93</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2012</td>
<td>BATTERY</td>
<td align="right">59135</td>
<td align="right">13061</td>
<td align="right">22.09</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2012</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">35854</td>
<td align="right">2462</td>
<td align="right">6.87</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2012</td>
<td>NARCOTICS</td>
<td align="right">35488</td>
<td align="right">35226</td>
<td align="right">99.26</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2012</td>
<td>BURGLARY</td>
<td align="right">22843</td>
<td align="right">1285</td>
<td align="right">5.63</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2011</td>
<td>THEFT</td>
<td align="right">75148</td>
<td align="right">8468</td>
<td align="right">11.27</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2011</td>
<td>BATTERY</td>
<td align="right">60458</td>
<td align="right">14139</td>
<td align="right">23.39</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2011</td>
<td>NARCOTICS</td>
<td align="right">38605</td>
<td align="right">38544</td>
<td align="right">99.84</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2011</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">37332</td>
<td align="right">2583</td>
<td align="right">6.92</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2011</td>
<td>BURGLARY</td>
<td align="right">26619</td>
<td align="right">1272</td>
<td align="right">4.78</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2010</td>
<td>THEFT</td>
<td align="right">76754</td>
<td align="right">7844</td>
<td align="right">10.22</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2010</td>
<td>BATTERY</td>
<td align="right">65403</td>
<td align="right">14277</td>
<td align="right">21.83</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2010</td>
<td>NARCOTICS</td>
<td align="right">43393</td>
<td align="right">43294</td>
<td align="right">99.77</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2010</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">40653</td>
<td align="right">2641</td>
<td align="right">6.5</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2010</td>
<td>BURGLARY</td>
<td align="right">26422</td>
<td align="right">1382</td>
<td align="right">5.23</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2009</td>
<td>THEFT</td>
<td align="right">80973</td>
<td align="right">9900</td>
<td align="right">12.23</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2009</td>
<td>BATTERY</td>
<td align="right">68462</td>
<td align="right">16325</td>
<td align="right">23.85</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2009</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">47724</td>
<td align="right">3270</td>
<td align="right">6.85</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2009</td>
<td>NARCOTICS</td>
<td align="right">43543</td>
<td align="right">43193</td>
<td align="right">99.2</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2009</td>
<td>BURGLARY</td>
<td align="right">26766</td>
<td align="right">1412</td>
<td align="right">5.28</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2008</td>
<td>THEFT</td>
<td align="right">88433</td>
<td align="right">9291</td>
<td align="right">10.51</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2008</td>
<td>BATTERY</td>
<td align="right">75922</td>
<td align="right">15520</td>
<td align="right">20.44</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2008</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">52841</td>
<td align="right">3403</td>
<td align="right">6.44</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2008</td>
<td>NARCOTICS</td>
<td align="right">46507</td>
<td align="right">45459</td>
<td align="right">97.75</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2008</td>
<td>OTHER OFFENSE</td>
<td align="right">26533</td>
<td align="right">3496</td>
<td align="right">13.18</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2007</td>
<td>THEFT</td>
<td align="right">85156</td>
<td align="right">9783</td>
<td align="right">11.49</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2007</td>
<td>BATTERY</td>
<td align="right">79591</td>
<td align="right">19386</td>
<td align="right">24.36</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2007</td>
<td>NARCOTICS</td>
<td align="right">54454</td>
<td align="right">53251</td>
<td align="right">97.79</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2007</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">53749</td>
<td align="right">3994</td>
<td align="right">7.43</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2007</td>
<td>OTHER OFFENSE</td>
<td align="right">26863</td>
<td align="right">4230</td>
<td align="right">15.75</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2006</td>
<td>THEFT</td>
<td align="right">86240</td>
<td align="right">10108</td>
<td align="right">11.72</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2006</td>
<td>BATTERY</td>
<td align="right">80666</td>
<td align="right">18892</td>
<td align="right">23.42</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2006</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">57124</td>
<td align="right">4135</td>
<td align="right">7.24</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2006</td>
<td>NARCOTICS</td>
<td align="right">55813</td>
<td align="right">55236</td>
<td align="right">98.97</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2006</td>
<td>OTHER OFFENSE</td>
<td align="right">27100</td>
<td align="right">4010</td>
<td align="right">14.8</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2005</td>
<td>THEFT</td>
<td align="right">85685</td>
<td align="right">11338</td>
<td align="right">13.23</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2005</td>
<td>BATTERY</td>
<td align="right">83965</td>
<td align="right">19994</td>
<td align="right">23.81</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2005</td>
<td>NARCOTICS</td>
<td align="right">56234</td>
<td align="right">56121</td>
<td align="right">99.8</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2005</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">54548</td>
<td align="right">4083</td>
<td align="right">7.49</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2005</td>
<td>OTHER OFFENSE</td>
<td align="right">28028</td>
<td align="right">4726</td>
<td align="right">16.86</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2004</td>
<td>THEFT</td>
<td align="right">95463</td>
<td align="right">12068</td>
<td align="right">12.64</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2004</td>
<td>BATTERY</td>
<td align="right">87136</td>
<td align="right">20718</td>
<td align="right">23.78</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2004</td>
<td>NARCOTICS</td>
<td align="right">57060</td>
<td align="right">57034</td>
<td align="right">99.95</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2004</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">53164</td>
<td align="right">3965</td>
<td align="right">7.46</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2004</td>
<td>OTHER OFFENSE</td>
<td align="right">29532</td>
<td align="right">5386</td>
<td align="right">18.24</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2003</td>
<td>THEFT</td>
<td align="right">98875</td>
<td align="right">12889</td>
<td align="right">13.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2003</td>
<td>BATTERY</td>
<td align="right">88378</td>
<td align="right">20459</td>
<td align="right">23.15</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2003</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55011</td>
<td align="right">4060</td>
<td align="right">7.38</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2003</td>
<td>NARCOTICS</td>
<td align="right">54288</td>
<td align="right">54283</td>
<td align="right">99.99</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2003</td>
<td>OTHER OFFENSE</td>
<td align="right">31147</td>
<td align="right">5856</td>
<td align="right">18.8</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2002</td>
<td>THEFT</td>
<td align="right">98327</td>
<td align="right">13697</td>
<td align="right">13.93</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2002</td>
<td>BATTERY</td>
<td align="right">94153</td>
<td align="right">21331</td>
<td align="right">22.66</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2002</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55940</td>
<td align="right">4403</td>
<td align="right">7.87</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2002</td>
<td>NARCOTICS</td>
<td align="right">51789</td>
<td align="right">51781</td>
<td align="right">99.98</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2002</td>
<td>OTHER OFFENSE</td>
<td align="right">32599</td>
<td align="right">5701</td>
<td align="right">17.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2001</td>
<td>THEFT</td>
<td align="right">99264</td>
<td align="right">15543</td>
<td align="right">15.66</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2001</td>
<td>BATTERY</td>
<td align="right">93447</td>
<td align="right">20463</td>
<td align="right">21.9</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2001</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55851</td>
<td align="right">4548</td>
<td align="right">8.14</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2001</td>
<td>NARCOTICS</td>
<td align="right">50567</td>
<td align="right">50559</td>
<td align="right">99.98</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2001</td>
<td>ASSAULT</td>
<td align="right">31384</td>
<td align="right">7150</td>
<td align="right">22.78</td>
<td align="right">5</td>
</tr>
</tbody></table>
The trouble is that, as far as I know, you cannot draw lines across charts. When creating a grouped bar chart, you have to facet across a column of your data. In effect, this produces several charts that are horizontally concatenated. So, for each chart you have only one point (for each color). If you want to have a line across years, you have to define your x axis to be years, and not facet it, and plot it separately. I would suggest vertical concatenation, to have the lines below the bars.
Note that I have taken the data from your previous question (How to create a nested Grouped Bar Chart using Altair? - Added sample data) because the way you provided it is not practical and I already had this one.
import altair as alt
import pandas as pd
from io import StringIO
q13a = pd.read_table(StringIO("""year primary_type Number_of_Incidents number_of_arrests percent_arrest rank
2018 THEFT 57330 5503 9.6 1
2018 BATTERY 44667 8886 19.89 2
2018 CRIMINAL DAMAGE 24889 1498 6.02 3
2018 ASSAULT 18229 2931 16.08 4
2018 DECEPTIVE PRACTICE 15879 713 4.49 5
2017 THEFT 64334 6459 10.04 1
2017 BATTERY 49213 10060 20.44 2
2017 CRIMINAL DAMAGE 29040 1747 6.02 3
2017 ASSAULT 19298 3455 17.9 4
2017 DECEPTIVE PRACTICE 18816 805 4.28 5
2016 THEFT 61600 6518 10.58 1
2016 BATTERY 50292 10328 20.54 2
2016 CRIMINAL DAMAGE 31018 1668 5.38 3
2016 ASSAULT 18738 3490 18.63 4
2016 DECEPTIVE PRACTICE 18733 815 4.35 5
2015 THEFT 57335 6771 11.81 1
2015 BATTERY 48918 11558 23.63 2
2015 CRIMINAL DAMAGE 28675 1835 6.4 3
2015 NARCOTICS 23883 23875 99.97 4
2015 OTHER OFFENSE 17552 4795 27.32 5
2014 THEFT 61561 7415 12.04 1
2014 BATTERY 49447 12517 25.31 2
2014 NARCOTICS 29116 29000 99.6 3
2014 CRIMINAL DAMAGE 27798 2095 7.54 4
2014 OTHER OFFENSE 16979 4159 24.49 5
2013 THEFT 71530 7727 10.8 1
2013 BATTERY 54002 12927 23.94 2
2013 NARCOTICS 34127 33819 99.1 3
2013 CRIMINAL DAMAGE 30853 2107 6.83 4
2013 OTHER OFFENSE 17993 3400 18.9 5"""))
bar = alt.Chart(height=200, width=100).mark_bar().encode(
x=alt.X('primary_type:N',
axis=None,
title=None,
sort=alt.EncodingSortField(op='sum', field='rank')),
y=alt.Y('sum(Number_of_Incidents):Q',
title='Total Number of Incidents'),
color=alt.Color('primary_type:N')
).facet(
column=alt.Column('year:O')
).resolve_scale(
x='independent'
)
line = alt.Chart().mark_line(point=True, color='red').encode(
x=alt.X('year:O', axis=alt.Axis(labelAngle=0)),
y=alt.Y('percent_arrest:Q'),
color=alt.Color('primary_type:N', legend=None)
).properties(height=80, width=680)
alt.vconcat(bar, line, data=q13a).configure_view(stroke='transparent')
Created on 2018-11-29 by the reprexpy package

Web Scraping tables from an HTML file

Hello all I am hoping to get some help with taking the tables in my HTML file and importing them into a csv file. I am very very new to web scraping so for give me if I am completely wrong with my code. The HTML file holds three separate table I am trying to extract; estimate, sampling error, and number of non-zero plots in estimate.
My code is shown below:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
Here is the HTML for the first table "Estimate":
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`
I made four tables almost identical to yours and put them into a fairly respectable page of HTML. Then I ran this code.
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
The results were four files like this, named table0.csv through table3.csv.
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
Perhaps the main thing I should mention is that I skipped the same number of rows in each table that BeautifulSoup delivered. If the number of header lines in the tables varies then you will have to do something more clever or just discard lines in the output files and omit the skiprows parameter.
Unsure as to what the exact question is here but right off the bat I can see an error that will throw you off a bit.
new_table = pd.DataFrame(columns=range(0-4))
Needs to be
new_table = pd.DataFrame(columns=range(0,4))
The result of range(0-4) is actually range(-4) which evaluates to range(0,-4) whereas you want range(0,4). You can just pass range(4) as the parameter or range(0,4).

Python & BS4: Exclude blank and Grand Total Row

Below is my code to extract data out of an HTML document and place it into variables. I need to exclude the blank lines, as well as the "grand total" line. I've added the HTML input of those segments beneath my code. I'm not sure how to make it work. I can't use len() because the length is variable. Any help?
from bs4 import BeautifulSoup
import urllib
import re
import HTMLParser
html = urllib.urlopen('RanpakAllocations.html').read()
parser = HTMLParser.HTMLParser()
#unescape doesn't seem to work
output = parser.unescape(html)
soup1 = BeautifulSoup(output, "html.parser")
Customer_No = []
Serial_No = []
data = []
#for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
rows = soup1.find_all("tr")
title = rows[0]
headers = rows[1]
datarows = rows[2:]
fields = []
try :
for row in datarows :
find_data = row.find_all(attrs={'face' : 'Arial,Helvetica,sans-serif'})
count = 0
for hit in find_data:
data = hit.text
count = count + 1
if count == 3 :
CSNO = data
if count == 9 :
ITNO = data
else :
continue
print CSNO, ITNO
print "new row"
except:
pass
Here is the input. The first <tr> is my last row of data, however my loop is repeating for the blank rows and the grand total row below it.
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
I would do something like this:
from bs4 import BeautifulSoup
content = '''
<root>
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</root>'''
soup = BeautifulSoup(content, 'html')
answer = []
rows = soup.find_all('tr')
for row in rows:
if not row.text.strip():
continue
row_text = []
for cell in row.find_all('td'):
if cell.text.strip():
row_text.append(cell.text)
answer.append(row_text)
print(answer)
Output
[[u'12', u'F5684', u'20182', u'VELOCITY SOLUTIONS INC.', u'EQPRAN77717', u'RANPAK FILLPAK TT 2', u'W/UNIVERSAL STAND S/N 51345563', u'1', u'51345563'], [u'Grand Total']]
You can skip over entire empty rows using if not row.text.strip(): continue (row.text.strip() returns an empty string, which evaluates to False).
For rows that you do iterate over, you can check each cell is not empty using if cell.text.strip() before saving the relevant text.

Categories

Resources