Python & BS4: Exclude blank and Grand Total Row - python

Below is my code to extract data out of an HTML document and place it into variables. I need to exclude the blank lines, as well as the "grand total" line. I've added the HTML input of those segments beneath my code. I'm not sure how to make it work. I can't use len() because the length is variable. Any help?
from bs4 import BeautifulSoup
import urllib
import re
import HTMLParser
html = urllib.urlopen('RanpakAllocations.html').read()
parser = HTMLParser.HTMLParser()
#unescape doesn't seem to work
output = parser.unescape(html)
soup1 = BeautifulSoup(output, "html.parser")
Customer_No = []
Serial_No = []
data = []
#for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
rows = soup1.find_all("tr")
title = rows[0]
headers = rows[1]
datarows = rows[2:]
fields = []
try :
for row in datarows :
find_data = row.find_all(attrs={'face' : 'Arial,Helvetica,sans-serif'})
count = 0
for hit in find_data:
data = hit.text
count = count + 1
if count == 3 :
CSNO = data
if count == 9 :
ITNO = data
else :
continue
print CSNO, ITNO
print "new row"
except:
pass
Here is the input. The first <tr> is my last row of data, however my loop is repeating for the blank rows and the grand total row below it.
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>

I would do something like this:
from bs4 import BeautifulSoup
content = '''
<root>
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</root>'''
soup = BeautifulSoup(content, 'html')
answer = []
rows = soup.find_all('tr')
for row in rows:
if not row.text.strip():
continue
row_text = []
for cell in row.find_all('td'):
if cell.text.strip():
row_text.append(cell.text)
answer.append(row_text)
print(answer)
Output
[[u'12', u'F5684', u'20182', u'VELOCITY SOLUTIONS INC.', u'EQPRAN77717', u'RANPAK FILLPAK TT 2', u'W/UNIVERSAL STAND S/N 51345563', u'1', u'51345563'], [u'Grand Total']]
You can skip over entire empty rows using if not row.text.strip(): continue (row.text.strip() returns an empty string, which evaluates to False).
For rows that you do iterate over, you can check each cell is not empty using if cell.text.strip() before saving the relevant text.

Related

Searching for a color tag in html (Python 3)

I am trying to grab elements from a table if a cell has a certain color. Only issue is that for the color tags, grabbing the color does not seem possible just yet.
jump = []
for tr in site.findAll('tr'):
for td in site.findAll('td'):
if td == 'td bgcolor':
jump.append(td)
print(jump)
This returns an empty list
How do I grab just the color from the below html?
I need to get the color from the [td] tag (it would also be useful to get the color from the [tr] tag)
<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> CME_ES </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
How about this:
jump = []
for tr in site.findAll('tr'):
for td in site.findAll('td'):
if 'bgcolor' in td.attrs:
#jump.append(td)
print(td.attrs['bgcolor'])
print(jump)
you can use has_attr to check if an element has a certain attribute:
if td.has_attr('bgcolor'):
jump.append(td)
If i misread your answer and you want to only find tds of a certain color, use find_all:
tr.find_all("td", {"bgcolor": "55aa2a"}) # returns list of matches
PS: if someone has a better docs snippet for has_attr, please edit this answer.

python xpath : extract only few items from tables

I want to extract only few items from the html which is a table.
<table cellspacing="0" cellpadding="2" width="100%" border="0" class="TableBorderBottom">
<tr>
<td class="tblBursaSummHeader">No.</td>
<td class="tblBursaSummHeader">Name</td>
<td class="tblBursaSummHeader">Stock<br>Code</td>
<td class="tblBursaSummHeader">Rem</td>
<td class="tblBursaSummHeader">Last<br>Done</td>
<td class="tblBursaSummHeader" width="55">Chg</td>
<td class="tblBursaSummHeader">% Chg</td>
<td class="tblBursaSummHeader">Vol<br>('00)</td>
<td class="tblBursaSummHeader">Buy Vol<br>('00)</td>
<td class="tblBursaSummHeader">Buy</td>
<td class="tblBursaSummHeader">Sell</td>
<td class="tblBursaSummHeader">Sell Vol<br>('00)</td>
<td class="tblBursaSummHeader">High</td>
<td class="tblBursaSummHeaderRect">Low</td>
</tr>
<tr>
<td class="tblBursaSEvenRow">1</td>
<td class="tblBursaSEvenRow">LBI CAPITAL BHD-WARRANT A 08/8 (LBICAP-WA)</td>
<td class="tblBursaSEvenRow Right">8494WA</td>
<td class="tblBursaSEvenRow Right">s</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right"><img src="/images/upArrow.gif" border=0> <span class=tblUp>+0.120</span></td>
<td class="tblBursaSEvenRow Right">300.0</td>
<td class="tblBursaSEvenRow Right">341,238</td>
<td class="tblBursaSEvenRow Right">745</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right">1,049</td>
<td class="tblBursaSEvenRow Right">0.185</td>
<td class="tblBursaSEvenRowRight Right">0.040</td>
</tr>
<tr>
<td class="tblBursaSOddRow">2</td>
<td class="tblBursaSOddRow">UNIMECH GROUP BHD-WA13/18 (UNIMECH-WA)</td>
<td class="tblBursaSOddRow Right">7091WA</td>
<td class="tblBursaSOddRow Right">s</td>
<td class="tblBursaSOddRow Right">0.070</td>
<td class="tblBursaSOddRow Right"><img src="/images/upArrow.gif" border=0> <span class=tblUp>+0.040</span></td>
<td class="tblBursaSOddRow Right">133.3</td>
<td class="tblBursaSOddRow Right">261,521</td>
<td class="tblBursaSOddRow Right">8,468</td>
<td class="tblBursaSOddRow Right">0.065</td>
<td class="tblBursaSOddRow Right">0.070</td>
<td class="tblBursaSOddRow Right">5,008</td>
<td class="tblBursaSOddRow Right">0.080</td>
<td class="tblBursaSOddRowRight Right">0.040</td>
</tr>
<tr>
My desired output is from Stock, Last done and Change. So the desirable output is
8494WA
0.160
+0.120
7091WA
0.070
+0.040
I able to extract the data but I need three lines of code but I prefer a one line code that can do the same works.
page_gain = requests.get('url')
gain = html.fromstring(page_gain.content)
stock = gain.xpath('//table[#class="TableBorderBottom"]/tr/td[3]/text()')
>>> ['Stock', 'Code', '8494WA', '7091WA']
gain.xpath('//table[#class="TableBorderBottom"]/tr/td[5]/text()')
>>>['Last', 'Done', '0.145', '0.075']
gain.xpath('//td/span/text()')
>>>['+0.120', '+0.070']
Notice that I also wish to eliminate the string 'Stock', 'Code','Last','Done' in the results
You need to process each row in the loop and get information you want from it:
data = []
for data_row in gain.xpath('//table[#class="TableBorderBottom"]/tr[position() > 1]'):
stock = data_row.xpath('./td[3]/text()')[0]
last_done = data_row.xpath('./td[5]/text()')[0]
change = data_row.xpath('./td[6]/span/text()')[0]
data.append({ "Stock": stock, "Last Done": last_done, "Change": change })

Web Scraping tables from an HTML file

Hello all I am hoping to get some help with taking the tables in my HTML file and importing them into a csv file. I am very very new to web scraping so for give me if I am completely wrong with my code. The HTML file holds three separate table I am trying to extract; estimate, sampling error, and number of non-zero plots in estimate.
My code is shown below:
#import necessary libraries
import urllib2
import pandas as pd
#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"
#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)
#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')
#Print out the html code with the function prettify
print soup.prettify()
#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)
#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))
#Extract the info from the HTML code
soup.find('table').find_all('td'),{'align':'right'}
#Remove the tags and extract table info into CSV
???
Here is the HTML for the first table "Estimate":
` Estimate:
</b>
</caption>
<tr>
<td>
</td>
<td align="center" colspan="5">
<b>
Ownership group
</b>
</td>
</tr>
<tr>
<th>
<b>
Forest type group
</b>
</th>
<td>
<b>
Total
</b>
</td>
<td>
<b>
National Forest
</b>
</td>
<td>
<b>
Other federal
</b>
</td>
<td>
<b>
State and local
</b>
</td>
<td>
<b>
Private
</b>
</td>
</tr>
<tr>
<td nowrap="">
<b>
Total
</b>
</td>
<td align="right">
4,875,993
</td>
<td align="right">
195,438
</td>
<td align="right">
169,500
</td>
<td align="right">
392,030
</td>
<td align="right">
4,119,025
</td>
</tr>
<tr>
<td nowrap="">
<b>
White / red / jack pine group
</b>
</td>
<td align="right">
40,492
</td>
<td align="right">
3,426
</td>
<td align="right">
-
</td>
<td align="right">
10,850
</td>
<td align="right">
26,217
</td>
</tr>
<tr>
<td nowrap="">
<b>
Loblolly / shortleaf pine group
</b>
</td>
<td align="right">
38,267
</td>
<td align="right">
11,262
</td>
<td align="right">
997
</td>
<td align="right">
4,015
</td>
<td align="right">
21,993
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other eastern softwoods group
</b>
</td>
<td align="right">
25,181
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
25,181
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic softwoods group
</b>
</td>
<td align="right">
5,868
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
662
</td>
<td align="right">
5,206
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / pine group
</b>
</td>
<td align="right">
144,238
</td>
<td align="right">
9,592
</td>
<td align="right">
-
</td>
<td align="right">
21,475
</td>
<td align="right">
113,171
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / hickory group
</b>
</td>
<td align="right">
3,480,272
</td>
<td align="right">
152,598
</td>
<td align="right">
123,900
</td>
<td align="right">
285,305
</td>
<td align="right">
2,918,470
</td>
</tr>
<tr>
<td nowrap="">
<b>
Oak / gum / cypress group
</b>
</td>
<td align="right">
76,302
</td>
<td align="right">
-
</td>
<td align="right">
12,209
</td>
<td align="right">
9,311
</td>
<td align="right">
54,782
</td>
</tr>
<tr>
<td nowrap="">
<b>
Elm / ash / cottonwood group
</b>
</td>
<td align="right">
652,001
</td>
<td align="right">
7,105
</td>
<td align="right">
25,431
</td>
<td align="right">
46,096
</td>
<td align="right">
573,369
</td>
</tr>
<tr>
<td nowrap="">
<b>
Maple / beech / birch group
</b>
</td>
<td align="right">
346,718
</td>
<td align="right">
10,871
</td>
<td align="right">
818
</td>
<td align="right">
12,748
</td>
<td align="right">
322,281
</td>
</tr>
<tr>
<td nowrap="">
<b>
Other hardwoods group
</b>
</td>
<td align="right">
21,238
</td>
<td align="right">
585
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
20,653
</td>
</tr>
<tr>
<td nowrap="">
<b>
Exotic hardwoods group
</b>
</td>
<td align="right">
2,441
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
2,441
</td>
</tr>
<tr>
<td nowrap="">
<b>
Nonstocked
</b>
</td>
<td align="right">
42,975
</td>
<td align="right">
-
</td>
<td align="right">
6,144
</td>
<td align="right">
1,570
</td>
<td align="right">
35,261
</td>
</tr>
</table>
<br/>
<table border="4" cellpadding="4" cellspacing="4">
<caption>
<b>`
I made four tables almost identical to yours and put them into a fairly respectable page of HTML. Then I ran this code.
>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
... df = pd.read_html(str(table), skiprows=2)
... df[0].to_csv('table%s.csv' % t)
The results were four files like this, named table0.csv through table3.csv.
,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261
Perhaps the main thing I should mention is that I skipped the same number of rows in each table that BeautifulSoup delivered. If the number of header lines in the tables varies then you will have to do something more clever or just discard lines in the output files and omit the skiprows parameter.
Unsure as to what the exact question is here but right off the bat I can see an error that will throw you off a bit.
new_table = pd.DataFrame(columns=range(0-4))
Needs to be
new_table = pd.DataFrame(columns=range(0,4))
The result of range(0-4) is actually range(-4) which evaluates to range(0,-4) whereas you want range(0,4). You can just pass range(4) as the parameter or range(0,4).

How to parse an HTML table with rowspans in Python?

The problem
I'm trying to parse an HTML table with rowspans in it, as in, I'm trying to parse my college schedule.
I'm running into the problem where if the last row contains a rowspan, the next row is missing a TD where the rowspan is now that TD that is missing.
I have no clue how to account for this and I hope to be able to parse this schedule.
What I tried
Pretty much everything I can think of.
The result I get
[
{
'blok_eind': 4,
'blok_start': 3,
'dag': 4, # Should be 5
'leraar': 'DOODF000',
'lokaal': 'ALK C212',
'vak': 'PROJ-T',
},
]
As you can see, there's a vak key with the value PROJ-T in the output snippet above, dag is 4 while it's supposed to be 5 (a.k.a Friday/Vrijdag), as seen here:
The result I want
A Python dict() that looks like the one posted above, but with the right value
Where:
day/dag is an int from 1~5 representing Monday~Friday
block_start/blok_start is an int that represents when the course starts (Time block, left side of table)
block_end/blok_eind is an int that represent in what block the course ends
classroom/lokaal is the classroom's code the course is in
teacher/leraar is the teacher's ID
course/vak is the ID of the course
Basic HTML Structure for above data
<center>
<table>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<font>
TEACHER-ID
</font>
</td>
<td>
<font>
<b>
CLASSROOM ID
</b>
</font>
</td>
</tr>
<tr>
<td>
<font>
COURSE ID
</font>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</table>
</center>
The code
HTML
<CENTER><font size="3" face="Arial" color="#000000">
<BR></font>
<font size="6" face="Arial" color="#0000FF">
16AO4EIO1B
</font> <font size="4" face="Arial">
IO1B
</font>
<BR>
<TABLE border="3" rules="all" cellpadding="1" cellspacing="1">
<TR>
<TD align="center">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial" color="#000000">
Maandag 29-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Dinsdag 30-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Woensdag 31-08
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Donderdag 01-09
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
Vrijdag 02-09
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>1</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
8:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>2</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
9:20
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:10
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
WEBD
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>3</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
10:25
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
DOODF000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK C212</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
PROJ-T
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>4</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
11:15
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
BLEEJ002
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B021B</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
MENT
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>5</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:05
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>6</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
12:55
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
JONGJ003
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
BURG
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>7</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
13:45
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:35
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
FLUIP000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B004</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
ICT algemeen Prakti
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>8</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
14:50
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=4 align="center" nowrap="1">
<TABLE>
<TR>
<TD width="50%" nowrap=1><font size="2" face="Arial">
KOOLE000
</font> </TD>
<TD width="50%" nowrap=1><font size="2" face="Arial">
<B>ALK B008</B>
</font> </TD>
</TR>
<TR>
<TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial">
NED
</font> </TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>9</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
15:40
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
<TR>
<TD rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial">
<B>10</B>
</font> </TD>
<TD align="center" nowrap=1><font size="2" face="Arial">
16:30
</font> </TD>
</TR>
<TR>
<TD align="center" nowrap=1><font size="2" face="Arial">
17:20
</font> </TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
<TD colspan=12 rowspan=2 align="center" nowrap="1">
<TABLE>
<TR>
<TD></TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
</TR>
</TABLE>
<TABLE cellspacing="1" cellpadding="1">
<TR>
<TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial">
Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r & P e t t e r s S o f t w a r e
</font></CENTER>
Python
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
daytable = {
1: "Maandag",
2: "Dinsdag",
3: "Woensdag",
4: "Donderdag",
5: "Vrijdag"
}
timetable = {
1: ("8:30", "9:20"),
2: ("9:20", "10:10"),
3: ("10:25", "11:15"),
4: ("11:15", "12:05"),
5: ("12:05", "12:55"),
6: ("12:55", "13:45"),
7: ("13:45", "14:35"),
8: ("14:50", "15:40"),
9: ("15:40", "16:30"),
10: ("16:30", "17:20"),
}
page = BeautifulSoup(r.content, "lxml")
roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
# There are 5 days, first column is not data we want
for day in range(2, 7):
dayroster = {
"dag": 0,
"blok_start": 0,
"blok_eind": 0,
"lokaal": "",
"leraar": "",
"vak": ""
}
# This selector provides the classroom
table_bold = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font > b")
# This selector provides the teacher's code and the course ID
table = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ") > table > tr > td > font")
# This gets the rowspan on the current row and column
rowspan = page.select(
"html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
day) + ")")
try:
if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
last_row_big = True
# Setting end of class
dayroster["blok_eind"] = (block_count // 2) + 1
else:
last_row_big = False
# Setting end of class
dayroster["blok_eind"] = (block_count // 2)
except IndexError:
pass
if table_bold:
x = table_bold[0]
# Classroom ID
dayroster["lokaal"] = x.contents[0]
if table:
iter = 0
for x in table:
content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
# Cell has data
if content != "":
# Set start of class
dayroster["blok_start"] = block_count // 2
# Set day of class
dayroster["dag"] = day - 1
if iter == 0:
# Teacher ID
dayroster["leraar"] = content
elif iter == 1:
# Course ID
dayroster["vak"] = content
iter += 1
if table or table_bold:
# Store the data
roster.append(dayroster)
# Remove duplicates
seen = set()
new_l = []
for d in roster:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
pprint(new_l)
You'll have to track the rowspans on previous rows, one per column.
You could do this simply by copying the integer value of a rowspan into a dictionary, and subsequent rows decrement the rowspan value until it drops to 1 (or we could store the integer value minus 1 and drop to 0 for ease of coding). Then you can adjust subsequent table counts based on preceding rowspans.
Your table complicates this a little by using a default span of size 2, incrementing in steps of two, but that can easily be brought back to manageable numbers by dividing by 2.
Rather than use massive CSS selectors, select just the table rows and we'll iterate over those:
roster = []
rowspans = {} # track rowspanning cells
# every second row in the table
rows = page.select('html > body > center > table > tr')[1:21:2]
for block, row in enumerate(rows, 1):
# take direct child td cells, but skip the first cell:
daycells = row.select('> td')[1:]
rowspan_offset = 0
for daynum, daycell in enumerate(daycells, 1):
# rowspan handling; if there is a rowspan here, adjust to find correct position
daynum += rowspan_offset
while rowspans.get(daynum, 0):
rowspan_offset += 1
rowspans[daynum] -= 1
daynum += 1
# now we have a correct day number for this cell, adjusted for
# rowspanning cells.
# update the rowspan accounting for this cell
rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1
if rowspan:
rowspans[daynum] = rowspan
texts = daycell.select("table > tr > td > font")
if texts:
# class info found
teacher, classroom, course = (c.get_text(strip=True) for c in texts)
roster.append({
'blok_start': block,
'blok_eind': block + rowspan,
'dag': daynum,
'leraar': teacher,
'lokaal': classroom,
'vak': course
})
# days that were skipped at the end due to a rowspan
while daynum < 5:
daynum += 1
if rowspans.get(daynum, 0):
rowspans[daynum] -= 1
This produces correct output:
[{'blok_eind': 2,
'blok_start': 1,
'dag': 5,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021',
'vak': u'WEBD'},
{'blok_eind': 3,
'blok_start': 2,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'WEBD'},
{'blok_eind': 4,
'blok_start': 3,
'dag': 5,
'leraar': u'DOODF000',
'lokaal': u'ALK C212',
'vak': u'PROJ-T'},
{'blok_eind': 5,
'blok_start': 4,
'dag': 3,
'leraar': u'BLEEJ002',
'lokaal': u'ALK B021B',
'vak': u'MENT'},
{'blok_eind': 7,
'blok_start': 6,
'dag': 5,
'leraar': u'JONGJ003',
'lokaal': u'ALK B008',
'vak': u'BURG'},
{'blok_eind': 8,
'blok_start': 7,
'dag': 3,
'leraar': u'FLUIP000',
'lokaal': u'ALK B004',
'vak': u'ICT algemeen Prakti'},
{'blok_eind': 9,
'blok_start': 8,
'dag': 5,
'leraar': u'KOOLE000',
'lokaal': u'ALK B008',
'vak': u'NED'}]
Moreover, this code will continue to work even if courses span more than 2 blocks, or just one block; any rowspan size is supported.
Maybe it is better to use bs4 builtin function like "findAll" to parse your table.
You may use the following code :
from pprint import pprint
from bs4 import BeautifulSoup
import requests
r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
"/c/c00025.htm")
content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", {},recursive=False)
tr_count=0
trs.pop(0)
final_table={}
for tr in trs:
tds=tr.findAll("td", {},recursive=False)
if tds:
td_count=0
tds.pop(0)
for td in tds:
if td.has_attr('rowspan'):
final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
if int(td.attrs['rowspan'])==4:
final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
td_count=td_count+1
td_count=td_count+1
tr_count=tr_count+1
roster=[]
for i in range(0,10): #iterate over time
for j in range(0,5): #iterate over day
item=final_table[str(i)+"-"+str(j)]
if len(item)!=0:
block_eind=i+1
try:
if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
block_eind=i+2
except:
pass
try:
lokaal=item.split('\r\n \n\n')[0]
leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
vak=item.split('\n \n\r\n')[1]
except:
lokaal=leraar=vak="---"
dayroster = {
"dag": j+1,
"blok_start": i+1,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
dayroster_double = {
"dag": j+1,
"blok_start": i,
"blok_eind": block_eind,
"lokaal": lokaal,
"leraar": leraar,
"vak": vak
}
#use to prevent double dict for same event
if dayroster_double not in roster:
roster.append(dayroster)
print (roster)

Python and BeautifulSoup4 - Extract Text from TD tags

I am stuck after looking through many other questions. My code currently is breaking the data into named rows, but is returning the entire line instead of just the text included, I am just looking for ASCO VALVE MFG., INC. from the following line: I am not sure how to pull out just that text from the row.
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">****ASCO VALVE MFG., INC.****</font></td>
My input looks like:
Headers:
<tr>
<td align="center" id="ColHead_0"><font size="3" face="Arial,Helvetica,sans-serif"><b>WH</b></font></td>
<td align="center" id="ColHead_1"><font size="3" face="Arial,Helvetica,sans-serif"><b>OrderNo.</b></font></td>
<td align="center" id="ColHead_2"><font size="3" face="Arial,Helvetica,sans-serif"><b>Cust.</b></font></td>
<td align="left" id="ColHead_3"><font size="3" face="Arial,Helvetica,sans-serif"><b>Customer Name</b></font></td>
<td align="center" id="ColHead_4"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Number</b></font></td>
<td align="center" id="ColHead_5"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Description 1</b></font></td>
<td align="center" id="ColHead_6"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Description 2</b></font></td>
<td align="center" id="ColHead_7"><font size="3" face="Arial,Helvetica,sans-serif"><b>Qty</b></font></td>
<td align="center" id="ColHead_8"><font size="3" face="Arial,Helvetica,sans-serif"><b>S/N </b></font></td>
</tr>
Data rows are as below:
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">09</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">92427</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20668</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">ASCO VALVE MFG., INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77333</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">S/N 50742543</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">50742543</font></td>
</tr>
My code currently is breaking the data into named rows, but is returning the whole html line.
soup1 = BeautifulSoup(output, "html.parser")
find_string = soup1.body.find_all(text="-")
Customer_No = []
Serial_No = []
rows = soup1.find_all("tr")
title = rows[0]
headers = rows[1]
datarows = rows[2:]
for row in datarows :
if len(row)> 7:
WHID = row.contents[1]
ORNO = row.contents[3]
CSNO = row.contents[5]
CSNM = row.contents[7]
ITNO = row.contents[9]
DESC = row.contents[11]
DESC2 = row.contents[13]
QTY = row.contents[15]
SN = row.contents[17]
print ITNO
else:
continue
What I am trying to end up with is a dictionary I guess of [text in CSNO] and [text in SN] pairs to match with a 2nd CSV file. I hope that all makes sense.
You can extract the text for each element using the .text attribute. Something along the following lines should help you get the idea:
from bs4 import BeautifulSoup
content = '''
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">09</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">92427</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20668</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">ASCO VALVE MFG., INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77333</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">S/N 50742543</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">50742543</font></td>
</tr>'''
soup = BeautifulSoup(content, 'html')
rows = soup.find_all('tr')
for row in rows:
td_cells = soup.find_all('td')
for td_cell in td_cells:
print td_cell.text
Output
09
92427
20668
ASCO VALVE MFG., INC.
EQPRAN77333
RANPAK FILLPAK TT
S/N 50742543
1
50742543
To store the text, you could do the following:
soup = BeautifulSoup(content, 'html')
rows = soup.find_all('tr')
table_text = []
for row in rows:
row_text = []
td_cells = soup.find_all('td')
for td_cell in td_cells:
row_text.append(td_cell.text)
table_text.append(row_text)

Categories

Resources