I'm trying to parse an HTML email from python code to extract various details and would appreciate a regular expression or two to help achieve this as it is too complex for my limited regex understanding. e.g. look for 'Travel Date' and extract 'October 30 2018 (Tue)'.
In all cases there is a field name contained within <td> tags followed by the field value contained within another set of <td> tags. Sometimes the name and value are contained within the same row <tr> tags (Case 1) and other times they are in separate row tags (Case 2). Other items like <span> and <img> need to be skipped over as well.
Case 1
<tr>
<td colspan="2"> </td></tr>
<tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #777777;">Travel Date</td>
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #444444;">October 30 2018 (Tue)</td>
</tr>
Case 2
<tr><td style="vertical-align: top;">
<span style="font-size: 10px; font-family: Arial; color: #999999; font-weight: bold; line-height: 19px; text-transform: uppercase;">Drop-off to Address</span>
</td></tr>
<tr><td style="vertical-align: top;">
<span style="font-size: 13px; font-family: Arial; color: #444444;"><img style="vertical-align:text-bottom;" src="https://d1lk4k9zl9klra.cloudfront.net/Email/Common/address_icon.png" alt="" width="14" height="14" /> 200 George St, Sydney NSW 2000, Australia</span>
</td></tr>
Instead of using regex, I would use Beautiful Soup. It makes it easier to go through HTML elements and scrape what you need. If you know the relationship between the key and value, then you could use that to extract information. Here's an example for case 1:
In [8]: from bs4 import BeautifulSoup
In [9]: text = """
...: <tr>
...: <td colspan="2"> </td></tr>
...: <tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>
...: <td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>
...: </tr>"""
In [11]: soup = BeautifulSoup(text, 'lxml')
In [13]: soup.find_all('td')
Out[13]:
[<td colspan="2"> </td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>]
In [15]: for tag in soup.find_all('td'):
...: if tag.text == "Travel Date":
...: print tag.find_next().text
...:
October 30 2018 (Tue)
Beautiful Soup gives a lot of flexibility when scraping HTML from the web.
Related
I have an HTML tag like the following:
print(tag)
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
I can update the value from 10 to 15 in the tag with BeautifulSoup:
tag.p.contents[0].replaceWith(str(15))
However, I haven't figured out a way to update the values in the style tags; because they seem to be a part of their parent, or 'base' tags.
For example, how would I update the tag to the following?
print(tag) -->
<td style="background-color: #762157;">
<p style="
margin: 0;
font-size: 12px;
line-height: 17px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
I change the background-color to #762157 and line-height to 17px;
In the context of the html you posted, style is not a tag, but an attribute of the p tag. Here is one way to modify that p attribute (and you can apply the same for td style attribute):
from bs4 import BeautifulSoup as bs
html = '''
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
'''
soup = bs(html, 'html.parser')
print('OLD SOUP')
print(soup.prettify())
print('______________')
p_style_attribute = soup.select_one('p').get('style')
new_p_style_attr = '''
margin: 3;
font-size: 17px;
line-height: 26px;
font-family: Arial, sans-serif;
text-align: center;
'''
soup.select_one('p')['style'] = new_p_style_attr
print('NEW SOUP')
print(soup.prettify())
This will print out in terminal:
OLD SOUP
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">
10
</p>
</td>
______________
NEW SOUP
<td style="background-color: #e5e5e5;">
<p style="
margin: 3;
font-size: 17px;
line-height: 26px;
font-family: Arial, sans-serif;
text-align: center;
">
10
</p>
</td>
Here is the documentation for BeautifulSoup:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html
A regex approach. Collect the new information about the style into a dictionary, (tag_name, attr_value)-pairs and pass it to update_style function for an in-place modification of the soup.
The substitution is invariant wrt white spaces and ;. A **subs could also be possible.
from bs4 import BeautifulSoup as bs
import re
def update_style(soup, subs):
for tag_name, attr in subs.items():
tag = soup.find(tag_name)
k, v = attr.split(':')
tag['style'] = re.sub(rf'{k}:(.+);', f'{k}: {v.strip(" ;")};', tag['style'])
html = '''
<td style="background-color: #e5e5e5;">
<p style="
margin: 0;
font-size: 12px;
line-height: 16px;
font-family: Arial, sans-serif;
text-align: center;
">10</p>
</td>
'''
soup = bs(html, 'lxml')
subs = {'td': 'background-color: #762157;', 'p': 'line-height: 17px;'}
update_style(soup, subs)
print(soup.prettify())
Here is a HTML table:
<table width="100%" cellpadding="4" cellspacing="0" style="page-break-before: always">
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<col width="32*"/>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">A</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">B</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">C</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">D</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">E</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">F</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">G</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">H</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">I</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">J</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">K</font></font></font></p>
</td>
<td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">L</font></font></font></p>
</td>
</tr>
<tr valign="top">
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O2</font></font></font></p>
</td>
<td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P</font></font></font></p>
</td>
<td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
<font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P2</font></font></font></p>
</td>
</tr>
</table>
The last row here has 2x more columns than others. When I'm trying to read it into the Pandas dataframe I get this result:
table = pd.read_html('1111.html')
table[0]
0 1 2 3 4 5 6 7
0 A A B B C C D D
1 E E F F G G H H
2 I I J J K K L L
3 M M2 N N2 O O2 P P2
How to read it correctly, without dubbing? I don't need the last row.
You can use BeautifulSoup to parse the table and then convert the results to a dataframe:
import pandas as pd
from bs4 import BeautifulSoup as soup
df = pd.DataFrame([[k[1:-1] for i in b.find_all('td') if (k:=i.text) is not None] for b in soup(html, 'html.parser').table.find_all('tr')])
Output:
0 1 2 3 4 5 6 7
0 A B C D None None None None
1 E F G H None None None None
2 I J K L None None None None
3 M M2 N N2 O O2 P P2
Edit: solution without assignment expression:
df = pd.DataFrame([[i.text[1:-1] if i else i for i in b.find_all('td')] for b in soup(html, 'html.parser').table.find_all('tr')])
Output:
0 1 2 3 4 5 6 7
0 A B C D None None None None
1 E F G H None None None None
2 I J K L None None None None
3 M M2 N N2 O O2 P P2
I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.
Here is the beginning of the html table showing the headers and the first row of data:
<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>
Below is the code that I wrote to read the data into a dataframe:
table = soup.find('table', id = 'tableid1')
table_rows = table.find_all('tr')
allData=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
allData.append(row)
headers = allData.pop(0)
self.d1_bundle_df = pd.DataFrame(allData, columns = headers)
When the above code is running, it generates the following error:
AssertionError: 3 columns passed, passed data had 4 columns
What's the best way to handle these nested tables?
This is still relatively new to me, so any direction would be greatly appreciated.
Problem is you are searching in row for all <td>, but these <td> can contain other <td> in your case. One solution is use CSS selectors and search only for <td> which don't have other <td>:
data = '''<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('#tableid1 > tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])
from pprint import pprint
pprint(rows)
Prints:
[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]
The CSS selector #tableid1 > tr will search for all <tr> that are directly under the tag with id=tableid1
The CSS selector td:not(:has(td)) will search for all <td> that don't contain other <td>.
Further reading:
CSS Selectors Reference
I am building HTML table from the list through lxml.builder and striving to make a link in one of the table cells
List is generated in a following way:
with open('some_file.html', 'r') as f:
table = etree.parse(f)
p_list = list()
rows = table.iter('div')
p_list.append([c.text for c in rows])
rows = table.xpath("body/table")[0].findall("tr")
for row in rows[2:]:
p_list.append([c.text for c in row.getchildren()])
HTML file which I parse is the same that is generated further by lxml, i.e. I set up some sort of recursion for testing purposes.
And here is how I build table
from lxml.builder import E
page = (
E.html(
E.head(
E.title("title")
),
E.body(
....
*[E.tr(
*[
E.td(E.a(E.img(src=str(col)))) if ind == 8 else
E.td(E.a(str(col), href=str(col))) if ind == 9 else
E.td(str(col)) for ind, col in enumerate(row)
]
) for row in p_list ]
When I specify link via literals all is going fine.
E.td(E.a("link", href="url_address"))
However, when I try to output list element value (which is https://blahblahblah.com) as a link
E.td(E.a(str(col), href=str(col)))
cell is empty, just nothing is showed in the cell.
If I specify link text as a literal and put str (col) into href, the link is showed normally, but instead of real href it contains the name of the generated html file.
If I output just that col value as a string
E.td(str(col))
it is showed normally, i.e. it is not empty. What is wrong with E.a and E.img elements?
Just noticed that this happens only if I build list from html file. When I build list manually, like this, all is output fine.
p_list = []
p_element = ['id']
p_element.append('value')
p_element.append('value2')
p_list.append(p_element)
Current output (pay attention to <a> and <href> tags)
<html>
<head>
<title>page</title>
</head>
<body>
<style type="text/css">
th {
background-color: DeepSkyBlue;
text-align: center;
vertical-align: bottom;
height: 150px;
padding-bottom: 3px;
padding-left: 5px;
padding-right: 5px;
}
.vertical {
text-align: center;
vertical-align: middle;
width: 20px;
margin: 0px;
padding: 0px;
padding-left: 3px;
padding-right: 3px;
padding-top: 10px;
white-space: nowrap;
-webkit-transform: rotate(-90deg);
-moz-transform: rotate(-90deg);
}</style>
<h1>title</h1>
<p>This is another paragraph, with a</p>
<table border="2">
<tr>
<th>
<div class="vertical">ID</div>
</th>
...
<th>
<div class="vertical">I blacklisted him</div>
</th>
</tr>
<tr>
<td>1020</td>
<td>ТаисияСтрахолет</td>
<td>No</td>
<td>Female</td>
<td>None</td>
<td>Санкт-Петербург</td>
<td>Росiя</td>
<td>None</td>
<td>
<a>
<img src="
"/>
</a>
</td>
<td>
<a href="
">
</a>
</td>
...
</tr>
</table>
</body>
</html>
Desired output
<html>
<head>
<title>page</title>
</head>
<body>
<style type="text/css">
th {
background-color: DeepSkyBlue;
text-align: center;
vertical-align: bottom;
height: 150px;
padding-bottom: 3px;
padding-left: 5px;
padding-right: 5px;
}
.vertical {
text-align: center;
vertical-align: middle;
width: 20px;
margin: 0px;
padding: 0px;
padding-left: 3px;
padding-right: 3px;
padding-top: 10px;
white-space: nowrap;
-webkit-transform: rotate(-90deg);
-moz-transform: rotate(-90deg);
}</style>
<h1>title</h1>
<p>This is another paragraph, with a</p>
<table border="2">
<tr>
<th>
<div class="vertical">ID</div>
</th>
...
<th>
<div class="vertical">I blacklisted him</div>
</th>
</tr>
<tr>
<td>1019</td>
<td>МихаилПавлов</td>
<td>No</td>
<td>Male</td>
<td>None</td>
<td>Санкт-Петербург</td>
<td>Росiя</td>
<td>C.-Петербург</td>
<td>
<a>
<img src="http://i.imgur.com/rejChZW.jpg"/>
</a>
</td>
<td>
link
</td>
...
</tr>
</table>
</body>
</html>
Got it myself. The problem was not in generating but in parsing HTML. Parsing function didn't fetch IMG and A tags nested in TD and these elements of the list were empty. Due to the rigorous logic of the program (fetching from file + fetching from site API) I wasn't able to detect the cause of the issue.
The correct parsing logic should be:
for row in rows[1:]:
data.append([
c.find("a").text if c.find("a") is not None else
c.find("img").attrib['src'] if c.find("img") is not None else
c.text
for c in row.getchildren()
])
Just a guess but it looks like 'str()' could be escaping it. Try E.td(E.a(col, href=col))
To start off here's my current code in its entirety:
import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re
page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'
sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)
tablelist = soup.findAll('table')
class MyParser(sgmllib.SGMLParser):
def parse(self, segment):
self.feed(segment)
self.close()
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.descriptions = []
self.inside_td_element = 0
self.starting_description = 0
def start_td(self, attributes):
for name, value in attributes:
if name == "valign":
self.inside_td_element = 1
self.starting_description = 1
else:
self.inside_td_element = 1
self.starting_description = 1
def end_td(self):
self.inside_td_element = 0
def handle_data(self, data):
if self.inside_td_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data
def get_descriptions(self):
return self.descriptions
counter = 0
trlist = []
dtablelist = []
while counter < len(tablelist):
trsegment = tablelist[counter].findAll('tr')
trlist.append(trsegment)
strsegment = str(trsegment)
myparser = MyParser()
myparser.parse(strsegment)
sub = myparser.get_descriptions()
dtablelist.append(sub)
counter = counter + 1
ex = []
dtablelist = [s for s in dtablelist if s != ex]
So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:
print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold">< <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
</tr>,...
As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:
print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']
As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:
[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]
Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?
Using lxml.html:
>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]
And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.
import urllib
import lxml.html
data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)
tables = []
for tbl in tree.iterfind('.//table'):
tele = []
tables.append(tele)
for tr in tbl.iterfind('.//tr'):
text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
tele.append(text)
print tables
Hope this helps, cheers!
If somebody is searching for a solution of the same problem but is using python 3:
You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser class was replaced by HTMLParser from html.parser. I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.
You can use that class (here named HTMLTableParser) the following way:
import urllib.request
from html_table_parser import HTMLTableParser
target = 'http://www.twitter.com'
# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')
# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
[[[' ', ' Anmelden ']],
[['Land', 'Code', 'Für Kunden von'],
['Vereinigte Staaten', '40404', '(beliebig)'],
['Kanada', '21212', '(beliebig)'],
...
['3424486444', 'Vodafone'],
[' Zeige SMS-Kurzwahlen für andere Länder ']]]