HTML Parsing Table - BeautifulSoup

HTML Parsing Table - BeautifulSoup - python

I am attempting to parse the second table seen below using BeautifulSoup. I am having trouble identifying the second table verses the first because the tables attributes are the exact same. How do I access the information in the table such as name = PATHWAY? What I have used so far to attempt to access the table is:
table = soup.find('table', {'name':'PATHWAY'})
I receive a response of "None" although I know the table is present. To me this means that my method to distinguish between the two is not working. Any suggestions?
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="REACTION TYPE">REACTION TYPE</td><td class=ue>ORGANISM</td><td class=ue>COMMENTARY</td><td class=ue>LITERATURE</td></tr>
<tr class=tr1>
<td class=g>condensation</td><td class=no>-</td><td class=no>-</td><td class=no>-</td></tr>
</table>
</td></tr></table>
<br>
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="PATHWAY">PATHWAY</td><td class=ue>KEGG Link</td><td class=ue>MetaCyc Link</td><td class=ue></td></tr>
<table>

Mu Mind has it right: find the "a" then traverse back up to the parent
soup.find(attrs={"name":"PATHWAY"}).findParent('table')
That's the python way....There is a single xpath command but operating with xpath on axis is more complicated and only worth the effort it it has some specific use (xslt or javascript requirements eg)

>>> soup.find(attrs={"name":"PATHWAY"})
<a name="PATHWAY">PATHWAY</a>

First:
table = soup.find('table' {'name':'PATHWAY'}
is no proper Python code.
What should this match?
This will match only.
Either you iterate through each single table and perform related check inside each table or you iterate over each single node of the tree until you find the related node and then walk up the node hierarchy (by following the parent nodes) until you find a table element. The recursiveChildGenerator() can be used to iterate over all nodes (like in a flat list).

You can use the function form of find:
soup.find(lambda tag: (tag.name=='table' and \
(tag.find('a', attrs={'name': 'PATHWAY'}) is not None)))

Related

Getting the data inside a td tag using Selenium Python

I am new to Selenium Python. I am using Selenium for logging into a website. So far I have successfully logged in and navigated to the page I want. In that page I have a table with id "Common". Inside the table body I have a no. of table rows. I need to get a particular the value "234" from the table. Below is the rough look of the HTML. I need the value "234" to be printed in the output window. I am using Python 2.7. Any help is much appreciated.
<div id="Common" class="x6w" theme="medium">
<div id="Common::content" class="x108" theme="medium">
<div>
<div id="pf12" class="x19" theme="medium">
<table cellpadding="0" cellspacing="0" border="0" summary="" role="presentation" style="width: auto">
<tbody>
<tr>
<td class="x4w" theme="medium" colspan="1">
<table cellpadding="0" cellspacing="0" border="0" width="100%" summary="" role="presentation">
<tbody>
<tr><td style="width: 150px"></td><td></td></tr>
<tr>....</tr>
<tr>....</tr>
<tr class="13" theme="medium" id="15"><td class="13" theme="medium"><label class="label-text" theme="medium">ID</label></td><td valign="top" style="padding-left:9px" class="xv" theme="medium">234</td></tr>

there is a authorisation error, in the html code
please provide the link for the page[Complte]
I you want to know how to iterate through elements this code snippet from Github will help you

Your problem seems related to locating an element.
You can do research on relative xpath
Anyways, base on your html, here's the locator using xpath
targetElem = driver.find_element_by_xpath("//div[contains(#id,'Common')]//tbody/*[text()='ID']/parent::td/following-sibling::td") value = targetElem.text

For any webtable that you are accessing you should be able to extract the the td columns and for that you need to identify that way to make the unique locator like in the case above, you have text in between tags so you can xpath using javascript function "//*[text()='234]"
otherwise the id attribute can be used to extract that element and using the "text" function you can get the text printed on console.

Selenium table search not returning correct text

I'm currently learning how to use selenium in python, I have a table, and I want to retrieve the element but currently facing some trouble.
<table class="table" id="SearchTable">
<thead>..</thead>
<tfoot>..</tfoot>
<tbody>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here</b>
</a>
</td>
</tr>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here as well</b>
</a>
</td>
</tr>
</table>
Inside this table, I want to access the text in the bold tag but my program isn't returning the correct number of tr, in fact I'm not even sure if its searching the correct stuff.
I have backtracked my problem from the end text and found that the errors started appearing from the line with comment. (I think the code afterwards is wrong as well but I'm focusing on getting the correct table row first)
My code is:
search_table = driver.find_element_by_id("SearchTable")
search_table_body = search_table.find_element(By.TAG_NAME, "tbody")
trs = search_table_body.find_elements(By.TAG_NAME, "tr")
print(trs) # this does not return correct number of tr)
for tr in trs:
tds = tr.find_elements(By.TAG_NAME, "td")
for td in tds:
href = td.find_element_by_class_name("qtooltip")
print(href.get_attribute("innerHtml"))
I'm supposed to get the correct number of tr count so I can return the text in the anchor tag but I am stuck. Any help is appreciated. Thanks!

You can get all <b> tags which are children of <a> tag having class attribute of qtooltip and living inside a table cell using a single XPath selector
//table/descendant::a[#class='qtooltip']/b
Example code:
elements = driver.find_elements_by_xpath("//table/descendant::a[#class='qtooltip']/b")
for element in elements:
print(element.text)
Demo:
References:
XPath Tutorial
XPath Axes
XPath Operators & Functions

Extract content of a HTML-file

I've got a HTML-file which looks like this (simplified):
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others...
I tried to extract the content using the following code:
root = lxml.html.parse('www.test.xyz').getroot()
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main')
The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
So the problem is that the middle part appears one time too much at the end.
Why is this and how can this be omitted and fixed?
paul t., also a stackoverflow-user, told me to use "root.xpath('//table[#class="main" and not(.//table[#class="main"])]')". This code prints out exactly the part I have twice.
I hope the problem is described clearly enough...thanks for any help and any propositions :)

You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:
root.xpath('//table[#class="main" and not(ancestor::table[#class="main"])]')

downloading zip files with python mechanize

I am using Python 2.7, mechanize, and beautifulsoup and if it helps I could use urllib
ok, I am trying to download a couple different zip files that are in an different html tables. I know what tables the particular files are in ( I know if they are in the first, second,third ... table)
here is the second table in the html format from the webpage:
<table class="fe-form" cellpadding="0" cellspacing="0" border="0" width="50%">
<tr>
<td colspan="2"><h2>Eligibility List</h2></td>
</tr>
<tr>
<td><b>Eligibility File for Met-Ed</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=ME&ftype=1&fname=cmb_me_elig_lst_06_2013.zip">cmb_me_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penelec</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PN&ftype=1&fname=cmb_pn_elig_lst_06_2013.zip">cmb_pn_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PP&ftype=1&fname=cmb_pennelig_06_2013.zip">cmb_pennelig_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for West Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=WP&ftype=1&fname=cmb_wp_elig_lst_06_2013.zip">cmb_wp_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
I was going to use the following code just to get to the 2nd table:
from bs4 import BeautifulSoup
html= br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table", class=fe-form)
I guess that class="fe-form" is wrong because it will not work, but there are no other attributes of the table that differentiates it from the other tables. All tables have cellpadding="0" cellspacing="0" border="0" width="50%". I guess I can't use the find() function.
so I am trying to get to the second table and then to download the files on this page. Could someone give me some info to push me in the right direction. I have worked with forms before, but not tables. I wish there was some way to find the find the particular title of the zip files I am looking for then download them since I will always know their names
Thanks for any help,
Tom

To select the table you want, simply do
table = soup.find('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
This assumes that there is only one table with class=fe-form and cellpadding=0 in your document. If there are more, this code will select only the first table. To be sure you are not overlooking anything on the page, you could do
tables = soup.findAll('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
table = tables[0]
And maybe assert that len(tables)==1 to be sure that there is only one table.
Now, to download the file, there is plenty you can do. Assuming from your code that you have loaded mechanize, you could something like
a_tags = table.findAll('a')
for a in a_tags:
if '.zip' in a.get('href'):
br.retrieve(a.get('href'), a.text)
That would download all files to your current working directory and would name them according to their link text.

CSS Selectors, Choose by CHILD values

Let say I have an html structure like this:
<html><head></head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<td>Left</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Center</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Right</td>
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I would like to construct CSS selectors to access the three sub tables, which are only distinguished by the contents of a table data item in their first row.
How can I do this?

I think there no such method available in css selector to verify the inner text.
You can achieve that by using xpath or jQuery path.
xpath :
"//td[contains(text(),'Left')]"
or
"//td[text()='Right']"
jQuery path
jQuery("td:contains('Centre')")
Using below logic you can execute jQuery paths in WebDriver automation.
JavascriptExecutor js = (JavascriptExecutor) driver;
WebElement element=(WebElement)js.executeScript(locator);

the .text method on an element returns the text of an element.
tables = page.find_elements_by_xpath('.//table')
contents = "Left Center Right".split()
results = []
for table in tables:
if table.find_element_by_xpath('.//td').text in contents: # returns only the first element
results.append(table)
You can narrow the search field by setting 'page' to the first 'table' element, and then running your search over that. There are all kinds of ways to improve performance like this. Note, this method will be fairly slow if there are a lot of extraneous tables present. Each webpage will have it's quirks on how it chooses to represent information, make sure you work around those to gain efficiency.
You can also use list comprehension to return your results.
results = [t for t in tables if t.find_element_by_xpath('.//td').text in contents]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML Parsing Table - BeautifulSoup - python

>>> soup.find(attrs={"name":"PATHWAY"}) <a name="PATHWAY">PATHWAY</a>

You can use the function form of find: soup.find(lambda tag: (tag.name=='table' and \ (tag.find('a', attrs={'name': 'PATHWAY'}) is not None)))

Related

Getting the data inside a td tag using Selenium Python

Selenium table search not returning correct text

Extract content of a HTML-file

downloading zip files with python mechanize

CSS Selectors, Choose by CHILD values

Categories

Resources