Extract content of a HTML-file

Extract content of a HTML-file - python

I've got a HTML-file which looks like this (simplified):
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others...
I tried to extract the content using the following code:
root = lxml.html.parse('www.test.xyz').getroot()
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main')
The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
So the problem is that the middle part appears one time too much at the end.
Why is this and how can this be omitted and fixed?
paul t., also a stackoverflow-user, told me to use "root.xpath('//table[#class="main" and not(.//table[#class="main"])]')". This code prints out exactly the part I have twice.
I hope the problem is described clearly enough...thanks for any help and any propositions :)

You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:
root.xpath('//table[#class="main" and not(ancestor::table[#class="main"])]')

Related

How do I fetch table with attributes using BeautifulSoup? [duplicate]

This question already has answers here:
How to find tags with only certain attributes - BeautifulSoup
(8 answers)
Closed 1 year ago.
How can I fetch table with other attributes using beautifulSoup?
I have this table :
I have tried this but didn't succeed :
Soup.find_all('table',{'style':"width=100%;align:center;border:0;cellpadding:0;cellspacing:0"})

You can do this:
from bs4 import BeautifulSoup
html_source = '''
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
'''
soup = BeautifulSoup(html_source, 'html.parser')
els = list(soup.find("table",attrs={"width":"100%","border":"0","align":"center"}))
print(' '.join(map(str, els)))
You can provide as many attributes as you want in that dictionary. You can get a bit more information from here.
Example: https://www.napkin.io/n/c174e5c8557d402d

Getting the data inside a td tag using Selenium Python

I am new to Selenium Python. I am using Selenium for logging into a website. So far I have successfully logged in and navigated to the page I want. In that page I have a table with id "Common". Inside the table body I have a no. of table rows. I need to get a particular the value "234" from the table. Below is the rough look of the HTML. I need the value "234" to be printed in the output window. I am using Python 2.7. Any help is much appreciated.
<div id="Common" class="x6w" theme="medium">
<div id="Common::content" class="x108" theme="medium">
<div>
<div id="pf12" class="x19" theme="medium">
<table cellpadding="0" cellspacing="0" border="0" summary="" role="presentation" style="width: auto">
<tbody>
<tr>
<td class="x4w" theme="medium" colspan="1">
<table cellpadding="0" cellspacing="0" border="0" width="100%" summary="" role="presentation">
<tbody>
<tr><td style="width: 150px"></td><td></td></tr>
<tr>....</tr>
<tr>....</tr>
<tr class="13" theme="medium" id="15"><td class="13" theme="medium"><label class="label-text" theme="medium">ID</label></td><td valign="top" style="padding-left:9px" class="xv" theme="medium">234</td></tr>

there is a authorisation error, in the html code
please provide the link for the page[Complte]
I you want to know how to iterate through elements this code snippet from Github will help you

Your problem seems related to locating an element.
You can do research on relative xpath
Anyways, base on your html, here's the locator using xpath
targetElem = driver.find_element_by_xpath("//div[contains(#id,'Common')]//tbody/*[text()='ID']/parent::td/following-sibling::td") value = targetElem.text

For any webtable that you are accessing you should be able to extract the the td columns and for that you need to identify that way to make the unique locator like in the case above, you have text in between tags so you can xpath using javascript function "//*[text()='234]"
otherwise the id attribute can be used to extract that element and using the "text" function you can get the text printed on console.

Click without id or name (Ghost)

I'm using Ghost for Python 2.7 and I'm trying to click in a link which is in a table. The problem is that I have no ID, name... This is the HTML code:
<table id="table_webbookmarkline_2" cellpadding="4" cellspacing="0" border="0" width="100%">
<tr valign="top">
<td>
<a href="/dana/home/launch.cgi?url=.ahuvs%3A%2F%2Fhq0l5458452ERA-w-Xz8G3LKe8JNM%2F.ISDXWXaWXUivecOc" target="_blank" onClick='javascript:openBookmark(
this.href, "yes", "yes");
return false;' ><img src="/dana-cached/imgs/icn18x18WebBookmarkPop.gif" alt="This will open in a new TAB" width="18" height="18" border="0" ></a>
</td>
<td width="100%" align="left">
<a href="/dana/home/launch.cgi?url=.ahuvs%3A%2F%2Fhq0l5458452ERA-w-Xz8G3LKe8JNM%2F.ISDXWXaWXUivecOc" target="_blank" onClick='JavaScript:openBookmark(
this.href, "yes", "yes");
return false;' ><b>**LINK WHERE I WANT TO CLICK**</b> </a><br><span class="cssSmall"></span>
</td>
</tr>
</table>
How can I click in this kind of link ?

Seems like Ghost's Session.click() takes a CSS selector. Here only the table has an ID, so a selector that takes the second td that is a descendant of that ID and finds the a element should work:
session.click('#table_webbookmarkline_2 td:nth-child(2) a')

Is there anyway to check if given XPath is valid in Python?

I have a python code that is extracting some information from a table. But the thing is sometimes the Xpath changes. Right now it only changes between two different XPath's that looks like this:
//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span/
and the other alternative is a slight change in the table like this:
//*[#id='content-primary']/table[2]/tbody/tr[td[1]/span/span/
this is the code that i am using right now to get the information that i need:
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
So what i want to do is a check if the given XPath is valid. If it is not i just try the other XPath alternative.
Hope somebody can help me with this problem. Thank you all.
EDIT1
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tfoot>
<tr>
<td colspan="3">
<dl>
<dt class="clNotify">Röd text</dt>
<dd> = Ändrad matchtid </dd>
<dt><img src="http://svenskfotboll.se/i/u/alert.gif" alt="Röda utropstecknet" /></dt>
<dd> = Peka på utropstecknet så visas en notering </dd>
<dt><img src="http://svenskfotboll.se/i/widget.gif" alt="Widget" /></dt>
<dd>Hämta widget för kommande matcher</dd>
</dl>
</td>
</tr>
</tfoot>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2015-04-17<!-- br ok --> 19:15</span></span> //This is the date i am checking with first
</td>
<td>Götene IF - Vårgårda IK </td> // The other information that i need from the table later
<td>Sparbanksvallen Götene konstgräs </td>
</tr>

In my situation i did not need to specify which table to extract the information from. Since the information that i will get is specified with the date that only contains in that table i just used this code and it worked out fine for me:
**rows_xpath = XPath("//*[#id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))**
now it is just table which means it will go through both tables in the website. Its not maybe a clean solution but works for me..

HTML Parsing Table - BeautifulSoup

I am attempting to parse the second table seen below using BeautifulSoup. I am having trouble identifying the second table verses the first because the tables attributes are the exact same. How do I access the information in the table such as name = PATHWAY? What I have used so far to attempt to access the table is:
table = soup.find('table', {'name':'PATHWAY'})
I receive a response of "None" although I know the table is present. To me this means that my method to distinguish between the two is not working. Any suggestions?
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="REACTION TYPE">REACTION TYPE</td><td class=ue>ORGANISM</td><td class=ue>COMMENTARY</td><td class=ue>LITERATURE</td></tr>
<tr class=tr1>
<td class=g>condensation</td><td class=no>-</td><td class=no>-</td><td class=no>-</td></tr>
</table>
</td></tr></table>
<br>
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="PATHWAY">PATHWAY</td><td class=ue>KEGG Link</td><td class=ue>MetaCyc Link</td><td class=ue></td></tr>
<table>

Mu Mind has it right: find the "a" then traverse back up to the parent
soup.find(attrs={"name":"PATHWAY"}).findParent('table')
That's the python way....There is a single xpath command but operating with xpath on axis is more complicated and only worth the effort it it has some specific use (xslt or javascript requirements eg)

>>> soup.find(attrs={"name":"PATHWAY"})
<a name="PATHWAY">PATHWAY</a>

First:
table = soup.find('table' {'name':'PATHWAY'}
is no proper Python code.
What should this match?
This will match only.
Either you iterate through each single table and perform related check inside each table or you iterate over each single node of the tree until you find the related node and then walk up the node hierarchy (by following the parent nodes) until you find a table element. The recursiveChildGenerator() can be used to iterate over all nodes (like in a flat list).

You can use the function form of find:
soup.find(lambda tag: (tag.name=='table' and \
(tag.find('a', attrs={'name': 'PATHWAY'}) is not None)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract content of a HTML-file - python

You want to select all the tables with class "main" which are not already selected as descendants of the same elements. This seems to work fine: root.xpath('//table[#class="main" and not(ancestor::table[#class="main"])]')

Related

How do I fetch table with attributes using BeautifulSoup? [duplicate]

Getting the data inside a td tag using Selenium Python

Click without id or name (Ghost)

Is there anyway to check if given XPath is valid in Python?

HTML Parsing Table - BeautifulSoup

Categories

Resources