How to make XPath select multiple table elements with identical id attributes? - python

I'm currently trying to extract information from a badly formatted web page. Specifically, the page has used the same id attribute for multiple table elements. The markup is equivalent to something like this:
<body>
<div id="random_div">
<p>Some content.</p>
<table id="table_1">
<tr>
<td>Important text 1.</td>
</tr>
</table>
<h4>Some heading in between</h4>
<table id="table_1">
<tr>
<td>Important text 2.</td>
<td>Important text 3.</td>
</tr>
</table>
<p>How about some more text here.</p>
<table id="table_1">
<tr>
<td>Important text 4.</td>
<td>Important text 5.</td>
</tr>
</table>
</div>
</body>
Clearly this is incorrectly formatted HTML, due to the multiple use of the same id for an element.
I'm using XPath to try and extract all the text in the various table elements, utilising the language through the Scrapy framework.
My call, looks something like this:
hxs.select('//div[contains(#id, "random_div")]//table[#id="table_1"]//text()').extract()
Thus the XPath expression is:
//div[contains(#id, "random_id")]//table[#id="table_1"]//text()
This returns: [u'Important text 1.'], i.e., the contents of the first table that matches the id value "table_1". It seems to me that once it has come across an element with a certain id it ignores any future occurrences in the markup. Can anyone confirm this?
UPDATE
Thanks for the fast responses below. I have tested my code on a page hosted locally, which has the same test format as above and the correct response is returned, i.e.,
`[u'Important text 1.', u'Important text 2.', . . . . ,u'Important text 5.']`
There is therefore nothing wrong with either the Xpath expression or the Python calls I'm making.
I guess this means that there is a problem on the webpage itself which is either screwing up XPath or the html parser, which is libxml2.
Does anyone have any advice as to how I can dig into this a bit more?
UPDATE 2
I have successfully isolated the problem. It is actually with the underlying parsing library, which is lxml (which provides Python bindings for the libxml2 C library.
The problem is that the parser is unable to deal with vertical tabs. I have no idea who coded up the site I am dealing with but it is full of vertical tabs. Web browser seem to be able to ignore these, which is why running the XPath queries from Firebug on the site in question, for example, are successful.
Further, because the above simplified example doesn't contain vertical tabs it works fine. For anyone who comes across this issue in Scrapy (or in python generally), the following fix worked for me, to remove vertical tabs from the html responses:
def parse_item(self, response):
# remove all vertical tabs from the html response
response.body = filter(lambda c: c != "\v", response.body)
hxs = HtmlXPathSelector(response)
items = hxs.select('//div[contains(#id, \"random_div\")]' \
'//table[#id="table_1"]//text()').extract()

With Firebug, this expression:
//table[#id='table_1']//td/text()
gives me this:
[<TextNode textContent="Important text 1.">,
<TextNode textContent="Important text 2.">,
<TextNode textContent="Important text 3.">,
<TextNode textContent="Important text 4.">,
<TextNode textContent="Important text 5.">]
I included the td filtering to give a nicer result, since otherwise, you would get the whitespace and newlines between the tags. But all in all, it seems to work.
What I noticed was that you query for //div[contains(#id, "random_id")], while your HTML snippet has a tag that reads <div id="random_div"> -- the _id and _div being different. I don't know Scrapy so I can't really say if that does something, but couldn't that be your issue as well?

count(//div[#id = "random_div"]/table[#id= "table_1"])
This xpath returns 3 for your sample input. So your problem is not with the xpath itself rather with the functions you use to extract the nodes.

Related

Select a html a tag with specified display content

I'm new to scrapy and have been struggling for this problem for hours.
I need to scrape a page, with its source somehow looks like this:
<tr class="odd">
<td class="pfama_PF02816">Pfam</td>
<td>Alpha_kinase</td>
<td>1389</td>
<td>1590</td>
<td class="sh" style="display: none">21.30</td>
</tr>
I need to get the information of the tr.odd tag, if and only if the a tag has "Alpha_kinase" value
I can get all of those content (including "Alpha_kinase", 1389, 1590 and many other values) and then process the output to get "Alpha_kinase" only, but this approach will be significantly fragile and ugly. Currently I have to do that way:
positions = response.css('tr.odd td:not([class^="sh"]) td a::text').extract()
then do a for-loop to check.
Is there any condition (like td.not above) expression to put in response.css to solve my problem?
Thanks in advance. Any advice will be highly appreciated!
You can use another selector: response.xpath to select element from the html,
and filter the text with xpath contains function.
>>> response.xpath("//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]")
[<Selector xpath="//tr[#class='odd']/td/a[contains(text(),'Alpha_kinase')]" data='<a href="http://pfam.xfam.org/family/Alp'>]
I assume there are multiple such tr elements on the page. If so, I would probably do something like:
# get only rows containing 'Alpha_kinase' in link text
for row in response.xpath('//tr[#class="odd" and contains(./td/a/text(), "Alpha_kinase")]'):
# extract all the information
item['link'] = row.xpath('./td[2]/a/#href').extract_first()
...
yield item

Want to pull a journal title from an RCSB Page using python & BeautifulSoup

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)

python, collecting links / script values from page

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.
With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):
<a class="visit" href="/tet?id=12&mv=13&san=221">
221
</a>
and this is the script:
<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>
I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)
I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after
extra info: This might be important. to get to the page I have to click on a button with this code:
<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">
load2
</a>
after which a "new page" loads in a part of the window (but the url never changes)
I think you pasted the wrong script of yours ;)
I'm not sure what you need exactly - there are at least two different approaches.
Matching all hrefs using regex
Matching specific tags and using getAttribute(...)
For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):
<a.+?href=['"](.*?)['"].*?/?>
If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.
This could result in code like this:
hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')
for element in elements:
hrefs.append(element.getAttribute('href'))
Or a one liner using list comprehension:
hrefs = [element.getAttribute('href') for element \
in webdriver.find_elements_by_css_selector('.visit')]

Can't see the HTML in the element

I am able to log on and access my account page, here is a sample of the HTML (modified for brevity and to not exceed the URL limit):
<div class='table m_t_4'>
<table class='data' border=0 width=100% cellpadding=0 cellspacing=0>
<tr class='title'>
<td align='center' width='15'><a></a></td>
<td align='center' width='60'></td>
</tr>
<TR bgcolor=>
<td valign='top' align='center'>1</TD>
<td valign='top' align='left'><img src='/images/sale_small.png' alt='bogo sale' />Garden Escape Planters</TD>
<td valign='top' align='right'>13225</TD>
<td valign='top' align='center'>2012-01-17 11:34:32</TD>
<td valign='top' align='center'>FILLED</TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/orderdetail?order=7886745'>7886745</A></TD>
<td valign='top' align='center'><A HREF='https://www.daz3d.com/i/account/req_dlreset?oi=18087292'>Reset</A>
</TR>
Note that the only item I really need is the first HREF with the "order=7886745'>7886745<"...
And there are several of the TR blocks that I need to read.
I am using the following xpath coding:
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history
product_block = account_history.find_element_by_xpath("//TR[contains(#bgcolor, '')]");
print product_block
product_link = product_block.find_element_by_xpath("//TR/td/A#HREF")
print product_link
I am using the Python FireFox version of webdriver.
When I run this, the account_history and product_block xpath's seem to work fine (they print as "none" so I assume they worked), but I get a "the expession is not a legal expression" error on the product_link.
I have 2 questions:
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be using the product_block - which it (should be) just the TR segment, so it should start with the TR, then look for the first td that has the HREF...correct?
I tried using the exact case used in the HTML, but I think it shouldn't matter...
2: What coding do I need to use to see the content (HTML/text) of the elements?
I need to be able to do this to get the URL I need for the next page to call.
I would also like to see for sure that the correct HTML is being read here...that should be a normal part of debugging, IMHO.
How is the element data stored? Is it in an array or table that I can read using Python? It has to be available somewhere, in order to be of any use in testing - doesn't it?
I apologize for being so confused, but I see a lot of info on this on the web, and yet much of it either doesn't do anything, or it causes an error.
There do not seem to be any "standard" coding rules available...and so I am a bit desperate here...
I really like what I have seen in Selenium up to this point, but I need to get past it in order to make this work!
Edited!
OK, after getting some sleep the first answer provided the clue - find_elements_by_xpath creates a list...so I used that to find all of the xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]"); elements in the entire history, then accessed the list it created...and write it to a file to be sure of what I was seeing.
The revised code:
links = open("listlinks.txt", "w")
browser.get('https://www.daz3d.com/i/account/orderitem_hist?')
account_history = browser.find_element_by_xpath("//div[#class='table m_t_4']");
print account_history.get_attribute("div")
product_links = []
product_links = account_history.find_elements_by_xpath("//a[contains(#href,'https://www.daz3d.com/i/account/orderdetail?order=')]");
print str(len(product_links)) + ' elements'
for index, item in enumerate(product_links):
link = item.get_attribute("href")
links.write(str(index) + '\t' + str(link) + '\n')
And this gives me the file with the links I need...
0 https://www.daz3d.com/i/account/orderdetail?order=7905687
1 https://www.daz3d.com/i/account/orderdetail?order=7886745
2 https://www.daz3d.com/i/account/orderdetail?order=7854456
3 https://www.daz3d.com/i/account/orderdetail?order=7812189
So simple I couldn't see it for tripping over it...
Thanks!
1: Why doesn't the "//TR/td/A#HREF" xpath work? It is supposed to be
using the product_block - which it (should be) just the TR segment, so
it should start with the TR, then look for the first td that has the
HREF...correct?
WebDriver only returns elements, not attributes of said elements, thus:
"//TR/td/A"
works, but
"//TR/td/A#HREF"
or
"//TR/td/A#ANYTHING"
does not.
2: What coding do I need to use to see the content (HTML/text) of the
elements?
To retrieve the innertext:
string innerValue = element.Text;
To retrieve the innerhtml:
This is a little harder, you would need to iterate through each of the child elements and reconstruct the html based on that - or you could process the html with a scraping tool.
To retrieve an attribute:
string hrefValue = element.GetAttribute("href");
(C#, hopefully you can make the translation to Python)
There are other ways too to access an element than browser.find_element_by_xpath.
You can access by for e.g. id, or class
browser.find_element_by_id
browser.find_element_by_link_text
browser.find_element
browser.find_element_by_class_name
browser.find_element_by_css_selector
browser.find_element_by_name
browser.find_element_by_partial_link_text
browser.find_element_by_xpath
browser.find_element_by_tag_name
Each of above has a similar function which returns a list(just replace element with elements
Note: I have separated top two rows as I think they might help you.

Regex try and match until hitting end tag in python

I'm looking for a bit of help with a regex in python and google is failing me. Basically I'm searching some html and there is a certain type of table I'm searching for, specifically any table that includes a background tag in it (i.e. BGCOLOR). Some tables have this tag and some do not. Could someone help me out with how to write a regex that searches for the start of the table, then searches for the BGCOLOR but if it hits the end of the table then it stops and moves on?
Here's a very simplified example that will server the purpose:
`<TABLE>
<B>Item 1.</B>
</TABLE>
<TABLE>
BGCOLOR
</TABLE>
<TABLE>
<B>Item 2.</B>
</TABLE>`
So we have three tables but I'm only interested in finding the middle table that contains 'BGCOLOR'
The problem with my regex at the moment is that it searches for the starting table tag then looks for 'BGCOLOR' and doesn't care if it passes the table end tag:
tables = re.findall('\<table.*?BGCOLOR=".*?".*?\<\/table\>', text, re.I|re.S)
So it would find the first two tables instead of just the second table. Let me know if anyone knows how to handle this situation.
Thanks,
Michael
Don't use a regular expression to parse HTML. Use lxml or BeautifulSoup.
Don't use regular expressions to parse HTML -- use an HTML parser, such as BeautifulSoup.
Specifically, your situation is basically one of having to deal with "nested parentheses" (where an open "parens" is an opening <table> tag and the corresponding closed parens is the matching </table>) -- exactly the kind of parsing tasks that regular expressions can't perform well. Lots of the work in parsing HTML is exactly connected with this "matched parentheses" issue, which makes regular expressions a perfectly horrible choice for the purpose.
You mention in a comment to another answer that you've had unspecified problems with BS -- I suspect you were trying the latest, 3.1 release (which has gone downhill) instead of the right one; try 3.0.8 instead, as BS's own docs recommend, and you could be better off.
If you've made some kind of pact with Evil never to use the right tool for the job, your task might not be totally impossible if you don't need to deal with nesting (just matching), i.e., there is never a table inside another table. In this case you can identify one table with r'<\s*TABLE(.*?)<\s*/\s*TABLE' (with suitable flags such as re.DOTALL and re.I); loop over all such matches with the finditer method of regular expressions; and in the loop's body check whether BGCOLOR (in a case-insensitive sense) happens to be inside the body of the current match. It's still going to be more fragile, and more work, than using an HTML parser, but while definitely an inferior choice it needs not be a desperate situation.
If you do have nested tables to contend with, then it is a desperate situation.
if your task is just this simple, here's a way. split on <TABLE> then iterate the items and find the required pattern you want.
myhtml="""
<TABLE>
<B>Item 1.</B>
</TABLE>
some text1
some text2
some text3
<TABLE>
blah
BGCOLOR
blah
</TABLE>
some texet
<TABLE>
<B>Item 2.</B>
</TABLE>
"""
for tab in myhtml.split("</TABLE>"):
if "<TABLE>" in tab and "BGCOLOR" in tab:
print ''.join(tab.split("<TABLE>")[1:])
output
$ ./python.py
blah
BGCOLOR
blah
Here's the code that ended up working for me. It finds the correct table and adds more tagging around it so that it is identified from the group with open and close tags of 'realTable'.
soup = BeautifulSoup(''.join(text))
for p in soup.findAll('table'):
pattern = '.*BGCOLOR.*'
if (re.match(pattern, str(p), re.S|re.I)):
tags = Tag(soup, "realTable")
p.replaceWith(tags)
text = NavigableString(str(p))
tags.insert(0, text)
print soup
prints this out:
<table><b>Item 1.</b></table>
<realTable><table>blah BGCOLOR blah</table></realTable>
<table><b>Item 2.</b></table>

Categories

Resources