Extracting p within h1 with Python/Scrapy - python

I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how.
I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has something to do with setting the Selector type to "xml" rather than "html" in order to recognize any XML tree, but for the life of me I cannot figure out how or where to do that in this instance.
For example, a website has the following HTML:
<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and Dvořák featuring pianist Emanuel Ax
</p>
</h1>
I have made an item called Concert() that has a value called 'title'. In my item loader, I use:
def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
thisconcert.add_xpath('title','//h1[#class="performance-title"]/p/text()')
return thisconcert.load_item()
This returns, in item['title'], a unicode list that does not include the text inside the p element, such as:
['\n ', '\n ', '\n ']
I understand why, but I don't know how to get around it. I have also tried things like:
from scrapy import Selector
def parse_item(self, response):
s = Selector(text=' '.join(response.xpath('.//section[#id="performers"]/text()').extract()), type='xml')
What am I doing wrong here, and how can I parse HTML that contains this problem (p within h1)?
I have referenced the information concerning this specific issue at Behavior of the scrapy xpath selector on h1-h6 tags but it does not provide a complete solution that can be applied to a spider, only an example within a session using a given text string.

That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as :
<h1 class="performance-title">\n</h1>
<p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>
As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is :
response.xpath('//h1[#class="performance-title"]/following-sibling::p/text()').extract()
This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag

//*[#id="content"]/section/article/section[2]/h1/p/text()

Related

How to find list of all HTML Tags which are active on a particular data

I want to parse HTML to convert it to some other format while keeping some of the styles (Bolds, lists, etc).
To better explain what I mean,
Consider the following code:
<html>
<body>
<h2>A Nested List</h2>
<p>List <b>can</b> be nested (lists inside lists):</p>
<ul>
<li>Coffee</li>
<li>Tea
<ul>
<li>Black tea</li>
<li>Green tea</li>
</ul>
</li>
<li>Milk</li>
</ul>
</body>
</html>
Now if I were to select the word "List" at the start of the paragraph, my output should be (html, body,p), since those are the tags active on the word "List".
Another example, if I were to select the word "Black tea", my output should be (html,body,ul,li,ul,li), since it's part of the nested list.
I've seen chrome inspector do this but I'm not sure how I can do this in code by using Python.
Here is an Image of what the chrome inspector shows:
Chrome Inspector
I've tried parsing through the HTML using Beautiful soup and while it is amazing for getting a data, I was unable to solve my problem using it.
Later I tried the html-parser for this same issue, trying to make a stack of all tags before a "data" and popping them out as I encounter corresponding end-tags, but I couldn't do it either.
As you said in your comment, it may or may not get you what you want, but it may be a start. So I would try it anyway and see what happens:
from lxml import etree
snippet = """[your html above]"""
root = etree.fromstring(snippet)
tree = etree.ElementTree(root)
targets = ['List','nested','Black tea']
for e in root.iter():
for target in targets:
if (e.text and target in e.text) or (e.tail and target in e.tail):
print(target,' :',tree.getpath(e))
Output is
List : /html/body/h2
List : /html/body/p
nested : /html/body/p/b
Black tea : /html/body/ul/li[2]/ul/li[1]
As you can see, what this does is give you the xpath to selected text targets. A couple of things to note: first, "List" appears twice because it appears twice the text. Second: the "Black tea" xpath contains positional values (for example, the [2] in /li[2]) which indicate that the target string appears in the second li element of the snippet, etc. If you don't need that, you may need to strip that information from the output (or use another tool).

Extracting text from hyperlink using XPath

I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?
If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>
Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.

Want to pull a journal title from an RCSB Page using python & BeautifulSoup

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein.
To do this I am using the python libraries requests and BeautifulSoup. To try and build the code, I went to the page for a particular protein, in this case 1K48, and also save the HTML for the page (by hitting command+s and saving the HTML to my desktop).
First things to note:
1) The url for this page is: http://www.rcsb.org/pdb/explore.do?structureId=1K48
2) You can get to the page for any protein by replacing the last four characters with the appropriate PDBID.
3) I am going to want to perform this procedure on many PDBIDs, in order to sort a large list by the Journal they originally appeared in.
4) Searching through the HTML, one finds the journal title located inside a form here:
<form action="http://www.rcsb.org/pdb/search/smartSubquery.do" method="post" name="queryForm">
<p><span id="se_abstractTitle"><a onclick="c(0);">Refined</a> <a onclick="c(1);">structure</a> <a onclick="c(2);">and</a> <a onclick="c(3);">metal</a> <a onclick="c(4);">binding</a> <a onclick="c(5);">site</a> of the <a onclick="c(8);">kalata</a> <a onclick="c(9);">B1</a> <a onclick="c(10);">peptide.</a></span></p>
<p><a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Skjeldal, L.');">Skjeldal, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Gran, L.');">Gran, L.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Sletten, K.');">Sletten, K.</a>, <a class="sePrimarycitations se_searchLink" onclick="searchCitationAuthor('Volkman, B.F.');">Volkman, B.F.</a></p>
<p>
<b>Journal:</b>
(2002)
<span class="se_journal">Arch.Biochem.Biophys.</span>
<span class="se_journal"><b>399: </b>142-148</span>
</p>
A lot more is in the form but it is not relevant. What I do know is that my journal title, "Arch.Biochem.Biophys", is located within a span tag with class "se_journal".
And so I wrote the following code:
def JournalLookup():
PDBID= '1K48'
import requests
from bs4 import BeautifulSoup
session = requests.session()
req = session.get('http://www.rcsb.org/pdb/explore.do?structureId=%s' %PDBID)
doc = BeautifulSoup(req.content)
Journal = doc.findAll('span', class_="se_journal")
Ideally I'd be able to use find instead of findAll as these are the only two in the document, but I used findAll to at least verify I'm getting an empty list. I assumed that it would return a list containing the two span tags with class "se_journal", but it instead returns an empty list.
After spending several hours going through possible solutions, including a piece of code that printed every span in doc, I have concluded that the requests doc does not include the lines I want at all.
Does anybody know why this is the case, and what I could possibly do to fix it?
Thanks.
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off.
This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)

python, collecting links / script values from page

I am trying to make a program to collect links and some values from a website. It works mostly well but I have come across a page in which it does not work.
With firebug I can see that this is the html code of the illusive "link" (cant find it when viewing the pages source thou):
<a class="visit" href="/tet?id=12&mv=13&san=221">
221
</a>
and this is the script:
<td><a href=\"/tet?id=12&mv=13&san=221\" class=\"visit\">221<\/a><\/td><\/tr>
I'm wondering how to get either the "link" ("/tet?id=12&mv=13&san=221") from the html code and the string "221" from either the script or the html using selenium, mechanize or requests (or some other library)
I have made an unsuccessful attempt at getting it with mechanize using the br.links() function, which collected a number of links from the side, just not the one i am after
extra info: This might be important. to get to the page I have to click on a button with this code:
<a id="f33" class="button-flat small selected-no" onclick="qc.pA('visitform', 'f33', 'QClickEvent', '', 'f52'); if ($j('#f44').length == 0) { $j('f44').style.display='inline'; }; $j('#f38').hide();qc.recordControlModification('f38', 'DisplayStyle', 'hide'); document.getElementById('forumpanel').className = 'section-3'; return false;" href="#">
load2
</a>
after which a "new page" loads in a part of the window (but the url never changes)
I think you pasted the wrong script of yours ;)
I'm not sure what you need exactly - there are at least two different approaches.
Matching all hrefs using regex
Matching specific tags and using getAttribute(...)
For the first one, you have to get the whole html source of the page with something like webdriver.page_source and using something like the following regex (you will have to escape either the normal or the double quotes!):
<a.+?href=['"](.*?)['"].*?/?>
If you need the hrefs of all matching links, you could use something similar to webdriver.find_elements_by_css_selector('.visit') (take care to choose find_elements_... instead of find_element_...!) to obtain a list of webelements and iterate through them to get their attributes.
This could result in code like this:
hrefs = []
elements = webdriver.find_elements_by_css_selector('.visit')
for element in elements:
hrefs.append(element.getAttribute('href'))
Or a one liner using list comprehension:
hrefs = [element.getAttribute('href') for element \
in webdriver.find_elements_by_css_selector('.visit')]

Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>
As a newbie with xpath, I can select such anchors via lxml like so:
text = '''
<a href="http://ad_network.com/abc.html">
<img src="ad_banner.jpg">
</a>'''
root = lxml.html.fromstring(text)
print root.xpath('//a[contains(#href,("ad_network.")) or contains(#href,("other_ad_network."))][descendant::img]')
In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?
I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!
It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.
sites=['aaa', 'bbb']
contains = ' or '.join('contains(#href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)

Categories

Resources