I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?
If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>
Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.
Related
I am trying to target text inside a tag. There are some tags that have nested tags as well and my XPATH isn't targeting the text value of those tags.
Link: https://help.lyft.com/hc/en-us/articles/115012925707-I-was-charged-incorrectly
Here is the XPATH I am using: //article//p/text()
Of course, I can do //article//p//text() and target the text but that also gets other links I don't want to extract. I only want to get all the text inside of a tag and if there is any nested tag, take that value too.
How can I achieve such a result?
Thanks, everyone.
As most of the pink-colored links start with Learn, I would probably go about it this way:
a = response.xpath('//article//p//a//text()').extract()
if "Learn" not in a and "Back to top" not in a:
print(response.xpath('//article//p/text()').extract())
I'm working with Python Selenium, and in the following HTML structure:
<div>
<h2>Welcome</h2>
<div>
<p>some text <strong>important</strong></p>
<a>link</a>
</div>
</div>
I'd like to get the text from each descendant (h2, div, p, strong, a) of the parent div, e.g. for the <p> tag I want some text.
I've been using the .text attribute and getting some text important instead. I'd like to use something similar as the BeautifulSoup attribute .string.
Edit: I need the code to work for any parent element containing descendants with more nested descendants - not just this particular HTML structure.
Thanks for your help.
Use Java Script executor to return textContent.
print(driver.execute_script('return arguments[0].firstChild.textContent;', driver.find_element_by_xpath("//h2[contains(.,'Welcome')]/following::div/p")))
I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how.
I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has something to do with setting the Selector type to "xml" rather than "html" in order to recognize any XML tree, but for the life of me I cannot figure out how or where to do that in this instance.
For example, a website has the following HTML:
<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and Dvořák featuring pianist Emanuel Ax
</p>
</h1>
I have made an item called Concert() that has a value called 'title'. In my item loader, I use:
def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
thisconcert.add_xpath('title','//h1[#class="performance-title"]/p/text()')
return thisconcert.load_item()
This returns, in item['title'], a unicode list that does not include the text inside the p element, such as:
['\n ', '\n ', '\n ']
I understand why, but I don't know how to get around it. I have also tried things like:
from scrapy import Selector
def parse_item(self, response):
s = Selector(text=' '.join(response.xpath('.//section[#id="performers"]/text()').extract()), type='xml')
What am I doing wrong here, and how can I parse HTML that contains this problem (p within h1)?
I have referenced the information concerning this specific issue at Behavior of the scrapy xpath selector on h1-h6 tags but it does not provide a complete solution that can be applied to a spider, only an example within a session using a given text string.
That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as :
<h1 class="performance-title">\n</h1>
<p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>
As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is :
response.xpath('//h1[#class="performance-title"]/following-sibling::p/text()').extract()
This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag
//*[#id="content"]/section/article/section[2]/h1/p/text()
Here is the HTML I'm dealing with
<a class="_54nc" href="#" role="menuitem">
<span>
<span class="_54nh">Other...</span>
</span>
</a>
I can't seem to get my XPath structured correctly to find this element with the link. There are other elements on the page with the same attributes as <a class="_54nc"> so I thought I would start with the child and then go up to the parent.
I've tried a number of variations, but I would think something like this:
crawler.get_element_by_xpath('//span[#class="_54nh"][contains(text(), "Other")]/../..')
None of the things I've tried seem to be working. Any ideas would be much appreciated.
Or, more cleaner is //*[.='Other...']/../.. and with . you are directly pointing to the parent element
In other scenario, if you want to find a tag then use css [role='menuitem'] which is a better option if role attribute is unique
how about trying this
crawler.get_element_by_xpath('//a[#class="_54nc"][./span/span[contains(text(), "other")]]')
Try this:
crawler.get_element_by_xpath('//a[#class='_54nc']//span[.='Other...']');
This will search for the element 'a' with class as "_54nc" and containing exact text/innerHTML "Other...". Furthermore, you can just edit the text "Other..." with other texts to find the respective element(s)
This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.