How to get WebElement text only for direct child? - python

I'm working with Python Selenium, and in the following HTML structure:
<div>
<h2>Welcome</h2>
<div>
<p>some text <strong>important</strong></p>
<a>link</a>
</div>
</div>
I'd like to get the text from each descendant (h2, div, p, strong, a) of the parent div, e.g. for the <p> tag I want some text.
I've been using the .text attribute and getting some text important instead. I'd like to use something similar as the BeautifulSoup attribute .string.
Edit: I need the code to work for any parent element containing descendants with more nested descendants - not just this particular HTML structure.
Thanks for your help.

Use Java Script executor to return textContent.
print(driver.execute_script('return arguments[0].firstChild.textContent;', driver.find_element_by_xpath("//h2[contains(.,'Welcome')]/following::div/p")))

Related

Python Selenium get element by CSS matching text

Hi I can't figure out how to get an element by CSS and matching with text. I know it can be done with Xpath but I'd rather use CSS.
<div class="button-face">
<div class="button-face-caption"> Text I want to find 1</div>
</div>
<div class="button-face">
<div class="button-face-caption"> Text I want to find 2</div>
</div>
So in by CSS would be something like...
driver.find_element_by_css('div.button-face-caption')
But how can add the text matching to that? i tried with contains and innerText and none seem to work.
As you said it's supported in xpath:
This would be a solution with an xpath using contains and text()
driver.find_element_by_xpath('//div[#class="button-face-caption" and contains(text(),"Text I want to find")]')
The xpath being:
//div[#class="button-face-caption" and contains(text(),"Text I want to find")]
For css, look here: https://sqa.stackexchange.com/q/362/34209 which should allow us to use:
div:contains('Text I want to find')
Which would lead us to
driver.find_element_by_css("div:contains('Text I want to find')")
However this comes with a BIG caveat:
:contains() is not part of the current CSS3 specification so it will
not work on all browsers, only ones that implemented it before it was
pulled. (see w3.org/TR/css3-selectors)
As workaround you can create your own function
def find_by_css(selector, text=''):
return [element for element in driver.find_elements_by_css_selector(selector) if text in element.text][0]
Then you can call it as
find_by_css('div.button-face-caption') # search only by CSS-selector
or
find_by_css('div.button-face-caption', 'Text I want to find 2') # search by CSS + text
As per the following discussions:
CSS selector :contains doesn't work with Selenium
css pseudo-class :contains() no longer allows anchors
The :contains pseudo-class isn't in the CSS Spec and is not supported by either Firefox or Chrome (even outside WebDriver).
Solution
You need to consider the ancestor of the <div class="button-face"> element and traverse down. Let us assume that both the <div class="button-face"> are with in a parent <div class="class">:
<div class="class">
<div class="button-face">
<div class="button-face-caption"> Text I want to find 1</div>
</div>
<div class="button-face">
<div class="button-face-caption"> Text I want to find 2</>
</div>
</div>
So to identify the element with text as:
Text I want to find 1:
div.class div:first-child > div.button-face-caption
Text I want to find 2:
div.class div:nth-child(2) > div.button-face-caption
References
You can find a couple of relevant detailed discussions in:
selenium.common.exceptions.InvalidSelectorException with “span:contains('string')”
Finding link using text in CSS Selector is not working

selenium webdriver: how to find the text in a paragraph which is nested in a div element?

<div class="loginbox">other code</div>
<div class="loginbox">
<p> style="color: Red;">Test Extract</p>
</div>
Using Selenium Web Driver, I would like to extract the text Test Extract within the paragraph element which is nested within a div, whose class name is shared with other div classes. python please. (this is a question i found on here already, but was asking for the answer in another programming language. so i rephrased my question to suit my need for python)
you can simply use:
text = browser.find_element_by_xpath("//p[text()='Test Extract']")
print(text.text)
EDIT:
If you're searching for an example in other language, here's one in java:
WebElement text = browser.findElement(By.xpath("//p[text()='Test Extract']"));
System.out.println(text.getText());

Selenium-Python: Class containing link-text

I am using Python & Selenium to scrap the content of a certain webpage. Currently, I have the following problem: There are multiple div-classes with the same name, but each div-class has different content. I only need the information for one particular div-class. In the following example, I would need the information in the first "show_result"-class since there is the "Important-Element" within the link text:
<div class="show_result">
<a href="?submitaction=showMoreid=77" title="Go-here">
<span class="new">Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=78" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
<div class="show_result">
<a href="?submitaction=showMoreid=79" title="Go-here">
<span class="new">Not-Important-Element</span></a>
Other text, links, etc within the class...
</div>
With the following code I can get the "Important-Element" and its link:
driver.find_element_by_partial_link_text('Important-Element'). However, I also need the other information within the same div-class "show-result". How can I refer to the entire div-class that contains the Important-Element in the link text? driver.find_elements_by_class_name('show_result') does not work since I do not know in which of the div-classes the Important-Element is located.
Thanks,
Finn
Edit / Update: Ups, I found the solution on my own using xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")
I know you've found an answer but I believe it's wrong since you would also select the other nodes because Important-Element is still in Non-Important-Element.
Maybe it works for your specific case since that's not really the text you're after. But here are a few more answers:
//div[#class='show_result' and starts-with(.,'Important-Element')]
//div[span[text()='Important-Element']]
//div[contains(span/text(),'Important-Element') and not(contains(span/text(),'Non'))]
There are more ways to write this...
Ups, i found the solution on my own via xpath:
driver.find_element_by_xpath("//div[contains(#class, 'show_result') and contains(., 'Important-Element')]")

getting article text using xpath but omit some tags

I'm trying to parse (article) text only using xpath.
I want to get all text which are direct children and all nested descendants text of a node, except for the following nodes/tags: <script>, <ul class="pager pagenav">, <style>.
Example html to match using xpath:
<section class="entry-content">
want this article text
<script>dont want this</script>
more text i want
<p>want this text too</p>
<any>also this</any>
<style>dont want this either</style>
<ul class="pager pagenav">nope, dont want this <a>Prev Next</a></ul>
</section>
Currently, i have something like:
result = tree.xpath('//section[#class="entry-content"]/*[not(descendant-or-self::script or self::ul[#class="pager pagenav"] or self::style)]/../descendant-or-self::text()')
..but it doesn't quite work.
Use the child::node() to match both regular children and text child nodes:
child::node() selects all the children of the context node, whatever their node type
self:: would help to filter unwanted elements having specific names:
//section[#class="entry-content"]/child::node()[not(self::script or self::ul or self::style)]/descendant-or-self::text()

Extracting text from hyperlink using XPath

I am using Python along with Xpath to scrape Reddit. Currently I am working on the front page. I am trying to extract links from its front page and display their titles in the shell.
For this I am using the Scrapy framework. I am testing this in the Scrapy shell itself.
My question is this: How do I extract the text from the <a> ABC </a> attribute. I want the string "ABC". I cannot find it. I have tried the following expressions, but it does not seem to work.
response.xpath('//p[descendant::a[contains(#class,"title")]]/#value')
response.xpath('//p[descendant::a[contains(#class,"title")]]/#data')
response.xpath('//p[descendant::a[contains(#class,"title")]]').extract()
response.xpath('//p[descendant::a[contains(#class,"title")]]/text()')
None of them seem to work. When I use extract(), it gives me the whole attribute itself. For example, instead of giving me ABC, it will give me <a>ABC</a>.
How can i extract the text string?
If <p> and <a> are in this situation:
<p>
<something>
<a class="title">ABC</a>
</something>
</p>
This will give you "ABC":
>>print response.xpath('//p//a[#class="title"]/text()').extract()[0]
ABC
// is equal of using descendants. p[descendant::a] wont give you the result because you are not considering <a> as descendant of <p>
Only tested it with online XPath evaluator, but it should work when you adjust it to
response.xpath('//p/descendant::a[contains(#class,"title")]/text()')
If you're evaluating //p[descendant::a[contains(#class,"title")]]/text(), the <p> (with the descendant <a>) is the current element and not the <a>.

Categories

Resources