XPATH to check on a specific text within a node - python

I have this as a node to parse:
<h3 class="atag">
<a href="http://www.example.com">
<span class="btag">text to be ignored</span>
</a>
<span class="ctag">text to be checked</span>
</h3>
I'm gonna need to extract "http://www.example.com" but not the part text to to be ignored; I also need to check that if ctag contains text to be checked.
I came up with this but it seems it doesn't do the job.
response.xpath("//h3/a/#*[not(self::span)]").extract()
any idea on this?

If you need to just select href from 'a' tag, use #href.
To also check, whether the ctag contains some text, I think you can use code like this:
'//h3[contains(span[#class="ctag"]/text(), "text to be checked")]/a/#href'
This would check whether there is a span with "text to be checked" inside given h3 block. If the text exists, the 'www.example.com' would be found, otherwise there would be an empty result.

Do you mean something like this XPath? :
//h3/a[following-sibling::span[#class='ctag' and .='text to be checked']/#href
above XPath get <a> tag that followed by <span class="ctag"> containing value of "text to be checked", then return href attribute from the previously mentioned <a> tag.

Related

How to extract the required element using find_all()

I am trying to extract the authors' names in the amazon page. The problem is, there are so many tags with same class and there are no other attributes to identify the exact element. Now i want to extract the author name. It is present in the second span tag.
<div class="a-row a-spacing-none">
<span class="a-size-small a-color-secondary">by </span>
<span class="a-size-small a-color-secondary"><a class="a-link-normal a-text-normal" href="/Arthur-Conan-Doyle/e/B000AQ43GQ/ref=sr_ntt_srch_lnk_2?qid=1510823399&sr=8-2">Arthur Conan Doyle</a></span></div>
As we can, there is a same class for both span tags. I want the second span tag.And more over, the a tag is not present in all blocks. So i have to use only span tag to extract the author name. How could i get the author name?
I am using BeautifulSoup and selenium.My code is:
soup=BeautifulSoup(self.driver.page_source,"html.parser")
titles=soup.find_all("h2",{"class":"a-size-medium s-inline s-access-title a-text-normal"})
authors=soup.find_all("span",{"class":"a-size-small a-color-secondary"})
for value in range(len(titles)):
d={}
d["Title"]=titles[value].text
d["Author"]=authors[value+2].text
title.append(d)
Find the above "div" element for that "span". then extract the entire text of the div tag.As u observe, there will be a "by" substring in every block of code. Use it to split the text and copy it to the d["Author"] part. If "by" is not present, check before copying it to dictionary using if condition. If u copy directly, then u may get Array out of Bound exception. So use if.
Here is the code:
temp = authors[value].text
temp1 = temp.split("by")
#print(temp[1])
if temp1[0]!=temp:
d["Author"] = temp1[1]
else:
d["Author"] = "None"

xpath/python search then grab child nodes?

I am working on a scraper using python and selenium and I have an issue traversing xpath. I feel like this should be simple, but I'm obviously missing something.
I am able to navigate the site I am browsing fine, but I need to grab some SPAN text based on an XPATH search.
I am able to click the appropriate radio button(in this case the 1st one)
(driver.find_elements_by_name("start-date"))[0].click()
But I also need to capture the text next to the radio button which is captures in the span tags.
<label>
<input type="radio" name="start-date" value="1" data-start-date="/Date(1507854300000)/" data-end-date="/Date(1508200200000)/" group="15" type-id="8">
<span class="start-date">
10/12/2017<br>Summary text
</span>
</label>
In the above example, I'm looking to capture "10/12/2017" and "Summary text" into 2 string variables based on the find_elements_by_name search I used to find the radio button.
I then have a second, similar, collection issue, where I need to capture the span tags after searching by class name. This finds the appropriate parent node on the page:
(driver.find_element_by_xpath("//div[#class=\"MyClass\"]"))
Based on the node returned by that search, I want to grab "Text 1" and "Text 2" from the span tags below it.
<div class="MyClass">
<span>
<span>Text 1</span>
</span>
<span class="bullet">
</span>
<span>
<span>Text 2</span>
</span>
</div>
I am new to xpath, but from what I can gather, the span nodes I am looking for should be children of the nodes I found with my searches, and I should be able to traverse down the hierarchy somehow to get the values, I'm just not sure how.
It's actually very simple, all WebElement objects have the same find_element_by_* methods that the WebDriver object has, with the main difference that the element methods change the context to that element, meaning that it will only have children of the selected element.
With that in mind you should be able to do:
my_element = driver.find_element_by_class_name('MyClass')
my_spans = my_element.find_elements_by_css_selector('span>span')
What happens here is that we grab the first element with class MyClass, then from the context of that element we search for elements that are span AND children of a span
you can try with the following x-path.
//div[#class='MyClass']/span[1]/span ---- To get Text 1
//div[#class='MyClass']/span[3]/span -----To get Text 2
or
(//div[#class='MyClass']/span/span)[1] ---- To get Text 1
(//div[#class='MyClass']/span/span)[2] ---- To get Text 2

How to extract text from HTML (after certain string)

I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?
Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors
The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.

How to find element on page with text which is dynamically changed using web driver?

I'm looking for way to find element which contain some exactly text, the problem is this text dynamically changes every time.
It looks like this:
<div class="some class" ng-class="{ 'ngSorted': !col.noSortVisible90 }">
<span ng-call-text class="ngbinding" style="cursor: defaulte;">some text and digits</span>
Where "some text and digits" element that I need.
Could somebody help me with this?
UPD: I have a lot elements with the same classes on page and also I know text phrase thet should be fount, I can provide this text to my code as parameter.
You can use the id attribute
<span ng-call-text id="snarfblat" class="ngbinding" style="cursor: defaulte;">some text and digits</span>
so you can access it within JavaScript with
document.getElementById("snarfblat");
Why don't you use Xpath or CSSSelector to reach to your target element, maybe on of its parent has a unique Id or a property, start from there and reach you destination i.e the concerned HTML tag with dynamic text

Find a particular element of an html page using Selenium/Python

I have multiple levels of div class elements out of which I need to find just a particular elements and get the text value and store in a variable.
<div class="Serial">
<p> … </p>
<p>
<span>
<a href="mailto:xyz#xyz.com">
Mr. XYZ
</a>
</span>
</p>
<p> … </p>
<p> … </p>
So, we have 4 different paragraphs out of which I need to only read second paragraph and save the email ID to a variable. When I use the following code,
find_element_by_xpath("//div[#class='Serial']")
I get all the 4 paragraph information. Is there anyway I can specify which paragraph to read within the div class? I know for sure the order doesn't change and I only want to read 2nd p element. Appreciate your help.
You could try accessing the <p> tag by giving xpath as find_element_by_xpath("//div[#class='Serial']/p[2]/span/a") to access the email id present in the second paragraph.
I think this is not completely correct to rely on order of paragraphs - one day it may change, and those who will come after you can be slightly confused by p[2]. As you need to find text from paragraph with email link, I believe this XPath would do the trick:
//p[span/a[starts-with(#href, 'mailto:')]]

Categories

Resources