Python: XPATH searching within a node

Python: XPATH searching within a node - python

What is a good way to select multiple nodes from with in a node in a html code using xpath?
I have this code (actually this repeated 23 times);
<li>
<a class="Title" href="http://www.google.com" >Google</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
I am trying to get both Title and Date and have two different XPATH querys like this;
//a[#class="Title"]/#href
//p[#class="Date"]
But when I do this I get two returns with 23 and 22 values each. This is because at one point in the HTML code Date is not present. Therefore I would like to stay inside the li and search for Title and Date within that li so I can check if there are any values.
I changed my XPATH to this;
//li
In my return Element I can see that there are two sub elements, a and div but I cannot seem to figure out how I am supposed to handle what is inside the return Element?

When you want to search elements within the current node you need to start your Xpath pattern with a dot.
For example:
.//a[#class="Title"]/#href
.//p[#class="Date"]

Related

Selenium get only child elements from current parent

I'm trying to get data from an HTML like this:
<div>
<h4 id='id1'>...</h4>
<ul>
<li></li>
<li></li>
</ul>
</div>
<div>
<h4 id='id2'>...</h4>
<ul> ... </ul>
</div>
The goal is to get the <li> values from every <h4>. To get this I've tried something like this:
divs = driver.find_elements_by_xpath("//div//h4[starts-with(#id,'id_')]")
for h4 in divs:
title = h4.text
# Get <li> from each div
for value in h4._parent.find_elements_by_tag_name('li'): #<-- It gives me all <li> in the page
# TODO ...
Here I'm trying to get all <h4> tags and then go to the parent (the <div>) and find the <li> tags existing only in that parent. But I retrieve all <li> tags.
I've searched over the internet and I've found a couple of question in StackOverflow like Get child element using xpath selenium python or selenium find child's child elements where it says to set the context, so I've tried this:
for value in h4._parent.find_elements_by_xpath('.//li'):
^
But it gives me the same numbers of elements.
So, I'm misunderstanding something?
Thanks in advance.

//div[./h4[starts-with(#id,'id')]]//li
To get all li's of all div that contains an element h4 with a certain id starting try this.
//div[./h4] basically means div with h4 element inside 1 layer deep.

Arendeep question is ok and it worked for me, but also I've noticed the problem I was having.
The element _parent seems to be the web page, not the parent element. And this is why the method find_element was getting all <li> tags from the page.
I can use accepted answer or also:
parent = h4.find_element_by_xpath('..')
for value in parent.find_elements_by_tag_name('li'):
# TODO
Where the xpath ('..') return the parent element.
This way gives me only child from the current element (maybe is an answer more accurate) but accepted answer also works on my scenario where I want all <li> tags dependant from the <h4>.
By the way I haven't found docs about _parent.

Scrapy selector: get nth-child text of an element

I am using Scrapy selector to extract fields from html
xpath = /html/body/path/to/element/text()
This is similar to question scrapy get nth-child text of same class
and following the documentation we can use .getall() method to get all element and select specific one from the list.
selected_list = Selector(text=soup.prettify()).xpath(xpath).getall()
Is it possible to directly specify which nth element to select in the xpath itself?
Something like below
xpath = /html/body/path/to/element/text(2) #to select 3 child text
Example
<body>
<div>
<i class="ent_sprite remind_icon">
</i>
text that needs to be
</div>
</body>
The result of response.xpath('/body/div/text()').getall() consist of 2 elements
'\n'
'text that needs to be'

You can use following-sibling:: to have the nearest sibling (downward) of the expression. For example in this case you want the nearest text() of <i> tag, so you can do:
response.xpath('//i[#class="ent_sprite remind_icon"]/following-sibling::text()').get()
Which gives you the nearest text() node to <i class="ent_sprite remind_icon">.
If you want to get nth nearest sibling (downward) of a node the XPath would be following-sibling::node[n] in our case comes following:
'//i[#class="ent_sprite remind_icon"]/following-sibling::text()[n]'

xpath in lxml to find id number based on href

I'm trying to rewrite someones library to parse some xml returned with requests. However they use lxml in a way I'm not used to. I believe it's using regular expression to find the data and while most of the library provided works, it doesn't work when the site being parsed has the file id in a list structure. Essnetially I get a page back and I'm looking for an id that matches the href athlete number. So say I want to just get id's for athlete 567377.
</div>
</a></div>
<ul class='list-entries'>
<li class='entity-details feed-entry' id='Activity-123120999590'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/567377' >
</a>
</div>
</li>
<li class='entity-details feed-entry' id='Activity-16784940202'>
<div class='avatar avatar-athlete avatar-default'>
<a class='avatar-content' href='/athletes/5252525'>
</a>
</div>
The code:
lst_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']")
Provides all list items perfectly but for all activities. I want to only have the one related to the right athlete. The library uses the following to use an #href to select the right athlete.
lst_athlethe_act_in_group_activity = parser.xpath(".//li[substring(#id, 1, 8)='Activity']/*[#href='/athletes/"+athlethe_id+"']/..")
However, this never seems to work. It finds the activity but then throws them all away.
Is there a better way to get this working? Any tutorial that can point me in the right direction to correlate to the next element.

The element with the href attribute isn't an immedite child of your li element, so your xpath is failing. You're matching:
.//li/*[#href="..."]
You want:
.//li/div/a[#href="..."]
(You could match * instead of a if you think another element might contain the href attribute, and you can match against .//li//a[#href="..."] if you think the path to the a element might not always be li/div/a).
So to find the li element:
parser.xpath(".//li[substring(#id, 1, 8)='Activity']/div/a[#href='/athletes/%s']/../.." % '5252525')
But you can also write that without the ../..:
parser.xpath(".//li[substring(#id, 1, 8)='Activity' and div/a/#href='/athletes/%s']" % '5252525')

Returning a list of values from a list of dictionaries where keys equal a certain value

<div id="tabs" class="clearfix">
<ul id="remove">
<li class="btn_arrow_tab left inactive">
<a href="#" class="doubleText">Pay Monthly <small>View standard rates and Bolt Ons</small>
</a>
</li>
<li class="btn_arrow_tab right inactive">
<a href="#" class="doubleText">Pay & Go<small>View standard rates and Bolt Ons</small>
</a>
</li>
</ul>
</div>
I have no experience in webscraping and trying to follow example and the docs to click on the button with text 'Pay Monthly'. This button then dynamically displays some text which I need to copy. How do I go about clicking this for starters, and then reading the text which is displayed. I am trying it with Selenium, would beautifulsoup be better? I have been trying this line of code but it isn't doing anything:
driver.find_element_by_xpath("//a[text()[contains(.,'Pay Monthly')]]").click()

It is always good practice to use mixture of absolute and relative xpath to locate a element.
First thing you should find is a parent that has a unique identifier. The element you mentioned has two parent items with a static id. One is root div and another is ul.
Now either we can follow your path and find the element using Text. Any of the following shall work.
driver.find_element_by_xpath("//div[#id='tabs']//a[text()[contains(.,'Pay Monthly')]]").click()
driver.find_element_by_xpath("//ul[#id='remove']//a[text()[contains(.,'Pay Monthly')]]").click()
But, if the item is static element and considering your goal here, I would suggest the following method. indexing your xpath when it returns multiple elements.
myElement = driver.find_element_by_xpath("//div[#id='tabs']//a[#href='#'][1]")
myElement.click()
And then you can capture the text. You can put some wait to ensure the text gets changed.
myText = myElement.text
Let me know if this doesn't work.

How to extract text from HTML (after certain string)

I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?

Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors

The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: XPATH searching within a node - python

When you want to search elements within the current node you need to start your Xpath pattern with a dot. For example: .//a[#class="Title"]/#href .//p[#class="Date"]

Related

Selenium get only child elements from current parent

Scrapy selector: get nth-child text of an element

xpath in lxml to find id number based on href

Returning a list of values from a list of dictionaries where keys equal a certain value

How to extract text from HTML (after certain string)

Categories

Resources