I am using Scrapy selector to extract fields from html
xpath = /html/body/path/to/element/text()
This is similar to question scrapy get nth-child text of same class
and following the documentation we can use .getall() method to get all element and select specific one from the list.
selected_list = Selector(text=soup.prettify()).xpath(xpath).getall()
Is it possible to directly specify which nth element to select in the xpath itself?
Something like below
xpath = /html/body/path/to/element/text(2) #to select 3 child text
Example
<body>
<div>
<i class="ent_sprite remind_icon">
</i>
text that needs to be
</div>
</body>
The result of response.xpath('/body/div/text()').getall() consist of 2 elements
'\n'
'text that needs to be'
You can use following-sibling:: to have the nearest sibling (downward) of the expression. For example in this case you want the nearest text() of <i> tag, so you can do:
response.xpath('//i[#class="ent_sprite remind_icon"]/following-sibling::text()').get()
Which gives you the nearest text() node to <i class="ent_sprite remind_icon">.
If you want to get nth nearest sibling (downward) of a node the XPath would be following-sibling::node[n] in our case comes following:
'//i[#class="ent_sprite remind_icon"]/following-sibling::text()[n]'
Related
There is a Anchor tag(<a>) under the div class and under the <a> tag there is a <p> tag with the class and the <p> class matched with 12 item. I was trying to find all the text under the p tag using python.
Here is my code.
First approach:
for ele in driver.find_element_by_xpath('//p[#class="BrandCard___StyledP-sc-1kq2v0k-1 bAWFRI text-sm font-semibold text-center cursor-pointer"]'):
print(ele.text)
Second approach:
content_blocks=driver.find(By.CSS_SELECTOR, "div.CategoryBrand__GridCategory-sc-17tjxen-0.PYjFK.my-4")
for block in content_blocks:
elements = block.find_elements_by_tag_name("a")
for el in elements:
list_of_hrefs.append(el.text)
but every time it gives me an error "WebElement is not iterable".
I have added a picture of the page element.
Page Element click here
This should help you, on your first approach you miss the S of elements (with S will return a list with all matches, without the first match).
I use xpath with contains some substring in the class.
r_elems = driver.find_elements_by_xpath("//p[contains(#class, 'BrandCard')]")
[x.text for x in r_elems]
I am trying to automate a process using selenium and a webdriver.
The html looks like this:
<span class="contract-item">
<span class="contract-label">
<span class="contract-name">Jimmy</span>
</span>
<div class="current-stats">
<span class="info"></span>
The issue I am facing is that there are many 'contract-item' and 'info' classes. I only want to find info for specific 'contract name's. However by finding the 'contract name' I have lost the info. How do I get the 'info' for a specific name?
I have this so far.
team_name = self.driver.find_elements_by_xpath("//*[contains(text(), {})]".format(jimmy))[1]
Many thanks!
Below is required xpath:
//*[#class='contract-item' and
contains(.,'Jimmy')]/div/span[#class='info']
You just need to change the contact name String (i.e. Jimmy) with desired one and you will get corresponding info.
You need to first get the element that contains the name, you did that correctly
//*[contains(text(), "{}")]
Then you need to go the nearest common parent between the info element and the element you found, each /.. will go one element up the HTML tree.
//*[contains(text(), "{}")]/../..
Finally, find the correct element filtering by class
//*[contains(text(), "{}")]/../..//span[#class="info"]//text()
So, you expression should be :
team_name = self.driver.find_elements_by_xpath('//*[contains(text(), "{}")]/../..//span[#class="info"]//text()'.format('jimmy'))[1]
What is a good way to select multiple nodes from with in a node in a html code using xpath?
I have this code (actually this repeated 23 times);
<li>
<a class="Title" href="http://www.google.com" >Google</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
I am trying to get both Title and Date and have two different XPATH querys like this;
//a[#class="Title"]/#href
//p[#class="Date"]
But when I do this I get two returns with 23 and 22 values each. This is because at one point in the HTML code Date is not present. Therefore I would like to stay inside the li and search for Title and Date within that li so I can check if there are any values.
I changed my XPATH to this;
//li
In my return Element I can see that there are two sub elements, a and div but I cannot seem to figure out how I am supposed to handle what is inside the return Element?
When you want to search elements within the current node you need to start your Xpath pattern with a dot.
For example:
.//a[#class="Title"]/#href
.//p[#class="Date"]
I have a radio button with value as HTML as follows:
<div class='result'>
<span>
<input type='radio'/>
option1
</span>
<span>
<input type='radio'/>
option2
</span>
<span>
<input type='radio'/>
option3
</span>
</div>
I tried the following XPath, but this isn't working:
//span[contains(text(),'option1')]/input[#type='radio']
Please help me write XPath for this.
There are actually two text nodes in target span: the first one is just an empty string before <input> and the second- after <input> (the one that contains "option1")
And your XPath //span[contains(text(),'option1')] means return span that contains "option1" in first text node.
You can use one of below expressions to match required input:
//span[normalize-space()="option1"]/input[#type="radio"]
//span[contains(text()[2],'option1')]/input[#type='radio']
There are two text elements per span. One precedes the input element, and one follows it, but the first one is essentially empty.
In this code I find the input elements, then their parents, then the second text elements of those span parents.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.html').read())
>>> for item in selector.xpath('.//input[#type="radio"]/../text()[2]'):
... item.extract()
...
'\noption1\n'
'\noption2\n'
'\noption3\n'
try this to select option 1
//input[#type='radio']/preceding::span[1][contains(.,'option1')]
I guess you can't use text() here. Because this function returns a sequence of child text nodes of current span element. There are 2 text nodes in your example:
<span>
<input type='radio'/>
option1
</span>
1st text node is between <span> and <input type='radio'/> containing just a newline.
2nd text node is between <input type='radio'/> and </span> containing option1 text plus 2 newlines (at the begining and at the ending).
contains function expects a string argument instead of a sequence. I think it will take only first text node from the sequence, which contains just a newline.
If you need to select input followed by some text node you can use the following expression:
//input[#type='radio'][contains(following-sibling::text(), 'option1')]
If you need to select span containing text option1 and input with #type='radio', you can try the following expression:
//span[contains(., 'option1') and input/#type='radio']
If you need to select input instead of span then use the following expression:
//span[contains(., 'option1')]/input[#type='radio']
I can suggest you the following resources to gain some information about XPath. W3C recomendations contains a full description of XPath. If you use XPath 2.0 then you can look at:
XML Path Language (XPath) 2.0
XQuery 1.0 and XPath 2.0 Functions and Operators
For XPath 3.0 look at:
XML Path Language (XPath) 3.0
XPath and XQuery Functions and Operators 3
These recomendations are big enough and hard to read. But you can find in these documents a list of all available axes including following-sibling::, a description of text(), a description of contains(), etc.
Also there are a lot of brief XPath tutorials. For example you can look at this one.
I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?
Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors
The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.