lxml etree iteration inside xpath searches - python

I've been trying to retrieve information from tree nodes. I got the nodes from xpath searches but inside those nodes I try to run xpath searches again but it goes to the root. Is it possible to iterate over nodes with specific classes and retrieve different information indide them?
The example of the code I'm parsing
<li>
<div class="product-preview">
<div class="product-image">
<div class="product-info">
<\li>
I need to find those product-preview nodes that I'm already getting using
lxml.html.xpath( //div[contains(#class, product-preview)])
but when I try to get the different subnodes iterating over the results from the previous code, always searches in the parent
What I am trying to do is
for element in lxml.html.xpath( //div[contains(#class, product-preview)] ):
element.xpath( "new search" )
How should I iterate over the elements to make new searches inside them?
Thank you very much.

I've resolved the issue using the method find_class from lxml.html, it returns the node without parent. After that I can use xpath to search inside it.
for element in tree.find_class( class_name ):
element.xpath( xpath ).get( attrib )

Related

Issue in locating web element selenium

I am unable to locate a web element on a website, The web elements of the website are dynamic and the elements which i am trying to locate have very similar attributes to that of others with just small differences like changes in integers. These integers also change when i refresh the page so i am unable to locate can someone help?
I tried following but there maybe mistakes in syntaxes:
With contains text = WebDriverWait(browser, 15).until(EC.presence_of_element_located( (By.XPATH,"//div[contains(text(),'Type here...')]") ))
Absolute Xpath and Rel Xpath(these changes)
Contains sibling, contains parents but the parent and sibling elements are also unable to locate
here is the html of the element <div contenteditable="true" class="cke_textarea_inline cke_editable cke_editable_inline cke_contents_ltr placeholder cke_show_borders" tabindex="0" spellcheck="true" role="textbox" aria-label="Rich Text Editor, editor15" title="Rich Text Editor, editor15" aria-describedby="cke_849" style="position: relative;" xpath="1">enter question here</div>
There are spaces being added () which some time cause issue, please deleting that space. Also if you can share html then we might be able to help. WebDriverWait(browser,15).until(EC.presence_of_element_located((By.XPATH,"//div[contains(text(),'Type here...')]")))
I removing space doesn't work then you can try below xpath-
//div[text()='enter question here'][contains(#aria-label,'Rich Text Editor')]
If there are really no attributes you would consider reliable, you can access the element by the text inside of it. I'd recommend two things: Don't do this unless you find no other choice, and don't just use //div, add as much XPATH info to the path as you can.
driver.find_element_by_xpath('//div[text()="enter question here"]')
OR
driver.find_element_by_xpath('//div[contains(text(),"enter question")]')

Selecting inner text when using find element by css selector python, selenium and create a loop

im trying to get all link text from a tag p and with a specific class. Then create a loop to find all other similar elements.
how it looks
so far i am using this :
the value i want is in
<div class='some other class'>
<p class='machine-name install-info-entry x-hidden-focus'> text i want
</p> ==$0
installations = browser.find_elements_by_css_selector('p.machine-name.install-info-entry.x-hidden-focus')
any help is appreciated. thanks.
You can just use .text
installations = browser.find_elements_by_css_selector('p.machine-name.install-info-entry.x-hidden-focus')
for installation in installations:
print(installation.text)
Note that installations is a list of web elements, whereas installation is just a web element from the list.
UPDATE1:
If you want to extract the attribute from a web element, then you can follow this code:
print(installation.get_attribute("attribute name"))
You should pass your desired attribute name in get_attribute method.
You can read innerHTML attribute to get source of the content of the element or outerHTML for source with the current element.
installation.get_attribute('innerHTML')
Hope this will be helpful.

python selenium accessing html elements on more than two levels

I have a page that has to be scrapped.I use the python code
div = driver.find_element_by_class_name("parent")
data = div.find_elements_by_class_name("child1")
//I cannot access the web elements of **data** for eg: data.find_elements_by
for tag in data
//I cannot print the information of each div here
the Html
<div class="Parent">
<div class = child1 >
<div class = "heading">
data
</div>
</div>
<div class = child1 child2 >strong text
<div class = "heading">
<span>data</span>
</div>
</div>
</div>
Is there an easy way to access data
Well you can access html tags or text in different ways http://selenium-python.readthedocs.io/locating-elements.html
For multiple elements you can use :
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
There isn't a simple solution as far as I'm aware only by having specifics about the information you're looking for.
For instance let's you're using xpath (my personal preference):
Absolute XPath :
/html/body/div[2]/div/div/footer/section[3]/div/ul/li[3]/a
We can use Absolute xpath: /html/body/div[2]/div/div/footer/section[3]/div/ul/li[3]/a
Above xpath will technically work, but each of those nested
relationships will need to be present 100% of the time, or the locator
will not function. Above choosed xpath is known as Absolute xpath.
There is a good chance that your xpath will vary in every release. It
is always better to choose Relative xpath, as it helps us to reduce
the chance of element not found exception.
Relative xpath: //*[#id=’social-media’]/ul/li[3]/a
We can have a different approach to the data, therefore by using the correct way to 'select' the data we need, we can only extract/select the needed information. Look into each of these methods to understand them better, because you're asking for one line of code and each of those have their pros/cons (times when they can be useful or not).
It seems you want to access text which is inside heading div, if it is so then you can try the below code.
element=driver.find_element_by_class_name("heading")
data=element.text
assuming you are asking a way to loop through data where the info present is located in different locators in different nesting levels
There are multiple ways,
look for various selectors that match your pattern - find a way to do it that matches your problem - you refer css/xpath selector reference
if there are many selectors( but consistenly being used), you can use ByChained/ByAll Selectors look for the implementation in java, it will be like this, you can mimic the implementation,
selector1 = .Heading .child2
selector3 = .Heading .child3 span
selector2 = .Heading .child1
ByAll(selector1,selector2,selector3)'
if the parent is the only matching selector and there's no way to know abt child selectors, then another way is to use innerText/textContent property from a common parent
driver.findElement(By.cssSelector
('.child1').getAttribute('innerText')
if none of these, solves your problem, and you application is dynamic enough to use different references and different nesting levels each time for all the page, then it was meant to be not scrapped. so your should look for other ways of scrapping it.

Iterating Over Elements and Sub Elements With lxml

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.

lxml bug in .xpath?

After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]
It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard
All p tags should be doc.findall(".//p")
As per guide, expression nodename Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).
In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.

Categories

Resources