scrapy get parent element - python

I'm new to scrapy and need to do one thing. I used to use lxml and did
elements = careers.xpath('//text()[contains(., "engineer")')
After that I was able to do
element = elements[0].getparent()
Unfortunately, I can't do the same with scrapy.
I try doing
response.xpath('//text()[contains(., "engineer")')
as well as .getparent() from any of these elements, but it says that Selectors have no attribute getparent. Is it possible to do the same with scrapy?

To access the parent element, you can use the .. notation at the end of your XPath expression. Consider reading this other StackOverflow answer for more details.
Apart from that, you may want to add a closing ] at the end of your XPath to close the [ before contains.

Related

How to use Regex in CSS Selector scrapy

I need to get a ul tag by the class name but the class name has a lot of different combinations but it is always just two letters that changes. product-gallerytw__thumbs could be one and product-galleryfp__thumbs could be one. I need to know how to use a css selector that uses regex so that either of these could be found (or any other combination)
I can't use Xpath as the location changes
img_ul = response.css('.product-gallerytw__thumbs')
print(img_ul)
This is what I am trying to do but have not found a way to add regex inside the .css()
You actually can use xpath:
img_ul = response.xpath("//*[contains(#class,'product-gallery')]")
or if you really need to specify everything but the two characters:
img_ul = response.xpath("//*[contains(#class,'product-gallery')][contains(#class,'__thumbs')]")
There is nothing a css selector can do that xpath can't. In fact css selectors are simply an abstraction of xpath selectors.

Reference to XPath - is it possible?

I am using Selenium to automate a browser task through a Python script.
There is a text-box in my browser that I need to fill with info, but the XPath is formatted as below:
//*[#id="NameInputId14514271346457986"]
The problem is that: everytime the number before the Id (14514271346457986) changes. Is there a way to refer to this XPath something like:
//*[#id.start-with="NameInputId"]
Sorry if it is a dumb question - I started to using Selenium this week and I couldn't find this info on documentation.
You can test whether the first 9 characters of the #id value equals "NameInput", for an XPath 1.0 expression:
//*[fn:substring(#id, 1, 9) = "NameInput"]
With XPath 2.0 (and greater) you could use the starts-with() function:
//*[starts-with(#id, "NameInput")]
Sure, you can use xpath like //*[contains(#id,"NameInputId")] but I guess this possibly will not be an unique locator. In this case the xpath should be more complex to contain additional attributes or some parent element
you can use xpath like this
//*[#id[contains(.,'NameInputId')]]

Parsing a wiki-styled web page, XPath error

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.
I have a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//text()'))
It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//*[not(#class="reference")].text()'))
What is the right XPath expression? Thanks in advance :)
Probably, the error occured because of .text() instead of /text().
If you want include also text of p elements then you have to use the descendant-or-self XPath axis:
//*[#id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(#class="reference")]/text()

Python CSSSelect long selectors not working?

I'm having some problems with cssselect.
Selectors like these don't work: #foo div.bar div div.baz p
But selecting #foo, then div.bar on the result of the first query and so on works.
I suppose I could split the selector at each space and implement it as a recursive function, but that would limit the available css selectors by removing (or forcing me to reimplement) a+b, a>b, a,b and so on.
I really want to use css selectors when parsing html (instead of xpath for example), since I (we) know them well already and because they can be reused in javascript/css.
Is there something I'm missing? Any better library?

Selenium / lxml : Get xpath

Is there a get_xpath method or a way to accomplish something similar in selenium or lxml.html. I have a feeling that I have seen somewhere but can't find anything like that in the docs.
Pseudocode to illustrate:
browser.find_element_by_name('search[1]').get_xpath()
>>> '//*[#id="langsAndSearch"]/div[1]/form/input[1]'
This trick works in lxml:
In [1]: el
Out[1]: <Element span at 0x109187f50>
In [2]: el.getroottree().getpath(el)
Out[2]: '/html/body/div/table[2]/tbody/tr[1]/td[3]/table[2]/tbody/tr/td[1]/p[4]/span'
See documentation of getpath.
As there is no unique mapping between an element and an xpath expression, a general solution is not possible. But if you know something about your xml/html, it might be easy to write it your own. Just start with your element, walk up the tree using the parent and generate your expression.
Whatever search function you use, you can reformat your search using xpath to return your element. For instance,
driver.find_element_by_id('foo')
driver.find_element_by_xpath('//*#id="foo"')
will return exactly the same elements.
That being said, I would argue that to extend selenium with this method would be possible, but nearly pointless-- you're already providing the module with all the information it needs to find the element, why use xpath (which will almost certainly be harder to read?) to do this at all?
In your example, browser.find_element_by_name('search[1]').get_xpath() would simply return '//*#name="search[1]"'. since the assumption is that your original element search returned what you were looking for.

Categories

Resources