Python CSSSelect long selectors not working? - python

I'm having some problems with cssselect.
Selectors like these don't work: #foo div.bar div div.baz p
But selecting #foo, then div.bar on the result of the first query and so on works.
I suppose I could split the selector at each space and implement it as a recursive function, but that would limit the available css selectors by removing (or forcing me to reimplement) a+b, a>b, a,b and so on.
I really want to use css selectors when parsing html (instead of xpath for example), since I (we) know them well already and because they can be reused in javascript/css.
Is there something I'm missing? Any better library?

Related

How to use Regex in CSS Selector scrapy

I need to get a ul tag by the class name but the class name has a lot of different combinations but it is always just two letters that changes. product-gallerytw__thumbs could be one and product-galleryfp__thumbs could be one. I need to know how to use a css selector that uses regex so that either of these could be found (or any other combination)
I can't use Xpath as the location changes
img_ul = response.css('.product-gallerytw__thumbs')
print(img_ul)
This is what I am trying to do but have not found a way to add regex inside the .css()
You actually can use xpath:
img_ul = response.xpath("//*[contains(#class,'product-gallery')]")
or if you really need to specify everything but the two characters:
img_ul = response.xpath("//*[contains(#class,'product-gallery')][contains(#class,'__thumbs')]")
There is nothing a css selector can do that xpath can't. In fact css selectors are simply an abstraction of xpath selectors.

scrapy get parent element

I'm new to scrapy and need to do one thing. I used to use lxml and did
elements = careers.xpath('//text()[contains(., "engineer")')
After that I was able to do
element = elements[0].getparent()
Unfortunately, I can't do the same with scrapy.
I try doing
response.xpath('//text()[contains(., "engineer")')
as well as .getparent() from any of these elements, but it says that Selectors have no attribute getparent. Is it possible to do the same with scrapy?
To access the parent element, you can use the .. notation at the end of your XPath expression. Consider reading this other StackOverflow answer for more details.
Apart from that, you may want to add a closing ] at the end of your XPath to close the [ before contains.

Does BeautifulSoup .select() method support use of regex?

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.
The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.

Extract text between tags with XPath including markup

I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span> tags.
For this I use XPath:
/span[#class="st"]
This however will extract everything including the <span>.
and.
/span[#class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDIT
Thanks to #helderdarocha and #TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.
The /span[#class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[#class="st"]/node()
This will return:
Two child text nodes
The full <em> node (element and contents).
If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:
/span[#class="st"]//text()
or
/span[#class="st"]/descendant::text()
This will return three text nodes, the text inside <em>, but not the <em> elements.
Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.
XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.
In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.
-s

Selenium / lxml : Get xpath

Is there a get_xpath method or a way to accomplish something similar in selenium or lxml.html. I have a feeling that I have seen somewhere but can't find anything like that in the docs.
Pseudocode to illustrate:
browser.find_element_by_name('search[1]').get_xpath()
>>> '//*[#id="langsAndSearch"]/div[1]/form/input[1]'
This trick works in lxml:
In [1]: el
Out[1]: <Element span at 0x109187f50>
In [2]: el.getroottree().getpath(el)
Out[2]: '/html/body/div/table[2]/tbody/tr[1]/td[3]/table[2]/tbody/tr/td[1]/p[4]/span'
See documentation of getpath.
As there is no unique mapping between an element and an xpath expression, a general solution is not possible. But if you know something about your xml/html, it might be easy to write it your own. Just start with your element, walk up the tree using the parent and generate your expression.
Whatever search function you use, you can reformat your search using xpath to return your element. For instance,
driver.find_element_by_id('foo')
driver.find_element_by_xpath('//*#id="foo"')
will return exactly the same elements.
That being said, I would argue that to extend selenium with this method would be possible, but nearly pointless-- you're already providing the module with all the information it needs to find the element, why use xpath (which will almost certainly be harder to read?) to do this at all?
In your example, browser.find_element_by_name('search[1]').get_xpath() would simply return '//*#name="search[1]"'. since the assumption is that your original element search returned what you were looking for.

Categories

Resources