How to use Regex in CSS Selector scrapy - python

I need to get a ul tag by the class name but the class name has a lot of different combinations but it is always just two letters that changes. product-gallerytw__thumbs could be one and product-galleryfp__thumbs could be one. I need to know how to use a css selector that uses regex so that either of these could be found (or any other combination)
I can't use Xpath as the location changes
img_ul = response.css('.product-gallerytw__thumbs')
print(img_ul)
This is what I am trying to do but have not found a way to add regex inside the .css()

You actually can use xpath:
img_ul = response.xpath("//*[contains(#class,'product-gallery')]")
or if you really need to specify everything but the two characters:
img_ul = response.xpath("//*[contains(#class,'product-gallery')][contains(#class,'__thumbs')]")
There is nothing a css selector can do that xpath can't. In fact css selectors are simply an abstraction of xpath selectors.

Related

How can I find classes with underscores in them using Selenium?

As the question implies I'm scraping a webpage that has a class name with an underscore in it, I'm unable to locate it. The element is as follows
<span class="s-item__time-left">30m</span> == $0
I've tried finding it by class name
time = driver.find_elements_class_name("s-item__time-left")
This just returns nothing, so I moved onto css selectors
time = driver.find_element_by_css_selector("s-item__time-left")
I tried a variety of the above, with one "." replaceing the 2 underscores and with 2 dots replacing the underscores. Both of these returned nothing as well.
There's no unique ID I can use, they vary constantly, all of the parent classes also use multiple underscores so I can't path to it hoping through the child elements.
I appreciate any suggestions!
Your css selector was wrong. it should be .classname
Try now.
time =driver.find_element_by_css_selector(".s-item__time-left")
print(time.text)
Try using XPATH: driver.find_element_by_xpath('//span[#class="s-item__time-left"]')

White space and selectors

Try to use a selector on scrapy shell to extract information from a web page and didn't work proprely. I believe that it happened because exist white space into class name. Any idea what's going wrong?
I tried different syntaxes like:
response.xpath('//p[#class="text-nnowrap hidden-xs"]').getall()
response.xpath('//p[#class="text-nnowrap hidden-xs"]/text()').get()
# what I type into my scrapy shell
response.css('div.offer-item-details').xpath('//p[#class="text-nowrap hidden-xs"]/text()').get()
# html code that I need to extract:
<p class="text-nowrap hidden-xs">Apartamento para arrendar: Olivais, Lisboa</p>
expected result: Apartamento para arrendar: Olivais, Lisboa
actual result: []
The whitespace in the class section means that there are multiple classes, the "text-nnowrap" class and the "hidden-xs" class. In order to select by xpath for multiple classes, you can use the following format:
"//element[contains(#class, 'class1') and contains(#class, 'class2')]"
(grabbed this from How to get html elements with multiple css classes)
So in your example, I believe this would work.
response.xpath("//p[contains(#class, 'text-nnowrap') and contains(#class, 'hidden-xs')]").getall()
For this cases I prefer using css selectors because of its minimalistic syntax:
response.css("p.text-nowrap.hidden-xs::text")
Also google chrome developer tools displays css selectors when you observing html code This makes scraper development much easier

Does BeautifulSoup .select() method support use of regex?

Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.
The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.

Python CSSSelect long selectors not working?

I'm having some problems with cssselect.
Selectors like these don't work: #foo div.bar div div.baz p
But selecting #foo, then div.bar on the result of the first query and so on works.
I suppose I could split the selector at each space and implement it as a recursive function, but that would limit the available css selectors by removing (or forcing me to reimplement) a+b, a>b, a,b and so on.
I really want to use css selectors when parsing html (instead of xpath for example), since I (we) know them well already and because they can be reused in javascript/css.
Is there something I'm missing? Any better library?

How to use selector select html element's attribute?

Currently, I am using Scrapy.
The selector works fine when matching
<xxx> something to match </xxx>
But I want to match
<xxx name="something I want match"> xxx </xxx>
What I want to match is inside the element tag.
I know Regex is one solution. Is there a easier way doing so.
I found 2 ways doing so:
1.sel.xpath('//baseTag/#attrName')
2.sel.css('baseTag::attr(attrName)')
see more

Categories

Resources