Try to use a selector on scrapy shell to extract information from a web page and didn't work proprely. I believe that it happened because exist white space into class name. Any idea what's going wrong?
I tried different syntaxes like:
response.xpath('//p[#class="text-nnowrap hidden-xs"]').getall()
response.xpath('//p[#class="text-nnowrap hidden-xs"]/text()').get()
# what I type into my scrapy shell
response.css('div.offer-item-details').xpath('//p[#class="text-nowrap hidden-xs"]/text()').get()
# html code that I need to extract:
<p class="text-nowrap hidden-xs">Apartamento para arrendar: Olivais, Lisboa</p>
expected result: Apartamento para arrendar: Olivais, Lisboa
actual result: []
The whitespace in the class section means that there are multiple classes, the "text-nnowrap" class and the "hidden-xs" class. In order to select by xpath for multiple classes, you can use the following format:
"//element[contains(#class, 'class1') and contains(#class, 'class2')]"
(grabbed this from How to get html elements with multiple css classes)
So in your example, I believe this would work.
response.xpath("//p[contains(#class, 'text-nnowrap') and contains(#class, 'hidden-xs')]").getall()
For this cases I prefer using css selectors because of its minimalistic syntax:
response.css("p.text-nowrap.hidden-xs::text")
Also google chrome developer tools displays css selectors when you observing html code This makes scraper development much easier
Related
I'm trying to scrape text from class .s-recipe-header__info-item, but as you can see on the picture, there are three classes with the same name and I would like to extract only the first one to get text "Do hodiny" See the image of code here. So far I have tried this code:
recipe_item["preparation_time"] = response.css(".s-recipe-header__info > .s-recipe-header__info-items > .s-recipe-header__info-item::text").extract_first()
I have also tried to use .get() instead of .extract_first(), but both do not seem to work...
I am new to web scraping and I have only elemental HTML and CSS knowledge. Thank you in advance for your help.
I need to get a ul tag by the class name but the class name has a lot of different combinations but it is always just two letters that changes. product-gallerytw__thumbs could be one and product-galleryfp__thumbs could be one. I need to know how to use a css selector that uses regex so that either of these could be found (or any other combination)
I can't use Xpath as the location changes
img_ul = response.css('.product-gallerytw__thumbs')
print(img_ul)
This is what I am trying to do but have not found a way to add regex inside the .css()
You actually can use xpath:
img_ul = response.xpath("//*[contains(#class,'product-gallery')]")
or if you really need to specify everything but the two characters:
img_ul = response.xpath("//*[contains(#class,'product-gallery')][contains(#class,'__thumbs')]")
There is nothing a css selector can do that xpath can't. In fact css selectors are simply an abstraction of xpath selectors.
I am trying to print by ID using Selenium. As far as I can tell, "a" is the tag and "title" is the attribute. See HTML below.
When I run the following code:
print(driver.find_elements(By.TAG_NAME, "a")[0].get_attribute('title'))
I get the output:
Zero Tolerance
So I'm getting the first attribute correctly. When I increment the code above:
print(driver.find_elements(By.TAG_NAME, "a")[1].get_attribute('title'))
My expected output is:
Aaliyah Love
However, I'm just getting blank. No errors. What am I doing incorrectly? Pls don't suggest using xpath or css, I'm trying to learn Selenium tags.
HTML:
<a class=" Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4 Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4" href="/en/channel/ztfilms" title="Zero Tolerance" rel="">Zero Tolerance</a>
...
<a class=" Link ActorThumb-ActorImage-Link styles_3dXcTxVCON Link ActorThumb-ActorImage-Link styles_3dXcTxVCON" href="/[RETRACTED]/Aaliyah-Love/63565" title="Aaliyah Love"
Selenium locators are a toolbox and you're saying you only want to use a screwdriver (By.TAG_NAME) for all jobs. We aren't saying that you shouldn't use By.TAG_NAME, we're saying that you should use the right tool for the right job and sometimes (most times) By.TAG_NAME is not the right tool for the job. CSS selectors are WAY more powerful locators because they can search for not only tags but also classes, properties, etc.
It's hard to say for sure what's going on without access to the site/page. It could be that the entire page isn't loaded and you need to add a wait for the page to finish loading (maybe count links expected on the page?). It could be that your locator isn't specific enough and is catching other A tags that don't have a title attribute.
I would start by doing some debugging.
links = driver.find_elements(By.TAG_NAME, "a")
for link in links:
print(link.get_attribute('title'))
What does this print?
If it prints some blank lines sprinkled throughout the actual titles, your locator is probably not specific enough. Try a CSS selector
links = driver.find_elements(By.CSS_SELECTOR, "a[title]")
for link in links:
print(link.get_attribute('title'))
If instead it returns some titles and then nothing but blank lines, the page is probably not fully loaded. Try something like
count = 20 # the number of expected links on the page
link_locator = (By.TAG_NAME, "a")
WebDriverWait(driver, 10).until(lambda wd: len(wd.find_elements(link_locator)) == count)
links = driver.find_elements(link_locator)
for link in links:
print(link.get_attribute('title'))
I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.
When I go to a certain webpage I am trying to find a certain element and piece of text:
<span class="Bold Orange Large">0</span>
This didn't work: (It gave an error of compound class names or something...)
elem = browser.find_elements_by_class_name("Bold Orange Large")
So I tried this: (but I'm not sure it worked because I don't really understand the right way to do css selectors in selenium...)
elem = browser.find_elements_by_css_selector("span[class='Bold Orange Large']")
Once I find the span element, I want to find the number that is inside...
num = elem.(what to put here??)
Any help with css selectors, class names, and finding element text would be great!!
Thanks.
EDIT:
My other problem is that there are multiple of those exact span elements but with different numbers inside..how can I deal with that?
you're correct in your usage of css selectors! Also your first attempt was failing because there are spaces in the class name and selenium does not seem to be able to find standalone identifiers with spacing at all. I think that is a bad development practice to begin with, so its not your problem. Selenium itself does not include an html editor, because its already been done before.
Try looking here: How to find/replace text in html while preserving html tags/structure.
Also this one is relevant and popular as well: RegEx match open tags except XHTML self-contained tags