Scrapy css selector get text from occurence of first class - python

I'm trying to scrape text from class .s-recipe-header__info-item, but as you can see on the picture, there are three classes with the same name and I would like to extract only the first one to get text "Do hodiny" See the image of code here. So far I have tried this code:
recipe_item["preparation_time"] = response.css(".s-recipe-header__info > .s-recipe-header__info-items > .s-recipe-header__info-item::text").extract_first()
I have also tried to use .get() instead of .extract_first(), but both do not seem to work...
I am new to web scraping and I have only elemental HTML and CSS knowledge. Thank you in advance for your help.

Related

I am confused why this XPath selector does not work

I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.

How to get non element text adjacent to a tag using Scrapy?

I am trying to scrape a page using Scrapy Framework.
<div class="info"><span class="label">Establishment year</span> 2014</div>
The tag I want to deal with looks like above. I want to get the value 2014. I can't use info or label class as they are common through the page.
So, I tried below xpath but I am getting null:
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling").get()
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling::text()").get()
Any clue what can be the issue?
Since you are trying to extract it in between the tag you should use the tag at the end. I don't know what website you are trying to scrape but here is an example of me scraping in between the 'a' tag on this website http://books.toscrape.com/ Here is the code I used for it
response.xpath("(//h3)[1]/a/text()").extract_first()
In your second line of code you did not use the function for extracting text right. The one you are using is for CSS selector. For Xpath if would be /text(), not ::text(). For you code I think you should try one of these options. Let me know if it helps.
response.xpath("//span[contains(text(),'Establishment year')]/div/text()").get()
or
response.xpath("//span[contains(text(),'Establishment year')]/span/text()").get()
Extract direct text children (/text()) from the parent element:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="info"><span class="label">Establishment year</span> 2014</div>')
>>> selector.xpath('//*[#class="info"]/text()').get()
' 2014'

Printing Text from 2nd Div in Class in Python + Selenium

newbie here trying to learn Selenium. I have the following HTML Code:
<div class="lower-text">
<div data-text="winScreen.yourCodeIs">Your Code Is:</div>
<div>OUTPUTCODE</div>
</div>
I am trying to only print the text OUTPUTCODE, however the following code only prints "Your Code Is:".
text = browser.find_elements_by_class_name('lower-text')
for test in text:
print(test.text)
Any help would be appreciated. Thank you.
Try the below xpath.
//div[#class='lower-text']/div[last()]
You code should be
print(driver.find_element_by_xpath("//div[#class='lower-text']/div[last()]").text)
Try below Solutions:
1. Xpath :
//div[#class='gs_copied']
2. CSS selector
.lower-text > div:nth-child(2)
Your site is unstable and not always generating coupon code.Currently I am getting below error(check screenshot). So wont able to identify elements which i have mentioned above.
You need to amend your logic based on functionality and if person is Unlucky for getting coupon code then you have to write script to handle other functionality based on your site, (e.g: Check out our Hot Deals Page)
Try the following approach:
text = driver.find_element_by_xpath("//div[text()='Your Code Is:']//following-sibling::div[text()]").get_attribute('innerHTML')
print(text)
I have copy pasted your html part in a new text file and tried the following xpath which work perfectly:
//div[#class='lower-text']/div[text()='Your Code Is:']/following-sibling::div
Attaching screenshot link also. Please have a look and hopefully it will solve your problem.
https://imgur.com/EujgZrI

Selecting with non-class tag in scrapy python

I am trying to scrap title of a website but the problem it has no class and id.
Usually i use this to get title that has class:
titles = response.xpath('//a[#class="result-title hdrlnk"]/text()').extract()
Now I am trying to extract text, please see the screenshot, can you please fix it? [https://i.stack.imgur.com/k6aCN.png][1]
You may locate a specific node by any attribute (not only class and id) or its relative position with some others.
A few examples for the text in your screenshot:
response.xpath('//div[#class="job-title-text"]/a/text()')
response.xpath('//a[contains(#onclick,"clickJObTitle")]/text()')
response.xpath('//a[contains(#href,"jobdetails")]/text()')
response.css('div.job-title-text a::text')
response.css('a[onclick*=clickJObTitle]::text')
response.css('a[href*=jobdetails]::text')
See also:
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/cssref/css_selectors.asp

Using Xpath to get the anchor text of a link in Python when the link has no class

(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element

Categories

Resources