Scrappy doesn't grab span text within div class

Scrappy doesn't grab span text within div class - python

I have a problem with retrieving the text from a div class of a website.
The structure of the page is attached below. I've trying to retrieve that <span class="product-details__toggler-selected" title="black". Only the text 'black' from it.
For the moment I don't retrieve nothing with it.
My xpath is this:
color = response.xpath("//div[#class='product-details__toggler-info-title']/p/span[#class='product-details__toggler-selected']/text()").extract()
Structure of page:
<div class="product-details__toggler-info-title">
<span class="product-details__toggler-title">Culoare</span>
<span class="product-details__toggler-selected" title="black"><em class="s-color-bg" style="background-color: #000000">black</em><span class="s-color-name">black</span></span>
</div>

Try below XPath to get required value:
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/span/text()
or
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/#title

Related

Using Scrapy itemloaders and XPath/CSS selector to extract data

I'm using the Scrapy framework to crawl and scrape data from a url and I'm not sure whether it's better to extract the value attribute from a progress bar using CSS selectors or to extract the text data from a nested div & span tag using XPath.
This is the URL
This is the HTML code
<div class="grid-x grid-margin-x">
<div class="cell small-1 medium-1 large-1">
<span class="vote-button-legend">12</span>
</div>
<div class="cell small-6 medium-6 large-6" style="display: inline-flex; align-items: center; justify-content: center;">
<progress class="alert" max="297" value="12" style="color: crimson; cursor: pointer;"></progress>
</div>
</div>
Using the inspector I can verify with XPath selectors that I'm able to retrieve the text data:
$x('//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()')
I can also use CSS Selectors to find the div class for the progress bar before extracting the value attribute
$('div progress.alert ')
From here I tried to use ItemLoaders to populate Items in Scrapy using the add_xpath and add_css methods:
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('votes', '//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()')
l.add_css('votes', 'div.cell.small-1.medium-1.large-1 span.vote-button-legend ::attr[value]')
When running my spider to check for the resulting items it doesn't appear in the console and when I try to see if the output appears using a log it returns an empty list:
self.log("voteCount: %s" % response.css('div.cell.small-1.medium-1.large-1 span.vote-button-legend ::attr[value]').extract())
self.log("value: %s" % response.xpath('//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()').extract())
Any help would be greatly appreciated

Find tag <a> and tag <img> when using bs4

I have the following source code:
code
<div class='aaa'>
<div class='aaa-child'>
<a>
<img></img>
</a>
</div>
</div>
code
So the structure is an image inside a hyperlink.
I would like to find if tags "a" and "img" exists inside the above divs. Any ideas? I tried with find_all but I get too many results that don't match my expectations.

Yeah use descendant CSS selector with a class selector:
soup.select('.aaa a,img')

Select a div for a specific class which has child elements that contains a certain text

Given this html sample :
<div class="measure-tab"> --- i want to select this one
<span class="color_title">someText</span>
</div>
<div class="measure-tab"> --- i dont want to select this one
<span class="color_title">someText</span>
<div>
<span class="qwery">jokerText</span>
</div>
</div>
<div class="measure-tab"> --- i want to select this one
<span class="color_title">someText</span>
</div>
I want to select the div that has #class='measure-tab' which has under it a span that as a specific class and text = someText and a nested span that has a specific class and does not contain text = 'jokerText', all this in an XPATH
What i've tried is :
//div[contains(#class, 'measure-tab') and //span[#class="color_title" and (contains(text(),'someText')) and //span[#class="color_title" and not(contains(text(),'jokerText'))]]
But this dosen't seem to work.
I also used This post as inspiration.
EDIT : Corrected bad description of what is the goal for this question
EDIT, made a new solution :
//div[contains(#class, 'measure-tab') and //span[contains(#class, 'color_title') and //span[not(contains(#class, 'qwery'))]]]
But this returns all the divs, instead of not matching it with --- i dont want to select this one
<span class="color_title">someText</span>
<div>
<span class="qwery">jokerText</span>
</div>
I feel so close but yet so far, haha, it dosen't make sense for me why is it matching it with <span class="qwery">jokerText</span> when i wrote not contains there

I believe this is what you are looking for-
MyDivs = driver.find_elements_by_xpath("//div[#class='measure-tab' and not(descendant::*[text() = 'jokerText' and #class = 'qwery'])]")
This will select all the div tag which does not have jokerText in it.

You can query with not(following-sibling::div/span.....)
Try with following xpath:
//span[#class='color_title' and not(following-sibling::div/span[#class='qwery' and text()='jokerText'])]/parent::div[#class='measure-tab']

How to properly get an element with BeautifulSoup?

I'm new to Python and trying to parse a simple HTML. However, one thing stops me: for example, I have this html:
<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>
I need to extract text from second div (text). This way I get it:
print repr(link.find('div').findNextSibling())
However, this returns the whole div (with "div" word): <div class="text">Here's the desired text!</div>
And I don't know how to get text only.
Adding .text results in \u043a\u0430\u043a \u0440\u0430\u0437\u0440\u0430\u0431 strings\
Adding .strings returns "None"
Adding .string returns both "None" and \u042f\u0445\u0438\u043a\u043e - \u0435\u0441\u043b\u0438\
Maybe there's something wrong with repr
P.S. I need to save tags inside div too.

Why don't you simply search the <div> element based in its class attribute? Something like the following seems to work for me:
from bs4 import BeautifulSoup
html = '''<div class = "quote">
<div class = "whatever">
some unnecessary text here
</div>
<div class = "text">
Here's the desired text!
</div>
</div>'''
link = BeautifulSoup(html, 'html')
print link.find('div', class_="text").text.strip()
It yields:
Here's the desired text!

Scrapy scrape content having same class name

I am using scrapy to crawl and scrape data from a particular webiste. The crawle works fine, but i'm having issue when scraping content having from div having same class name. As for example:
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
I want to retrieve only this is the 1st div. The code i've used is:
desc = hxs.select('//div[#class = "same_name"]/text()').extract()
But it returns me all the contents. Any help would be really helpful !!

Ok , this one worked for me.
print desc[0]
It returned me this is the first div which is what i wanted.

You can use BeautifulSoup. Its a great html parser.
from BeautifulSoup import BeautifulSoup
html = """
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
"""
soup = BeautifulSoup(html)
print soup.text
That should do the work.

Using xpath you will get all the div with the same class, further, you can loop on them to get the result(for scrapy):
divs = response.xpath('//div[#class="full class name"]')
for div in divs:
if div.css("div.class"):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrappy doesn't grab span text within div class - python

Try below XPath to get required value: //div[#class='product-detailstoggler-info-title']//span[#class='product-detailstoggler-selected']/span/text() or //div[#class='product-detailstoggler-info-title']//span[#class='product-detailstoggler-selected']/#title

Related

Using Scrapy itemloaders and XPath/CSS selector to extract data

Find tag <a> and tag <img> when using bs4

Select a div for a specific class which has child elements that contains a certain text

How to properly get an element with BeautifulSoup?

Scrapy scrape content having same class name

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrappy doesn't grab span text within div class - python

Try below XPath to get required value: //div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/span/text() or //div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/#title

Related

Using Scrapy itemloaders and XPath/CSS selector to extract data

Find tag <a> and tag <img> when using bs4

Select a div for a specific class which has child elements that contains a certain text

How to properly get an element with BeautifulSoup?

Scrapy scrape content having same class name

Categories

Resources

Try below XPath to get required value: //div[#class='product-detailstoggler-info-title']//span[#class='product-detailstoggler-selected']/span/text() or //div[#class='product-detailstoggler-info-title']//span[#class='product-detailstoggler-selected']/#title