Using Scrapy itemloaders and XPath/CSS selector to extract data - python

I'm using the Scrapy framework to crawl and scrape data from a url and I'm not sure whether it's better to extract the value attribute from a progress bar using CSS selectors or to extract the text data from a nested div & span tag using XPath.
This is the URL
This is the HTML code
<div class="grid-x grid-margin-x">
<div class="cell small-1 medium-1 large-1">
<span class="vote-button-legend">12</span>
</div>
<div class="cell small-6 medium-6 large-6" style="display: inline-flex; align-items: center; justify-content: center;">
<progress class="alert" max="297" value="12" style="color: crimson; cursor: pointer;"></progress>
</div>
</div>
Using the inspector I can verify with XPath selectors that I'm able to retrieve the text data:
$x('//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()')
I can also use CSS Selectors to find the div class for the progress bar before extracting the value attribute
$('div progress.alert ')
From here I tried to use ItemLoaders to populate Items in Scrapy using the add_xpath and add_css methods:
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('votes', '//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()')
l.add_css('votes', 'div.cell.small-1.medium-1.large-1 span.vote-button-legend ::attr[value]')
When running my spider to check for the resulting items it doesn't appear in the console and when I try to see if the output appears using a log it returns an empty list:
self.log("voteCount: %s" % response.css('div.cell.small-1.medium-1.large-1 span.vote-button-legend ::attr[value]').extract())
self.log("value: %s" % response.xpath('//div[#class="cell small-1 medium-1 large-1"]/span[#class="vote-button-legend"]/text()').extract())
Any help would be greatly appreciated

Related

Using beatifulsoup to find text on html

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0
If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

Is it possible to get data from div inside div with overflow:hidden with python's requests?

I'm scraping website and got stuck with reaching data stored in div with overflow:hidden CSS tag.
I can reach the div with overflow:hidden (variants__container-items), but it responds with empty list although it has a few div's inside.
Also, beautiful soup should find those divs (e.g. variant__available-qty, variant__item) directly, but it doesn't.
<div class="variants__container-items">
<div id="variant__70259" class="variants__container-item">
<div class="variant__price">
<div class="price">69.24 zł</div>
</div>
<div class="variant__item">
<div class="variant__attributes">36-39 </div>
<div class="variant__sku">610306139887</div>
</div>
<div class="variant__qty">
<form method="post" action="https://some_link_here">
(...)
</form>
</div>
<div class="variant__available-qty">6</div>
</div>
(...)
BeautifulSoup can find div variants__container-items, but it is an empty list (because it's CSS tag is overflow: hidden.
I would like to reach data in div variant__70259 (especially in div variant__available-qty, which returns 6 in given example).
I cannot share URL, because it needs login to see the data.
Can someone help me with it? 🙏

Scrappy doesn't grab span text within div class

I have a problem with retrieving the text from a div class of a website.
The structure of the page is attached below. I've trying to retrieve that <span class="product-details__toggler-selected" title="black". Only the text 'black' from it.
For the moment I don't retrieve nothing with it.
My xpath is this:
color = response.xpath("//div[#class='product-details__toggler-info-title']/p/span[#class='product-details__toggler-selected']/text()").extract()
Structure of page:
<div class="product-details__toggler-info-title">
<span class="product-details__toggler-title">Culoare</span>
<span class="product-details__toggler-selected" title="black"><em class="s-color-bg" style="background-color: #000000">black</em><span class="s-color-name">black</span></span>
</div>
Try below XPath to get required value:
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/span/text()
or
//div[#class='product-details__toggler-info-title']//span[#class='product-details__toggler-selected']/#title

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

Scrapy scrape content having same class name

I am using scrapy to crawl and scrape data from a particular webiste. The crawle works fine, but i'm having issue when scraping content having from div having same class name. As for example:
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
I want to retrieve only this is the 1st div. The code i've used is:
desc = hxs.select('//div[#class = "same_name"]/text()').extract()
But it returns me all the contents. Any help would be really helpful !!
Ok , this one worked for me.
print desc[0]
It returned me this is the first div which is what i wanted.
You can use BeautifulSoup. Its a great html parser.
from BeautifulSoup import BeautifulSoup
html = """
<div class="same_name">
this is the 1st div
</div>
<div class="same_name">
this is the 2nd div
</div>
<div class="same_name">
this is the 3rd div
</div>
"""
soup = BeautifulSoup(html)
print soup.text
That should do the work.
Using xpath you will get all the div with the same class, further, you can loop on them to get the result(for scrapy):
divs = response.xpath('//div[#class="full class name"]')
for div in divs:
if div.css("div.class"):

Categories

Resources