scrapy scrape html source code - python

I'm using scrapy to crawl and scrape a website. I need the whole html instead of components. We can easily extract the component using xpath selectors but is there any method to extract the whole html block for a given class. For example in the below html code, i need the exact html source code of the whole div block prod-basic-info. Is there anyway i can do this ?
<div class="block prod-basic-info">
<h2>Product information</h2>
<p class="product-info-label">Category</p>
<p>
<a href="xyz.html"</a>
</p>
</div>

Just point your xpath expression or CSS selector to the element and extract() it:
response.xpath('//div[contains(#class, "prod-basic-info")]').extract()[0]
response.css('div.prod-basic-info').extract()[0]

Related

Find tag <a> and tag <img> when using bs4

I have the following source code:
code
<div class='aaa'>
<div class='aaa-child'>
<a>
<img></img>
</a>
</div>
</div>
code
So the structure is an image inside a hyperlink.
I would like to find if tags "a" and "img" exists inside the above divs. Any ideas? I tried with find_all but I get too many results that don't match my expectations.
Yeah use descendant CSS selector with a class selector:
soup.select('.aaa a,img')

Elements Inside Opening Tag

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.
Upon inspection, the links are provided but the HTML looks like this for all of them:
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?
*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.
How to read the mangled attribute, data-cachedhtml
The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:
Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.
XPath part
For the de-mangled data-chachedhtml in this form:
<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
<div class="media-preview-content">
<a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
<img class="preview" src="https://i.redditmedia.com/elided"
width="861" height="638"/>
</a>
</div>
<span class="error">loading...</span>
</div>
This XPath will retrieve the preview image links:
//a/img/#src
(That is, all src attributes of img element children of a elements.)
or
This XPath will retrieve the click-through image links:
//a[img]/#href
(That is, all href attributes of the a elements that have a img child.)

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

scrapy scrape only after checking if a class exits

I've created a crawler to crawl a webpage and store items in mysql database. I'm facing a slight problem while scraping a fixed part of the webpage. I want to check if a div with certain class name exits inside a div or not and if it exits, i will store the root div.
<div class="page-col-1-2-right">
<div class="block">
<h2>Produktbewertung und Test</h2>
<div class="area spacing ingredient-rating"></div>
</div>
<div class="block">
<h2>Artikel zu Nasentropfen & Schnupfen</h2>
<div class="cell clickable teaser-large" data-id="62151"></div>
<div>
</div>
In the above code, i want div block if and only if it has
<div class="area spacing ingredient-rating"></div>
inside it. Since some pages of the website i'm crawling might have or might not have the required block my code below didn't work.
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]').extract()[0]
Since you a test to perform before extracting the text you can make usage of a and expression inside the xpath. So it will be response.xpath(test1 and test2).extract()[0]
Applying it to your code it:
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]' and '//div[contains(#class, "ingredient-rating")]').extract()[0]`

Using scrapy to get the "Next Page" data

I need to grab a commodity website's review data, but it's user data is paged .The comments per page are 10 strips , and there are about 100 pages. How can I crawl all of them out?
My intention is to use the yield and Request method to crawl the "Next Page" link, and then using the Xpath to extract data. But I can't jump to the next page to extract the data.
Here is the Html code about the "Next Page" link:
<div class="xs-pagebar clearfix">
<div class="Pagecon">
<div class="Pagenum">
<a class="pre-page pre-disable">
<a class="pre-page pre-disable">
<span class="curpage">1</span>
2
3
<span class="elli">...</span>
Next Page
Final Page
</div>
</div>
</div>
What does href="#" exactly mean?
Unfortunately you will not be able to do this with scrapy. href="#" is an anchor link that just links nowhere (to make this look like a link). What really happens is the javascript onclick handler that is executed. You will need to have a method of executing the javascript to do this for your use case. You might want to look into Splinter to do this.

Categories

Resources