downloading full page with scrapy

downloading full page with scrapy - python

i need to crawl a website.
get some of its pages and store them with all of the CSS files and images. exactly like saving the pages in browser.
i have tried selenium, but with selenium i can only save the html not full page so it is not possible to do this with selenium.
i want to know can i do this using Scrapy?
if it is not possible using Scrapy what else can i use?

Yes - you should be able to do this in scrapy
Inside of the <head>tag in the html you should see urls to javascript references in <script> tags and you should see <link> tags that give you the url to get the css files
Once you get the url, it's a simple matter to do a request in scrapy. The scrapy tutorial shows this:
https://doc.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
These urls contain the raw css or javascript and you can either download that separately or construct a new single HTML document
One thing to note is that the <script> tags may contain the full javascript and not a url reference. In this case you'll get the data when you get the html portion

Related

Retrieving dynamic DOM content in selenium

I am trying to scrape some content from a website, but selenium.page_source() does not contain all the content I need beacuse the webiste is dynamically rendered. When opening DevTools in Chrome you are able to inspect all of the DOM-elements - even those rendered dynamically. This made me believe that there must be a way in selenium to do this as well.
Any suggestions?

Get the inner html of the html or body
driver.find_element_by_xpath("/html/body").get_attribute('innerHTML')
If that does not get everything, please post the source html/website.

No content returns from Scrapy and Instagram

I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.

First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.

If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.

How to get text inside div

Take a look at this webpage:
https://www.michaelkors.com/large-crossgrain-leather-dome-crossbody-bag/_/R-US_32S9SF5C3L?color=2519
I want to get text under details section. When I look at the div it has class detail and text under it. This is the statement I am using:
details = response.xpath('.//div[#class="detail"]/text()').extract()
However, it is returning nothing.

Looks like the div you're trying to parse does not exist when the page is loaded.
Product data is stored as json inside a script tag, and the div is generated from it using javascript.
This leaves you with a couple of options:
Parse the javascript and extract the data yourself
Use a browser (e.g. scrapy-splash) to run the javascript, and parse the resulting HTML

class detail element not found in the page source. Which means it is not found in the response loaded by scrapy request.
Scrapy deals with static requests, it responses all the elements present in the page source.
If the request is a dynamic request, it responses elements present in the inspect element, loaded by javascript, ajax type requests). we should try some other packages along with scrapy to scrape those data.
Examples: Splash, Selenium etc
In your case you should handle it as dynamic requests.

Scrapy Extract Script Value

Using the scrapy shell on a specific url I am trying to identify how I can extract the author value or contributor value out of this script within a pages source code? I have tried
response.xpath('//script').re(r'author":"([0-9.]+)"')
this is the script in the source code of the site
<script charSet="UTF-8">...
"author":"3810161","contributor":{"id":"3810161"}},
</script>

Did you try printing all the <script> contents from Scrapy itself?
I guess you will not see the same content as you see in your navigator inspector since theses nodes appear to be Javascript rendered and Scrapy don't handle Javascript.
If you just want to extract some content from theses search results, you could just use the api (Same search parameter you posted, but give you a JSON response, really more easy to parse...)

Extract embedded script from web page

I have a link i want to scrape the content from that looks like this:
https://www.whatever.com/getDescModuleAjax.htm?productId=32663684002&t=1478698394335
But when i want to open it with selenium it won't work. When i load it in a normal Browser it opens as plain Text with the Html in a bracket like this:
window.productDescription='<div style="clea....
#I want this
....n.jpg" width="950"/></p></div>'";
I was thinking i will Download the source code as plain text and extract the content i need using Bs4. But this can't be the best solution. is there a way to ignore the tags and load the web page normally using selenium and python?

If all the source code is inside of JS variable:
window.variable="<div>...</div>" then you probably can't use bs4 to resolve it since bs4 works for pure html DOM nodes.
Is there a way to ignore the tags and load the web page normally using selenium and python
Most likely Selenium should be able to force on-page JS to get executed and load variable content into page's DOM. Try to search where window.productDescription or productDescription expression is applied/used (in which onloaded .js files)?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

downloading full page with scrapy - python

Related

Retrieving dynamic DOM content in selenium

No content returns from Scrapy and Instagram

How to get text inside div

Scrapy Extract Script Value

Extract embedded script from web page

Categories

Resources