Take a look at this webpage:
https://www.michaelkors.com/large-crossgrain-leather-dome-crossbody-bag/_/R-US_32S9SF5C3L?color=2519
I want to get text under details section. When I look at the div it has class detail and text under it. This is the statement I am using:
details = response.xpath('.//div[#class="detail"]/text()').extract()
However, it is returning nothing.
Looks like the div you're trying to parse does not exist when the page is loaded.
Product data is stored as json inside a script tag, and the div is generated from it using javascript.
This leaves you with a couple of options:
Parse the javascript and extract the data yourself
Use a browser (e.g. scrapy-splash) to run the javascript, and parse the resulting HTML
class detail element not found in the page source. Which means it is not found in the response loaded by scrapy request.
Scrapy deals with static requests, it responses all the elements present in the page source.
If the request is a dynamic request, it responses elements present in the inspect element, loaded by javascript, ajax type requests). we should try some other packages along with scrapy to scrape those data.
Examples: Splash, Selenium etc
In your case you should handle it as dynamic requests.
Related
I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.
I write a scrapy spider to scrape data from a webpage which has several subpages. Each of them also has several subpages etc. I want to visit all of the sub-sub-...-pages and take specific information from it.
To go deeper and deeper I want to query subsequent subpages with xpath() to get links and enter them. But to use xpath I need an object of scrapy.http.response.html.HtmlResponse class. Therefore I write:
from scrapy.http import HtmlResponse
new_response = HtmlResponse(url=subpage_url)
But when I do an xpath query on such an object I don't get what I should get, just an empty list. What I suspect is that I didn't specify 'body' argument in HtmlResponse(). But the body is hidden in HTML from subpage_url and I want to get it from the subpage. Am I doing something incorrent, or is there any better way to get HTML from a subpage with known URL to xpath-query this HTML?
This is how BeautifulSoup would work. Use link extractor to go to the next page and click on items you want. Use xpath to extract the content you want. This is not how scrapy is used.
So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:
import scrapy
from ..items import AmazonsItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']
def parse(self, response):
items = AmazonsItem()
products_name = response.css('.s-access-title::attr("data-attribute")').extract()
for product_name in products_name:
print(product_name)
next_page = response.css('li.a-last a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.
So how do I scrape a website which has dynamic content?
what exactly is the difference between dynamic and static content?
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
how would I know that data is dynamically created?
So how do I scrape a website which has dynamic content?
there are a few options:
Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format
what exactly is the difference between dynamic and static content?
Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
Refer to your first question
how would I know that data is dynamically created?
You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR
Lastly
Amazon does offer an API to access the data. Try looking into that as well
If you want to load dynamic content, you will need to simulate a web browser. When you make an HTTP request, you will only get the text returned by that request, and nothing more. To simulate a web browser, and interact with data on the browser, use the selenium package for Python:
https://selenium-python.readthedocs.io/
So how do I scrape a website which has dynamic content?
Websites that have dynamic content have their own APIs from where they are pulling data. That data is not even fixed it will be different if you will check it after some time. But, it does not mean that you can't scrape a dynamic website. You can use automated testing frameworks like Selenium or Puppeteer.
what exactly is the difference between dynamic and static content?
As I have explained this in your first question, the static data is fixed and will remain the same forever but the dynamic data will be periodically updated or changes asynchronously.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
for that, you can use libraries like BeautifulSoup in python and cheerio in Nodejs. Their docs are quite easy to understand and I will highly recommend you to read them thoroughly.
You can also follow this tutorial
how would I know that data is dynamically created?
While reloading the page open the network tab in chrome dev tools. You will see a lot of APIs are working behind to provide the relevant data according to the page you are trying to access. In that case, the website is dynamic.
So how do I scrape a website which has dynamic content?
To scrape the dynamic content from websites, we are required to let the web page load completely, so that the data can be injected into the page.
What exactly is the difference between dynamic and static content?
Content in static websites is fixed content that is not processed on the server and is directly returned by using prebuild source code files.
Dynamic websites load the contents by processing them on the server side in runtime. These sites can have different data every time you load the page, or when the data is updated.
How would I know that data is dynamically created?
You can open the Dev Tools and open the Networks tab. Over there once you refresh the page, you can look out for the XHR requests or requests to the APIs. If some requests like those exist, then the site is dynamic, else it is static.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
To extract the dynamic content from the websites we can use Selenium (python - one of the best options) :
Selenium - an automated browser simulation framework
You can load the page, and use the CSS selector to match the data on the page. Following is an example of how you can use it.
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6")
time.sleep(4)
titles = driver.find_elements_by_css_selector(
".a-size-medium.a-color-base.a-text-normal")
print(titles[0].text)
In case you don't want to use Python, there are other open-source options like Puppeteer and Playwright, as well as complete scraping platforms such as Bright Data that have built-in capabilities to extract dynamic content automatically.
i need to crawl a website.
get some of its pages and store them with all of the CSS files and images. exactly like saving the pages in browser.
i have tried selenium, but with selenium i can only save the html not full page so it is not possible to do this with selenium.
i want to know can i do this using Scrapy?
if it is not possible using Scrapy what else can i use?
Yes - you should be able to do this in scrapy
Inside of the <head>tag in the html you should see urls to javascript references in <script> tags and you should see <link> tags that give you the url to get the css files
Once you get the url, it's a simple matter to do a request in scrapy. The scrapy tutorial shows this:
https://doc.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
These urls contain the raw css or javascript and you can either download that separately or construct a new single HTML document
One thing to note is that the <script> tags may contain the full javascript and not a url reference. In this case you'll get the data when you get the html portion
I'm trying to scrape https://www.grailed.com/ using scrapy. I have been able to get the elements I want in each listing (price, item title, size). I currently trying to get the ahrefs for each listing at the home page.
When I try response.xpath('.//div[starts-with(#id, "product")]').extract
returns
<bound method SelectorList.extract of [<Selector xpath='.//div[starts-with(#id, "product")]'
data=u'<div id="products">\n<div id="loading">\n<'>]>
Based on the inspect element it should be returning div class="feed-wrapper">?
I'm just trying to get those links so scrapy knows to go into each listing. Thank you for any help.
When you do scrapping always check source of page (not in the inspector but view-source) - that would be real data you operate with.
That div is added dynamically after page loads. JS does that job.
When you send request to server and receive pure HTML - JS will not be executed and so you see real Server response which you support to work with.
div class="feed-wrapper">
Here is real Server response to you. You must deal with it.