Wait until the webpage loads in Scrapy - python

I am using scrapy script to load URL using "yield".
MyUrl = "www.example.com"
request = Request(MyUrl, callback=self.mydetail)
yield request
def mydetail(self, response):
item['Description'] = response.xpath(".//table[#class='list']//text()").extract()
return item
The URL seems to take minimum 5 seconds to load. So I want Scrapy to wait for some time to load the entire text in item['Description'].
I tried "DOWNLOAD_DELAY" in settings.py but no use.

Make a brief view on firebug or another tool to capture responses for Ajax requests, which were made by javascript code. You are able to make a chain of responses to catch those ajax requests which appear after uploading of the page.There are several related questions: parse ajax content,
retreive final page,
parse dynamic content.

Related

For the BeautifulSoup specialists: How do I scrape a page with multiple panes?

Here is a link to the page that I'm trying to scrape:
https://www.simplyhired.ca/search?q=data+analyst&l=Vancouver%2C+BC&job=grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g%27
More specifically, I'm trying to scrape the 'Qualifications' element on the page.
When I print the soup object, I do not see the HTML code for the right pane.
Any thoughts on how I could access these elements?
Thanks in advance!
The DOM elements of the page you're trying to scrape are populated asynchronously using JavaScript. In other words, the information you're trying to scrape is not actually baked into the HTML at the time the server serves the page document to you, so BeautifulSoup can't see it - the document you get back is just a "bare bones" template, which, normally, when viewed in a browser like it's meant to be, will be populated via JavaScript, pulling the required information from various other places. You can expect most modern, dynamic websites to be implemented in this way. BeautifulSoup will only work for pages whose content is baked into the HTML at the time it is served to you by the server. The fact that some elements of the page take some time to load when viewed in a browser is an instant give-away - any time you see that, your first thought should be "DOM is populated asynchronously using JavaScript. BeautifulSoup won't work for this". If it's a Single-Page Application, you can forget BeautifulSoup.
Upon visiting the page in my browser, I logged my network traffic and saw that it made multiple XHR (XmlHttpRequest) HTTP GET requests, one of which was to a REST API that serves JSON which contains all the job information you're looking for. All you need to do is imitate that HTTP GET request to that same API URL, with the same query-string parameters (the API doesn't seem to care about request headers, which is nice). No BeautifulSoup or Selenium required:
def main():
import requests
url = "https://www.simplyhired.ca/api/job"
params = {
"key": "grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g",
"isp": "0",
"al": "1",
"ia": "0",
"tk": "1f4aknr5vs7aq800",
"tkt": "serp",
"from": "manual",
"jatk": "",
"q": "data%20analyst"
}
response = requests.get(url, params=params)
response.raise_for_status()
print(response.json()["skillEntities"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
["Tableau", "SQL"]
>>>
For more information about logging your network traffic, finding the API URL and exploring all the information available to you in the JSON response, Take a look at one of my other answers where I go more in depth.

how to automatically retrieve the request URL (for python) from an XHR request of a page loaded via JavaScript

Here is the URL that I'm trying to scrape: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
I'm trying to scrape the webpage using Python which mean I will require the XHR request for this page as it is loaded via JavaScript.
Upon inspection of the Network under Developer Tools, I can see the XHR request: a10-qq320196292019.htm which produces the request URL: https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm
My question is two-fold,
How can I automatically get this request URL if i am only accessing using the URL given initially and,
How do I know that this is THE XHR request I need? This particular URL works for my needs but I noticed that there were many other XHR reqeusts as well. How does one differentiate?
In this case, I don't think you need to go that route. The link you're using is an ixbrl view of the actual html document. The url for the html doc is embedded in that first link. All you have to do is extract it:
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm'
html_url = url.replace('/ix?doc=','')
html_url
Output:
'https://www.sec.gov/Archives/edgar/data/320193/000032019319000076/a10-qq320196292019.htm

How to wait a page to load before getting data with requests.get in python and without using api

I am using Python and requests library to do web-scraping. I've got a problem with the loading of a page, I would like to make the requests.get() wait before getting the result.
I saw some people with the same "problem" they resolved using Selenium, but I don't want to use another API. I am wondering if it's possible using only urllib, urllib2 or requests.
I have tried to put time.sleep() in the get method, it didn't work.
It seems that I need to find where the website get the data before showing it but I can't find it.
import requests
def search():
url= 'https://academic.microsoft.com/search?q=machine%20learning'
mySession = requests.Session()
response = mySession.get(url)
myResponse = response.text
The response is the html code of the loading page (you can see it if you go to the link in the code) with the loading blocks but I need to get the results of the research.
requests cannot get loaded elements from ajax. See this explanation from w3schools.com.
Read data from a web server - after a web page has loaded
The only thing requests do is to download the html, but it does not interpret the javascript code and so cannot load elements which is normally loaded via ajax in a web browser (or using Selenium).
This site is making another requests and using javascript to render it. You cannot execute javascript with requests. That's why some people use Selenium.
https://academic.microsoft.com/search?q=machine%20learning is not meant to by used without browser.
If you want data specifically from academic.microsoft.com use their api.
import requests
url = 'https://academic.microsoft.com/api/search'
data = {"query": "machine learning",
"queryExpression": "",
"filters": [],
"orderBy": None,
"skip": 0,
"sortAscending": True,
"take": 10}
r = requests.post(url=url, json=data)
result = r.json()
You will get data in nice format and easy to use.

Detect page is fully loaded using requests library

I want to know if there is a response from requests.get(url) when the page is fully loaded. I did tests with around 200 refreshes of my page and it happens randomly that once or twice the page does not load the footer.
First requests GET will return you the entire page but requests is no a browser, it does not parse the content.
When you load a page with the browser, it does usually 10-50 requests for each resource, runs the JavaScript, ....

Scrapy - how to manage cookies/sessions

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors

Categories

Resources