Detect page is fully loaded using requests library - python

I want to know if there is a response from requests.get(url) when the page is fully loaded. I did tests with around 200 refreshes of my page and it happens randomly that once or twice the page does not load the footer.

First requests GET will return you the entire page but requests is no a browser, it does not parse the content.
When you load a page with the browser, it does usually 10-50 requests for each resource, runs the JavaScript, ....

Related

How should i get another redirected page url in python?

Like we open a URL to a normal browser so it will redirect to another website url. Example a shorted link. After you open this it will redirect you to the main url.
So how to do this in python I mean I need to open a URL on python and this will redirect to other website page then I will copy the other website page link.
That's all I want to know thank you.
I tried it with python requests and urllib module.
Like this
import requests
a = requests.get("url", allow_redirects = True)
And
import urllib.request
a = urllib.request.urlopen("url")
But it's not working at all. I mean didn't get the redirected page.
I know 4 types of redirections.
server sends response with status 3xx and new address
HTTP/1.1 302 Found
Location: https://new_domain.com/some/folder
Wikipedia: HTTP 301, HTTP 302, HTTP 303
server sends header Refresh with time in seconds and new address
Refresh: 0; url=https://new_domain.com/some/folder
server sends HTML with meta tag which emulates header Refresh
<meta http-equiv="refresh" content="0; url=https://new_domain.com/some/folder">
Wikipedia: meta refresh
JavaScript sets new location
location = url
location.href = url
location.replace(url)
location.assing(url)
The same for document.location, window.location
There should be also combination with open(),document.open(), window.open()
requests automatically redirects for first and (probably) second type. With urllib probably you would have to check status, get url, and run next request - but this is easy. You can even run it in loop because some pages may have many redirections. You can test it on httpbin.org (even for multi-redirections)
For third type it is easy to check if HTML has meta tag and run next request with new url. And again you can run in loop because some pages may have many redirections.
But forth type makes problem because requests can't run JavaScript and there are many different methods to assign new location. They can also hide it in code - "obfuscation".
In requests you can check response.history to see executed redirections

For the BeautifulSoup specialists: How do I scrape a page with multiple panes?

Here is a link to the page that I'm trying to scrape:
https://www.simplyhired.ca/search?q=data+analyst&l=Vancouver%2C+BC&job=grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g%27
More specifically, I'm trying to scrape the 'Qualifications' element on the page.
When I print the soup object, I do not see the HTML code for the right pane.
Any thoughts on how I could access these elements?
Thanks in advance!
The DOM elements of the page you're trying to scrape are populated asynchronously using JavaScript. In other words, the information you're trying to scrape is not actually baked into the HTML at the time the server serves the page document to you, so BeautifulSoup can't see it - the document you get back is just a "bare bones" template, which, normally, when viewed in a browser like it's meant to be, will be populated via JavaScript, pulling the required information from various other places. You can expect most modern, dynamic websites to be implemented in this way. BeautifulSoup will only work for pages whose content is baked into the HTML at the time it is served to you by the server. The fact that some elements of the page take some time to load when viewed in a browser is an instant give-away - any time you see that, your first thought should be "DOM is populated asynchronously using JavaScript. BeautifulSoup won't work for this". If it's a Single-Page Application, you can forget BeautifulSoup.
Upon visiting the page in my browser, I logged my network traffic and saw that it made multiple XHR (XmlHttpRequest) HTTP GET requests, one of which was to a REST API that serves JSON which contains all the job information you're looking for. All you need to do is imitate that HTTP GET request to that same API URL, with the same query-string parameters (the API doesn't seem to care about request headers, which is nice). No BeautifulSoup or Selenium required:
def main():
import requests
url = "https://www.simplyhired.ca/api/job"
params = {
"key": "grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g",
"isp": "0",
"al": "1",
"ia": "0",
"tk": "1f4aknr5vs7aq800",
"tkt": "serp",
"from": "manual",
"jatk": "",
"q": "data%20analyst"
}
response = requests.get(url, params=params)
response.raise_for_status()
print(response.json()["skillEntities"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
["Tableau", "SQL"]
>>>
For more information about logging your network traffic, finding the API URL and exploring all the information available to you in the JSON response, Take a look at one of my other answers where I go more in depth.

How to find redirected url using requests and BeautifulSoup

I'm building a web scraper using Python and found a page which redirects to another page after displaying some text on screen. I'm using the requests library. How can I find the URL it redirects to?
display message like (page1)
wait we will redirect you...
r.content displays the source code of page 1. How to wait for the last page to load? It gets redirected through 2-3 pages.

Wait until the webpage loads in Scrapy

I am using scrapy script to load URL using "yield".
MyUrl = "www.example.com"
request = Request(MyUrl, callback=self.mydetail)
yield request
def mydetail(self, response):
item['Description'] = response.xpath(".//table[#class='list']//text()").extract()
return item
The URL seems to take minimum 5 seconds to load. So I want Scrapy to wait for some time to load the entire text in item['Description'].
I tried "DOWNLOAD_DELAY" in settings.py but no use.
Make a brief view on firebug or another tool to capture responses for Ajax requests, which were made by javascript code. You are able to make a chain of responses to catch those ajax requests which appear after uploading of the page.There are several related questions: parse ajax content,
retreive final page,
parse dynamic content.

How to refresh the page in Mechanize?

I'm using Mechanize in Python to submit a form and view some info. The URL goes to some standard URL for the request, without the request parameters in it. Something like: xyzdomain.com/request
In the browser, it normal shows a loading icon, then displays the data. There is no change in the top of the page (header) so the full page is never reloaded, but the URL does change from /index to /request.
About 1/3 of the time, I get a httplib.IncompleteRead exception and I checked the partial HTML of the response and the page is saying "If it takes longer than 25 seconds, refresh the page."
So if I grabbed the current URL of the Mechanize Browser and used open() on it, would that have the same affect as using refresh (if Mechanize had refresh).
Maybe this might help
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
For more: Meta Refresh

Categories

Resources