I'm using Mechanize in Python to submit a form and view some info. The URL goes to some standard URL for the request, without the request parameters in it. Something like: xyzdomain.com/request
In the browser, it normal shows a loading icon, then displays the data. There is no change in the top of the page (header) so the full page is never reloaded, but the URL does change from /index to /request.
About 1/3 of the time, I get a httplib.IncompleteRead exception and I checked the partial HTML of the response and the page is saying "If it takes longer than 25 seconds, refresh the page."
So if I grabbed the current URL of the Mechanize Browser and used open() on it, would that have the same affect as using refresh (if Mechanize had refresh).
Maybe this might help
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
For more: Meta Refresh
Related
Like we open a URL to a normal browser so it will redirect to another website url. Example a shorted link. After you open this it will redirect you to the main url.
So how to do this in python I mean I need to open a URL on python and this will redirect to other website page then I will copy the other website page link.
That's all I want to know thank you.
I tried it with python requests and urllib module.
Like this
import requests
a = requests.get("url", allow_redirects = True)
And
import urllib.request
a = urllib.request.urlopen("url")
But it's not working at all. I mean didn't get the redirected page.
I know 4 types of redirections.
server sends response with status 3xx and new address
HTTP/1.1 302 Found
Location: https://new_domain.com/some/folder
Wikipedia: HTTP 301, HTTP 302, HTTP 303
server sends header Refresh with time in seconds and new address
Refresh: 0; url=https://new_domain.com/some/folder
server sends HTML with meta tag which emulates header Refresh
<meta http-equiv="refresh" content="0; url=https://new_domain.com/some/folder">
Wikipedia: meta refresh
JavaScript sets new location
location = url
location.href = url
location.replace(url)
location.assing(url)
The same for document.location, window.location
There should be also combination with open(),document.open(), window.open()
requests automatically redirects for first and (probably) second type. With urllib probably you would have to check status, get url, and run next request - but this is easy. You can even run it in loop because some pages may have many redirections. You can test it on httpbin.org (even for multi-redirections)
For third type it is easy to check if HTML has meta tag and run next request with new url. And again you can run in loop because some pages may have many redirections.
But forth type makes problem because requests can't run JavaScript and there are many different methods to assign new location. They can also hide it in code - "obfuscation".
In requests you can check response.history to see executed redirections
I'm building a web scraper using Python and found a page which redirects to another page after displaying some text on screen. I'm using the requests library. How can I find the URL it redirects to?
display message like (page1)
wait we will redirect you...
r.content displays the source code of page 1. How to wait for the last page to load? It gets redirected through 2-3 pages.
Here is my scenario.
I have a lot of links. I want to know if any of them redirect to a different site (maybe a particular one) and only get those redirect URLs.(I want to preserve them for further scraping).
I don't want to get contents of webpage. I only want to get the link it redirects to. If there are multiple redirects, I may want to get the urls until say the 3rd redirect (So, that I'm not in a redirect loop).
How do I achieve this?
Can I do this in requests?
Requests seems to have a r.status, but it only works after fetching the page.
You can use requests.head(url, allow_redirects=True) which will only get the headers. If the response has the Location header it will follow the redirect and head the next url.
import requests
response = requests.head('http://httpbin.org/redirect/3', allow_redirects=True)
for redirect in response.history:
print(redirect.url)
print(response.url)
Output:
http://httpbin.org/redirect/3
http://httpbin.org/relative-redirect/2
http://httpbin.org/relative-redirect/1
http://httpbin.org/get
I want to know if there is a response from requests.get(url) when the page is fully loaded. I did tests with around 200 refreshes of my page and it happens randomly that once or twice the page does not load the footer.
First requests GET will return you the entire page but requests is no a browser, it does not parse the content.
When you load a page with the browser, it does usually 10-50 requests for each resource, runs the JavaScript, ....
I'm filling a form on a web page using python's request module. I'm submitting the form as a POST request, which works fine. I get the expected response from the POST. However, it's a multistep form; after the first "submit" the site loads another form on the same page (using AJAX) . The post response has this HTML page . Now, how do I use this response to fill the form on the new page? Can I intertwine Requests module with Twill or Mechanize in some way?
Here's the code for the POST:
import requests
from requests.auth import HTTPProxyAuth
import formfill
from twill import get_browser
from twill.commands import *
import mechanize
from mechanize import ParseResponse, urlopen, urljoin
http_proxy = "some_Proxy"
https_proxy = "some_Proxy"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy
}
auth = HTTPProxyAuth("user","pass")
r = requests.post("site_url",data={'key':'value'},proxies=proxyDict,auth=auth)
The response r above, contains the new HTML page that resulted from submitting that form. This HTML page also has a form which I have to fill. Can I send this r to twill or mechanize in some way, and use Mechanize's form filling API? Any ideas will be helpful.
The problem here is that you need to actually interact with the javascript on the page. requests, while being an excellent library has no support for javascript interaction, it is just an http library.
If you want to interact with javascript-rich web pages in a meaningful way I would suggest selenium. Selenium is actually a full web browser that can navigate exactly as a person would.
The main issue is that you'll see your speed drop precipitously. Rendering a web page takes a lot longer than the raw html request. If that's a real deal breaker for you you've got two options:
Go headless: There are many options here, but I personally prefer casper. You should see a ~3x speed up on browsing times by going headless, but every site is different.
Find a way to do everything through http: Most non-visual site functions have equivalent http functionality. Using the google developer tools network tab you can dig into the requests that are actually being launched, then replicate those in python.
As far as the tools you mentioned, neither mechanize nor twill will help. Since your main issue here is javascript interaction rather than cookie management, and neither of those frameworks support javascript interactions you would run into the same issue.
UPDATE: If the post response is actually the new page, then you're not actually interacting with AJAX at all. If that's the case and you actually have the raw html, you should simply mimic the typical http request that the form would send. The same approach you used on the first form will work on the second. You can either grab the information out of the HTML response, or simply hard-code the successive requests.
using Mechanize:
#get the name of the form
for form in br.forms():
print "Form name:", form.name
print form
#select 1st form on the page - nr=1 for next etc etc
#OR just select the form with the name br.select_form(form.name)
br.select_form(nr=0)
br.form['form#'] = 'Test Name'
#fill in the fields
r = br.submit() #can always pass in additional params