Check for page loading (Python) - python

In Python, is there any way that I can find out if a browser window that I've opened has loaded completely or not, maybe using a package (for instance, webbrowser)? Once it's loaded completely I want to take a screenshot of it and save it.

You can do this using e.g. Selenium; I'm not sure if it's what you want, though. See this short guide.
#!/usr/bin/env python
from selenium import selenium
sel = selenium('localhost', 4444, '*firefox', 'http://www.google.com/')
sel.start()
sel.open('/')
sel.wait_for_page_to_load(10000)
sel.stop()
You could also use COM hooks to control IE (ugh):
import win32com.client
import time
ie = win32com.client.Dispatch( "InternetExplorer.Application" )
ie.Navigate( <some URL> )
time.sleep( 1 )
print( ie.Busy )
There's a module that wraps all the COM functionality: IEC.py.

Strictly speaking you can't do that with python. However, using Javascript you can wait for the "onload" event and send an ajax request to your python backend.
With JQuery it should be something like.
//callback for the onload
jQuery(document)
.ready(function() {
$.ajax({
type: "POST",
url: "thePythonScript.py",
data: "name=Daniel&location=Somewhere",
success: function(msg){
alert( "Data Saved: " + msg );
}
});
});

Related

Scrapy-Splash Waiting for Page to Load

I'm new to scrapy and splash, and I need to scrape data from single page and regular web apps.
A caveat, though, is I'm mostly scraping data from internal tools and applications, so some require authentication and all of them require at least a couple of seconds loading time before the page fully loads.
I naively tried a Python time.sleep(seconds) and it didn't work. It seems like SplashRequest and scrapy.Request both run and yield results, basically. I then learned about LUA scripts as arguments to these requests, and attempted a LUA script with various forms of wait(), but it looks like the requests never actually run the LUA scripts. It finishes right away and my HTMl selectors don't find anything I'm looking for.
I'm following directions from here https://github.com/scrapy-plugins/scrapy-splash, and have their docker instance running on localhost:8050 and created a settings.py.
Anyone with experience here know what I might be missing?
Thanks!
spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import logging
import base64
import time
# from selenium import webdriver
# lua_script="""
# function main(splash)
# splash:set_user_agent(splash.args.ua)
# assert(splash:go(splash.args.url))
# splash:wait(5)
# -- requires Splash 2.3
# -- while not splash:select('#user-form') do
# -- splash:wait(5)
# -- end
# repeat
# splash:wait(5))
# until( splash:select('#user-form') ~= nil )
# return {html=splash:html()}
# end
# """
load_page_script="""
function main(splash)
splash:set_user_agent(splash.args.ua)
assert(splash:go(splash.args.url))
splash:wait(5)
function wait_for(splash, condition)
while not condition() do
splash:wait(0.5)
end
end
local result, error = splash:wait_for_resume([[
function main(splash) {
setTimeout(function () {
splash.resume();
}, 5000);
}
]])
wait_for(splash, function()
return splash:evaljs("document.querySelector('#user-form') != null")
end)
-- repeat
-- splash:wait(5))
-- until( splash:select('#user-form') ~= nil )
return {html=splash:html()}
end
"""
class HelpSpider(scrapy.Spider):
name = "help"
allowed_domains = ["secet_internal_url.com"]
start_urls = ['https://secet_internal_url.com']
# http_user = 'splash-user'
# http_pass = 'splash-password'
def start_requests(self):
logger = logging.getLogger()
login_page = 'https://secet_internal_url.com/#/auth'
splash_args = {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
'lua_source': load_page_script
}
#splash_args = {
# 'html': 1,
# 'png': 1,
# 'width': 600,
# 'render_all': 1,
# 'lua_source': lua_script
#}
yield SplashRequest(login_page, self.parse, endpoint='execute', magic_response=True, args=splash_args)
def parse(self, response):
# time.sleep(10)
logger = logging.getLogger()
html = response._body.decode("utf-8")
# Looking for a form with the ID 'user-form'
form = response.css('#user-form')
logger.info("####################")
logger.info(form)
logger.info("####################")
I figured it out!
Short Answer
My Spider class was configured incorrectly for using splash with scrapy.
Long Answer
Part of running splash with scrape is, in my case, running a local Docker instance that it uses to load my requests into for it to run the Lua scripts. An important caveat to note is the settings for splash as described in the github page must be a property of the spider class itself, so I added this code to my Spider:
custom_settings = {
'SPLASH_URL': 'http://localhost:8050',
# if installed Docker Toolbox:
# 'SPLASH_URL': 'http://192.168.99.100:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
}
Then I noticed my Lua code running, and the Docker container logs indicating the interactions. After fixing errors with the splash:select() my login script worked, as did my waits:
splash:wait( seconds_to_wait )
Lastly, I created a Lua script to handle logging in, redirecting, and gathering links and text from pages. My application is an AngularJS app, so I can't gather links or visit them except clicking. This script let me run through every link, click it, and gather content.
I suppose an alternative solution would have been to use end-to-end testing tools such as Selenium/WebDriver or Cypress, but I prefer to use scrapy to scrape and testing tools to test. To each their own (Python or NodeJS tools), I suppose.
Neat Trick
Another thing to mention that's really helpful for debugging, is when you're running the Docker instance for Scrapy-Splash, you can visit that URL in your browser and there's an interactive "request tester" that lets you test out Lua scripts and see rendered HTML results (for example, verifying login or page visits). For me, this url was http://0.0.0.0:8050, and this URL is set in your settings and should be configured to match with your Docker container.
Cheers!

Executing Javascript on a webpage with python requests

I am not sure if this is possible but let me try to explain.
I am trying to post data from a form but before my data gets posted the website encrypts some of it, with a public key, that i am able to achieve from the response.text
I found the javascript that is used
var myVal = 123
n = (myVal, ClassName.create(publicKey);
n.encrypt(myVal)
The .encrypt returns the string that is passed to the form. My question is can I somehow bring that javascript into my script so I can execute that .encrypt method to pass that properly to the form?
if the script is simple,I will use pyexecjs
import execjs
js_cmd = '''
function add(x,y){
return x+y
}
'''
cxt = execjs.compile(js_cmd)
print(cxt.eval("add(3,4)"))

Python + webapp2 + Modify the URL without reloading the page [duplicate]

Is there a way I can modify the URL of the current page without reloading the page?
I would like to access the portion before the # hash if possible.
I only need to change the portion after the domain, so it's not like I'm violating cross-domain policies.
window.location.href = "www.mysite.com/page2.php"; // this reloads
This can now be done in Chrome, Safari, Firefox 4+, and Internet Explorer 10pp4+!
See this question's answer for more information:
Updating address bar with new URL without hash or reloading the page
Example:
function processAjaxData(response, urlPath){
document.getElementById("content").innerHTML = response.html;
document.title = response.pageTitle;
window.history.pushState({"html":response.html,"pageTitle":response.pageTitle},"", urlPath);
}
You can then use window.onpopstate to detect the back/forward button navigation:
window.onpopstate = function(e){
if(e.state){
document.getElementById("content").innerHTML = e.state.html;
document.title = e.state.pageTitle;
}
};
For a more in-depth look at manipulating browser history, see this MDN article.
HTML5 introduced the history.pushState() and history.replaceState() methods, which allow you to add and modify history entries, respectively.
window.history.pushState('page2', 'Title', '/page2.php');
Read more about this from here
You can also use HTML5 replaceState if you want to change the url but don't want to add the entry to the browser history:
if (window.history.replaceState) {
//prevents browser from storing history with each change:
window.history.replaceState(statedata, title, url);
}
This would 'break' the back button functionality. This may be required in some instances such as an image gallery (where you want the back button to return back to the gallery index page instead of moving back through each and every image you viewed) whilst giving each image its own unique url.
Here is my solution (newUrl is your new URL which you want to replace with the current one):
history.pushState({}, null, newUrl);
NOTE: If you are working with an HTML5 browser then you should ignore this answer. This is now possible as can be seen in the other answers.
There is no way to modify the URL in the browser without reloading the page. The URL represents what the last loaded page was. If you change it (document.location) then it will reload the page.
One obvious reason being, you write a site on www.mysite.com that looks like a bank login page. Then you change the browser URL bar to say www.mybank.com. The user will be totally unaware that they are really looking at www.mysite.com.
parent.location.hash = "hello";
In modern browsers and HTML5, there is a method called pushState on window history. That will change the URL and push it to the history without loading the page.
You can use it like this, it will take 3 parameters, 1) state object 2) title and a URL):
window.history.pushState({page: "another"}, "another page", "example.html");
This will change the URL, but not reload the page. Also, it doesn't check if the page exists, so if you do some JavaScript code that is reacting to the URL, you can work with them like this.
Also, there is history.replaceState() which does exactly the same thing, except it will modify the current history instead of creating a new one!
Also you can create a function to check if history.pushState exist, then carry on with the rest like this:
function goTo(page, title, url) {
if ("undefined" !== typeof history.pushState) {
history.pushState({page: page}, title, url);
} else {
window.location.assign(url);
}
}
goTo("another page", "example", 'example.html');
Also, you can change the # for <HTML5 browsers, which won't reload the page. That's the way Angular uses to do SPA according to hashtag...
Changing # is quite easy, doing like:
window.location.hash = "example";
And you can detect it like this:
window.onhashchange = function () {
console.log("#changed", window.location.hash);
}
The HTML5 replaceState is the answer, as already mentioned by Vivart and geo1701. However it is not supported in all browsers/versions.
History.js wraps HTML5 state features and provides additional support for HTML4 browsers.
Before HTML5 we can use:
parent.location.hash = "hello";
and:
window.location.replace("http:www.example.com");
This method will reload your page, but HTML5 introduced the history.pushState(page, caption, replace_url) that should not reload your page.
If what you're trying to do is allow users to bookmark/share pages, and you don't need it to be exactly the right URL, and you're not using hash anchors for anything else, then you can do this in two parts; you use the location. hash discussed above, and then implement a check on the home page, to look for a URL with a hash anchor in it, and redirect you to the subsequent result.
For instance:
User is on www.site.com/section/page/4
User does some action which changes the URL to www.site.com/#/section/page/6 (with the hash). Say you've loaded the correct content for page 6 into the page, so apart from the hash the user is not too disturbed.
User passes this URL on to someone else, or bookmarks it
Someone else, or the same user at a later date, goes to www.site.com/#/section/page/6
Code on www.site.com/ redirects the user to www.site.com/section/page/6, using something like this:
if (window.location.hash.length > 0){
window.location = window.location.hash.substring(1);
}
Hope that makes sense! It's a useful approach for some situations.
Below is the function to change the URL without reloading the page. It is only supported for HTML5.
function ChangeUrl(page, url) {
if (typeof (history.pushState) != "undefined") {
var obj = {Page: page, Url: url};
history.pushState(obj, obj.Page, obj.Url);
} else {
window.location.href = "homePage";
// alert("Browser does not support HTML5.");
}
}
ChangeUrl('Page1', 'homePage');
You can use this beautiful and simple function to do so anywhere on your application.
function changeurl(url, title) {
var new_url = '/' + url;
window.history.pushState('data', title, new_url);
}
You can not only edit the URL but you can update the title along with it.
Any changes of the loction (either window.location or document.location) will cause a request on that new URL, if you’re not just changing the URL fragment. If you change the URL, you change the URL.
Use server-side URL rewrite techniques like Apache’s mod_rewrite if you don’t like the URLs you are currently using.
You can add anchor tags. I use this on my site so that I can track with Google Analytics what people are visiting on the page.
I just add an anchor tag and then the part of the page I want to track:
var trackCode = "/#" + urlencode($("myDiv").text());
window.location.href = "http://www.piano-chords.net" + trackCode;
pageTracker._trackPageview(trackCode);
As pointed out by Thomas Stjernegaard Jeppesen, you could use History.js to modify URL parameters whilst the user navigates through your Ajax links and apps.
Almost an year has passed since that answer, and History.js grew and became more stable and cross-browser. Now it can be used to manage history states in HTML5-compliant as well as in many HTML4-only browsers. In this demo You can see an example of how it works (as well as being able to try its functionalities and limits.
Should you need any help in how to use and implement this library, i suggest you to take a look at the source code of the demo page: you will see it's very easy to do.
Finally, for a comprehensive explanation of what can be the issues about using hashes (and hashbangs), check out this link by Benjamin Lupton.
Use history.pushState() from the HTML 5 History API.
Refer to the HTML5 History API for more details.
Your new url.
let newUrlIS = window.location.origin + '/user/profile/management';
In a sense, calling pushState() is similar to setting window.location = "#foo", in that both will also create and activate another history entry associated with the current document. But pushState() has a few advantages:
history.pushState({}, null, newUrlIS);
You can check out the root: https://developer.mozilla.org/en-US/docs/Web/API/History_API
This code works for me. I used it into my application in ajax.
history.pushState({ foo: 'bar' }, '', '/bank');
Once a page load into an ID using ajax, It does change the browser url automatically without reloading the page.
This is ajax function bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
}
});
}
Example: From any page or controller like "Dashboard", When I click on the bank, it loads bank list using the ajax code without reloading the page. At this time, browser URL will not be changed.
history.pushState({ foo: 'bar' }, '', '/bank');
But when I use this code into the ajax, it change the browser url without reloading the page.
This is the full ajax code here in the bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
history.pushState({ foo: 'bar' }, '', '/bank');
}
});
}
This is all you will need to navigate without reload
// add setting without reload
location.hash = "setting";
// if url change with hash do somthing
window.addEventListener('hashchange', () => {
console.log('url hash changed!');
});
// if url change do somthing (dont detect changes with hash)
//window.addEventListener('locationchange', function(){
// console.log('url changed!');
//})
// remove #setting without reload
history.back();
Simply use, it will not reload the page, but just the URL :
$('#form_name').attr('action', '/shop/index.htm').submit();

Can Splash/PhantomJS (for JavaScript rendering) work with Wget for downloading a webpage?

For wget, in many cases it just return “Turn of your javascript to continue”
I’ve found some articles says python’s scrapy with Splash/PhantomJS can rendering, but I’m not familiar with programming even with python, so if there has some solution that can integrate with wget it will be perfect. Thanks
You can't do that with wget only. But you can with a little PhantomJS script:
$ phantomjs dl_page.js http://stackoverflow.com/questions > stackoverflow.html
dl_page.js:
const system = require('system');
const page = require('webpage').create();
page.open(system.args[1], function()
{
console.log(page.content);
phantom.exit();
});
You can use Splash's HTTP API.
To get the rendered HTML, use the /render.html endpoint, passing the URL as argument, and optionally with some wait parameter:
wget -qO- 'http://localhost:8050/render.html?url=http://www.example.com/&timeout=10&wait=0.5'

pycurl script can't login to website

I'm currently trying to get a grasp on pycurl. I'm attempting to login to a website. After logging into the site it should redirect to the main page. However when trying this script it just gets returned to the login page. What might I be doing wrong?
import pycurl
import urllib
import StringIO
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
pageContents.seek(0)
print pageContents.readlines()
EDIT: As pointed out by Peter the URL should point to a login URL but the site I'm trying to get this to work for fails to show me what URL this would be. The form's action just points to the home page ( /index.html )
As you're troubleshooting this problem, I suggest getting a browser plugin like FireBug or LiveHTTPHeaders (I suggest Firefox plugins, but there are similar plugins for other browsers as well). Then you can exercise a request to the site and see what action (URL), method, and form parameters are being passed to the target server. This will likely help elucidate the crux of the problem.
If that's no help, you may consider using a different tool for your mechanization. I've used ClientForm and BeautifulSoup to perform similar operations. Based on what I've read in the pycURL docs and your code above, ClientForm might be a better tool to use. ClientForm will parse your HTML page, locate the forms on it (including login forms), and construct the appropriate request for you based on the answers you supply to the form. You could even use ClientForm with pycURL... but at least ClientForm will provide you with the appropriate action to which to POST, and construct all of the appropriate parameters.
Be aware, though, that if there is JavaScript handling any necessary part of the login form, even ClientForm can't help you there. You will need something that interprets the JavaScript to effectively automate the login. In that case, I've used SeleniumRC to control a browser (and I let the browser handle the JavaScript).
One of the golden rule, you need to 'brake the ice', have debugging enabled when trying to solve pycurl example:
Note: don't forget to use p.close() after p.perform()
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
Now you can see how your code is breathing, because you have debugging enabled
import pycurl
import urllib
import StringIO
def test(debug_type, debug_msg):
if len(debug_msg) < 300:
print "debug(%d): %s" % (debug_type, debug_msg.strip())
pf = {'username' : 'user', 'password' : 'pass' }
fields = urllib.urlencode(pf)
pageContents = StringIO.StringIO()
p = pycurl.Curl()
p.setopt(pycurl.FOLLOWLOCATION, 1)
p.setopt(pycurl.COOKIEFILE, './cookie_test.txt')
p.setopt(pycurl.COOKIEJAR, './cookie_test.txt')
p.setopt(pycurl.POST, 1)
p.setopt(pycurl.POSTFIELDS, fields)
p.setopt(pycurl.WRITEFUNCTION, pageContents.write)
p.setopt(pycurl.VERBOSE, True)
p.setopt(pycurl.DEBUGFUNCTION, test)
p.setopt(pycurl.URL, 'http://localhost')
p.perform()
p.close() # This is mandatory.
pageContents.seek(0)
print pageContents.readlines()

Categories

Resources