How to *efficiently* monitor a webpage modified with javascript?

How to *efficiently* monitor a webpage modified with javascript? - python

I'm trying to monitor an element on a website that's generated via javascript. The problem of downloading a javascript modified page has been handled before, I borrowed the following code that solves the problem with PyQt.
But when I set this code to run every 20 seconds, my network traffic averages 70KB/s down 5 KB/s up. The actual page saved is only 80KB, but is javascript heavy.
6GB a day is not reasonable, my ISP has data limits and I already toe the line.
Is there a way to modify this code so that, for example, it only executes the javascript that corresponds to a specific element on the page? If so, how would I go about figuring out what I need to execute? And would that have a significant effect on the network traffic I'm seeing?
Alternately, how SHOULD I be doing this? I considered making a chrome extension, as Chrome would already be handling the javascript for me, but then I have to figure out how to integrate it with the rest of my project, and that's brand new territory for me. If there's a better way I'd rather do that.
#Borrowed from http://stackoverflow.com/questions/19161737/cannot-add-custom-request-headers-in-pyqt4
#which is borrowed from http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/
import sys, signal
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
from PyQt4.QtNetwork import QNetworkAccessManager, QNetworkRequest, QNetworkReply
cookie = ''#snipped, the cookie I have to send is about as long as this bit of code...
class MyNetworkAccessManager(QNetworkAccessManager):
def __init__(self, url):
QNetworkAccessManager.__init__(self)
request = QNetworkRequest(QUrl(url))
self.reply = self.get(request)
def createRequest(self, operation, request, data):
request.setRawHeader('User-Agent', 'Mozilla/5.0')
request.setRawHeader("Cookie",cookie);
return QNetworkAccessManager.createRequest( self, operation, request, data )
class Crawler( QWebPage ):
def __init__(self, url, file):
QWebPage.__init__( self )
self._url = url
self._file = file
self.manager = MyNetworkAccessManager(url)
self.setNetworkAccessManager(self.manager)
def crawl( self ):
signal.signal( signal.SIGINT, signal.SIG_DFL )
self.loadFinished.connect(self._finished_loading)
self.mainFrame().load( QUrl( self._url ) )
def _finished_loading( self, result ):
file = open( self._file, 'w' )
file.write( self.mainFrame().toHtml() )
file.close()
exit(0)
def main(url,file):
app = QApplication([url,file])
crawler = Crawler(url, file)
crawler.crawl()
sys.exit( app.exec_() )

First, be clear about your motivation here. Something is changing every 20 seconds and you want to tell the server when one field changes. So
1). Do you need to send the whole page or just the contents of one field. I'm not clear what you are currently doing, but if you are sending 80k every 20 seconds this seems like overkill.
2). Does the server need to know immediately? What are the consequences of sending the state every minute rather than every 20 seconds. You miss some changes, but does that matter?
You haven't really told what you're doing or why, so we can't comment on that.
My first thought is that if the server just wants to know about one field then make an ajax call with just the payload you care about. To me more efficient send a summary every few minutes. One composite record is much cheaper than several small records.

Related

Dynamically add Items to a QListWIdget

I have this function
def search(self):
#creating a list just ignore
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
#iterarting through the generated lsit
for x in self.gogo.resultsContainer:
self.listWidget.update()
#print(x[2])
#x[2] is the url to an image
url = x[2]
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
pixmap = QPixmap()
pixmap.loadFromData(webpage)
icon = QIcon(pixmap)
#x[1] is a string I want to display it's a title basically
item = QListWidgetItem(icon, x[1])
size = QSize()
size.setHeight(100)
size.setWidth(400)
item.setSizeHint(size)
#item.iconSize(QSize(100, 400))
self.listWidget.addItem(item)
It works, my problem is that it displays everything only after it iterates through every item.
What I mean is that I can see it using the print statement that it IS going through the list and creating the items but no items is being displayed.
They get all displayed at once after it completely iterates through the list.
It's very slow. I know part of it is due to the image download and I can do nothing about it. But adding the items dynamically would at least make it a little more bearable.
tried to use update() and it didn't really work.
another weird behaviour is despite the clear() being the first instruction it doesn't clear the listWidget as soon as the function is called, it looks like it's due to the same thing that leads to everything being displayed at one.

UI systems use an "event loop", which is responsible of "drawing" elements and allow user interaction; that loop must be left free to do its job, and functions that operate within it must return as soon as possible to prevent "freezing" of the UI: the window is not refreshed or properly displayed, and it seems unresponsive to keyboard or mouse events.
Your function does exactly that: it blocks everything until it's finished.
Calling update() won't be enough, as it only schedules a repainting, which will only happen as soon as the main loop is able to process events. That's why long or possibly infinite for/while loops should always be avoided, as much as any blocking function like time.sleep, even for short amounts of time.
"I can do nothing about it."
Actually, not only you can, but you have to.
A possibility is to use threading, specifically QThread, which is an interface that allows to execute calls in a separate thread while providing Qt's signal/slot mechanism that can work asynchronously between thread. Using QThread and signals to communicate with the main thread is extremely important, as UI elements are not thread-safe, and must never be accessed (nor created) from external threads.
class Downloader(QThread):
imageDownloaded = pyqtSignal(int, object)
def __init__(self, parent, urlList):
super().__init__(parent)
self.urlList = urlList
def run(self):
for i, url in enumerate(self.urlList):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
self.imageDownloaded.emit(i, webpage)
class YourWidget(QWidget):
# ...
def search(self):
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
#iterating through the generated list
urlList = []
for x in self.gogo.resultsContainer:
url = x[2]
urlList.append(url)
item = QListWidgetItem(x[1])
item.setSizeHint(QSize(400, 100))
self.listWidget.addItem(item)
# the parent argument is mandatory, otherwise there won't be any
# persistent reference to the downloader, and it will be deleted
# as soon as this function returns; connecting the finished signal
# to deleteLater will delete the object when it will be completed.
downloadThread = Downloader(self, urlList)
downloadThread.imageDownloaded.connect(self.updateImage)
downloadThread.finished.connect(downloadThread.deleteLater)
downloadThread.start()
def updateImage(self, index, data):
pixmap = QPixmap()
if not pixmap.loadFromData(data) or index >= self.listWidget.count():
return
self.listWidget.item(index).setIcon(QIcon(pixmap))
Note: the code above is untested, as it's not clear what modules you're actually using for downloading, nor what self.gogo is.
A slightly different alternative of the above is to use a persistent "download manager" thread, and queue requests using a python Queue, which will be read in the run() implementation.
Consider that Qt provides the QtNetwork module that already works asynchronously, and implements the same concept.
You have to create a QNetworkAccessManager instance (one is usually enough for the whole application), then create a QNetworkRequest for each url, and finally call the manager's get() with that request, which will return a QNetworkReply that will later be used for retrieving the downloaded data.
In the following example, I'm also setting a custom property for the reply, so that when the reply will be received, we will know to what index it corresponds to: this is even more important than what done above, as QNetworkAccessManager can download in parallel, and downloads can be completed in a different order than they were requested (for instance, if the images have different sizes, or are being downloaded from different servers).
Note that the index must be set as a Qt property, and cannot (or, better, should not) be set as a python attribute, like reply.index = i. This is because the reply we're using in python is just a wrapper around the actual Qt object, and unless we keep a persistent reference to that wrapper (for instance, by adding it to a list), that attribute would be lost.
from PyQt5.QtNetwork import *
class YourWidget(QWidget):
def __init__(self):
# ...
self.downloadManager = QNetworkAccessManager()
self.downloadManager.finished.connect(self.updateImage)
# ...
def search(self):
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
for i, x in enumerate(self.gogo.resultsContainer):
item = QListWidgetItem(x[1])
item.setSizeHint(QSize(400, 100))
self.listWidget.addItem(item)
url = QUrl(x[2])
request = QNetworkRequest(url)
request.setHeader(request.UserAgentHeader, 'Mozilla/5.0')
reply = self.downloadManager.get(request)
reply.setProperty('index', i)
def updateImage(self, reply):
index = reply.property('index')
if isinstance(index, int):
pixmap = QPixmap()
if pixmap.loadFromData(reply.readAll()):
item = self.listWidget.item(index)
if item is not None:
item.setIcon(QIcon(pixmap))
reply.deleteLater()

Navigate to new page hosted on bokeh server from within bokeh app

So I'm writing an application running on the bokeh server, and having difficulty with navigating between different pages being hosted. Let's say I have this simple class that loads a single button:
class navigateWithButton(HBox):
extra_generated_classes = [["navigateWithButton", "navigateWithButton", "HBox"]]
myButton= Instance(Button)
inputs = Instance(VBoxForm)
#classmethod
def create(cls):
obj = cls()
obj.myButton = Button(
label="Go"
)
obj.inputs = VBoxForm(
children=[
obj.login_button
]
)
obj.children.append(obj.inputs)
return obj
def setup_events(self):
super(navigateWithButton, self).setup_events()
if not self.myButton:
return
self.myButton.on_click(self.navigate)
def navigate(self, *args):
###################################################################
### want to redirect to 'http://localhost:5006/other_app' here! ###
###################################################################
and further down I have, as would be expected:
#bokeh_app.route("/navigate/")
#object_page("navigate")
def navigate_button_test():
nav = navigateWithButton.create()
return nav
Along with a route to an addtional app I've created from within the same script:
#bokeh_app.route("/other_app/")
#object_page("other_app")
def some_other_app():
app = otherApp.create()
return app
running this code I can (obviously) easily navigate between the two applications just by typing in the address, and they both work beautifully, but I cannot for the life of me find an example of programmatically navigating between the two pages. I'm certain the answer is simple and I must be overlooking something very obvious, but If someone could tell me precisely where I am being ridiculous or if I'm barking waaay up the wrong tree I'd be extremely appreciative!!
And please bear in mind: I'm certain there are better ways of doing this, but I'm tasked with finishing inherited code and I'd like to try and find a solution before having to rebuild from scratch

Set session not to expire automatically

I am working on store data into cache memory using cherrypy. I am using below code to put data into cache :
import cherrypy
import datetime
import sys
from cherrypy.lib.caching import MemoryCache
cache = MemoryCache()
def putDataIntoCache(self, *args, **kwargs):
data = cache.get()
if not data:
obj = kwargs
size = sys.getsizeof(obj)
cache.put(obj, size)
data = obj
return 'obj: %s, id: %s' % (cache.get(), id(cache.get()))
But problem is that cache data is clear automatically after 10 second.
I found that delay = 600 set in cache.py class. For this reason data is cleared after 10 second.
I just want to clear cache data when cherrypy server restarted.
How to solve this issue?

I think that you shouldn't be using the cache directly. Specially when you don't really want to have the benefit of finer control of the cache invalidation.
You can do something like this:
import cherrypy as cp
from cherrypy.process.plugins import SimplePlugin
class CachePlugin(SimplePlugin):
def start(self):
self.bus.log('Initializing cache')
cp.cache = {}
def stop(self):
self.bus.log('Clearing cache')
cp.cache = {}
class Root:
#cp.expose
def default(self, user=None):
if user is not None:
cp.cache['user'] = user
return cp.cache.get('user', 'No user')
CachePlugin(cp.engine).subscribe()
cp.quickstart(Root())
Try it with: /, /?user=test, / and /?user=new_user
You could contain the cache on the plugin and publish the channels to modify it but in general the idea is that you don't really need the cache tool, just to listen when the server starts and stops (I'm assuming you are using the autoreloader).
This is simply a dictionary monkey patched on the cherrypy module. This is quick and simple. It's pretty late here in Mexico... I hope this helps.
More info about the plugins: http://cherrypy.readthedocs.org/en/latest/extend.html#server-wide-functions

Load a web page

I am trying to load a web page using PySide's QtWebKit module. According to the documentation (Elements of QWebView; QWebFrame::toHtml()), the following script should print the HTML of the Google Search Page:
from PySide import QtCore
from PySide import QtGui
from PySide import QtWebKit
# Needed if we want to display the webpage in a widget.
app = QtGui.QApplication([])
view = QtWebKit.QWebView(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
But alas it does not. All that is printed is the method's equivalent of a null response:
<html><head></head><body></body></html>
So I took a closer look at the setUrl documentation:
The view remains the same until enough data has arrived to display the new url.
This made me think that maybe I was calling the toHtml() method too soon, before a response has been received from the server. So I wrote a class that overrides the setUrl method, blocking until the loadFinished signal is triggered:
import time
class View(QtWebKit.QWebView):
def __init__(self, *args, **kwargs):
super(View, self).__init__(*args, **kwargs)
self.completed = True
self.loadFinished.connect(self.setCompleted)
def setCompleted(self):
self.completed = True
def setUrl(self, url):
self.completed = False
super(View, self).setUrl(url)
while not self.completed:
time.sleep(0.2)
view = View(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
That made no difference at all. What am I missing here?
EDIT: Merely getting the HTML of a page is not my end game here. This is a simplified example of code that was not working the way I expected it to. Credit to Oleh for suggesting replacing time.sleep() with app.processEvents()

Copied from my other answer:
from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings
qapp = QApplication([])
def load_source(url):
page = QWebPage()
page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
page.mainFrame().setUrl(QUrl(url))
class State(QObject):
src = None
finished = False
#Slot()
def loaded(self, success=True):
self.finished = True
if self.src is None:
self.src = page.mainFrame().toHtml()
state = State()
# Optional; reacts to DOM ready, which happens before a full load
def js():
page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
page.mainFrame().evaluateJavaScript('''
document.addEventListener('DOMContentLoaded', qstate$.loaded);
''')
page.mainFrame().javaScriptWindowObjectCleared.connect(js)
page.mainFrame().loadFinished.connect(state.loaded)
while not state.finished:
qapp.processEvents()
return state.src
load_source downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.
But you really should think what you're doing. Do you actually need to invoke the engine and get the modified HTML? If you just want to download HTML of some webpage, there are much, much simpler ways to do this.
Now, the problem with the code in your answer is you don't let Qt do anything. There is no magic happening, no code running in background. Qt is based on an event loop, and you never let it enter that loop. This is usually achieved by calling QApplication.exec_ or with a workaround processEvents as shown in my code. You can replace time.sleep(0.2) with app.processEvents() and it might just work.

HTML page vastly different when using a headless webkit implementation using PyQT

I was under the impression that using a headless browser implementation of webkit using PyQT will automatically get me the html code for each URL even with heavy JS code in it. But I am only seeing it partially. I am comparing with the page I get when I save the page from the firefox window.
I am using the following code -
class JabbaWebkit(QWebPage):
# 'html' is a class variable
def __init__(self, url, wait, app, parent=None):
super(JabbaWebkit, self).__init__(parent)
JabbaWebkit.html = ''
if wait:
QTimer.singleShot(wait * SEC, app.quit)
else:
self.loadFinished.connect(app.quit)
self.mainFrame().load(QUrl(url))
def save(self):
JabbaWebkit.html = self.mainFrame().toHtml()
def userAgentForUrl(self, url):
return USER_AGENT
def get_page(url, wait=None):
# here is the trick how to call it several times
app = QApplication.instance() # checks if QApplication already exists
if not app: # create QApplication if it doesnt exist
app = QApplication(sys.argv)
#
form = JabbaWebkit(url, wait, app)
app.aboutToQuit.connect(form.save)
app.exec_()
return JabbaWebkit.html
Can some one see anything obviously wrong with the code?
After running the code through a few URLs, here is one I found that shows the problems I am running into quite clearly - http://www.chilis.com/EN/Pages/menu.aspx
Thanks for any pointers.

The page have ajax code, when it finish load, it still need some time to update the page with ajax. But you code will quit when it finish load.
You should add some code like this to wait some time and process events in webkit:
for i in range(200): #wait 2 seconds
app.processEvents()
time.sleep(0.01)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently monitor a webpage modified with javascript? - python

Related

Dynamically add Items to a QListWIdget

Navigate to new page hosted on bokeh server from within bokeh app

Set session not to expire automatically

Load a web page

HTML page vastly different when using a headless webkit implementation using PyQT

Categories

Resources