HTML page vastly different when using a headless webkit implementation using PyQT - python

I was under the impression that using a headless browser implementation of webkit using PyQT will automatically get me the html code for each URL even with heavy JS code in it. But I am only seeing it partially. I am comparing with the page I get when I save the page from the firefox window.
I am using the following code -
class JabbaWebkit(QWebPage):
# 'html' is a class variable
def __init__(self, url, wait, app, parent=None):
super(JabbaWebkit, self).__init__(parent)
JabbaWebkit.html = ''
if wait:
QTimer.singleShot(wait * SEC, app.quit)
else:
self.loadFinished.connect(app.quit)
self.mainFrame().load(QUrl(url))
def save(self):
JabbaWebkit.html = self.mainFrame().toHtml()
def userAgentForUrl(self, url):
return USER_AGENT
def get_page(url, wait=None):
# here is the trick how to call it several times
app = QApplication.instance() # checks if QApplication already exists
if not app: # create QApplication if it doesnt exist
app = QApplication(sys.argv)
#
form = JabbaWebkit(url, wait, app)
app.aboutToQuit.connect(form.save)
app.exec_()
return JabbaWebkit.html
Can some one see anything obviously wrong with the code?
After running the code through a few URLs, here is one I found that shows the problems I am running into quite clearly - http://www.chilis.com/EN/Pages/menu.aspx
Thanks for any pointers.

The page have ajax code, when it finish load, it still need some time to update the page with ajax. But you code will quit when it finish load.
You should add some code like this to wait some time and process events in webkit:
for i in range(200): #wait 2 seconds
app.processEvents()
time.sleep(0.01)

Related

How to use one class to scrape two websites

I'm trying to render websites in PyQt that are written in java. The first site is rendered without problems and scraped for the information I need, but when I want to use the same class to render another site and retrieve the new data it tells me the frame that's defined in the Render class is not defined (which was defined for the first website, which worked perfectly fine in retrieving the data that I needed).
So, why is this happening? Am I missing something fundamental in Python? My understanding is that when the first site has been rendered, then the object will be garbage collected and the second one can be rendered. Below is the referred code:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']
for url in urls:
r = Render(url)
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
#QString should be converted to string before processed by lxml
formatted_result = str(result)
#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//div[#class="campaign"]/a/#href')[1:5]
print (archive_links)
The error message I'm getting:
File "javaweb2.py", line 24, in <module>
result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'
Any help would be much appreciated!
That's because the self.frame is only defined when self._loadFinished() is called, which only occurs when the QWebPage instance emits a signal. So barring several dubious practices I see in the code you posted, the following would solve the issue (not the line with **** is important):
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
self.frame = None # *****
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']
for url in urls:
r = Render(url)
# wait till frame arrives:
while r.frame is None:
# pass # option 1: works, but will cause 100% cpu
time.sleep(0.1) # option 2: much better
result = r.frame.toHtml()
...
So the "pass" would work but will consume 100% cpu as the loop is executed a million times per second. Using the timer only checks every 1/10th second and will be very low cpu consumption.
The best of all solutions of course is to put the logic that depends on the frame being available (i.e. code that is currently in the URL loop below r=Render(url)) in a function that will get called when the loadFinished signal is emitted. Since you can't control the order of signals, the best option is to move that code into the _loadfinished() method.

Load a web page

I am trying to load a web page using PySide's QtWebKit module. According to the documentation (Elements of QWebView; QWebFrame::toHtml()), the following script should print the HTML of the Google Search Page:
from PySide import QtCore
from PySide import QtGui
from PySide import QtWebKit
# Needed if we want to display the webpage in a widget.
app = QtGui.QApplication([])
view = QtWebKit.QWebView(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
But alas it does not. All that is printed is the method's equivalent of a null response:
<html><head></head><body></body></html>
So I took a closer look at the setUrl documentation:
The view remains the same until enough data has arrived to display the new url.
This made me think that maybe I was calling the toHtml() method too soon, before a response has been received from the server. So I wrote a class that overrides the setUrl method, blocking until the loadFinished signal is triggered:
import time
class View(QtWebKit.QWebView):
def __init__(self, *args, **kwargs):
super(View, self).__init__(*args, **kwargs)
self.completed = True
self.loadFinished.connect(self.setCompleted)
def setCompleted(self):
self.completed = True
def setUrl(self, url):
self.completed = False
super(View, self).setUrl(url)
while not self.completed:
time.sleep(0.2)
view = View(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
That made no difference at all. What am I missing here?
EDIT: Merely getting the HTML of a page is not my end game here. This is a simplified example of code that was not working the way I expected it to. Credit to Oleh for suggesting replacing time.sleep() with app.processEvents()
Copied from my other answer:
from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings
qapp = QApplication([])
def load_source(url):
page = QWebPage()
page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
page.mainFrame().setUrl(QUrl(url))
class State(QObject):
src = None
finished = False
#Slot()
def loaded(self, success=True):
self.finished = True
if self.src is None:
self.src = page.mainFrame().toHtml()
state = State()
# Optional; reacts to DOM ready, which happens before a full load
def js():
page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
page.mainFrame().evaluateJavaScript('''
document.addEventListener('DOMContentLoaded', qstate$.loaded);
''')
page.mainFrame().javaScriptWindowObjectCleared.connect(js)
page.mainFrame().loadFinished.connect(state.loaded)
while not state.finished:
qapp.processEvents()
return state.src
load_source downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.
But you really should think what you're doing. Do you actually need to invoke the engine and get the modified HTML? If you just want to download HTML of some webpage, there are much, much simpler ways to do this.
Now, the problem with the code in your answer is you don't let Qt do anything. There is no magic happening, no code running in background. Qt is based on an event loop, and you never let it enter that loop. This is usually achieved by calling QApplication.exec_ or with a workaround processEvents as shown in my code. You can replace time.sleep(0.2) with app.processEvents() and it might just work.

How to *efficiently* monitor a webpage modified with javascript?

I'm trying to monitor an element on a website that's generated via javascript. The problem of downloading a javascript modified page has been handled before, I borrowed the following code that solves the problem with PyQt.
But when I set this code to run every 20 seconds, my network traffic averages 70KB/s down 5 KB/s up. The actual page saved is only 80KB, but is javascript heavy.
6GB a day is not reasonable, my ISP has data limits and I already toe the line.
Is there a way to modify this code so that, for example, it only executes the javascript that corresponds to a specific element on the page? If so, how would I go about figuring out what I need to execute? And would that have a significant effect on the network traffic I'm seeing?
Alternately, how SHOULD I be doing this? I considered making a chrome extension, as Chrome would already be handling the javascript for me, but then I have to figure out how to integrate it with the rest of my project, and that's brand new territory for me. If there's a better way I'd rather do that.
#Borrowed from http://stackoverflow.com/questions/19161737/cannot-add-custom-request-headers-in-pyqt4
#which is borrowed from http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/
import sys, signal
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
from PyQt4.QtNetwork import QNetworkAccessManager, QNetworkRequest, QNetworkReply
cookie = ''#snipped, the cookie I have to send is about as long as this bit of code...
class MyNetworkAccessManager(QNetworkAccessManager):
def __init__(self, url):
QNetworkAccessManager.__init__(self)
request = QNetworkRequest(QUrl(url))
self.reply = self.get(request)
def createRequest(self, operation, request, data):
request.setRawHeader('User-Agent', 'Mozilla/5.0')
request.setRawHeader("Cookie",cookie);
return QNetworkAccessManager.createRequest( self, operation, request, data )
class Crawler( QWebPage ):
def __init__(self, url, file):
QWebPage.__init__( self )
self._url = url
self._file = file
self.manager = MyNetworkAccessManager(url)
self.setNetworkAccessManager(self.manager)
def crawl( self ):
signal.signal( signal.SIGINT, signal.SIG_DFL )
self.loadFinished.connect(self._finished_loading)
self.mainFrame().load( QUrl( self._url ) )
def _finished_loading( self, result ):
file = open( self._file, 'w' )
file.write( self.mainFrame().toHtml() )
file.close()
exit(0)
def main(url,file):
app = QApplication([url,file])
crawler = Crawler(url, file)
crawler.crawl()
sys.exit( app.exec_() )
First, be clear about your motivation here. Something is changing every 20 seconds and you want to tell the server when one field changes. So
1). Do you need to send the whole page or just the contents of one field. I'm not clear what you are currently doing, but if you are sending 80k every 20 seconds this seems like overkill.
2). Does the server need to know immediately? What are the consequences of sending the state every minute rather than every 20 seconds. You miss some changes, but does that matter?
You haven't really told what you're doing or why, so we can't comment on that.
My first thought is that if the server just wants to know about one field then make an ajax call with just the payload you care about. To me more efficient send a summary every few minutes. One composite record is much cheaper than several small records.

PySide wait for signal from main thread in a worker thread

I decided to add a GUI to one of my scripts. The script is a simple web scraper. I decided to use a worker thread as downloading and parsing the data can take a while. I decided to use PySide, but my knowledge of Qt in general is quite limited.
As the script is supposed to wait for user input upon coming across a captcha I decided it should wait until a QLineEdit fires returnPressed and then send it's content to the worker thread so it can send it for validation. That should be better than busy-waiting for the return key to be pressed.
It seems that waiting for a signal isn't as straight forward as I thought it would be and after searching for a while I came across several solutions similar to this. Signaling across threads and a local event loop in the worker thread make my solution a bit more complicated though.
After tinkering with it for several hours it still won't work.
What is supposed to happen:
Download data until refered to captcha and enter a loop
Download captcha and display it to the user, start QEventLoop by calling self.loop.exec_()
Exit QEventLoop by calling loop.quit() in a worker threads slot which is connected via self.line_edit.returnPressed.connect(self.worker.stop_waiting) in the main_window class
Validate captcha and loop if validation fails, otherwise retry the last url which should be downloadable now, then move on with the next url
What happens:
...see above...
Exiting QEventLoop doesn't work. self.loop.isRunning() returns False after calling its exit(). self.isRunning returns True, as such the thread didn't seem to die under odd circumstances. Still the thread halts at the self.loop.exec_() line. As such the thread is stuck executing the event loop even though the event loop tells me it is not running anymore.
The GUI responds as do the slots of the worker thread class. I can see the text beeing send to the worker thread, the status of the event loop and the thread itself, but nothing after the above mentioned line gets executed.
The code is a bit convoluted, as such I add a bit of pseudo-code-python-mix leaving out the unimportant:
class MainWindow(...):
# couldn't find a way to send the text with the returnPressed signal, so I
# added a helper signal, seems to work though. Doesn't work in the
# constructor, might be a PySide bug?
helper_signal = PySide.QtCore.Signal(str)
def __init__(self):
# ...setup...
self.worker = WorkerThread()
self.line_edit.returnPressed.connect(self.helper_slot)
self.helper_signal.connect(self.worker.stop_waiting)
#PySide.QtCore.Slot()
def helper_slot(self):
self.helper_signal.emit(self.line_edit.text())
class WorkerThread(PySide.QtCore.QThread):
wait_for_input = PySide.QtCore.QEventLoop()
def run(self):
# ...download stuff...
for url in list_of_stuff:
self.results.append(get(url))
#PySide.QtCore.Slot(str)
def stop_waiting(self, text):
self.solution = text
# this definitely gets executed upon pressing return
self.wait_for_input.exit()
# a wrapper for requests.get to handle captcha
def get(self, *args, **kwargs):
result = requests.get(*args, **kwargs)
while result.history: # redirect means captcha
# ...parse and extract captcha...
# ...display captcha to user via not shown signals to main thread...
# wait until stop_waiting stops this event loop and as such the user
# has entered something as a solution
self.wait_for_input.exec_()
# ...this part never get's executed, unless I remove the event
# loop...
post = { # ...whatever data necessary plus solution... }
# send the solution
result = requests.post('http://foo.foo/captcha_url'), data=post)
# no captcha was there, return result
return result
frame = MainWindow()
frame.show()
frame.worker.start()
app.exec_()
What you are describing looks ideal for QWaitCondition.
Simple example:
import sys
from PySide import QtCore, QtGui
waitCondition = QtCore.QWaitCondition()
mutex = QtCore.QMutex()
class Main(QtGui.QMainWindow):
def __init__(self, parent=None):
super(Main, self).__init__()
self.text = QtGui.QLineEdit()
self.text.returnPressed.connect(self.wakeup)
self.worker = Worker(self)
self.worker.start()
self.setCentralWidget(self.text)
def wakeup(self):
waitCondition.wakeAll()
class Worker(QtCore.QThread):
def __init__(self, parent=None):
super(Worker, self).__init__(parent)
def run(self):
print "initial stuff"
mutex.lock()
waitCondition.wait(mutex)
mutex.unlock()
print "after returnPressed"
if __name__=="__main__":
app = QtGui.QApplication(sys.argv)
m = Main()
m.show()
sys.exit(app.exec_())
The slot is executed inside the thread which created the QThread, and not in the thread that the QThread controls.
You need to move a QObject to the thread and connect its slot to the signal, and that slot will be executed inside the thread:
class SignalReceiver(QtCore.QObject):
def __init__(self):
self.eventLoop = QEventLoop(self)
#PySide.QtCore.Slot(str)
def stop_waiting(self, text):
self.text = text
eventLoop.exit()
def wait_for_input(self):
eventLoop.exec()
return self.text
class MainWindow(...):
...
def __init__(self):
...
self.helper_signal.connect(self.worker.signalReceiver.stop_waiting)
class WorkerThread(PySide.QtCore.QThread):
def __init__(self):
self.signalReceiver = SignalReceiver()
# After the following call the slots will be executed in the thread
self.signalReceiver.moveToThread(self)
def get(self, *args, **kwargs):
result = requests.get(*args, **kwargs)
while result.history:
...
self.result = self.signalReceiver.wait_for_input()

pyQT QNetworkManager and ProgressBars

I'm trying to code something that downloads a file from a webserver and saves it, showing the download progress in a QProgressBar.
Now, there are ways to do this in regular Python and it's easy. Problem is that it locks the refresh of the progressBar. Solution is to use PyQT's QNetworkManager class. I can download stuff just fine with it, I just can't get the setup to show the progress on the progressBar. HereĀ“s an example:
class Form(QDialog):
def __init__(self,parent=None):
super(Form,self).__init__(parent)
self.progressBar = QProgressBar()
self.reply = None
layout = QHBoxLayout()
layout.addWidget(self.progressBar)
self.setLayout(layout)
self.manager = QNetworkAccessManager(self)
self.connect(self.manager,SIGNAL("finished(QNetworkReply*)"),self.replyFinished)
self.Down()
def Down(self):
address = QUrl("http://stackoverflow.com") #URL from the remote file.
self.manager.get(QNetworkRequest(address))
def replyFinished(self, reply):
self.connect(reply,SIGNAL("downloadProgress(int,int)"),self.progressBar, SLOT("setValue(int)"))
self.reply = reply
self.progressBar.setMaximum(reply.size())
alltext = self.reply.readAll()
#print alltext
#print alltext
def updateBar(self, read,total):
print "read", read
print "total",total
#self.progressBar.setMinimum(0)
#self.progressBar.setMask(total)
#self.progressBar.setValue(read)
In this case, my method "updateBar" is never called... any ideas?
Well you haven't connected any of the signals to your updateBar() method.
change
def replyFinished(self, reply):
self.connect(reply,SIGNAL("downloadProgress(int,int)"),self.progressBar, SLOT("setValue(int)"))
to
def replyFinished(self, reply):
self.connect(reply,SIGNAL("downloadProgress(int,int)"),self.updateBar)
Note that in Python you don't have to explicitly use the SLOT() syntax; you can just pass the reference to your method or function.
Update:
I just wanted to point out that if you want to use a Progress bar in any situation where your GUI locks up during processing, one solution is to run your processing code in another thread so your GUI receives repaint events. Consider reading about the QThread class, in case you come across another reason for a progress bar that does not have a pre-built solution for you.

Categories

Resources