How to use one class to scrape two websites

How to use one class to scrape two websites - python

I'm trying to render websites in PyQt that are written in java. The first site is rendered without problems and scraped for the information I need, but when I want to use the same class to render another site and retrieve the new data it tells me the frame that's defined in the Render class is not defined (which was defined for the first website, which worked perfectly fine in retrieving the data that I needed).
So, why is this happening? Am I missing something fundamental in Python? My understanding is that when the first site has been rendered, then the object will be garbage collected and the second one can be rendered. Below is the referred code:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']
for url in urls:
r = Render(url)
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
#QString should be converted to string before processed by lxml
formatted_result = str(result)
#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)
#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//div[#class="campaign"]/a/#href')[1:5]
print (archive_links)
The error message I'm getting:
File "javaweb2.py", line 24, in <module>
result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'
Any help would be much appreciated!

That's because the self.frame is only defined when self._loadFinished() is called, which only occurs when the QWebPage instance emits a signal. So barring several dubious practices I see in the code you posted, the following would solve the issue (not the line with **** is important):
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
self.frame = None # *****
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']
for url in urls:
r = Render(url)
# wait till frame arrives:
while r.frame is None:
# pass # option 1: works, but will cause 100% cpu
time.sleep(0.1) # option 2: much better
result = r.frame.toHtml()
...
So the "pass" would work but will consume 100% cpu as the loop is executed a million times per second. Using the timer only checks every 1/10th second and will be very low cpu consumption.
The best of all solutions of course is to put the logic that depends on the frame being available (i.e. code that is currently in the URL loop below r=Render(url)) in a function that will get called when the loadFinished signal is emitted. Since you can't control the order of signals, the best option is to move that code into the _loadfinished() method.

Related

Why does my for loop does not continue to execute with PyQt? [duplicate]

This question already has answers here:
Scrape multiple urls using QWebPage
(1 answer)
PyQt Class not working for the second usage
(1 answer)
Closed last month.
I am scraping a web page with PyQt. In order to facilitate this, I have constructed the following class:
class Client (QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
I then use this class to obtain data from multiple subpages that I loop over with a for loop:
from config import o_u_types,countries,leagues,limited_league_countries,bookmakers
def getodds(url):
for i in o_u_types:
ou_i_df= pd.DataFrame()
url_appendix="#over-under;2;{:.2f};0".format(i)
o_u_type= i
o_u_match_url= str(url)+str(url_appendix)
print(o_u_match_url)
ou_page=Client(o_u_match_url)
#Here some actions are performed, after which the output is store in a data frame (data_df) that is then appended to the master_df that stores the data across loops
pd.concat([master_df,data_df])
print(master_df)
return master_df
After opening the client with the individual match URL, i perform some further actions that should not be relevant to the issue (but happy to include them if needed)
This works fine for the first iteration, but execution always concludes without an error message before moving on to the second execution.
What might be the issue here?

Dynamically add Items to a QListWIdget

I have this function
def search(self):
#creating a list just ignore
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
#iterarting through the generated lsit
for x in self.gogo.resultsContainer:
self.listWidget.update()
#print(x[2])
#x[2] is the url to an image
url = x[2]
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
pixmap = QPixmap()
pixmap.loadFromData(webpage)
icon = QIcon(pixmap)
#x[1] is a string I want to display it's a title basically
item = QListWidgetItem(icon, x[1])
size = QSize()
size.setHeight(100)
size.setWidth(400)
item.setSizeHint(size)
#item.iconSize(QSize(100, 400))
self.listWidget.addItem(item)
It works, my problem is that it displays everything only after it iterates through every item.
What I mean is that I can see it using the print statement that it IS going through the list and creating the items but no items is being displayed.
They get all displayed at once after it completely iterates through the list.
It's very slow. I know part of it is due to the image download and I can do nothing about it. But adding the items dynamically would at least make it a little more bearable.
tried to use update() and it didn't really work.
another weird behaviour is despite the clear() being the first instruction it doesn't clear the listWidget as soon as the function is called, it looks like it's due to the same thing that leads to everything being displayed at one.

UI systems use an "event loop", which is responsible of "drawing" elements and allow user interaction; that loop must be left free to do its job, and functions that operate within it must return as soon as possible to prevent "freezing" of the UI: the window is not refreshed or properly displayed, and it seems unresponsive to keyboard or mouse events.
Your function does exactly that: it blocks everything until it's finished.
Calling update() won't be enough, as it only schedules a repainting, which will only happen as soon as the main loop is able to process events. That's why long or possibly infinite for/while loops should always be avoided, as much as any blocking function like time.sleep, even for short amounts of time.
"I can do nothing about it."
Actually, not only you can, but you have to.
A possibility is to use threading, specifically QThread, which is an interface that allows to execute calls in a separate thread while providing Qt's signal/slot mechanism that can work asynchronously between thread. Using QThread and signals to communicate with the main thread is extremely important, as UI elements are not thread-safe, and must never be accessed (nor created) from external threads.
class Downloader(QThread):
imageDownloaded = pyqtSignal(int, object)
def __init__(self, parent, urlList):
super().__init__(parent)
self.urlList = urlList
def run(self):
for i, url in enumerate(self.urlList):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
self.imageDownloaded.emit(i, webpage)
class YourWidget(QWidget):
# ...
def search(self):
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
#iterating through the generated list
urlList = []
for x in self.gogo.resultsContainer:
url = x[2]
urlList.append(url)
item = QListWidgetItem(x[1])
item.setSizeHint(QSize(400, 100))
self.listWidget.addItem(item)
# the parent argument is mandatory, otherwise there won't be any
# persistent reference to the downloader, and it will be deleted
# as soon as this function returns; connecting the finished signal
# to deleteLater will delete the object when it will be completed.
downloadThread = Downloader(self, urlList)
downloadThread.imageDownloaded.connect(self.updateImage)
downloadThread.finished.connect(downloadThread.deleteLater)
downloadThread.start()
def updateImage(self, index, data):
pixmap = QPixmap()
if not pixmap.loadFromData(data) or index >= self.listWidget.count():
return
self.listWidget.item(index).setIcon(QIcon(pixmap))
Note: the code above is untested, as it's not clear what modules you're actually using for downloading, nor what self.gogo is.
A slightly different alternative of the above is to use a persistent "download manager" thread, and queue requests using a python Queue, which will be read in the run() implementation.
Consider that Qt provides the QtNetwork module that already works asynchronously, and implements the same concept.
You have to create a QNetworkAccessManager instance (one is usually enough for the whole application), then create a QNetworkRequest for each url, and finally call the manager's get() with that request, which will return a QNetworkReply that will later be used for retrieving the downloaded data.
In the following example, I'm also setting a custom property for the reply, so that when the reply will be received, we will know to what index it corresponds to: this is even more important than what done above, as QNetworkAccessManager can download in parallel, and downloads can be completed in a different order than they were requested (for instance, if the images have different sizes, or are being downloaded from different servers).
Note that the index must be set as a Qt property, and cannot (or, better, should not) be set as a python attribute, like reply.index = i. This is because the reply we're using in python is just a wrapper around the actual Qt object, and unless we keep a persistent reference to that wrapper (for instance, by adding it to a list), that attribute would be lost.
from PyQt5.QtNetwork import *
class YourWidget(QWidget):
def __init__(self):
# ...
self.downloadManager = QNetworkAccessManager()
self.downloadManager.finished.connect(self.updateImage)
# ...
def search(self):
self.listWidget.clear()
self.gogo.keywordSetter(self.lineEdit.text())
self.gogo.linkSetter()
self.gogo.resultsGetter()
for i, x in enumerate(self.gogo.resultsContainer):
item = QListWidgetItem(x[1])
item.setSizeHint(QSize(400, 100))
self.listWidget.addItem(item)
url = QUrl(x[2])
request = QNetworkRequest(url)
request.setHeader(request.UserAgentHeader, 'Mozilla/5.0')
reply = self.downloadManager.get(request)
reply.setProperty('index', i)
def updateImage(self, reply):
index = reply.property('index')
if isinstance(index, int):
pixmap = QPixmap()
if pixmap.loadFromData(reply.readAll()):
item = self.listWidget.item(index)
if item is not None:
item.setIcon(QIcon(pixmap))
reply.deleteLater()

Is there any way to call synchronously the method 'toHtml' which is QWebEnginePage's object?

I'm trying to get html code from the QWebEnginePage object. According to Qt reference, QWebEnginePage object's 'toHtml' is asynchronous method as below.
Asynchronous method to retrieve the page's content as HTML, enclosed in HTML and BODY tags. Upon successful completion, resultCallback is called with the page's content.
so I tried to find out how call this method synchronously.
the result what i want to get is below.
class MainWindow(QWidget):
html = None
...
...
def store_html(self, data):
self.html = data
def get_html(self):
current_page = self.web_view.page()
current_page.toHtml(self.store_html)
# I want to wait until the 'store_html' method is finished
# but the 'toHtml' is called asynchronously, return None when try to return self.html value like below.
return self.html
...
...

A simple way to get that behavior is to use QEventLoop(). An object of this class prevents the code that is after exec_() from being executed, this does not mean that the GUI does not continue working.
from PyQt5.QtCore import *
from PyQt5.QtWidgets import *
from PyQt5.QtWebEngineWidgets import *
class Widget(QWidget):
toHtmlFinished = pyqtSignal()
def __init__(self, *args, **kwargs):
QWidget.__init__(self, *args, **kwargs)
self.setLayout(QVBoxLayout())
self.web_view = QWebEngineView(self)
self.web_view.load(QUrl("http://doc.qt.io/qt-5/qeventloop.html"))
btn = QPushButton("Get HTML", self)
self.layout().addWidget(self.web_view)
self.layout().addWidget(btn)
btn.clicked.connect(self.get_html)
self.html = ""
def store_html(self, html):
self.html = html
self.toHtmlFinished.emit()
def get_html(self):
current_page = self.web_view.page()
current_page.toHtml(self.store_html)
loop = QEventLoop()
self.toHtmlFinished.connect(loop.quit)
loop.exec_()
print(self.html)
if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
w = Widget()
w.show()
sys.exit(app.exec_())
Note: The same method works for PySide2.

Here's a different approach and also a different behavior compared to the QEventLoop method
You can subclass from QWebEngineView and expand upon the load() functionality with loadFinished Signal and create a custom method read_html()
class MyWebView(QWebEngineView):
def __init__(self, parent):
super(MyWebView, self).__init__(parent)
self.html = None
def read_html(self, url):
"""
Load url and read webpage content in html
"""
def read_page():
def process_html(html):
self.html = html
self.page().toHtml(process_html)
self.load(url)
self.loadFinished.connect(read_page)
this way the application won't halt while waiting the page to finish loading in the event loop, but once the page is loaded, you can access the HTML content.
class MainWindow(QWidget):
def __int__(self):
...
self.web_view = MyWebView(self)
self.web_view.read_html(r'https://www.xingyulei.com/')
...
self.btn.clicked.connect(self.print_html)
def print_html(self):
print(self.view.html)

You could use a multiprocessing.Connection object created as one side of a multiprocessing.Pipe's send method as the call back and then use the other end of the pipe's recv method immediately after. Recv will block until the html is received, so keep that in mind
example:
from multiprocessing import Pipe
class MainWindow(QWidget):
def __init__(...):
...
self.from_loopback,self.to_loopback=Pipe(False)
def get_html(self):
current_page = self.web_view.page()
current_page.toHtml(self.to_loopback.send)
return self.from_loopback.recv()

Load a web page

I am trying to load a web page using PySide's QtWebKit module. According to the documentation (Elements of QWebView; QWebFrame::toHtml()), the following script should print the HTML of the Google Search Page:
from PySide import QtCore
from PySide import QtGui
from PySide import QtWebKit
# Needed if we want to display the webpage in a widget.
app = QtGui.QApplication([])
view = QtWebKit.QWebView(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
But alas it does not. All that is printed is the method's equivalent of a null response:
<html><head></head><body></body></html>
So I took a closer look at the setUrl documentation:
The view remains the same until enough data has arrived to display the new url.
This made me think that maybe I was calling the toHtml() method too soon, before a response has been received from the server. So I wrote a class that overrides the setUrl method, blocking until the loadFinished signal is triggered:
import time
class View(QtWebKit.QWebView):
def __init__(self, *args, **kwargs):
super(View, self).__init__(*args, **kwargs)
self.completed = True
self.loadFinished.connect(self.setCompleted)
def setCompleted(self):
self.completed = True
def setUrl(self, url):
self.completed = False
super(View, self).setUrl(url)
while not self.completed:
time.sleep(0.2)
view = View(None)
view.setUrl(QtCore.QUrl("http://www.google.com/"))
frame = view.page().mainFrame()
print(frame.toHtml())
That made no difference at all. What am I missing here?
EDIT: Merely getting the HTML of a page is not my end game here. This is a simplified example of code that was not working the way I expected it to. Credit to Oleh for suggesting replacing time.sleep() with app.processEvents()

Copied from my other answer:
from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings
qapp = QApplication([])
def load_source(url):
page = QWebPage()
page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
page.mainFrame().setUrl(QUrl(url))
class State(QObject):
src = None
finished = False
#Slot()
def loaded(self, success=True):
self.finished = True
if self.src is None:
self.src = page.mainFrame().toHtml()
state = State()
# Optional; reacts to DOM ready, which happens before a full load
def js():
page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
page.mainFrame().evaluateJavaScript('''
document.addEventListener('DOMContentLoaded', qstate$.loaded);
''')
page.mainFrame().javaScriptWindowObjectCleared.connect(js)
page.mainFrame().loadFinished.connect(state.loaded)
while not state.finished:
qapp.processEvents()
return state.src
load_source downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.
But you really should think what you're doing. Do you actually need to invoke the engine and get the modified HTML? If you just want to download HTML of some webpage, there are much, much simpler ways to do this.
Now, the problem with the code in your answer is you don't let Qt do anything. There is no magic happening, no code running in background. Qt is based on an event loop, and you never let it enter that loop. This is usually achieved by calling QApplication.exec_ or with a workaround processEvents as shown in my code. You can replace time.sleep(0.2) with app.processEvents() and it might just work.

HTML page vastly different when using a headless webkit implementation using PyQT

I was under the impression that using a headless browser implementation of webkit using PyQT will automatically get me the html code for each URL even with heavy JS code in it. But I am only seeing it partially. I am comparing with the page I get when I save the page from the firefox window.
I am using the following code -
class JabbaWebkit(QWebPage):
# 'html' is a class variable
def __init__(self, url, wait, app, parent=None):
super(JabbaWebkit, self).__init__(parent)
JabbaWebkit.html = ''
if wait:
QTimer.singleShot(wait * SEC, app.quit)
else:
self.loadFinished.connect(app.quit)
self.mainFrame().load(QUrl(url))
def save(self):
JabbaWebkit.html = self.mainFrame().toHtml()
def userAgentForUrl(self, url):
return USER_AGENT
def get_page(url, wait=None):
# here is the trick how to call it several times
app = QApplication.instance() # checks if QApplication already exists
if not app: # create QApplication if it doesnt exist
app = QApplication(sys.argv)
#
form = JabbaWebkit(url, wait, app)
app.aboutToQuit.connect(form.save)
app.exec_()
return JabbaWebkit.html
Can some one see anything obviously wrong with the code?
After running the code through a few URLs, here is one I found that shows the problems I am running into quite clearly - http://www.chilis.com/EN/Pages/menu.aspx
Thanks for any pointers.

The page have ajax code, when it finish load, it still need some time to update the page with ajax. But you code will quit when it finish load.
You should add some code like this to wait some time and process events in webkit:
for i in range(200): #wait 2 seconds
app.processEvents()
time.sleep(0.01)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use one class to scrape two websites - python

Related

Why does my for loop does not continue to execute with PyQt? [duplicate]

Dynamically add Items to a QListWIdget

Is there any way to call synchronously the method 'toHtml' which is QWebEnginePage's object?

Load a web page

HTML page vastly different when using a headless webkit implementation using PyQT

Categories

Resources