Python : Use PyQT4 + Soup to scrape SEVERAL pages [duplicate]

Python : Use PyQT4 + Soup to scrape SEVERAL pages [duplicate] - python

This question already has an answer here:
PyQt Class not working for the second usage
(1 answer)
Closed 7 years ago.
I am trying to scrape several webpages using Python PyQT4 + Beautiful Soup.
Due to the nature of my overal program, I use a main script "program.py" calling functions from other scripts, doing different analyses with beautiful Soup.
Thus, the simplified architecture of my main program.py is as follows :
program.py :
import script1
import script2
script1.function1(urlA)
script2.function2(urlB)
With script1.py and script2.py as follows :
script1.py :
import requests
import re
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def function1(url):
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
#Do many things with soup.
#Nothing related to PyQT4 further in this script
And my script 2 has exactly the same structure, but does other things on another url.
script2.py :
import requests
import re
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def function2(url):
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
#Do many other things with soup
#Nothing related to PyQT4 further in this script
Everything works fine with script1.py. My function1 and analyses are run successfully.
But script2.py bugs, and I have the following error :
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
I spent time searching for this problem, and I found that PyQT4 could not load several pages in the same instance.
The problem is that I need PyQT4 to render Javascripts before loading the page content into Beautiful Soup.
So I think I need to put some kind of "self.app.quit()" at the end of my function1 in script1, so that function2 in script2 can render a page with PyQT4 too. But I was not able to make it work.

How about this
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
r.app.quit()

Related

loop over a list of urls using PyQt4

I am trying to loop over a list of URLs using PyQt4 and Beautifulsoup using the following code:
import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl, pyqtSignal
from PyQt4.QtWebKit import QWebPage
class Render(QWebPage):
def __init__(self, urls, cb):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.urls = urls
self.cb = cb
self.crawl()
self.app.exec_()
def crawl(self):
if self.urls:
url = self.urls.pop(0)
print ('Downloading', url)
self.mainFrame().load(QUrl(url))
else:
self.app.quit()
def _loadFinished(self, result):
frame = self.mainFrame()
url = str(frame.url().toString())
html = frame.toHtml()
self.cb(url, html)
self.crawl()
def scrape(url, html):
pass
soup = BeautifulSoup(unicode(html), "lxml")
t = soup.findAll("div", {"class": "detalhamento_label_valor hidden-print ng-binding"})[0].text
print t
urls = ["http://apps.mpf.mp.br/aptusmpf/index2#/detalhe/920000000000000000005?modulo=0&sistema=portal" ,
"http://apps.mpf.mp.br/aptusmpf/index2#/detalhe/920000000000000000005?modulo=0&sistema=portal" ,
"http://apps.mpf.mp.br/aptusmpf/index2#/detalhe/920000000000000000004?modulo=0&sistema=portal" ]
r = Render(urls, cb=scrape)
It seems to work well if the urls are the same [0,1], but it gets stuck once the url changes [2]. I am not really familiar with PyQt4, so I wonder if there is something trivial I might be missing.
EDIT
The program hangs while running the third item of the url list on this operation:
self.mainFrame().load(QUrl(url))
Other than that, the only warning I get is:
libpng warning: iCCP: known incorrect sRGB profile
Though I'm not sure what it means, it does not seem to be connected to the issue.

Python pyQt4 QObject::connect: Cannot connect (null)

I am trying to scrape several webpages and I know the URLs in advance. Using a fairly standard code found here this is what I have:
from lxml import html
import requests
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
Doing a for-loop with the various URLs works fine for the first two, then it crashes with the following:
QObject::connect: Cannot connect
(null)::configurationAdded(QNetworkConfiguration) to
QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect
(null)::configurationRemoved(QNetworkConfiguration) to
QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect
(null)::configurationChanged(QNetworkConfiguration) to
QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to
QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to
QNetworkConfigurationManager::updateCompleted()
My (vague) understanding of the issue after researching similar questions is that the QApplication might not be closed properly from one iteration to the next. I tried pausing the script on each iteration for 5-20 secs without any effect. Could not find any other applicable suggestions. Any help on the matter is appreciated.

How to make request through proxy in PyQt4? - I saw the answer but not sure where to put the code

I'm basically trying to scrap a site using PyQt to be able to load the Javascript and I'm trying to do the request through proxy in PyQt4. I saw it works for the guy who asked on this answered question: How to make request through proxy in PyQt4 but I cannot make it work, I'm not sure where to add the information suggested in the answer:
old_manager = self.page().networkAccessManager()
new_manager = MyNetworkAccessManager(old_manager)
self.page().setNetworkAccessManager(new_manager)
Any idea how to make this code complete?
I have tried something like that:
import sys
import socket
import requests
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4.QtNetwork import *
from bs4 import BeautifulSoup
class MyNetworkAccessManager(QNetworkAccessManager):
def __init__(self):
QNetworkAccessManager.__init__(self)
proxy = QNetworkProxy('HTTP','179.179.253.147', '8080')
self.setProxy(proxy)
class MySettings(QWebPage):
def __init__(self):
QWebPage.__init__(self)
self.settings().setAttribute(QWebSettings.AutoLoadImages, False)
class Browser(QWebView):
def __init__(self):
QWebView.__init__(self)
old_manager = self.page().networkAccessManager()
new_manager = MyNetworkAccessManager(old_manager)
self.page().setNetworkAccessManager(new_manager)
self.setPage(MySettings())
self.loadProgress.connect(self._progress)
self.loadFinished.connect(self._loadFinished)
self.doc = self.page().currentFrame()
def _progress(self, progress):
print progress
def _loadFinished(self):
soup = BeautifulSoup(unicode(self.doc.toHtml()), 'lxml')
print soup.prettify().encode('utf-8')
if __name__ == "__main__":
app = QApplication(sys.argv)
br = Browser()
url = QUrl('https://www.example.com')
br.load(url)
br.show()
app.exec_()
But is returns an error saying:
"__init__() takes exactly 1 argument (2 given)" on the line
new_manager = MyNetworkAccessManager(old_manager)

You should not pass the old manager as an argument, you must change:
old_manager = self.page().networkAccessManager()
new_manager = MyNetworkAccessManager(old_manager)
self.page().setNetworkAccessManager(new_manager)
to
new_manager = MyNetworkAccessManager()
self.page().setNetworkAccessManager(new_manager)
Also Change:
proxy = QNetworkProxy('HTTP','179.179.253.147', '8080')
to:
proxy = QNetworkProxy(QNetworkProxy.HttpProxy, QString("179.179.253.147"), 8080)

python qt4 : To referesh and scrape again

I have written a small python code to scrape a table in a web page. It uses qt4 to scrape. Now, The problem is I need to keep scraping the data every 5 mins. I am thinking of refreshing the page and scrape again. How can I refresh the webpage and scrape again every 5 mins?
Below is the code which I am using to scrape.
import sys
from BeautifulSoup import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
import redis
from time import sleep
class Scraper(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
#self.render = Scraper(url)
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def close_app(self):
self.app.quit()
print "closed"
url = 'https://www.nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G'
r = Scraper(url)
result = r.frame.toHtml()
formatted_result = str(result.toAscii())
soup = BeautifulSoup(formatted_result)
table = soup.find(id="topGainers")
print table

Check this page out.
It provides a very light-weight library for scheduling tasks and should work fine within Qt. How do I get a Cron like scheduler in Python?
But if you are worried about your GUI freezing or just wanting to keep everything native within Qt, check this out : Background thread with QThread in PyQt.

You can use QtCore.QTimer.singleShot(5 * 60, func) function.
def __init__(self, url):
# ...
self.show_page()
def show_page(self)
# display page here
QtCore.QTimer.singleShot(5 * 60, self.show_page)

login.live.com with python and mechanize?

I need to automatically login with python and mechanize on login.live.com.
The problem is,that I can't find any browser.forms(), but there should be some, since I checked the HTML code:
My code:
import urllib2
import lxml
from mechanize import Browser
br=Browser()
#Simulate user
br.set_handle_robots( False )
br.addheaders = [('User-agent', 'GoogleChrome')]
#open site
url = "https://login.live.com/"
rep = br.open(url)
for frm in br.forms():
print frm
There should be a form named 'f1' on 'login.live.com'. Is it possible, that this part is generated dynamically?
Nero

As sbarzowski pointed out you need to execute the javascript on the site.
But you don't need to leave python for that. In fact you could automate Qt webkit.
Example (python3, tested on linux):
#!/usr/bin/env python3
import sys
from urllib.request import urlopen
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
SHOWBROWSER = True
LOGIN = 'name#example.com'
PASS = 'foo'
class WebPage(QWebPage):
def __init__(self, parent=None):
super(WebPage, self).__init__(parent)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl('http://login.live.com'))
def javaScriptConsoleMessage(self, msg, lineNumber, sourceID):
print("JsConsole(%s:%d): %s" % (sourceID, lineNumber, msg))
def _loadFinished(self, result):
frame = self.mainFrame()
url = frame.requestedUrl().toString()
print(url)
if url == 'http://login.live.com/':
frame.evaluateJavaScript(self.get_jquery())
frame.evaluateJavaScript(
'''
$('input[name="login"]').val('{login}')
$('input[name="passwd"]').val('{password}')
$('input[type="submit"]').click()
'''.format(login=LOGIN, password=PASS)
)
if 'auth/complete-signin' in url:
print('finished login')
if not SHOWBROWSER:
QApplication.quit()
def get_jquery(self):
response = urlopen('http://code.jquery.com/jquery-2.1.3.js')
return response.read().decode('utf-8')
class Window(QWidget):
def __init__(self):
super(Window, self).__init__()
self.view = QWebView(self)
self.view.setPage(WebPage())
layout = QVBoxLayout(self)
layout.setMargin(0)
layout.addWidget(self.view)
def headless():
app = QApplication(sys.argv)
view = QWebView()
view.setPage(WebPage())
app.exec_()
def main():
app = QApplication(sys.argv)
window = Window()
window.show()
app.exec_()
if __name__ == "__main__":
if SHOWBROWSER:
main()
else:
headless()

The answer from https://login.live.com has empty body. Everything is done through javascript onload.
To see yourself you can (on Mac and Linux at least):
wget https://login.live.com/
Or in your code:
import urllib2
from mechanize import Browser
br=Browser()
#Simulate user
br.set_handle_robots( False )
br.addheaders = [('User-agent', 'GoogleChrome')]
#open site
url = "https://login.live.com/"
rep = br.open(url)
print rep.read()
It may be hard/impossible to get these forms without executing javascript, but to do so I think you will have to leave python. EDIT: Or maybe you don't have to (see other answers).
If you have no need to actually analyze the site respones and just want to do some simple things there you can just make your requests without caring too much about responses (you still have http status codes which may be enough to see if your requests succeded).
I guess there is also actual API. I'm not familiar with MS products and don't know exactly what you are trying to do, so I cannot point to anything specific.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Use PyQT4 + Soup to scrape SEVERAL pages [duplicate] - python

How about this r = Render(url) soup = BeautifulSoup(unicode(r.frame.toHtml())) r.app.quit()

Related

loop over a list of urls using PyQt4

Python pyQt4 QObject::connect: Cannot connect (null)

How to make request through proxy in PyQt4? - I saw the answer but not sure where to put the code

python qt4 : To referesh and scrape again

login.live.com with python and mechanize?

Categories

Resources