Create restartable scrapy spider reactor in PyQt5 with qt5reactor - python

My GUI has an "Update Database" button and every time the user presses it, I want to start a Scrapy spider that stores the data scraped into a Sqlite3 Database. I implemented qt5reactor, as this answer suggests, but now I'm getting a ReactorNotRestartable error when I press the update button for a second time. How can I get around this? I tried switching from CrawlerRunner to CrawlerProcess, but it still throws the same error (but maybe I'm doing it wrong, though). I also cannot use this answer, because q.get() locks the event loop, so the GUI freezes when I run the spider. I'm new to multiprocessing, so sorry if I'm missing something incredibly obvious.
In main.py
... # PyQt5 imports
import qt5reactor
from scrapy import crawler
from twisted.internet import reactor
from currency_scraper.currency_scraper.spiders.investor import InvestorSpider
class MyGUI(QMainWindow):
def __init__(self):
self.update_db_button.clicked.connect(self.on_clicked_update)
...
def on_clicked_update(self):
"""Gives command to run scraper and fetch data from the website"""
runner = crawler.CrawlerRunner(
{
"USER_AGENT": "currency scraper",
"SCRAPY_SETTINGS_MODULE": "currency_scraper.currency_scraper.settings",
"ITEM_PIPELINES": {
"currency_scraper.currency_scraper.pipelines.Sqlite3Pipeline": 300,
}
}
)
deferred = runner.crawl(InvestorSpider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run() # has to be run here or the crawling doesn't start
update_notification()
... # other stuff
if __name__ == "__main__":
open_window()
qt5reactor.install()
reactor.run()
Error log:
Traceback (most recent call last):
File "c:/Users/Familia/Documents/ProgramaþÒo/Python/Projetos/Currency_converter/main.py", line 330, in on_clicked_update
reactor.run()
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "c:\Users\Familia\Documents\ProgramaþÒo\Python\Projetos\Currency_converter\venv\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

Related

How to properly use run_detached() and stop() in pystray?

I'm trying to use pystray without blocking the main thread. Based on the pystray docs we can use the function run_detached() to start without blocking.
I'm using pystray on windows so, apparently I don't need to pass any argument to run_detached() to work.
The first thing I tried is to run this code:
import pystray
from pystray import MenuItem as item
from PIL import Image, ImageTk
def show_window(icon):
print('Test')
def quit_window(icon):
icon.stop()
icon = 'icon.ico'
image=Image.open(icon)
menu=pystray.Menu(item('Show', show_window, default=True), item('Quit', quit_window))
icon=pystray.Icon("name", image, "My System Tray Icon", menu)
icon.run_detached()
But I received this error:
Exception in thread Thread-2:
Traceback (most recent call last):
File "...\lib\threading.py", line 973, in _bootstrap_inner
self.run()
File "...\lib\threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "...\lib\site-packages\pystray\_base.py", line 384, in <lambda>
threading.Thread(target=lambda: self.run(setup)).start()
NameError: name 'setup' is not defined
So I tried to bypass this error by changing the line 384 in _base.py removing the setup variable
#threading.Thread(target=lambda: self.run(setup)).start()
threading.Thread(target=lambda: self.run()).start()
The code worked like expected and created the tray icon with the menu buttons working properly.
The problem is when I press "Quit" because the stop() function is not working like when I use icon.run().
The thread appears to keep running and the tray icon stay frozen and the program don't end.
Is there another way to make this work properly?
EDIT:
I found this issue in the official git repository LINK and it appears to be a bug already reported. I want to know if is possible to make a workaround.
Modifying the stop() function further to exit from the thread using os._exit will work if you don't need the calling thread to remain available.

I am having trouble to create a thread and running it through another definition in python

I've been trying to build an application which takes input as text and gives output as speech.
I referred this site to get to know about Text-To-Speech modules in python:
https://pythonprogramminglanguage.com/text-to-speech/
when i ran the program it did the job perfectly but i couldn't use other functions like pause or resume. So i tried to create a new thread for the speech function so that i can alter it speech whenever i want to.
Here is the program:
import threading
import win32com.client as wincl
speak = wincl.Dispatch("SAPI.SpVoice")
t=threading.Event()
def s():
global t
t.set()
data="""This is a story of two tribal Armenian boys who belonged to the
Garoghlanian tribe. """
s=speak.Speak(data)
t1=threading.Thread(target=s)
t1.start
However i am trying to implement the program in GUI using tkinter.
I want the application to read the text when the user is clicking the button.
Since tkinter's button takes command as a function, i made a function for the initialization and starting of the new thread but it is producing an error which i could not interpret and find a solution.
Here is the program thats making error:
import threading
import win32com.client as wincl
speak = wincl.Dispatch("SAPI.SpVoice")
t=threading.Event()
def s():
global t
t.set()
data="""This is a story of two tribal Armenian boys who belonged to the
Garoghlanian tribe. """
s=speak.Speak(data)
def strt():
t1=threading.Thread(target=s)
t1.start()
Here is the error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Application\Python\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Application\Python\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\absan\Desktop\Python\Project-SpeakIt\SI-1.py", line 32, in
speakITheart
s=speak.Speak(data)
File "C:\Users\absan\AppData\Local\Temp\gen_py\3.6\C866CA3A-32F7-11D2-9602-
00C04F8EE628x0x5x4.py", line 2980, in Speak
, 0)
pywintypes.com_error: (-2147352567, 'Exception occurred.', (0, None, None,
None, 0, -2147221008), None)
EDIT:
Guys i somehow found a way to fix it when i was writing this post. I just added these lines to the program
import pyttsx3
engine = pyttsx3.init()
i really don't know how or why it fixed the error but it works!!
So this post might be helpful for someone who is facing the same problem.
Cheers!!
I like the solution from the comments. Just include in non-main thread this line to be able to use COM in it (right before the Speak call)
pythoncom.CoInitialize()
self.engine.Speak(self.msg)

Scrapy - Reactor not Restartable [duplicate]

This question already has answers here:
ReactorNotRestartable error in while loop with scrapy
(10 answers)
Closed 3 years ago.
with:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
I've always ran this process sucessfully:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
but since I've moved this code into a web_crawler(self) function, like so:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
and started calling the method using class instantiation, like:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
and running:
test()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
what is wrong?
You cannot restart the reactor, but you should be able to run it more times by forking a separate process:
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Run it twice:
configure_logging()
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
Result:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
Hope this will help somebody, as it helped for me :)
As per the Scrapy documentation, the start() method of the CrawlerProcess class does the following:
"[...] starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE."
The error you are receiving is being thrown by Twisted, because a Twisted reactor cannot be restarted. It uses a ton of globals, and even if you do jimmy-rig some sort of code to restart it (I've seen it done), there's no guarantee it will work.
Honestly, if you think you need to restart the reactor, you're likely doing something wrong.
Depending on what you want to do, I would also review the Running Scrapy from a Script portion of the documentation, too.
As some people pointed out already: You shouldn't need to restart the reactor.
Ideally if you want to chain your processes (crawl1 then crawl2 then crawl3) you simply add callbacks.
For example, I've been using this loop spider that follows this pattern:
1. Crawl A
2. Sleep N
3. goto 1
And this is how it looks in scrapy:
import time
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.body)
def sleep(_, duration=5):
print(f'sleeping for: {duration}')
time.sleep(duration) # block here
def crawl(runner):
d = runner.crawl(HttpbinSpider)
d.addBoth(sleep)
d.addBoth(lambda _: crawl(runner))
return d
def loop_crawl():
runner = CrawlerRunner(get_project_settings())
crawl(runner)
reactor.run()
if __name__ == '__main__':
loop_crawl()
To explain the process more the crawl function schedules a crawl and adds two extra callbacks that are being called when crawling is over: blocking sleep and recursive call to itself (schedule another crawl).
$ python endless_crawl.py
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
The mistake is in this code:
def __call__(self):
result1 = test.web_crawler()[1]
result2 = test.web_crawler()[0] # here
web_crawler() returns two results, and for that purpose it is trying to start the process twice, restarting the Reactor, as pointed by #Rejected.
obtaining results running one single process, and storing both results in a tuple, is the way to go here:
def __call__(self):
result1, result2 = test.web_crawler()
This solved my problem,put below code after reactor.run() or process.start():
time.sleep(0.5)
os.execl(sys.executable, sys.executable, *sys.argv)

Multithreading with matplotlib and wxpython

Brief description on what I'm trying to achieve:
I'm working on a analytics software built using Python, wxPython, and matplotlib. I'm trying to implement a function where the program can plot the results after performing some analytical calculations. At the moment, the program freezes when the it's performing the calculations (and the calculation time takes up to 10 seconds depending on the amount of data) so I'm trying to use threading to create a non-blocking program to improve user experience.
Problem I'm getting
I keep getting this error :
(PyAssertionError: C++ assertion "hdcDst && hdcSrc" failed at ...... \src\msw\dc.cpp(2559) in AlphaBlt():AlphaBlt():invalid HDC)
and googling hasn't really help with identifying the cause.
I'll post the full traceback at the bottom of the post.
Here's my code:
import wx
import time
import matplotlib.pyplot as plt
from wx.lib.pubsub import Publisher as pub
from threading import Thread
def plotgraph(x,y,sleeptime):
plt.plot(x,y)
#Simulate long process using time.sleep
time.sleep(sleep time)
#Send out a message once process is completed
pub.sendMessage('PLOT','empty')
class listener():
def __init__(self,name):
self.name = name
#Listens to message
pub.subscribe(self.Plot,'PLOT')
pass
def Plot(self,message):
print self.name
plt.show()
print 'printed'
waiting = listener('Bob')
t1 = Thread(target=plotgraph,args=([1,2,3],[1,2,3],5))
t1.start()
t2 = Thread(target=plotgraph,args=([1,2,3],[1,2,3],3))
t2.start()
Basically, the user will be clicking an icon on the GUI and that will trigger a function to perform some analytical calculation simulated by 'plotgraph()' here. At the moment, without using threads, plotgraph() will block my entire program, so I'm trying to use threads to perform the calculations to free up my GUI.
However when I tried to plot my GUI within the thread, i.e. have plt.show() in plotgraph(), the plot appears then disappears again. When I click the button on the GUI to spawn the thread a second time, I get the same error.
So I've tried to work around it by sending a message after the thread's ended so that the plt.show() will happen outside the thread but I'm still getting the same error.
I can't seem to be able to find a similar error online, except for one thread posted in 2008. If anyone could help that would be awesome!
In a nutshell
I need a way to implement sort of a callback function that allows me to perform the analytic calculation in a thread, then plot the graph once the calculations are completed to free up my GUI. It'd be great if someone could explain to me what's wrong here, or could suggest an alternative method to do it. Thanks very much!!
Here's the full traceback:
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\App\appdata\canopy-1.
.5.3123.win-x86\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\App\appdata\canopy-1.
.5.3123.win-x86\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "<ipython-input-5-0cb01f87e97a>", line 13, in plotgraph
pub.sendMessage('PLOT','empty')
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\wx\lib\pubsub.py", line 811, in sendMessage
self.__topicTree.sendMessage(aTopic, message, onTopicNeverCreated)
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\wx\lib\pubsub.py", line 498, in sendMessage
deliveryCount += node.sendMessage(message)
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\wx\lib\pubsub.py", line 336, in sendMessage
listener(message)
File "<ipython-input-5-0cb01f87e97a>", line 24, in Plot
plt.show()
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\matplotlib\pyplot.py", line 155, in show
return _show(*args, **kw)
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\matplotlib\backend_bases.py", line 154, in __call__
manager.show()
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\matplotlib\backends\backend_wx.py", line 1414, in show
self.canvas.draw()
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\matplotlib\backends\backend_wxagg.py", line 50, in draw
self.gui_repaint(drawDC=drawDC)
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\matplotlib\backends\backend_wx.py", line 911, in gui_repaint
drawDC.DrawBitmap(self.bitmap, 0, 0)
File "C:\Users\chaishen\AppData\Local\Enthought\Canopy32\User\lib\site-package
\wx\_gdi.py", line 3460, in DrawBitmap
return _gdi_.DC_DrawBitmap(*args, **kwargs)
yAssertionError: C++ assertion "hdcDst && hdcSrc" failed at ..\..\src\msw\dc.cp
(2559) in AlphaBlt(): AlphaBlt(): invalid HDC
I think what you need is the wx.PyEventBinder.
It works like this:
anEVT_CALCULATED = wx.NewEventType()
EVT_CALCULATED = wx.PyEventBinder(anEVT_CALCULATED, 1)
def onCalculate(self, event): # this is your click
calc_thread = CalculatorThread(self, params)
calc_thread.start()
return
def onConnected(self, event):
''' this is where your thread comes back '''
self.doSomeThingLikePlotting(event.resultdata)
class CalcEvent(wx.PyCommandEvent):
''' Event to signal that the thread has calculated'''
def __init__(self, etype, eid, resultdata):
wx.PyCommandEvent.__init__(self, etype, eid)
self.resultdata = resultdata
class CalculatorThread(threading.Thread):
''' This is the thread doing your calculation and handing it back'''
def __init__(self, listener, params):
threading.Thread.__init__(self)
self.listener = listener
self.params = params
def run(self):
resultdata = calculate(params) # this is your calculation
event = CalcEvent(anEVT_CALCULATED, -1, resultdata=resultdata)
wx.PostEvent(self.listener, event)
return
And of course you need to add one line to your __init__
self.Bind(EVT_CONNECTED, self.onCalculated)

cannot import scrapy modules as library

I'm trying to run spiders from python script following scrapy document: http://doc.scrapy.org/en/latest/topics/practices.html
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
But python just cannot import the module, the error looks like this:
Traceback (most recent call last):
...
from scrapy.crawler import Crawler
File "aappp/scrapy.py", line 1, in <module>
ImportError: No module named crawler
The issue is briefly mentioned in faq of scrapy document, but it doesn't help too much for me.
Have you tried doing it this way?
from scrapy.project import crawler
(That's how it's done on http://doc.scrapy.org/en/latest/faq.html - looks like they already answered your question there.)
It also gives a more recent way of doing it and calls this previous method deprecated:
"This way to access the crawler object is deprecated, the code should be ported to use from_crawler class method, for example:
class SomeExtension(object):
#classmethod
def from_crawler(cls, crawler):
o = cls()
o.crawler = crawler
return o
"

Categories

Resources