Twisted: Wait for a deferred to 'finish'

Twisted: Wait for a deferred to 'finish' - python

How can I 'throw' deferred's into the reactor so it gets handled somewhere down the road?
Situation
I have 2 programs running on localhost.
A twisted jsonrpc service (localhost:30301)
A twisted webservice (localhost:4000)
When someone connects to the webservice, It needs to send a query to the jsonrpc service, wait for it to come back with a result, then display the result in the web browser of the user (returning the value of the jsonrpc call).
I can't seem to figure out how to return the value of the deferred jsonrpc call. When I visit the webservice with my browser I get a HTML 500 error code (did not return any byte) and Value: < Deferred at 0x3577b48 >.
It returns the deferred object and not the actual value of the callback.
Been looking around for a couple of hours and tried a lot of different variations before asking.
from txjsonrpc.web.jsonrpc import Proxy
from twisted.web import resource
from twisted.web.server import Site
from twisted.internet import reactor
class Rpc():
def __init__(self, child):
self._proxy = Proxy('http://127.0.0.1:30301/%s' % child)
def execute(self, function):
return self._proxy.callRemote(function)
class Server(resource.Resource):
isLeaf = True
def render_GET(self, request):
rpc = Rpc('test').execute('test')
def test(result):
return '<h1>%s</h1>' % result
rpc.addCallback(test)
return rpc
site = Site(Server())
reactor.listenTCP(4000, site)
print 'Running'
reactor.run()

The problem you're having here is that web's IResource is a very old interface, predating even Deferred.
The quick solution to your problem is to use Klein, which provides a nice convenient high-level wrapper around twisted.web for writing web applications, among other things, adding lots of handling for Deferreds throughout the API.
The slightly more roundabout way to address it is to read the chapter of the Twisted documentation that is specifically about asynchronous responses in twisted.web.

Related

tornado one handler blocks for another

Using python/tornado I wanted to set up a little "trampoline" server that allows two devices to communicate with each other in a RESTish manner. There's probably vastly superior/simpler "off the shelf" ways to do this. I'd welcome those suggestions, but I still feel it would be educational to figure out how to do my own using tornado.
Basically, the idea was that I would have the device in the role of server doing a longpoll with a GET. The client device would POST to the server, at which point the POST body would be transferred as the response of the blocked GET. Before the POST responded, it would block. The server side then does a PUT with the response, which is transferred to the blocked POST and return to the device. I thought maybe I could do this with tornado.queues. But that appears to not have worked out. My code:
import tornado
import tornado.web
import tornado.httpserver
import tornado.queues
ToServerQueue = tornado.queues.Queue()
ToClientQueue = tornado.queues.Queue()
class Query(tornado.web.RequestHandler):
def get(self):
toServer = ToServerQueue.get()
self.write(toServer)
def post(self):
toServer = self.request.body
ToServerQueue.put(toServer)
toClient = ToClientQueue.get()
self.write(toClient)
def put(self):
ToClientQueue.put(self.request.body)
self.write(bytes())
services = tornado.web.Application([(r'/query', Query)], debug=True)
services.listen(49009)
tornado.ioloop.IOLoop.instance().start()
Unfortunately, the ToServerQueue.get() does not actually block until the queue has an item, but rather returns a tornado.concurrent.Future. Which is not a legal value to pass to the self.write() call.
I guess my general question is twofold:
1) How can one HTTP verb invocation (e.g. get, put, post, etc) block and then be signaled by another HTTP verb invocation.
2) How can I share data from one invocation to another?
I've only really scratched the simple/straightforward use cases of making little REST servers with tornado. I wonder if the coroutine stuff is what I need, but haven't found a good tutorial/example of that to help me see the light, if that's indeed the way to go.

1) How can one HTTP verb invocation (e.g. get, put, post,u ne etc) block and then be signaled by another HTTP verb invocation.
2) How can I share data from one invocation to another?
The new RequestHandler object is created for every request. So you need some coordinator e.g. queues or locks with state object (in your case it would be re-implementing queue).
tornado.queues are queues for coroutines. Queue.get, Queue.put, Queue.join return Future objects, that need to be "resolved" - scheduled task done either with success or exception. To wait until future is resolved you should yielded it (just like in the doc examples of tornado.queues). The verbs method also need to be decorated with tornado.gen.coroutine.
import tornado.gen
class Query(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
toServer = yield ToServerQueue.get()
self.write(toServer)
#tornado.gen.coroutine
def post(self):
toServer = self.request.body
yield ToServerQueue.put(toServer)
toClient = yield ToClientQueue.get()
self.write(toClient)
#tornado.gen.coroutine
def put(self):
yield ToClientQueue.put(self.request.body)
self.write(bytes())
The GET request will last (wait in non-blocking manner) until something will be available on the queue (or timeout that can be defined as Queue.get arg).
tornado.queues.Queue provides also get_nowait (there is put_nowait as well) that don't have to be yielded - returns immediately item from queue or throws exception.

Building a RESTful Flask API for Scrapy

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.
The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.
Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?
from flask import Flask
import os
import sys
import json
from n_grams.spiders.n_gram_spider import NGramsSpider
# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
app = Flask(__name__)
def scrape_it(url):
items = []
def add_item(item):
items.append(item)
runner = CrawlerRunner()
d = runner.crawl(NGramsSpider, [url])
d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???
dispatcher.connect(add_item, signal=signals.item_passed)
reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished
return items
#app.route('/scrape/<path:url>')
def scrape(url):
ret = scrape_it(url)
return json.dumps(ret, ensure_ascii=False, encoding='utf8')
if __name__ == '__main__':
PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080
app.run(debug=True, host='0.0.0.0', port=int(PORT))

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.
Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.
I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.
Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.
There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.
As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.
If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

I have been working on similar project last week, it's SEO service API, my workflow was like this:
The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app)
Run Scrapy in the background using Celery. The spider will save the data to the database.
The backgound service will notify the client by calling the callback url when the spider is done.

Twisted XmlStream: How to connect to events?

I would like to implement a Twisted server that expects XML requests and sends XML responses in return:
<request type='type 01'><content>some request content</content></request>
<response type='type 01'><content>some response content</content></response>
<request type='type 02'><content>other request content</content></request>
<response type='type 02'><content>other response content</content></response>
I have created a Twisted client & server before that exchanged simple strings and tried to extend that to using XML, but I can't seem to figure out how to set it all up correctly.
client.py:
#!/usr/bin/env python
# encoding: utf-8
from twisted.internet import reactor
from twisted.internet.endpoints import TCP4ClientEndpoint, connectProtocol
from twisted.words.xish.domish import Element, IElement
from twisted.words.xish.xmlstream import XmlStream
class XMLClient(XmlStream):
def sendObject(self, obj):
if IElement.providedBy(obj):
print "[TX]: %s" % obj.toXml()
else:
print "[TX]: %s" % obj
self.send(obj)
def gotProtocol(p):
request = Element((None, 'request'))
request['type'] = 'type 01'
request.addElement('content').addContent('some request content')
p.sendObject(request)
request = Element((None, 'request'))
request['type'] = 'type 02'
request.addElement('content').addContent('other request content')
reactor.callLater(1, p.sendObject, request)
reactor.callLater(2, p.transport.loseConnection)
endpoint = TCP4ClientEndpoint(reactor, '127.0.0.1', 12345)
d = connectProtocol(endpoint, XMLClient())
d.addCallback(gotProtocol)
from twisted.python import log
d.addErrback(log.err)
reactor.run()
As in the earlier string-based approach mentioned, the client idles until CTRL+C. Once I have this going, it will draw some / a lot of inspiration from the Twisted XMPP example.
server.py:
#!/usr/bin/env python
# encoding: utf-8
from twisted.internet import reactor
from twisted.internet.endpoints import TCP4ServerEndpoint
from twisted.words.xish.xmlstream import XmlStream, XmlStreamFactory
from twisted.words.xish.xmlstream import STREAM_CONNECTED_EVENT, STREAM_START_EVENT, STREAM_END_EVENT
REQUEST_CONTENT_EVENT = intern("//request/content")
class XMLServer(XmlStream):
def __init__(self):
XmlStream.__init__(self)
self.addObserver(STREAM_CONNECTED_EVENT, self.onConnected)
self.addObserver(STREAM_START_EVENT, self.onRequest)
self.addObserver(STREAM_END_EVENT, self.onDisconnected)
self.addObserver(REQUEST_CONTENT_EVENT, self.onRequestContent)
def onConnected(self, xs):
print 'onConnected(...)'
def onDisconnected(self, xs):
print 'onDisconnected(...)'
def onRequest(self, xs):
print 'onRequest(...)'
def onRequestContent(self, xs):
print 'onRequestContent(...)'
class XMLServerFactory(XmlStreamFactory):
protocol = XMLServer
endpoint = TCP4ServerEndpoint(reactor, 12345, interface='127.0.0.1')
endpoint.listen(XMLServerFactory())
reactor.run()
client.py output:
TX [127.0.0.1]: <request type='type 01'><content>some request content</content></request>
TX [127.0.0.1]: <request type='type 02'><content>other request content</content></request>
server.py output:
onConnected(...)
onRequest(...)
onDisconnected(...)
My questions:
How do I subscribe to an event fired when the server encounters a certain XML tag ? The //request/content XPath query seems ok to me, but onRequestContent(...) does not get called :-(
Is subclassing XmlStream and XmlStreamFactory a reasonable approach at all ? It feels weird because XMLServer subscribes to events sent by its own base class and is then passed itself (?) as xs parameter ?!? Should I rather make XMLServer an ordinary class and have an XmlStream object as class member ? Is there a canonical approach ?
How would I add an error handler to the server like addErrback(...) in the client ? I'm worried exceptions get swallowed (happened before), but I don't see where to get a Deferred from to attach it to...
Why does the server by default close the connection after the first request ? I see XmlStream.onDocumentEnd(...) calling loseConnection(); I could override that method, but I wonder if there's a reason for the closing I don't see. Is it not the 'normal' approach to leave the connection open until all communication necessary for the moment has been carried out ?
I hope this post isn't considered too specific; talking XML over the network is commonplace, but despite searching for a day and a half, I was unable to find any Twisted XML server examples. Maybe I manage to turn this into a jumpstart for anyone in the future with similar questions...

This is mostly a guess but as far as I know you need to open the stream by sending a stanza without closing it.
In your example when you send <request type='type 01'><content>some request content</content></request> the server sees the <request> stanza as the start document but then you send </request> and the server will see that as the end document.
Basically, your server consumes <request> as the start document and that's also why your xpath, //request/content, will not match, because all that's left of the element is <content>...</content>.
Try sending something like <stream> from the client first, then the two requests and then </stream>.
Also, subclassing XmlStream is fine as long as you make sure you don't override any methods by default.

The "only" relevant component of XmlStream is the SAX parser. Here's how I've implemented an asynchronous SAX parser using XmlStream and only the XML parsing functions:
server.py
from twisted.words.xish.domish import Element
from twisted.words.xish.xmlstream import XmlStream
class XmlServer(XmlStream):
def __init__(self):
XmlStream.__init__(self) # possibly unnecessary
def dataReceived(self, data):
""" Overload this function to simply pass the incoming data into the XML parser """
try:
self.stream.parse(data) # self.stream gets created after self._initializestream() is called
except Exception as e:
self._initializeStream() # reinit the DOM so other XML can be parsed
def onDocumentStart(self, elementRoot):
""" The root tag has been parsed """
print('Root tag: {0}'.format(elementRoot.name))
print('Attributes: {0}'.format(elementRoot.attributes))
def onElement(self, element):
""" Children/Body elements parsed """
print('\nElement tag: {0}'.format(element.name))
print('Element attributes: {0}'.format(element.attributes))
print('Element content: {0}'.format(str(element)))
def onDocumentEnd(self):
""" Parsing has finished, you should send your response now """
response = domish.Element(('', 'response'))
response['type'] = 'type 01'
response.addElement('content', content='some response content')
self.send(response.toXml())
Then you create a Factory class that will produce this Protocol (which you've demonstrated you're capable of). Basically, you will get all your information from the XML in the onDocumentStart and onElement functions and when you've reached the end (ie. onDocumentEnd) you will send a response based on the parsed information. Also, be sure you call self._initializestream() after parsing each XML message or else you'll get an exception. That should serve as a good skeleton for you.
My answers to your questions:
Don't know :)
It's very reasonable. However I usually just subclass XmlStream (which simply inherits from Protocol) and then use a regular Factory object.
This is a good thing to worry about when using Twisted (+1 for you). Using the approach above, you could fire callbacks/errbacks as you parse and hit an element or wait till you get to the end of the XML then fire your callbacks to your hearts desire. I hope that makes sense :/
I've wondered this too actually. I think it has something to do with the applications and protocols that use the XmlStream object (such as Jabber and IRC). Just overload onDocumentEnd and make it do what you want it to do. That's the beauty of OOP.
Reference:
xml.sax
iterparse
twisted.web.sux: Twisted XML SAX parser. This is actually what XmlStream uses to parse XML.
xml.etree.cElementTree.iterparse: Here's another Stackoverflow question - ElementTree iterparse strategy
iterparse is throwing 'no element found: line 1, column 0' and I'm not sure why - I've asked a similar question :)
PS
Your problem is quite common and very simple to solve (at least in my opinion) so don't kill yourself trying to learn the Event Dipatcher model. Actually it seems you have a good handle on callbacks and errbacks (aka Deferred), so I suggest you stick to those and avoid the dispatcher.

Crossbar.io: How to publish a message on a topic using a Django service?

I just started using Crossbar.io to implement a live stats page. I've looked at a lot of code examples, but I can't figure out how to do this:
I have a Django service (to avoid confusion, you can assume I´m talking about a function in views.py) and I'd like it to publish messages in a specific topic, whenever it gets called. I've seen these approaches: (1) Extending ApplicationSession and (2) using an Application instance that is "runned".
None of them work for me, because the Django service doesn't live inside a class, and is not executed as a stand-alone python file either, so I don't find a way to call the "publish" method (that is the only thing I want to do on the server side).
I tried to get an instance of "StatsBackend", which extends ApplicationSession, and publish something... But StatsBackend._instance is None always (even when I execute 'crossbar start' and StatsBackend.init() is called).
StatsBackend.py:
from twisted.internet.defer import inlineCallbacks
from autobahn import wamp
from autobahn.twisted.wamp import ApplicationSession
class StatsBackend(ApplicationSession):
_instance = None
def __init__(self, config):
ApplicationSession.__init__(self, config)
StatsBackend._instance = self
#classmethod
def update_stats(cls, amount):
if cls._instance:
cls._instance.publish('com.xxx.statsupdate', {'amount': amount})
#inlineCallbacks
def onJoin(self, details):
res = yield self.register(self)
print("CampaignStatsBackend: {} procedures registered!".format(len(res)))
test.py:
import StatsBackend
StatsBackend.update_stats(100) #Doesn't do anything, StatsBackend._instance is None

Django is a blocking WSGI application, and that does not blend well with AutobahnPython, which is non-blocking (runs on top of Twisted or asyncio).
However, Crossbar.io has a built-in REST bridge, which includes a HTTP Pusher to which you can submit events via any HTTP/POST capable client. Crossbar.io will forward those events to regular WAMP subscribers (eg via WebSocket in real-time).
Crossbar.io also comes with a complete application template to demonstrate above functionality. To try:
cd ~/test1
crossbar init --template pusher
crossbar start
Open your browser at http://localhost:8080 (open the JS console) and in a second terminal
curl -H "Content-Type: application/json" \
-d '{"topic": "com.myapp.topic1", "args": ["Hello, world"]}' \
http://127.0.0.1:8080/push
You can then do the publish from within a blocking application like Django.

I found what I needed: It is possible to do a HTTP POST request to publish on a topic.
You can read the doc for more information: https://github.com/crossbario/crossbar/wiki/Using-the-REST-to-WebSocket-Pusher

Tornado celery integration hacks

Since nobody provided a solution to this post plus the fact that I desperately need a workaround, here is my situation and some abstract solutions/ideas for debate.
My stack:
Tornado
Celery
MongoDB
Redis
RabbitMQ
My problem: Find a way for Tornado to dispatch a celery task ( solved ) and then asynchronously gather the result ( any ideas? ).
Scenario 1: (request/response hack plus webhook)
Tornado receives a (user)request, then saves in local memory (or in Redis) a { jobID : (user)request} to remember where to propagate the response, and fires a celery task with jobID
When celery completes the task, it performs a webhook at some url and tells tornado that this jobID has finished ( plus the results )
Tornado retrieves the (user)request and forwards a response to the (user)
Can this happen? Does it have any logic?
Scenario 2: (tornado plus long-polling)
Tornado dispatches the celery task and returns some primary json data to the client (jQuery)
jQuery does some long-polling upon receipt of the primary json, say, every x microseconds, and tornado replies according to some database flag. When the celery task completes, this database flag is set to True, then jQuery "loop" is finished.
Is this efficient?
Any other ideas/schemas?

My solution involves polling from tornado to celery:
class CeleryHandler(tornado.web.RequestHandlerr):
#tornado.web.asynchronous
def get(self):
task = yourCeleryTask.delay(**kwargs)
def check_celery_task():
if task.ready():
self.write({'success':True} )
self.set_header("Content-Type", "application/json")
self.finish()
else:
tornado.ioloop.IOLoop.instance().add_timeout(datetime.timedelta(0.00001), check_celery_task)
tornado.ioloop.IOLoop.instance().add_timeout(datetime.timedelta(0.00001), check_celery_task)
Here is post about it.

Here is our solution to the problem. Since we look for result in several handlers in our application we made the celery lookup a mixin class.
This also makes code more readable with the tornado.gen pattern.
from functools import partial
class CeleryResultMixin(object):
"""
Adds a callback function which could wait for the result asynchronously
"""
def wait_for_result(self, task, callback):
if task.ready():
callback(task.result)
else:
# TODO: Is this going to be too demanding on the result backend ?
# Probably there should be a timeout before each add_callback
tornado.ioloop.IOLoop.instance().add_callback(
partial(self.wait_for_result, task, callback)
)
class ARemoteTaskHandler(CeleryResultMixin, tornado.web.RequestHandler):
"""Execute a task asynchronously over a celery worker.
Wait for the result without blocking
When the result is available send it back
"""
#tornado.web.asynchronous
#tornado.web.authenticated
#tornado.gen.engine
def post(self):
"""Test the provided Magento connection
"""
task = expensive_task.delay(
self.get_argument('somearg'),
)
result = yield tornado.gen.Task(self.wait_for_result, task)
self.write({
'success': True,
'result': result.some_value
})
self.finish()

I stumbled upon this question and hitting the results backend repeatedly did not look optimal to me. So I implemented a Mixin similar to your Scenario 1 using Unix Sockets.
It notifies Tornado as soon as the task finishes (to be accurate, as soon as next task in chain runs) and only hits results backend once. Here is the link.

Now, https://github.com/mher/tornado-celery comes to rescue...
class GenAsyncHandler(web.RequestHandler):
#asynchronous
#gen.coroutine
def get(self):
response = yield gen.Task(tasks.sleep.apply_async, args=[3])
self.write(str(response.result))
self.finish()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.