Twisted multiple concurrent or async streams - python

I'm writing an application in python using the twisted.web framework to stream video using html 5.
The videos are being server via static.File('pathtovideo').render_GET()
The problem is that only one video can be streamed at a time as it ties up the entire process.
Is there anyway to make the streaming async or non-block, whichever term would be appropriate here.
I tried using deferToThread but that still tied up the process.
This is the class Im currently using, where Movie is an ORM table and mid is just an id to an arbitrary row.
class MovieStream(Resource):
isLeaf=True
def __init__(self, mid):
Resource.__init__(self)
self.mid = mid
def render_GET(self, request):
movie = Movie.get(Movie.id == self.mid)
if movie:
defered = deferToThread(self._start_stream, path=movie.source), request=request)
defered.addCallback(self._finish_stream, request)
return NOT_DONE_YET
else:
return NoResource()
`
def _start_stream(self, path, request):
stream = File(path)
return stream.render_GET(request)
def _finish_stream(self, ret, request):
request.finish()

The part of this code that looks like it blocks is actually the Movie.get call.
It is incorrect to call _start_stream with deferToThread because _start_stream uses Twisted APIs (File and whatever File.render_GET uses) and it is illegal to use Twisted APIs except in the reactor thread (in other words, it is illegal to use them in a function you call with deferToThread).
Fortunately you can just delete the use of deferToThread to fix that bug.
To fix the problem that Movie.get blocks you'll need to find a way to access your database asynchronously. Perhaps using deferToThread(Movie.get, Movie.id == self.mid) - if the database library that implements Movie.get is thread-safe, that is.
For what it's worth, you can also avoid the render_GET hijinx by moving your database lookup logic earlier in the resource traversal hierarchy.
For example, I imagine your URLs look something like /foo/bar/<movie id>. In this case, the resource at /foo/bar gets asked for <movie id> children. If you implement that lookup like this:
from twisted.web.resource import Resource
from twisted.web.util import DeferredResource
class MovieContainer(Resource):
def getChild(self, movieIdentifier):
condition = (Movie.id == movieIdentifier)
getting = deferToThread(Movie.get, condition)
return DeferredResource(getting)
(assuming here that Movie.get is thread-safe) then you'll essentially be done.
Resource traversal will conclude with the object constructed by DeferredResource(getting) and when that object is rendered it will take care of waiting for getting to have a result (for the Deferred to "fire", in the lingo) and of calling the right method on it, eg render_GET, to produce a response for the request.

Related

Best approach to multiple websocket client connections in Python?

I appreciate that the question I am about to ask is rather broad but, as a newcomer to Python, I am struggling to find the [best] way of doing something which would be trivial in, say, Node.js, and pretty trivial in other environments such as C#.
Let's say that there is a warehouse full of stuff. And let's say that there is a websocket interface onto that warehouse with two characteristics: on client connection it pumps out a full list of the warehouse's current inventory, and it then follows that up with further streaming updates when the inventory changes.
The web is full of examples of how, in Python, you connect to the warehouse and respond to changes in its state. But...
What if I want to connect to two warehouses and do something based on the combined information retrieved separately from each one? And what if I want to do things based on factors such as time, rather than solely being driven by inventory changes and incoming websocket messages?
In all the examples I've seen - and it's beginning to feel like hundreds - there is, somewhere, in some form, a run() or a run_forever() or a run_until_complete() etc. In other words, the I/O may be asynchronous, but there is always a massive blocking operation in the code, and always two fundamental assumptions which don't fit my case: that there will only be one websocket connection, and that all processing will be driven by events sent out by the [single] websocket server.
It's very unclear to me whether the answer to my question is some sort of use of multiple event loops, or of multiple threads, or something else.
To date, experimenting with Python has felt rather like being on the penthouse floor, admiring the quirky but undeniably elegant decor. But then you get in the elevator, press the button marked "parallelism" or "concurrency", and the evelator goes into freefall, eventually depositing you in a basement filled with some pretty ugly and steaming pipes.
... Returning from flowery metaphors back to the technical, the key thing I'm struggling with is the Python equivalent of, say, Node.js code which could be as trivially simple as the following example [left inelegant for simplicity]:
var aggregateState = { ... some sort of representation of combined state ... };
var socket1 = new WebSocket("wss://warehouse1");
socket1.on("message", OnUpdateFromWarehouse);
var socket2 = new WebSocket("wss://warehouse2");
socket2.on("message", OnUpdateFromWarehouse);
function OnUpdateFromWarehouse(message)
{
... Take the information and use it to update aggregate state from both warehouses ...
}
Answering my own question, in the hope that it may help other Python newcomers... asyncio seems to be the way to go (though there are gotchas such as the alarming ease with which you can deadlock the event loop).
Assuming the use of an asyncio-friendly websocket module such as websockets, what seems to work is a framework along the following lines - shorn, for simplicity, of logic such as reconnects. (The premise remains a warehouse which sends an initial list of its full inventory, and then sends updates to that initial state.)
class Warehouse:
def __init__(self, warehouse_url):
self.warehouse_url = warehouse_url
self.inventory = {} # Some description of the warehouse's inventory
async def destroy():
if (self.websocket.open):
self.websocket.close() # Terminates any recv() in wait_for_incoming()
await self.incoming_message_task # keep asyncio happy by awaiting the "background" task
async def start(self):
try:
# Connect to the warehouse
self.websocket = await connect(self.warehouse_url)
# Get its initial message which describes its full state
initial_inventory = await self.websocket.recv()
# Store the initial inventory
process_initial_inventory(initial_inventory)
# Set up a "background" task for further streaming reads of the web socket
self.incoming_message_task = asyncio.create_task(self.wait_for_incoming())
# Done
return True
except:
# Connection failed (or some unexpected error)
return False
async def wait_for_incoming(self):
while self.websocket.open:
try:
update_message = await self.websocket.recv()
asyncio.create_task(self.process_update_message(update_message))
except:
# Presumably, socket closure
pass
def process_initial_inventory(self, initial_inventory_message):
... Process initial_inventory_message into self.inventory ...
async def process_update_message(self, update_message):
... Merge update_message into self.inventory ...
... And fire some sort of event so that the object's
... creator can detect the change. There seems to be no ...
... consensus about what is a pythonic way of implementing events, ...
... so I'll declare that - potentially trivial - element as out-of-scope ...
After completing the initial connection logic, one key thing is setting up a "background" task which repeatedly reads further update messages coming in over the websocket. The code above doesn't include any firing of events, but there are all sorts of ways in which process_update_message() can/could do this (many of them trivially simple), allowing the object's creator to deal with notifications whenever and however it sees fit. The streaming messages will continue to be received, and any events will be continued to be fired, for as long as the object's creator continues to play nicely with asyncio and to participate in co-operative multitasking.
With that in place, a connection can be established along the following lines:
async def main():
warehouse1 = Warehouse("wss://warehouse1")
if await warehouse1.start():
... Connection succeeded. Update messages will now be processed
in the "background" provided that other users of the event loop
yield in some way ...
else:
... Connection failed ...
asyncio.run(main())
Multiple warehouses can be initiated in several ways, including doing a create_task(warehouse.start()) on each one and then doing a gather on the tasks to ensure/check that they're all okay.
When it's time to quit, to keep asyncio happy, and to stop it complaining about orphaned tasks, and to allow everything to shut down nicely, it's necessary to call destroy() on each warehouse.
But there's one common element which this doesn't cover. Extending the original premise above, let's say that the warehouse also accepts requests from our websocket client, such as "ship X to Y". The success/failure responses to these requests will come in alongside the general update messages; it generally won't be possible to guarantee that the first recv() after the send() of a request will be the response to that request. This complicates process_update_message().
The best answer I've found may or may not be considered "pythonic" because it uses a Future in a way which is strongly analogous to a TaskCompletionSource in .NET.
Let's invent a couple of implementation details; any real-world scenario is likely to look something like this:
We can supply a request_id when submitting an instruction to the warehouse
The success/failure response from the warehouse repeats the request_id back to us (and thus also distinguishing between command-response messages versus inventory-update messages)
The first step is to have a dictionary which maps the ID of pending, in-progress requests to Future objects:
def __init__(self, warehouse_url):
...
self.pending_requests = {}
The definition of a coroutine which sends a request then looks something like this:
async def send_request(self, some_request_definition)
# Allocate a unique ID for the request
request_id = <some unique request id>
# Create a Future for the pending request
request_future = asyncio.Future()
# Store the map of the ID -> Future in the dictionary of pending requests
self.pending_requests[request_id] = request_future
# Build a request message to send to the server, somehow including the request_id
request_msg = <some request definition, including the request_id>
# Send the message
await self.websocket.send(request_msg)
# Wait for the future to complete - we're now asynchronously awaiting
# activity in a separate function
await asyncio.wait_for(command_future, timeout = None)
# Return the result of the Future as the return value of send_request()
return request_future.result()
A caller can create a request and wait for its asynchronous response using something like the following:
some_result = await warehouse.send_request(<some request def>)
The key to making this all work is then to modify and extend process_update_message() to do the following:
Distinguish between request responses versus inventory updates
For the former, extract the request ID (which our invented scenario says gets repeated back to us)
Look up the pending Future for the request
Do a set_result() on it (whose value can be anything depending on what the server's response says). This releases send_request() and causes the await from it to be resolved.
For example:
async def process_update_message(self, update_message):
if <some test that update_message is a request response>:
request_id = <extract the request ID repeated back in update_message>
# Get the Future for this request ID
request_future = self.pending_requests[request_id]
# Create some sort of return value for send_request() based on the response
return_value = <some result of the request>
# Complete the Future, causing send_request() to return
request_future.set_result(return_value)
else:
... handle inventory updates as before ...
I've not used sockets with asyncio, but you're likely just looking for asyncio's open_connection
async def socket_activity(address, callback):
reader, _ = await asyncio.open_connection(address)
while True:
message = await reader.read()
if not message: # empty bytes on EOF
break # connection was closed
await callback(message)
Then add these to the event loop
tasks = [] # keeping a reference prevents these from being garbage collected
for address in ["wss://warehouse1", "wss://warehouse2"]:
tasks.append(asyncio.create_task(
socket_activity(address, callback)
))
# return tasks # or work with them
If you want to wait in a coroutine until N operations are complete, you can use .gather()
Alternatively, you may find Tornado does everything you want and more (I based my Answer off this one)
Tornado websocket client: how to async on_message? (coroutine was never awaited)

How to simulate two parallel requests in django testing framework

I have a Django application (using uWSGI and nginx, and atomic views) with a view that creates new items of a model in the DB (postgres). Before creating anything the view checks if the record doesn't already exist in the DB, something like:
...
try:
newfile = DataFile.objects.get(md5=request.POST['md5'])
except DataFile.DoesNotExist:
newfile = DataFile.objects.create(md5=request.POST['md5'], filename=request.POST['filename'])
return JsonResponse({'file_id': newfile.pk})
I noticed sometimes this doesn't work, and I get duplicates in the DB (which is easily solved with a unique constraint). I'm not sure why this happens, if there is caching or race conditions, but I'd like to at least cover the behaviour with a test in the Django test framework.However, I do not know how to simulate two parallel requests. Is there a way to fire the next request while not waiting for the first, built into the framework, or should one use multiprocessing or similar for this?
I suggest you use an async loop to trigger 2 quite simultaneous request.
Example:
async def test_case(request):
try:
newfile = DataFile.objects.get(md5=request.POST['md5'])
except DataFile.DoesNotExist:
newfile = DataFile.objects.create(md5=request.POST['md5'], filename=request.POST['filename'])
return JsonResponse({'file_id': newfile.pk})
async def simult(request):
t_case_0 = await test_case(request)
t_case_1 = await test_case(request)
asyncio.run(simult(request))

tda-api assigning streaming endpoints to variables

I am new to pyhton APIs. I can not get the script to return a value. Could anyone give me a direction please. I can not get the lambda function to work properly. I am trying to save the streamed data into variables to use with a set of operations.
from tda.auth import easy_client
from tda.client import Client
from tda.streaming import StreamClient
import asyncio
import json
import config
import pathlib
import math
import pandas as pd
client = easy_client(
api_key=config.API_KEY,
redirect_uri=config.REDIRECT_URI,
token_path=config.TOKEN_PATH)
stream_client = StreamClient(client, account_id=config.ACCOUNT_ID)
async def read_stream():
login = asyncio.create_task(stream_client.login())
await login
service = asyncio.create_task(stream_client.quality_of_service(StreamClient.QOSLevel.EXPRESS))
await service
book_snapshots = {}
def my_nasdaq_book_handler(msg):
book_snapshots.update(msg)
stream_client.add_nasdaq_book_handler(my_nasdaq_book_handler)
stream = stream_client.nasdaq_book_subs(['GOOG','AAPL','FB'])
await stream
while True:
await stream_client.handle_message()
print(book_snapshots)
asyncio.run(read_stream())
Callbacks
This (wrong) assumption
stream_client.add_nasdaq_book_handler() contains all the trade data.
shows difficulties in understanding the callback concept. Typically the naming pattern add handler indicates that this concept is being used. There is also the comment in the boiler plate code from the Streaming Client docs
# Always add handlers before subscribing because many streams start sending
# data immediately after success, and messages with no handlers are dropped.
that consistently talks about subscribing - also this word is a strong indicator.
The basic principle of a callback is that instead you pull the information from a service (and being blocked until it's available), you enable that service to push that information to you when it's available. You do this typically be first registering one (or more) interest(s) with the service and after then wait for the things to come.
In section Handling Messages they give an example for function (to provide by you) as follows:
def sample_handler(msg):
print(json.dumps(msg, indent=4))
which takes a str argument which is dumped in JSON format to the console. The lambda in your example does exactly the same.
Lambdas
it's not possible to return a value from a lambda function because it is anonymous
This is not correct. If lambda functions wouldn't be able to return values, they wouldn't play such an important role. See 4.7.6. Lambda Expressions in the Python 3 docs.
The problem in your case is that both functions don't do anything you want, both just print to console. Now you need to get into these functions to tell what to do.
Control
Actually, your program runs within this loop
while True:
await stream_client.handle_message()
each stream_client.handle_message() call finally causes a call to the function you registered by calling stream_client.add_nasdaq_book_handler. So that's the point: your script defines what to do when messages arrive before it gets waiting.
For example, your function could just collect the arriving messages:
book_snapshots = []
def my_nasdaq_book_handler(msg):
book_snapshots.append(msg)
A global object book_snapshots is used in the implementation. You may expand/change this function at will (of course translating the information into JSON format will help you accessing it in a structured way). This line will register your function:
stream_client.add_nasdaq_book_handler(my_nasdaq_book_handler)

Is get_result() a required call for put_async() in Google App Engine

With the new release of GAE 1.5.0, we now have an easy way to do async datastore calls. Are we required to call get_result() after calling
'put_async'?
For example, if I have an model called MyLogData, can I just call:
put_async(MyLogData(text="My Text"))
right before my handler returns without calling the matching get_result()?
Does GAE automatically block on any pending calls before sending the result to the client?
Note that I don't really care to handle error conditions. i.e. I don't mind if some of these puts fail.
I don't think there is any sure way to know if get_result() is required unless someone on the GAE team verifies this, but I think it's not needed. Here is how I tested it.
I wrote a simple handler:
class DB_TempTestModel(db.Model):
data = db.BlobProperty()
class MyHandler(webapp.RequestHandler):
def get(self):
starttime = datetime.datetime.now()
lots_of_data = ' '*500000
if self.request.get('a') == '1':
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
if self.request.get('a') == '2':
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
self.response.out.write(str(datetime.datetime.now()-starttime))
I ran it a bunch of times on a High Replication Application.
The data was always there, making me believe that unless there is a failure in the datastore side of things (unlikely), it's gonna be written.
Here's the interesting part. When the data is written with put_async() (?a=2), the amount of time (to process the request) was on average about 2 to 3 times as fast as put()(?a=1) (not a very scientific test - just eyeballing it).
But the cpu_ms and api_cpu_ms were the same for both ?a=1 and ?a=2.
From the logs:
ms=440 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
vs
ms=149 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
On the client side, looking at the network latency of the requests, it showed the same results, i.e. `?a=2' requests were at least 2 times faster. Definitely a win on the client side... but it seems to not have any gain on the server side.
Anyone on the GAE team care to comment?
db.put_async works fine without get_result when deployed (in fire-and-forget style) but in locally it won't take action until get_result gets called more context
I dunno, but this works:
import datetime
from google.appengine.api import urlfetch
def main():
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, "some://artificially/slow.url")
print "Content-type: text/plain"
print
print str(datetime.datetime.now())
if __name__ == '__main__':
main()
The remote URL sleeps 3 seconds and then sends me an email. The App Engine handler returns immediately, and the remote URL completes as expected. Since both services abstract the same underlying RPC framework, I would guess the datastore behaves similarly.
Good question, though. Perhaps Nick or another Googler can answer definitively.

python twisted - Timeouting on a sent message that did not get a response

I am creating a sort of a client-server implementation, and I'd like to make sure that every sent message gets a response. So I want to create a timeout mechanism, which doesn't check if the message itself is delivered, but rather checks if the delivered message gets a response.
IE, for two computers 1 and 2:
1: send successfully: "hello"
2: <<nothing>>
...
1: Didn't get a response for my "hello" --> timeout
I thought of doing it by creating a big boolean array with id for each message, which will hold a "in progress" flag, and will be set when the message's response is received.
I was wondering perhaps there was a better way of doing that.
Thanks,
Ido.
There is a better way, which funnily enough I myself just implemented here. It uses the TimeoutMixin to achieve the timeout behaviour you need, and a DeferredLock to match up the correct replies with what was sent.
from twisted.internet import defer
from twisted.protocols.policies import TimeoutMixin
from twisted.protocols.basic import LineOnlyReceiver
class PingPongProtocol(LineOnlyReceiver, TimeoutMixin):
def __init__(self):
self.lock = defer.DeferredLock()
self.deferred = None
def sendMessage(self, msg):
result = self.lock.run(self._doSend, msg)
return result
def _doSend(self, msg):
assert self.deferred is None, "Already waiting for reply!"
self.deferred = defer.Deferred()
self.deferred.addBoth(self._cleanup)
self.setTimeout(self.DEFAULT_TIMEOUT)
self.sendLine(msg)
return self.deferred
def _cleanup(self, res):
self.deferred = None
return res
def lineReceived(self, line):
if self.deferred:
self.setTimeout(None)
self.deferred.callback(line)
# If not, we've timed out or this is a spurious line
def timeoutConnection(self):
self.deferred.errback(
Timeout("Some informative message"))
I haven't tested this, it's more of a starting point. There are a few things you might want to change here to suit your purposes:
I use a LineOnlyReceiver — that's not relevant to the problem itself, and you'll need to replace sendLine/lineReceived with the appropriate API calls for your protocol.
This is for a serial connection, so I don't deal with connectionLost etc. You might need to.
I like to keep state directly in the instance. If you need extra state information, set it up in _doSend and clean it up in _cleanup. Some people don't like that — the alternative is to create nested functions inside _doSend that close over the state information that you need. You'll still need that self.deferred there though, otherwise lineReceived (or dataReceived) has no idea what to do.
How to use it
Like I said, I created this for serial communications, where I don't have to worry about factories, connectTCP, etc. If you're using TCP communications, you'll need to figure out the extra glue you need.
# Create the protocol somehow. Maybe this actually happens in a factory,
# in which case, the factory could have wrapper methods for this.
protocol = PingPongProtocol()
def = protocol.sendMessage("Hi there!")
def.addCallbacks(gotHiResponse, noHiResponse)

Categories

Resources