twisted python request/response message and substantial binary data transfer

twisted python request/response message and substantial binary data transfer - python

I am trying to implement a server using python-twisted with potential C# and ObjC clients. I started with LineReceiver and that works well for basic messaging, but I can't figure out the best approach for something more robust. Any ideas for a simple solution for the following requirements?
Request and response
ex. send message to get a status, receive status back
Recieve binary data transfer (non-trivial, but not massive - less than a few megs)
ex. bytes of a small png file
AMP seems like a feasible solution for the first scenario, but may not be able to handle the size for the data transfer scenario.
I've also looked at full blown SOAP but haven't found a decent enough example to get me going.

I like AMP a lot. twisted.protocols.amp is moderately featureful and relatively easily testable (although documentation on how to test applications written with it is a little lacking).
The command/response abstraction AMP provides is comfortable and familiar (after all, we live in a world where HTTP won). AMP avoids the trap of excessive complexity (seemingly for the sake of complexity) that SOAP fell squarely into. But it's not so simple you won't be able to do the job with it (like LineReceiver most likely is).
There are intermediate steps - for example, twisted.protocols.basic.Int32Receiver gives you a more sophisticated framing mechanism (32 bit length prefixes instead of magic-bytes-terminated-lines) - but in my opinion AMP is a really good first choice for a protocol. You may find you want to switch to something else later (one size really does not fit all) but AMP is at the sweet spot between features and simplicity that seems like a good fit for a very broad range of applications.
It's true that there are some built-in length limits in AMP. This is a long standing sore spot that is just waiting for someone with a real-world need to address it. :) There is a fairly well thought-out design for lifting this limit (without breaking protocol compatibility!). If AMP seems otherwise appealing to you then I encourage you to engage the Twisted development community to find out how you can help make this a reality. ;)
There's also always the option of using AMP for messaging and to set up another channel (eg, HTTP) for transferring your larger chunks of data.

Related

Cleanest method of inter-program communication on separate devices

I need to settle on a method of communication between multiple programs spread out over multiple machines. The data itself is fairly straight forwards, comprising of variable length vectors with some meta-data descriptors.
Each program "type" will send data to 1 other program type, and expects to see a reply from it.
The number of connections a given program has is variable over time, and programs can be added or removed at any moment.
Programs may be distributed across several processors, which may use different operating systems, or could be microprocessors.
I have not had to really contend with such a inter-communication puzzle before and am unsure of what the cleanest approach would be. I've heard that ROS might be useful, but that it may not be suitable for Windows environments and has a steep learning curve. I can imagine creating a database between the nodes which would act as a kind of blackboard, but this feels like it may be very inefficient. If using sockets would resolve efficiencies, then how would the connections be managed / maintained?
My main development environment at the moment is Julia, but I can port the data over to C++ or Python without too much discomfort.
I am open to ideas and learning new things, but I can't figure out which path to take. Any nuggets of wisdom would be greatly appreciated.
As requested in a comment, here is some additional information:
Data: The data is a vector of floating point numbers, which can vary in size anywhere from 1 value to 1000's of values (maybe more depending on other aspects of the project, but I'm capping it for prototyping purposes). The meta-data is in json format and provides information such as the total length of the array, and a few identifiers (all integers).
Reliability: I can handle a few missed messages here and there. I'd need the data in the order it was sent though. So if a packet of data arrives after a more recent one it will be ignored. (In case you were instead referring to data integrity, then the data must not be corrupted in transit).
Latency: As little as possible as the program could start to introduce errors otherwise. That being said, for prototyping purposes I can persevere with 10's of milliseconds average, and irregular 100's of milliseconds. I would very much like to keep this down though.
Standard IP Facilities: You may assume all devices have these, yes.

Perhaps taking a look at something well established (and therefore should be able to hook up to whichever language/s you end up using) like Redis or RabbitMQ would be wise seeing as you'll need to be able to communicate with a number of different devices over a network. Either way having some form of central hub that handles communication among the other devices would be advisable.

What's a good starting point to design an architecture with scalability in mind?

I'm currently about to start designing a new application.
The application will allow a user to insert some data and will provide data analysis (with reports as well), i know it's not helpful but the data-processing will be done in post-processing so that's not really interesting for the front-end.
I'd like to start with the right path to help myself when there will be the need to scale to handle more users.
I'm thinking about PostgreSQL to store the data, because I've already used it and I like it (also if a NoSQL would be a good choice -since not all data needs to have a relation- I like the Postgres support and community and I feel better knowing that there's a big community out there to help me), MySQL (innodb) is also a good choice, tbh I've not a real reason to choose it over PostgreSQL and vice-versa (is maybe MySQL easier to create shards?).
I know several programming languages but my strengths are Python, C/C++, Javascript.
I'm not sure if I should choose a sync or async approach for this task (I could scale out by running more sync applications behind a load balancer).
I've already developed another big-size project that teached me a lot of things about concurrency, but there each choice was influenced according to the (whole rest of the team, but mostly by the) sysadmin skills, so we have used python (django) + uwsgi + nginx.
For this project (since it's totally different from the other - that was an e-commerce, this is such a SaaS) I was also considering to make use of node.js, it would be a good opportunity to try it out in a serious project.
The most heavy data processing would be done by post-processes so all the front-end (user website) would be mostly I/O (+1 to use an async enviroment).
What would you suggest?
ps. I must also keep in mind that first of all the project has to start, so I cannot only think about each possible design, but I should start writing code ASAP :-)
My current thoughts are:
- start with something you know
- keep it as simple as possibile
- track everything to find bottlenecks
- scale out
So it wouldn't really matter if I deploy sync or async, but I know async has much better performances, and each thing that could help me to get better results (ergo lower costs) is evaluable as well.
I'm curious to know what are your experiences (also with other technologies)...
I'm becoming paranoid about this scalability and I fear it could lead to a wrong design (it's also the first time I'm designing alone for a commercial purpose = FUD)
If you need some more info please let me know and I'll try go give to you an answer.
Thanks.

A good resource for all of this is http://highscalability.com/. Lots of interesting case studies about handling big web loads.
You didn't mention it but you might want to think about hosting it in the cloud (Azure, Amazon, etc). Makes scaling the hardware a little easier and it's especially nice if your demand fluctuates.

Here are some basic guidelines:
Use as much async processes as possible. Or atleast design it in such a way that it can be converted to be async.
Design processes such that they can be segregated on different servers. This also goes to above. Say you have a webapp that has some intensive processes. If this process is asynch; then the main webserver could queue the job and be done with. Then a seperate server could pick the job and process it. This way your main web servers are not affected. But if you are resource constrained, you could still run the background process on same server (till you have enough clients and then you can spawn it off to a diff. server)
Design for load balancing. So if you app useses sessions, then you should factor in how you will be replicating sessions or not. You dont have to - you could send the user to a diff. server and then forward all subsequent requests to that server. But you still have to design for it.
Have the ability to route load to different servers based on some predefined criteria. So for eg: since your app is a SAAS app, you could decide that certain clients will go to Environment1 and certain other clients will go to Environment2. Lot of the SAAS players do this. For eg Salesforce.
You dont necessarily have to do this from the get go - but having this ability will go a long way to scale your app when the time comes.
Also, remember that theses approaches are not exclusive. You should design your app for all these approaches; but only implement it when required.
Take a look at the book The Art of Scalability
This book was written by guys that worked with eBay & Paypal.

Tale a look at this excellent presentation on scalability patterns and approaches.

Why is there a need for Twisted?

I have been playing around with the twisted framework for about a week now(more because of curiosity rather than having to use it) and its been a lot of fun doing event driven asynchronous network programming.
However, there is something that I fail to understand. The twisted documentation starts off with
Twisted is a framework designed to be very flexible and let you write powerful servers.
My doubt is :- Why do we need such an event-driven library to write powerful servers when there are already very efficient implementations of various servers out there?
Surely, there must have been more than a couple of concrete implementations which the twisted developers had in mind while writing this event-driven I\O library. What are those? Why exactly was twisted made?

In a comment on another answer, you say "Every library is supposed to have ...". "Supposed" by whom? Having use-cases is certainly a nice way to nail down your requirements, but it's not the only way. It also doesn't make sense to talk about the use-cases for all of Twisted at once. There is no use case that justifies every single API in Twisted. There are hundreds or thousands of different use cases, each which justifies a lesser or greater subdivision of Twisted. These came and went over the years of Twisted's development, and no attempt has been made to keep a list of them. I can say that I worked on part of Twisted Names so that I would have a topic for a paper I was presenting at the time. I implemented the vt102 parser in Twisted Conch because I am obsessed with terminals and wanted a fun project involving them. And I implemented the IMAP4 support in Twisted Mail because I worked at a company developing a mail server which required tighter control over the mail store than any other IMAP4 server at the time offered.
So, as you can see, different parts of Twisted were written for widely differing reasons (and I've only given examples of my own reasons, not the reasons of any other developers).
The initial reason for a program being written often doesn't matter much in the long run though. Now the code is written: Twisted Names now runs the DNS for many domain names on the internet, the vt102 parser helped me get a job, and the company that drove the IMAP4 development is out of business. What really matters is what useful things you can do with the code now. As MattH points out, the resulting plethora of functionality has resulted in a library that (perhaps uniquely) addresses a wide array of interesting problems.

Why do we need such an event-driven library to write powerful servers when there are already very efficient implementations of various servers out there?
So paraphrasing: you can't imagine why anyone would need a toolkit when dyecast products already exist?
I'm guessing you've never needed to knock up a protocol gateway, e.g.
- write a daemon to md5 local files on demand over a unix socket
- interrogate a piece of software using udp and expose statistics over http.
I wrote a little proof-of-concept for the second example for a question here on SO in a handful of minutes. I couldn't do that without twisted.
Have you looked at: ProjectsUsingTwisted?

More on 'why': (disclaimer: I'm not a developer of Twisted proper), it's necessary to consider Twisted's high age (relative to Python's). When Twisted was written there was no sufficiently powerful non-blocking network/event driven library written around the reactor pattern (almost everyone was using threads back then). Twisted's initial use case was a large multiplayer game, although the specifics of this game seems to be somewhat lost in time.
Since the origins, as #MattH's link suggest, a very large amount of various network servers written in Python is based on Twisted.

This PyCon talk by the creator of Twisted should give you answers.
It has changed my opinion of Twisted. Before I viewed it as a massive piece of software with interfaces and weird names, two things that many developers dislike but that are actually just superficial things, and now that I’ve seen the history behind and the amazing number of use cases I respect it a lot. Life is short, you need Twisted :)

C/Python Socket Performance?

my question simply relates to the difference in performance between a socket in C and in Python. Since my Python build is CPython, I assume it's similar, but I'm curious if someone actually has "real" benchmarks, or at least an opinion that's evidence based.
My logics is as such:
C socket much faster? then write a C
extension.
not/barely a difference?
keep writing in Python and figure out
how to obtain packet level control
(scapy? dpkt?)
I'm sure someone will want to know for either context or curiosity. I plan to build a sort of proxy for myself (not for internet browsing, anonymity, etc) and will bind the application I want to use with it to a specific port. Then, all packets on said port will be queued, address header modified, and then sent, etc, etc.
Thanks in advance.

In general, sockets in Python perform just fine. For example, the reference implementation of the BitTorrent tracker server is written in Python.
When doing networking operations, the speed of the network is usually the limiting factor. That is, any possible tiny difference in speed between C and Python's socket code is completely overshadowed by the fact that you're doing networking of some kind.
However, your description of what you want to do indicates that you want to inspect and modify individual IP packets. This is beyond the capabilities of Python's standard networking libraries, and is in any case a very OS-dependent operation. Rather than asking "which is faster?" you will need to first ask "is this possible?"

i would think C would be faster, but python would be a lot easier to manage and use.
the difference would be so small, you wouldn't need it unless you were trying to send masses amount of data (something stupid like 1 million gb/second lol)
joe

Suggestion Needed - Networking in Python - A good idea?

I am considering programming the network related features of my application in Python instead of the C/C++ API. The intended use of networking is to pass text messages between two instances of my application, similar to a game passing player positions as often as possible over the network.
Although the python socket modules seems sufficient and mature, I want to check if there are limitations of the python module which can be a problem at a later stage of the development.
What do you think of the python socket module :
Is it reliable and fast enough for production quality software ?
Are there any known limitations which can be a problem if my app. needs more complex networking other than regular client-server messaging ?
Thanks in advance,
Paul

Check out Twisted, a Python engine for Networking. Has built-in support for TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols (including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others)

Python is a mature language that can do almost anything that you can do in C/C++ (even direct memory access if you really want to hurt yourself).
You'll find that you can write beautiful code in it in a very short time, that this code is readable from the start and that it will stay readable (you will still know what it does even after returning one year later).
The drawback of Python is that your code will be somewhat slow. "Somewhat" as in "might be too slow for certain cases". So the usual approach is to write as much as possible in Python because it will make your app maintainable. Eventually, you might run into speed issues. That would be the time to consider to rewrite a part of your app in C.
The main advantages of this approach are:
You already have a running application. Translating the code from Python to C is much more simple than write it from scratch.
You already have a running application. After the translation of a small part of Python to C, you just have to test that small part and you can use the rest of the app (that didn't change) to do it.
You don't pay a price upfront. If Python is fast enough for you, you'll never have to do the optional optimization.
Python is much, much more powerful than C. Every line of Python can do the same as 100 or even 1000 lines of C.

To answer #1, I know that among other things, EVE Online (the MMO) uses a variant of Python for their server code.

The python that EVE online uses is StacklessPython (http://www.stackless.com/), and as far as i understand they use it for how it implements threading through using tasklets and whatnot. But since python itself can handle stuff like MMO with 40k people online i think it can do anything.
This bad answer and not really an answer to your question, rather addition to previous answer.
Alan.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.