Writing a Python data analysis server for a Java interface

Writing a Python data analysis server for a Java interface - python

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.
What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)
I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?
I was thinking to learn how to use sockets and pass YAML strings.
Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?
There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?
Does any of the methods have advantages for debugging or other pecularities?
I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.
(I do not consider solutions like Jython, JEPP, ...)

Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.
http://twistedmatrix.com/documents/current/core/examples/
Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.
The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.
Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.
If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Related

Concurrent server in Golang

First of all I have to admit that I am a beginner concerning concurrency in general, but reading a lot about it recently. Because I heard that Golang is strong on that area. I wanted to ask how (concurrent) servers are written in this language.
I mean, there are different ways in how to write a server that can handle multiple requests/connections concurrently. You can use threads, asynchronous programming (async/asyncio in Python for example), and in Golang there are goroutines, which is more or less a lightweight thread.
However, when using Python and async/asyncio you can have one single process and one thread and it's able to handle concurrency. However, the code is complicated (at least for me without any background).
My question:
What is the way to go to write a concurrent server in Golang? Just a new goroutine for every connection or are there any asynchronous ways? What's the "best practice"?
I mean is it not expensive to have LOTS of goroutines on a highly used server? How to do a well-written server in Golang?

For beginner the best way to start is just use https://golang.org/pkg/net/http/ and just write http handlers. You don't need to initialize Go routines - the http.Server will do it for you.
The code will be straight forward with blocking calls. You don't need to think about concurrency at this stage as Go will do it for you. For example when you do a call like
record, err := someDb.GetRecordByID(123)
actually it's an asynchronous call that blocks current flow but release thread to other Go routines. It will continue flow once data returned and a thread (may be different from previous) becomes available.
If you will need to do concurrent calls within 1 HTTP request you can start Go routines. But leave it for later stage and do the Go lang tour on concurrency first.
If you really need a high load solution for HTTP requests consider using https://github.com/valyala/fasthttp instead of standard http package.

For HTTP #icza's comments & Alexander's answer give a fair idea. Just to add Goroutines are not expensive because they are lighter than normal threads. They can have variable sized stack (probably start as low as 2k) & hence can scale up very well with less operating overhead.
Also on http, there are third party libraries like Gorilla mux which can make life better as also other frameworks like Buffalo which you can explore. While I haven't used the latter, I have heard it makes life easier.
Now if you are going to be writing your own custom server (something different from http) then again Go is a great choice for it. The program can start as simple as https://golang.org/pkg/net/#example_Listener (To try running this program, you can use netcat like this from another terminal)
$ nc localhost 2000
Hellow
Hellow
And finally channels in Go make sharing data & communication much easier and safer across routines taking care of the synchronization aspects. Hope this helps.

My question: What is the way to go to write a concurrent server in
Golang? Just a new goroutine for every connection or are there any
asynchronous ways? What's "best practice"?
Golang http package will do requests concurrency handling for you and I really like that code looks like synchronous and you don't need to add any async/await keywords. Here is how you start
func helloHandler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello")
}
http.HandleFunc("/hello", helloHandler)
log.Fatal(http.ListenAndServe(":8080", nil))

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?

If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

Is there a good way to split a python program into independent modules?

I'm trying to do some machinery automation with python, but I've run into a problem.
I have code that does the actual control, code that logs, code the provides a GUI, and some other modules all being called from a single script.
The issue is that an error in one module halts all the others. So, for instance a bug in the GUI will kill the control systems.
I want to be able to have the modules run independently, so one can crash, be restarted, be patched, etc without halting the others.
The only way I can find to make that work is to store the variables in an SQL database, or files or something.
Is there a way for one python script to sort of ..debug another? so that one script can read or change the variables in the other? I can't find a way to do that that also allows to scripts to be started and stopped independently.
Does anyone have any ideas or advice?

A fairly effective way to do this is to use message passing. Each of your modules are independent, but they can send and receive messages to each other. A very good reference on the many ways to achieve this in Python is the Python wiki page for parallel processing.
A generic strategy
Split your program into pieces where there are servers and clients. You could then use middleware such as 0MQ, Apache ActiveMQ or RabbitMQ to send data between different parts of the system.
In this case, your GUI could send a message to the log parser server telling it to begin work. Once it's done, the log parser will send a broadcast message to anyone interested telling the world the a reference to the results. The GUI could be a subscriber to the channel that the log parser subscribes to. Once it receives the message, it will open up the results file and display whatever the user is interested in.
Serialization and deserialization speed is important also. You want to minimise the overhead for communicating. Google Protocol Buffers and Apache Thrift are effective tools here.
You will also need some form of supervision strategy to prevent a failure in one of the servers from blocking everything. supervisord will restart things for you and is quite easy to configure. Again, it is only one of many options in this space.
Overkill much?
It sounds like you have created a simple utility. The multiprocessing module is an excellent way to have different bits of the program running fairly independently. You still apply the same strategy (message passing, no shared shared state, supervision), but with different tactics.

You want multiply independent processes, and you want them to talk to each other. Hence: read what methods of inter-process communication are available on your OS. I recommend sockets (generic, will work over a n/w and with diff OSs). You can easily invent a simple (maybe http-like) protocol on top of TCP, maybe with json for messages. There is a bunch of classes coming with Python distribution to make it easy (SocketServer.ThreadingMixIn, SocketServer.TCPServer, etc.).

How to create a polling script in Python?

I was trying to create a polling script in python that starts when another python script starts and then keeps supplying data back to this script.
I can obviously write an infinite loop but is that the right way to go about it? I might loose control over how the functions work and how many times a function should be called in an hour.
Edit:
What I am trying to accomplish is to poll the REST API of twitter and get new mentions and people who follow me. I obviously can't keep polling because I will run out of API requests per hour. Thus, the issue. This poller, will send the new mention and follower id/user to the main script that would be listening to any such update.

I highly suggest looking into Twisted, one of the most popular async frameworks using the reactor pattern.
The "infinite loop" you are looking for is really an application pattern that Twisted implements to respond to events asynchronously, and it almost never makes sense to roll your own.
Twisted is largely used for networking requirements, but the it has a LoopingCall interface to set up the kind of functionality you require. Using the core Twisted deferred as your request model allows you to set up a long-polling server that can perform the kind of conditional network test you need. It can intially be a little intimidating, but once you understand the core components (Factories, Reactors, Protocols etc) that you need to inherit it becomes much easier to visualize your problem.
This also might be a good tutorial to start looking at the basics of the "push" model:
http://carloscarrasco.com/simple-http-pubsub-server-with-twisted.html

Twisted or Celery? Which is right for my application with lots of SOAP calls?

I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?

Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.

Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.