Google Cloud Python Flexible Environment Multithreaded database worker freezes

Google Cloud Python Flexible Environment Multithreaded database worker freezes - python

I run a flexible service for heavy load on Google App Engine Python Flexible Environment. I run PSQ workers to handle tasks through Pub/Sub.
This is all fine and dandy as long as I work with single-threaded workers. On single threaded workers, if I instantiate a datastore client like so:
from google.cloud import datastore
_client = datastore.Client(project='project-name-kept-private')
... and retrieve an entity:
entity = _client.get(_client.key('EntityKind', 1234))
... it works fine.
However, once I do this exact same thing in a multi-threaded worker, it freezes on the last line:
entity = _client.get(_client.key('EntityKind', 1234))
I know it fails exactly on this line because I user logging.error before and after that specific line like so:
import logging
logging.error('entity test1')
entity = _client.get(_client.key('EntityKind', 1234))
logging.error('entity test2')
The line entity test1 and entity test2 both appear in the logs on a single-threaded worker, but only entity test1 gets printed on a multi-threaded worker. It never finishes the task – it just gets stuck on that line.
Any advice or pointers in the right the direction would be of great help. I've been struggling with this issues for quite some time now.

I figured out what the problem was, when the 'datastore_client' constructs its api client, it uses gRPC by default. Apparently this freezes if you use multithreaded workers. By setting GOOGLE_CLOUD_DISABLE_GRPC to True in environmental variables you force it to use the HTTPDatastoreAPI. This 'fixes' my problem.

Related

Ensuring at most a single instance of job executing on Kubernetes and writing into Postgresql

I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?

A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.

Building an HTTP API for continuously running python process

TL;DR: I have a beautifully crafted, continuously running piece of Python code controlling and reading out a physics experiment. Now I want to add an HTTP API.
I have written a module which controls the hardware using USB. I can script several types of autonomously operating experiments, but I'd like to control my running experiment over the internet. I like the idea of an HTTP API, and have implemented a proof-of-concept using Flask's development server.
The experiment runs as a single process claiming the USB connection and periodically (every 16 ms) all data is read out. This process can write hardware settings and commands, and reads data and command responses.
I have a few problems choosing the 'correct' way to communicate with this process. It works if the HTTP server only has a single worker. Then, I can use python's multiprocessing.Pipe for communication. Using more-or-less low-level sockets (or things like zeromq) should work, even for request/response, but I have to implement some sort of protocol: send {'cmd': 'set_voltage', 'value': 900} instead of calling hardware.set_voltage(800) (which I can use in the stand-alone scripts). I can use some sort of RPC, but as far as I know they all (SimpleXMLRPCServer, Pyro) use some sort of event loop for the 'server', in this case the process running the experiment, to process requests. But I can't have an event loop waiting for incoming requests; it should be reading out my hardware! I googled around quite a bit, but however I try to rephrase my question, I end up with Celery as the answer, which mostly fires off one job after another, but isn't really about communicating with a long-running process.
I'm confused. I can get this to work, but I fear I'll be reinventing a few wheels. I just want to launch my app in the terminal, open a web browser from anywhere, and monitor and control my experiment.
Update: The following code is a basic example of using the module:
from pysparc.muonlab.muonlab_ii import MuonlabII
muonlab = MuonlabII()
muonlab.select_lifetime_measurement()
muonlab.set_pmt1_voltage(900)
muonlab.set_pmt1_threshold(500)
lifetimes = []
while True:
data = muonlab.read_lifetime_data()
if data:
print "Muon decays detected with lifetimes", data
lifetimes.extend(data)
The module lives at https://github.com/HiSPARC/pysparc/tree/master/pysparc/muonlab.
My current implementation of the HTTP API lives at https://github.com/HiSPARC/pysparc/blob/master/bin/muonlab_with_http_api.
I'm pretty happy with the module (with lots of tests) but the HTTP API runs using Flask's single-threaded development server (which the documentation and the internet tells me is a bad idea) and passes dictionaries through a Pipe as some sort of IPC. I'd love to be able to do something like this in the above script:
while True:
data = muonlab.read_lifetime_data()
if data:
print "Muon decays detected with lifetimes", data
lifetimes.extend(data)
process_remote_requests()
where process_remote_requests is a fairly short function to call the muonlab instance or return data. Then, in my Flask views, I'd have something like:
muonlab = RemoteMuonlab()
#app.route('/pmt1_voltage', methods=['GET', 'PUT'])
def get_data():
if request.method == 'PUT':
voltage = request.form['voltage']
muonlab.set_pmt1_voltage(voltage)
else:
voltage = muonlab.get_pmt1_voltage()
return jsonify(voltage=voltage)
Getting the measurement data from the app is perhaps less of a problem, since I could store that in SQLite or something else that handles concurrent access.

But... you do have an IO loop; it runs every 16ms.
You can use BaseHTTPServer.HTTPServer in such a case; just set the timeout attribute to something small. bascially...
class XmlRPCApi:
def do_something(self):
print "doing something"
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_instance(XMLRpcAPI())
server.timeout = 0
while True:
sleep(0.016)
do_normal_thing()
x.handle_request()
Edit: python has a built in server, also built on BaseHTTPServer, capable of serving a flask app. since flask.Flask() happens to be a wsgi compliant application, your process_remote_requests() should look like this:
import wsgiref.simple_server
remote_server = wsgire.simple_server('localhost', 8000, app)
# app here is just your Flask() application!
# as before, set timeout to zero so that you can go right back
# to your event loop if there are no requests to handle
remote_server.timeout = 0
def process_remote_requests():
remote_server.handle_request()
This works well enough if you have only short running requests; but if you need to handle requests that may possibly take longer than your event loop's normal polling interval, or if you need to handle more requests than you have polls per unit of time, then you can't use this approach, exactly.
You don't necessarily need to fork off another process, though, You can potentially get by using a pool of workers in another thread. roughly:
import threading
import wsgiref.simple_server
remote_server = wsgire.simple_server('localhost', 8000, app)
POOL_SIZE = 10 # or some other value.
pool = [threading.Thread(target=remote_server.serve_forever) for dummy in xrange(POOL_SIZE)]
for thread in pool:
thread.daemon = True
thread.start()
while True:
pass # normal experiment processing here; don't handle requests in this thread.
However; this approach has one major shortcoming, you now have to deal with concurrency! It's not safe to manipulate your program state as freely as you could with the above loop, since you might be, concurrently manipulating that same state in the main thread (or another http server thread). It's up to you to know when this is valid, wrapping each resource with some sort of mutex lock or whatever is appropriate.

Google App Engine push tasks targeting backends not executing on production server

I have a really strange problem with the push task queue on backends on production server.
I have some simple tasks which will write some log messages and a few NDB entities. When they are running on backends, they just "go away" silently. There are only single lines in the backend log saying the POST method for these tasks has been called, but all other things that would indicate the task being executed (e.g. log messages written by the task, data store operations, etc.) are nowhere to be seen.
I tried to monitor the task queue from the console as the tasks run, and I see that almost all the tasks run through without retries. And yet they just don't look like they ever ran at all. If I move them off the backend, they all run normally (log entries, ndb), provided they can finish within the 10min deadline. However, on development server, they all run normally regardless of whether they are on backend or not. I'm really puzzled as to why these tasks would behave this way.
I have a task handler that does something like this:
class MyTask(webapp2.RequestHandler):
def post(self):
_do_some_computation_
logging.info("some log message")
some_ndb_entity = _some_thing_
some_ndb_entity.put()
And depending on the estimated run time for each of the task of such type (MyTask), I will decide whether to target it to a backend or not.
if big_task:
taskqueue.add(url = '/task/mytask',
queue_name = 'queue-for-backend',
params = {key: value},
target = 'my-backend')
else:
taskqueue.add(url = '/task/mytask',
queue_name = 'queue-regular',
params = {key: value})
I have configured the two queues with different parameters with the backend one having a smaller number of max_concurrent_requests, so I don't really see "503 Instance Unavailable" errors for those tasks on backends. So my problem is, I see those tasks run through on backends without error, but I don't see the log messages they would have written, nor do I see those ndb entities they would have put into the datastore. And these behaviors only happen on production server.
I'm really clueless about this. I will greatly appreciate any help or advice, thanks in advance.

Do I need celery when I am using gevent?

I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.

In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.

Twisted: Creating a ThreadPool and then daemonizing leads to uninformative hangs

I am developing a networked application in Twisted, part of which consists of a web interface written in Django.
I wish to use Twisted's WSGI server to host the web interface, and I've written a working "tap" plugin to allow me to use twistd.
When running the server with the -n flag (don't daemonize) everything works fine, but when this flag is removed the server doesn't respond to requests at all, and there are no messages logged (though the server is still running).
There is a bug on Twisted's Trac which seems to describe the problem exactly, and my plugin happens to be based on the code referenced in the ticket.
Unfortunately, the issue hasn't been fixed - and it was raised almost a year ago.
I have attempted to create a ThreadPoolService class, which extends Service and starts a given ThreadPool when startService is called:
class ThreadPoolService(service.Service):
def __init__(self, pool):
self.pool = pool
def startService(self):
super(ThreadPoolService, self).startService()
self.pool.start()
def stopService(self):
super(ThreadPoolService, self).stopService()
self.pool.stop()
However, Twisted doesn't seem to be calling the startService method at all. I think the problem is that with a "tap" plugin, the ServiceMaker can only return one service to be started - and any others belonging to the same application aren't started. Obviously, I am returning a TCPServer service which contains the WSGI root.
At this point, I've hit somewhat of a bit of a brick wall. Does anyone have any ideas as to how I can work round this issue?

Return a MultiService from your ServiceMaker; one that includes your ThreadPoolService as well as your main application service. The API for assembling such a thing is pretty straightforward:
multi = MultiService()
mine = TCPServer(...) # your existing application service
threads = ThreadPoolService()
mine.setServiceParent(multi)
threads.setServiceParent(multi)
return multi
Given that you've already found the ticket for dealing with this confusing issue within Twisted, I look forward to seeing your patch :).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.