Using the Jaeger Python client together with Luigi

Using the Jaeger Python client together with Luigi - python

I'm just starting to use Jaeger for tracing and want to get the Python client to work with Luigi. The root of the problem is, that Luigi uses multiprocessing to fork worker processes. The docs mention that this can cause problems and recommend - in case of web apps - to defer the tracer initialization until the request handling process has been forked. This does not work for me, because I want to create traces in the main process and in the worker processes.
In the main process I initialize the tracer like described here:
from jaeger_client import Config
config = Config(...)
tracer = config.initialize_tracer()
This creates internally a Tornado io-loop that cannot be reused in the forked processes. So I try to re-initialize the Jaeger client in each Luigi worker process. This is possible by setting the (undocumented?) task_process_context of the worker-Section to a class implementing the context manager protocol. It looks like this:
class WorkerContext:
def __init__(self, process):
pass
def __enter__(self):
Config._initialized = False
Config._initialized_lock = threading.Lock()
config = Config(...)
self.tracer = config.initialize_tracer()
def __exit__(self, type, value, traceback):
self.tracer.close()
time.sleep(2)
Those two lines are of course very "hackish":
Config._initialized = False
Config._initialized_lock = threading.Lock()
The class variables are copied to the forked process and initialize_tracer would complain about already being initialized, if I would not reset the variables.
The details of how multiprocessing forks the new process and what this means for the Tornado loop are somewhat "mystic" to me. So my question is:
Is the above code safe or am I asking for trouble?
Of course I would get rid of accessing Configs internals. But if the solution could be considered safe, I would ask the maintainers for a reset method, that can be called in a forked process.

Related

How can I provide shared state to my Flask app with multiple workers without depending on additional software?

I want to provide shared state for a Flask app which runs with multiple workers, i. e. multiple processes.
To quote this answer from a similar question on this topic:
You can't use global variables to hold this sort of data. [...] Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs.
(Source: Are global variables thread safe in flask? How do I share data between requests?)
My question is on that last part regarding suggestions on how to provide the data "outside" of Flask. Currently, my web app is really small and I'd like to avoid requirements or dependencies on other programs. What options do I have if I don't want to run Redis or anything else in the background but provide everything with the Python code of the web app?

If your webserver's worker type is compatible with the multiprocessing module, you can use multiprocessing.managers.BaseManager to provide a shared state for Python objects. A simple wrapper could look like this:
from multiprocessing import Lock
from multiprocessing.managers import AcquirerProxy, BaseManager, DictProxy
def get_shared_state(host, port, key):
shared_dict = {}
shared_lock = Lock()
manager = BaseManager((host, port), key)
manager.register("get_dict", lambda: shared_dict, DictProxy)
manager.register("get_lock", lambda: shared_lock, AcquirerProxy)
try:
manager.get_server()
manager.start()
except OSError: # Address already in use
manager.connect()
return manager.get_dict(), manager.get_lock()
You can assign your data to the shared_dict to make it accessible across processes:
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
shared_dict["text"] = "Hello World"
shared_dict["array"] = numpy.array([1, 2, 3])
However, you should be aware of the following circumstances:
Use shared_lock to protect against race conditions when overwriting values in shared_dict. (See Flask example below.)
There is no data persistence. If you restart the app, or if the main (the first) BaseManager process dies, the shared state is gone.
With this simple implementation of BaseManager, you cannot directly edit nested values in shared_dict. For example, shared_dict["array"][1] = 0 has no effect. You will have to edit a copy and then reassign it to the dictionary key.
Flask example:
The following Flask app uses a global variable to store a counter number:
from flask import Flask
app = Flask(__name__)
number = 0
#app.route("/")
def counter():
global number
number += 1
return str(number)
This works when using only 1 worker gunicorn -w 1 server:app. When using multiple workers gunicorn -w 4 server:app it becomes apparent that number is not a shared state but individual for each worker process.
Instead, with shared_dict, the app looks like this:
from flask import Flask
app = Flask(__name__)
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
#app.route("/")
def counter():
with shared_lock:
shared_dict["number"] += 1
return str(shared_dict["number"])
This works with any number of workers, like gunicorn -w 4 server:app.

your example is a bit magic for me! I'd suggest reusing the magic already in the multiprocessing codebase in the form of a Namespace. I've attempted to make the following code compatible with spawn servers (i.e. MS Windows) but I only have access to Linux machines, so can't test there
start by pulling in dependencies and defining our custom Manager and registering a method to get out a Namespace singleton:
from multiprocessing.managers import BaseManager, Namespace, NamespaceProxy
class SharedState(BaseManager):
_shared_state = Namespace(number=0)
#classmethod
def _get_shared_state(cls):
return cls._shared_state
SharedState.register('state', SharedState._get_shared_state, NamespaceProxy)
this might need to be more complicated if creating the initial state is expensive and hence should only be done when it's needed. note that the OPs version of initialising state during process startup will cause everything to reset if gunicorn starts a new worker process later, e.g. after killing one due to a timeout
next I define a function to get access to this shared state, similar to how the OP does it:
def shared_state(address, authkey):
manager = SharedState(address, authkey)
try:
manager.get_server() # raises if another server started
manager.start()
except OSError:
manager.connect()
return manager.state()
though I'm not sure if I'd recommend doing things like this. when gunicorn starts it spawns lots of processes that all race to run this code and it wouldn't surprise me if this could go wrong sometimes. also if it happens to kill off the server process (because of e.g. a timeout) every other process will start to fail
that said, if we wanted to use this we would do something like:
ss = shared_state('server.sock', b'noauth')
ss.number += 1
this uses Unix domain sockets (passing a string rather than a tuple as an address) to lock this down a bit more.
also note this has the same race conditions as the OP's code: incrementing a number will cause the value to be transferred to the worker's process, which is then incremented, and sent back to the server. I'm not sure what the _lock is supposed to be protecting, but I don't think it'll do much

How to fork and join multiple subprocesses with a global timeout in Python?

I want to execute some tasks in parallel in multiple subprocesses and time out if the tasks were not completed within some delay.
A first approach consists in forking and joining the subprocesses individually with remaining timeouts computed with respect to the global timeout, like suggested in this answer. It works fine for me.
A second approach, which I want to use here, consists in creating a pool of subprocesses and waiting with the global timeout, like suggested in this answer.
However I have a problem with the second approach: after feeding the pool of subprocesses with tasks that have multiprocessing.Event() objects, waiting for their completion raises this exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
Here is the Python code snippet:
import multiprocessing.pool
import time
class Worker:
def __init__(self):
self.event = multiprocessing.Event() # commenting this removes the RuntimeError
def work(self, x):
time.sleep(1)
return x * 10
if __name__ == "__main__":
pool_size = 2
timeout = 5
with multiprocessing.pool.Pool(pool_size) as pool:
result = pool.map_async(Worker().work, [4, 5, 2, 7])
print(result.get(timeout)) # raises the RuntimeError

In the "Programming guidlines" section of the multiprocessing — Process-based parallelism documentation, there is this paragraph:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
So multiprocessing.Event() caused a RuntimeError because it is not pickable, as demonstrated by the following Python code snippet:
import multiprocessing
import pickle
pickle.dumps(multiprocessing.Event())
which raises the same exception:
RuntimeError: Condition objects should only be shared between processes through inheritance
A solution is to use a proxy object:
A proxy is an object which refers to a shared object which lives (presumably) in a different process.
because:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
multiprocessing.Manager().Event() creates a shared threading.Event() object and returns a proxy for it, so replacing this line:
self.event = multiprocessing.Event()
by the following line in the Python code snippet of the question solves the problem:
self.event = multiprocessing.Manager().Event()

Multiprocessing errors in OS X with python2.7 on pre-El Capitan machines

The context for this is much, much too big for an SO question so the code below is a extremely simplified demonstration of the actual implementation.
Generally, I've written an extensive module for academic contexts that launches a subprocess at runtime to be used for event scheduling. When a script or program using this module closes on pre-El Capitan machines my efforts to join the child process fail, as do my last-ditch efforts to just kill the process; OS X gives a "Python unexpectedly quit" error and the the orphaned process persists. I am very much a nub to multiprocessing, without a CS background; diagnosing this is beyond me.
If I am just too ignorant, I'm more than willing to go RTFM; specific directions welcome.
I'm pretty sure this example is coherent & representative, but, know that the actual project works flawlessly on El Capitan, works during runtime on everything else, but consistently crashes as described when quitting. I've tested it with absurd time-out values (30 sec+); always the same result.
One last note: I started this with python's default multiprocessing libraries, then switched to billiard as a dev friend suggested it might run smoother. To date, I've not experienced any difference.
UPDATE:
Had omitted the function that gives the #threaded decorator purpose; now present in code.
Generally, we have:
shared_queue = billiard.Queue() # or multiprocessing, have used both
class MainInstanceParent(object):
def __init__(self):
# ..typically init stuff..
self.event_ob = EventClass(self) # gets a reference to parent
def quit():
try:
self.event_ob.send("kkbai")
started = time.time()
while time.time - started < 1: # or whatever
self.event_ob.recieve()
if self.event_ob.event_p.is_alive():
raise RuntimeError("Little bugger still kickin'")
except RuntimeError:
os.kill(self.event_on.event_p.pid, SIGKILL)
class EventClass(object):
def __init__(self, parent):
# moar init stuff
self.parent = parent
self.pipe, child = Pipe()
self.event_p = __event_process(child)
def receive():
self.pipe.poll()
t = self.pipe.recv()
if isinstance(t, Exception):
raise t
return t
def send(deets):
self.pipe.send(deets)
def threaded(func):
def threaded_func(*args, **kwargs):
p = billiard.Process(target=func, args=args, kwargs=kwargs)
p.start()
return p
return threaded_func
#threaded
def __event_process(pipe):
while True:
if pipe.poll():
inc = pipe.recv()
# do stuff conditionally on what comes through
if inc == "kkbai":
return
if inc == "meets complex condition to pass here":
shared_queue.put("stuff inferred from inc")

Before exiting the main program, call multiprocessing.active_children() to see how many child processes are still running. This will also join the processes that have already quit.
If you would need to signal the children that it's time to quit, create a multiprocessing.Event before starting the child processes. Give it a meaningful name like children_exit. The child processes should regularly call children_exit.is_set() to see if it is time for them to quit. In the main program you call children_exit.set() to signal the child processes.
Update:
Have a good look through the Programming guidelines in the multiprocessing documentation;
It is best to provide the abovementioned Event objects as argument to the target of the Process initializer for reasons mentioned in those guidelines.
If your code also needs to run on ms-windows, you have to jump through some extra hoop, since that OS doesn't do fork().
Update 2:
On your PyEval_SaveThread error; could you modify your question to show the complete trace or alternatively could you post it somewhere?
Since multiprocessing uses threads internally, this is probably the culprit, unless you are also using threads somewhere.
If you also use threads note that GUI toolkits in general and tkinter in particular are not thread safe. Tkinter calls should therefore only be made from one thread!
How much work would it be to port your code to Python 3? If it is a bug in Python 2.7, it might be already fixed in the current (as of now) Python 3.5.1.

Shared pool map between processes with object-oriented python

(python2.7)
I'm trying to do a kind of scanner, that has to walk through CFG nodes, and split in different processes on branching for parallelism purpose.
The scanner is represented by an object of class Scanner. This class has one method traverse that walks through the said graph and splits if necessary.
Here how it looks:
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
self.process_pool = Pool(processes=4)
def traverse(self, ...):
[...]
if branch:
self.process_pool.map(my_func, todo_list).
My problem is the following:
How do I create a instance of multiprocessing.Pool, that is shared between all of my processes ? I want it to be shared, because since a path can be splitted again, I do not want to end with a kind of fork bomb, and having the same Pool will help me to limit the number of processes running at the same time.
The above code does not work, since Pool can not be pickled. In consequence, I have tried that:
class Scanner(object):
def __getstate__(self):
self_dict = self.__dict__.copy()
def self_dict['process_pool']
return self_dict
[...]
But obviously, it results in having self.process_pool not defined in the created processes.
Then, I tried to create a Pool as a module attribute:
process_pool = Pool(processes=4)
def my_func(x):
[...]
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
def traverse(self, ...):
[...]
if branch:
process_pool.map(my_func, todo_list)
It does not work, and this answer explains why.
But here comes the thing, wherever I create my Pool, something is missing. If I create this Pool at the end of my file, it does not see self.attribute1, the same way it did not see answer and fails with an AttributeError.
I'm not even trying to share it yet, and I'm already stuck with Multiprocessing way of doing thing.
I don't know if I have not been thinking correctly the whole thing, but I can not believe it's so complicated to handle something as simple as "having a worker pool and giving them tasks".
Thank you,
EDIT:
I resolved my first problem (AttributeError), my class had a callback as its attribute, and this callback was defined in the main script file, after the import of the scanner module... But the concurrency and "do not fork bomb" thing is still a problem.

What you want to do can't be done safely. Think about if you somehow had a single shared Pool shared across parent and worker processes, with, say, two worker processes. The parent runs a map that tries to perform two tasks, and each task needs to map two more tasks. The two parent dispatched tasks go to each worker, and the parent blocks. Each worker sends two more tasks to the shared pool and blocks for them to complete. But now all workers are now occupied, waiting for a worker to become free; you've deadlocked.
A safer approach would be to have the workers return enough information to dispatch additional tasks in the parent. Then you could do something like:
class MoreWork(object):
def __init__(self, func, *args):
self.func = func
self.args = args
pool = multiprocessing.Pool()
try:
base_task = somefunc, someargs
outstanding = collections.deque([pool.apply_async(*base_task)])
while outstanding:
result = outstanding.popleft().get()
if isinstance(result, MoreWork):
outstanding.append(pool.apply_async(result.func, result.args))
else:
... do something with a "final" result, maybe breaking the loop ...
finally:
pool.terminate()
What the functions are is up to you, they'd just return information in a MoreWork when there was more to do, not launch a task directly. The point is to ensure that by having the parent be solely responsible for task dispatch, and the workers solely responsible for task completion, you can't deadlock due to all workers being blocked waiting for tasks that are in the queue, but not being processed.
This is also not at all optimized; ideally, you wouldn't block waiting on the first item in the queue if other items in the queue were complete; it's a lot easier to do this with the concurrent.futures module, specifically with concurrent.futures.wait to wait on the first available result from an arbitrary number of outstanding tasks, but you'd need a third party PyPI package to get concurrent.futures on Python 2.7.

Maintaining log file from multiple threads in Python

I have my python baseHTTPServer server, which handles post requests.
I used ThreadingMixIn and its now opens a thread for each connection.
I wish to do several multithreaded actions, such as:
1. Monitoring successful/failed connections activities, by adding 1 to a counter for each.
I need a lock for that. My counter is in global scope of the same file. How can I do that?
2. I wish to handle some sort of queue and write it to a file, where the content of the queue is a set of strings, written from my different threads, that simply sends some information for logging issues. How can it be done? I fail to accomplish that since my threading is done "behind the scenes", as each time Im in do_POST(..) method, Im already in a different thread.
Succcessful_Logins = 0
Failed_Logins = 0
LogsFile = open(logfile)
class httpHandler(BaseHTTPRequestHandler):
def do_POST(self):
..
class ThreadingHTTPServer(ThreadingMixIn, HTTPServer):
pass
server = ThreadingHTTPServer(('localhost', PORT_NUMBER), httpHandler)
server.serve_forever()
this is a small fragile of my server.
Another thing that bothers my is the face I want to first send the post response back to the client, and only then possibly get delayed due to locking mechanism or whatever.

From your code, it looks like a new httpHandler is constructed in each thread? If that's the case you can use a class variable for the count and a mutex to protect the count like:
class httpHandler(...):
# Note that these are class variables and are therefore accessable
# to all instances
numSuccess = 0
numSuccessLock = new threading.Lock()
def do_POST(self):
self.numSuccessLock.aquire()
self.numSuccess += 1
self.numSuccessLock.release()
As for writing to a file from different threads, there are a few options:
Use the logging module, "The logging module is intended to be thread-safe without any special work needing to be done by its clients." from http://docs.python.org/2/library/logging.html#thread-safety
Use a Lock object like above to serialize writes to the file
Use a thread safe queue to queue up writes and then read from the queue and write to the file from a separate thread. See http://docs.python.org/2/library/queue.html#module-Queue for examples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.