I have a python program where I need to load and de-serialize a 1GB pickle file. It takes a good 20 seconds and I would like to have a mechanism whereby the content of the pickle is readily available for use. I've looked at shared_memory but all the examples of its use seem to involve numpy and my project doesn't use numpy. What is the easiest and cleanest way to achieve this using shared_memory or otherwise?
This is how I'm loading the data now (on every run):
def load_pickle(pickle_name):
return pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
I would like to be able to edit the simulation code in between runs without having to reload the pickle. I've been messing around with importlib.reload but it really doesn't seem to work well for a large Python program with many file:
def main():
data_manager.load_data()
run_simulation()
while True:
try:
importlib.reload(simulation)
run_simulation()
except:
print(traceback.format_exc())
print('Press enter to re-run main.py, CTRL-C to exit')
sys.stdin.readline()
This could be an XY problem, the source of which being the assumption that you must use pickles at all; they're just awful to deal with due to how they manage dependencies and are fundamentally a poor choice for any long-term data storage because of it
The source financial data is almost-certainly in some tabular form to begin with, so it may be possible to request it in a friendlier format
A simple middleware to deserialize and reserialize the pickles in the meantime will smooth the transition
input -> load pickle -> write -> output
Converting your workflow to use Parquet or Feather which are designed to be efficient to read and write will almost-certainly make a considerable difference to your load speed
Further relevant links
Answer to How to reversibly store and load a Pandas dataframe to/from disk
What are the pros and cons of parquet format compared to other formats?
You may also be able to achieve this with hickle, which will internally use a HDH5 format, ideally making it significantly faster than pickle, while still behaving like one
An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.
sudo mkdir mnt/pickle
mount -o size=1536M -t tmpfs none /mnt/pickle
cp path/to/pickle.pkl mnt/pickle/pickle.pkl
Then you can access the pickle at mnt/pickle/pickle.pkl. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)
You can use shareable list:
So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json.
So
Program1
import pickle
import json
from multiprocessing.managers import SharedMemoryManager
YOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
data_dict={'DATA':YOUR_DATA}
data_dict_json=json.dumps(data_dict)
smm = SharedMemoryManager()
smm.start()
sl = smm.ShareableList(['alpha','beta',data_dict_json])
print (sl)
#smm.shutdown() commenting shutdown now but you will need to do it eventually
The output will look like this
#OUTPUT
>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')
Now in Program2:
from multiprocessing import shared_memory
load_from_mem=shared_memory.ShareableList(name='psm_12abcd')
load_from_mem[1]
#OUTPUT
'beta'
load_from_mem[2]
#OUTPUT
yourdataindictionaryformat
You can look for more over here
https://docs.python.org/3/library/multiprocessing.shared_memory.html
Adding another assumption-challenging answer, it could be where you're reading your files from that makes a big difference
1G is not a great amount of data with today's systems; at 20 seconds to load, that's only 50MB/s, which is a fraction of what even the slowest disks provide
You may find you actually have a slow disk or some type of network share as your real bottleneck and that changing to a faster storage medium or compressing the data (perhaps with gzip) makes a great difference to read and writing
Here are my assumptions while writing this answer:
Your Financial data is being produced after complex operations and you want the result to persist in memory
The code that consumes must be able to access that data fast
You wish to use shared memory
Here are the codes (self-explanatory, I believe)
Data structure
'''
Nested class definitions to simulate complex data
'''
class A:
def __init__(self, name, value):
self.name = name
self.value = value
def get_attr(self):
return self.name, self.value
def set_attr(self, n, v):
self.name = n
self.value = v
class B(A):
def __init__(self, name, value, status):
super(B, self).__init__(name, value)
self.status = status
def set_attr(self, n, v, s):
A.set_attr(self, n,v)
self.status = s
def get_attr(self):
print('\nName : {}\nValue : {}\nStatus : {}'.format(self.name, self.value, self.status))
Producer.py
from multiprocessing import shared_memory as sm
import time
import pickle as pkl
import pickletools as ptool
import sys
from class_defs import B
def main():
# Data Creation/Processing
obj1 = B('Sam Reagon', '2703', 'Active')
#print(sys.getsizeof(obj1))
obj1.set_attr('Ronald Reagon', '1023', 'INACTIVE')
obj1.get_attr()
###### real deal #########
# Create pickle string
byte_str = pkl.dumps(obj=obj1, protocol=pkl.HIGHEST_PROTOCOL, buffer_callback=None)
# compress the pickle
#byte_str_opt = ptool.optimize(byte_str)
byte_str_opt = bytearray(byte_str)
# place data on shared memory buffer
shm_a = sm.SharedMemory(name='datashare', create=True, size=len(byte_str_opt))#sys.getsizeof(obj1))
buffer = shm_a.buf
buffer[:] = byte_str_opt[:]
#print(shm_a.name) # the string to access the shared memory
#print(len(shm_a.buf[:]))
# Just an infinite loop to keep the producer running, like a server
# a better approach would be to explore use of shared memory manager
while(True):
time.sleep(60)
if __name__ == '__main__':
main()
Consumer.py
from multiprocessing import shared_memory as sm
import pickle as pkl
from class_defs import B # we need this so that while unpickling, the object structure is understood
def main():
shm_b = sm.SharedMemory(name='datashare')
byte_str = bytes(shm_b.buf[:]) # convert the shared_memory buffer to a bytes array
obj = pkl.loads(data=byte_str) # un-pickle the bytes array (as a data source)
print(obj.name, obj.value, obj.status) # get the values of the object attributes
if __name__ == '__main__':
main()
When the Producer.py is executed in one terminal, it will emit a string identifier (say, wnsm_86cd09d4) for the shared memory. Enter this string in the Consumer.py and execute it in another terminal.
Just run the Producer.py in one terminal and the Consumer.py on another terminal on the same machine.
I hope this is what you wanted!
You can take advantage of multiprocessing to run the simulations inside of subprocesses, and leverage the copy-on-write benefits of forking to unpickle/process the data only once at the start:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
def _run_simulation(_):
# Wrapper for `run_simulation` that takes one argument. The function passed
# into `multiprocessing.Pool.map` must take one argument.
run_simulation()
with mp.Pool() as pool:
pool.map(_run_simulation, range(num_simulations))
If you want to parameterize each simulation run, you can do so like so:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
with mp.Pool() as pool:
simulations = ('arg for simulation run', 'arg for another simulation run')
pool.map(run_simulation, simulations)
This way the run_simulation function will be passed in the values from the simulations tuple, which can allow for having each simulation run with different parameters, or even just assign each run a ID number of name for logging/saving purposes.
This whole approach relies on fork being available. For more information about using fork with Python's built-in multiprocessing library, see the docs about contexts and start methods. You may also want to consider using the forkserver multiprocessing context (by using mp = multiprocessing.get_context('fork')) for the reasons described in the docs.
If you don't want to run your simulations in parallel, this approach can be adapted for that. The key thing is that in order to only have to process the data once, you must call run_simulation within the process that processed the data, or one of its child processes.
If, for instance, you wanted to edit what run_simulation does, and then run it again at your command, you could do it with code resembling this:
main.py:
import multiprocessing
from multiprocessing.connection import Connection
import pickle
from data import load_data
# Load/process data in the parent process
load_data()
# Now child processes can access the data nearly instantaneously
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork') # Consider using 'forkserver' instead
# This is only ever run in child processes
def load_and_run_simulation(result_pipe: Connection) -> None:
# Import `run_simulation` here to allow it to change between runs
from simulation import run_simulation
# Ensure that simulation has not been imported in the parent process, as if
# so, it will be available in the child process just like the data!
try:
run_simulation()
except Exception as ex:
# Send the exception to the parent process
result_pipe.send(ex)
else:
# Send this because the parent is waiting for a response
result_pipe.send(None)
def run_simulation_in_child_process() -> None:
result_pipe_output, result_pipe_input = mp.Pipe(duplex=False)
proc = mp.Process(
target=load_and_run_simulation,
args=(result_pipe_input,)
)
print('Starting simulation')
proc.start()
try:
# The `recv` below will wait until the child process sends sometime, or
# will raise `EOFError` if the child process crashes suddenly without
# sending an exception (e.g. if a segfault occurs)
result = result_pipe_output.recv()
if isinstance(result, Exception):
raise result # raise exceptions from the child process
proc.join()
except KeyboardInterrupt:
print("Caught 'KeyboardInterrupt'; terminating simulation")
proc.terminate()
print('Simulation finished')
if __name__ == '__main__':
while True:
choice = input('\n'.join((
'What would you like to do?',
'1) Run simulation',
'2) Exit\n',
)))
if choice.strip() == '1':
run_simulation_in_child_process()
elif choice.strip() == '2':
exit()
else:
print(f'Invalid option: {choice!r}')
data.py:
from functools import lru_cache
# <obtain 'DATA_ROOT' and 'pickle_name' here>
#lru_cache
def load_data():
with open(DATA_ROOT + pickle_name, 'rb') as f:
return pickle.load(f)
simulation.py:
from data import load_data
# This call will complete almost instantaneously if `main.py` has been run
data = load_data()
def run_simulation():
# Run the simulation using the data, which will already be loaded if this
# is run from `main.py`.
# Anything printed here will appear in the output of the parent process.
# Exceptions raised here will be caught/handled by the parent process.
...
The three files detailed above should all be within the same directory, alongside an __init__.py file that can be empty. The main.py file can be renamed to whatever you'd like, and is the primary entry-point for this program. You can run simulation.py directly, but that will result in a long time spent loading/processing the data, which was the problem you ran into initially. While main.py is running, the file simulation.py can be edited, as it is reloaded every time you run the simulation from main.py.
For macOS users: forking on macOS can be a bit buggy, which is why Python defaults to using the spawn method for multiprocessing on macOS, but still supports fork and forkserver for it. If you're running into crashes or multiprocessing-related issues, try adding OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES to your environment. See https://stackoverflow.com/a/52230415/5946921 for more details.
As I understood:
something is needed to be loaded
it is needed to be loaded often, because file with code which uses this something is edited often
you don't want to wait until it will be loaded every time
Maybe such solution will be okay for you.
You can write script loader file in such way (tested on Python 3.8):
import importlib.util, traceback, sys, gc
# Example data
import pickle
something = pickle.loads(pickle.dumps([123]))
if __name__ == '__main__':
try:
mod_path = sys.argv[1]
except IndexError:
print('Usage: python3', sys.argv[0], 'PATH_TO_SCRIPT')
exit(1)
modules_before = list(sys.modules.keys())
argv = sys.argv[1:]
while True:
MOD_NAME = '__main__'
spec = importlib.util.spec_from_file_location(MOD_NAME, mod_path)
mod = importlib.util.module_from_spec(spec)
# Change to needed global name in the target module
mod.something = something
sys.modules[MOD_NAME] = mod
sys.argv = argv
try:
spec.loader.exec_module(mod)
except:
traceback.print_exc()
del mod, spec
modules_after = list(sys.modules.keys())
for k in modules_after:
if k not in modules_before:
del sys.modules[k]
gc.collect()
print('Press enter to re-run, CTRL-C to exit')
sys.stdin.readline()
Example of module:
# Change 1 to some different number when first script is running and press enter
something[0] += 1
print(something)
Should work. And should reduce the reload time of pickle close to zero 🌝
UPD
Add a possibility to accept script name with command line arguments
This is not exact answer to the question as the Q looks as pickle and SHM are required, but others went of the path, so I am going to share a trick of mine. It might help you. There are some fine solutions here using the pickle and SHM anyway. Regarding this I can offer only more of the same. Same pasta with slight sauce modifications.
Two tricks I employ when dealing with your situations are as follows.
First is to use sqlite3 instead of pickle. You can even easily develop a module for a drop-in replacement using sqlite. Nice thing is that data will be inserted and selected using native Python types, and you can define yourown with converter and adapter functions that would use serialization method of your choice to store complex objects. Can be a pickle or json or whatever.
What I do is to define a class with data passed in through *args and/or **kwargs of a constructor. It represents whatever obj model I need, then I pick-up rows from "select * from table;" of my database and let Python unwrap the data during the new object initialization. Loading big amount of data with datatype conversions, even the custom ones is suprisingly fast. sqlite will manage buffering and IO stuff for you and do it faster than pickle. The trick is construct your object to be filled and initiated as fast as possible. I either subclass dict() or use slots to speed up the thing.
sqlite3 comes with Python so that's a bonus too.
The other method of mine is to use a ZIP file and struct module.
You construct a ZIP file with multiple files within. E.g. for a pronunciation dictionary with more than 400000 words I'd like a dict() object. So I use one file, let say, lengths.dat in which I define a length of a key and a length of a value for each pair in binary format. Then I have a one file of words and one file of pronunciations all one after the other.
When I load from file, I read the lengths and use them to construct a dict() of words with their pronunciations from two other files. Indexing bytes() is fast, so, creating such a dictionary is very fast. You can even have it compressed if diskspace is a concern, but some speed loss is introduced then.
Both methods will take less place on a disk than the pickle would.
The second method will require you to read into RAM all the data you need, then you will be constructing the objects, which will take almost double of RAM that the data took, then you can discard the raw data, of course. But alltogether shouldn't require more than the pickle takes. As for RAM, the OS will manage almost anything using the virtual memory/SWAP if needed.
Oh, yeah, there is the third trick I use. When I have ZIP file constructed as mentioned above or anything else which requires additional deserialization while constructing an object, and number of such objects is great, then I introduce a lazy load. I.e. Let say we have a big file with serialized objects in it. You make the program load all the data and distribute it per object which you keep in list() or dict().
You write your classes in such a way that when the object is first asked for data it unpacks its raw data, deserializes and what not, removes the raw data from RAM then returns your result. So you will not be losing loading time until you actually need the data in question, which is much less noticeable for a user than 20 secs taking for a process to start.
I implemented the python-preloaded script, which can help you here. It will store the CPython state at an early stage after some modules are loaded, and then when you need it, you can restore from this state and load your normal Python script. Storing currently means that it will stay in memory, and restoring means that it does a fork on it, which is very fast. But these are implementation details of python-preloaded and should not matter to you.
So, to make it work for your use case:
Make a new module, data_preloaded.py or so, and in there, just this code:
preloaded_data = load_pickle(...)
Now run py-preloaded-bundle-fork-server.py data_preloaded -o python-data-preloaded.bin. This will create python-data-preloaded.bin, which can be used as a replacement for python.
I assume you have started python your_script.py before. So now run ./python-data-preloaded.bin your_script.py. Or also just python-data-preloaded.bin (no args). The first time, this will still be slow, i.e. take about 20 seconds. But now it is in memory.
Now run ./python-data-preloaded.bin your_script.py again. Now it should be extremely fast, i.e. a few milliseconds. And you can start it again and again and it will always be fast, until you restart your computer.
I'm trying to create a chess game using tkinter. I don't have a huge experience in python programming, but I kind of find weird the philosophy of tkinter : if my assumptions are correct, it seems to me that using tkinter means setting it as the base of the project, and everything has to work around it. And what I mean by that is that using whatever code that is not 'wrapped' in the tkinter framework is a pain to deal with (you have to use the event system, you have to use the after method if you want to perform an action after starting the main loop, etc.)
I have a rather different view on that, and in my chess project I simply consider the tkinter display as a part of my rendering system, and the event system provided by tkinter as a part of my input parser system. That being said, I want to be able to easily change the renderer or the input parser, which means that I could want to detect input from the terminal (for instance by writing D2 D3) instead of moving the objects on the screen. I could also want to print the chessboard on the terminal instead of having a GUI.
More to the point, because tkinter blocks the thread through the mainloop method instead of looping in another thread, I have to put my Tk object in a different thread, so that I can run the rest of my program in parallel. And I'm having a tough time doing it, because my Tk variable contained by my thread needs to be accessed by my program, to update it for instance.
After quite a bit of research, I found that queues in python were synchronized, which means that if I put my Tk object in a queue, I could access it without any problem from the main thread. I tried to see if the following code was working :
import threading, queue
class VariableContainer(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.queue = queue.Queue()
def run(self):
self.queue.put("test")
container = VariableContainer()
container.start()
print(container.queue.get(False))
and it does ! The output is test.
However, if I replace my test string by a Tk object, like below :
import threading, queue
import tkinter
class VariableContainer(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.queue = queue.Queue()
def run(self):
root = tkinter.Tk()
self.queue.put(root)
root.mainloop() # whether I call the mainloop or not doesn't change anything
container = VariableContainer()
container.start()
print(container.queue.get(False))
then the print throws an error, stating that the queue is empty.
(Note that the code above is not the code of my program, it is just an exemple since posting sample codes from my project might be less clear)
Why?
The answer to the trivial question you actually asked: you have a race condition because you call Queue.get(block=False). Tk taking a lot longer to initialize, the main thread almost always wins and finds the queue still empty.
The real question is “How do I isolate my logic from the structure of my interface library?”. (While I understand the desire for a simple branch point between “read from the keyboard” and “wait for a mouse event”, it is considered more composable, in the face of large numbers of event types, sources, and handlers, to have one event dispatcher provided by the implementation. It can sometimes be driven one event at a time, but that composes less well with other logic than one might think.)
The usual answer to that is to make your logic a state machine rather than an algorithm. Mechanically, this means replacing local variables with attributes on an object and dividing the code into methods on its class (e.g., one call per “read from keyboard” in a monolithic implementation). Sometimes language features like coroutines can be used to make this transformation mostly transparent/automatic, but they’re not always a good fit. For example:
def algorithm(n):
tot=0
for i in range(n):
s=int(input("#%s:"%i))
tot+=(i+1)*(n-i)*s
print(tot)
class FSM(object):
def __init__(self,n):
self.n=n
self.i=self.tot=0
def send(self,s):
self.tot+=(self.i+1)*(self.n-self.i)*s
self.i+=1
def count(self): return self.i
def done(self): return self.i>=self.n
def get(self):
return self.tot
def coroutine(n): # old-style, not "async def"
tot=0
for i in range(n):
s=(yield)
tot+=(i+1)*(n-i)*s
yield tot
Having done this, it’s trivial to layer the traditional stream-driven I/O back on top, or to connect it to an event-driven system (be it a GUI or asyncio). For example:
def fsmClient(n):
fsm=FSM(n)
while not fsm.done():
fsm.send(int(input("#%s:"%fsm.count())))
return fsm.get()
def coClient(n):
co=coroutine(n)
first=True
while True:
ret=co.send(None if first else
int(input("#%s:"%fsm.count())))
if ret is not None:
co.close()
return ret
first=False
These clients can work with any state machine/coroutine using the same interface; it should be obvious how to instead supply values from a Tkinter input box or so.
I am fairly new to Python, and my experience is specific to its use in Powerflow modelling through the API provided in Siemens PSS/e. I have a script that I have been using for several years that runs some simulation on a large data set.
In order to get to finish quickly, I usually split the inputs up into multiple parts, then run multiple instances of the script in IDLE. Ive recently added a GUI for the inputs, and have refined the code to be more object oriented, creating a class that the GUI passes the inputs to but then works as the original script did.
My question is how do I go about splitting the process from within the program itself rather than making multiple copies? I have read a bit about the mutliprocess module but I am not sure how to apply it to my situation. Essentially I want to be able to instantiate N number of the same object, each running in parallel.
The class I have now (called Bot) is passed a set of arguments and creates a csv output while it runs until it finishes. I have a separate block of code that puts the pieces together at the end but for now I just need to understand the best approach to kicking multiple Bot objects off once I hit run in my GUI. Ther are inputs in the GUI to specify the number of N segments to be used.
I apologize ahead of time if my question is rather vague. Thanks for any information at all as Im sort of stuck and dont know where to look for better answers.
Create a list of configurations:
configurations = [...]
Create a function which takes the relevant configuration, and makes use of your Bot:
def function(configuration):
bot = Bot(configuration)
bot.create_csv()
Create a Pool of workers with however many CPUs you want to use:
from multiprocessing import Pool
pool = Pool(3)
Call the function multiple times which each configuration in your list of configurations.
pool.map(function, configurations)
For example:
from multiprocessing import Pool
import os
class Bot:
def __init__(self, inputs):
self.inputs = inputs
def create_csv(self):
pid = os.getpid()
print('TODO: create csv in process {} using {}'
.format(pid, self.inputs))
def use_bot(inputs):
bot = Bot(inputs)
bot.create_csv()
def main():
configurations = [
['input1_1.txt', 'input1_2.txt'],
['input2_1.txt', 'input2_2.txt'],
['input3_1.txt', 'input3_2.txt']]
pool = Pool(2)
pool.map(use_bot, configurations)
if __name__ == '__main__':
main()
Output:
TODO: create csv in process 10964 using ['input2_1.txt', 'input2_2.txt']
TODO: create csv in process 8616 using ['input1_1.txt', 'input1_2.txt']
TODO: create csv in process 8616 using ['input3_1.txt', 'input3_2.txt']
If you'd like to make life a little less complicated, you can use multiprocess instead of multiprocessing, as there is better support for classes and also for working in the interpreter. You can see below, we can now work directly with a method on a class instance, which is not possible with multiprocessing.
>>> from multiprocess import Pool
>>> import os
>>>
>>> class Bot(object):
... def __init__(self, x):
... self.x = x
... def doit(self, y):
... pid = os.getpid()
... return (pid, self.x + y)
...
>>> p = Pool()
>>> b = Bot(5)
>>> results = p.imap(b.doit, range(4))
>>> print dict(results)
{46552: 7, 46553: 8, 46550: 5, 46551: 6}
>>> p.close()
>>> p.join()
Above, I'm using imap, to get an iterator on the results, which I'll just dump into a dict. Note that you should close your pools after you are done, to clean up. On Windows, you may also want to look at freeze_support, for cases where the code otherwise fails to run.
>>> import multiprocess as mp
>>> mp.freeze_support
I am facing an issue that should be easy to resolve, yet I do feel that I tap in the dark. I was writing a simple framework which consists of the following classes:
First there is an Algorithm class which simply contains numerical procedures:
class Algorithm(object):
.
.
.
#staticmethod
def calculate(parameters):
#do stuff
return result
Then, there is an item class which holds paths to files, utility and status information. A Worker class subclasses QRunnable:
class Worker(QRunnable):
def __init__(self,item,*args,**kwargs):
self.item = item
def run(self,*args,**kwargs):
result = Algorithms.calculate(items.parameter)
item.result = result
And in a Manager class the processes are started
class Manager(object):
def __init__(self,*args,**kwargs):
self.pool = QThreadPool()
self.pool.setMaxThreadCount(4)
self.items = [item1,item2,...]
def onEvent(self):
for i in self.items:
self.pooil.start(item.requestWoker()) #returns a worker
Now the problem: After doing this I notice 2 things:
The work is done not faster (even a bit slower) then doing it with 1 thread
The items get assigned the same results! For example result A, which is the correct result for item A, gets assigned to all items!
I could not find much docu on this, so where did I go wrong?
All the best
Twerp
One limitation of the most common implementation of Python (CPython) is that its parser uses a Global Interpreter Lock which means that only one thread can be parsing Python bytes at a time. It is possible for multiple Python threads to be executing C-based Python subroutines simultaneously, and for multiple Python threads to be waiting on I/O simultaneously, but not for them to be executing Python code simultaneously. Because of that, it is common not to see any speedup when multithreading a CPU-bound Python program. Common workaround are to spawn sub-processes instead of threads (since each sub-process will use its own copy of the Python interpreter, they won't interfere with each other), or to rewrite some or all of the Python program in another language that doesn't have this limitation.
I need to communicate update events to all running instances of my python script, and i would like to keep the code as simple as possible. I have zero experience with communicating between running processes. Up until now, i have been reading/writing configuration files, which each instance will read and/or update.
Here is some pseudo code i have written (sort of a simple template) to wrap my head around how to solve this problem. Please try to help me fill in the blanks. And remember, i have no experience with sockets, threads, etc...
import process # imaginary module
class AppA():
def __init__(self):
# Every instance that opens will need to attach
# itself to the "Instance Manager". If no manager
# exists, then we need to spawn it. Only one manager
# will ever exist no matter how many instances are
# running.
try:
hm = process.get_handle(AppA_InstanceManager)
except NoSuchProgError:
hm.spawn_instance(AppA_InstanceManager)
finally:
hm.register(self)
self.instance_manager = hm
def state_update(self):
# This method won't exist in the real code, however,
# it emulates internal state changes for the sake of
# explaination.
#
# When any internal state changes happen, we will then
# propagate the changes outward by calling the
# appropriate method of "self.instance_manager".
self.instance_manager.propagate_state()
def cb_state_update(self):
# Called from the "Instance Manager" only!
#
# This may be as simple as reading a known
# config file. Or could simply pass data
# to this method.
class AppA_InstanceManager():
def __init__(self):
self.instances = []
def register_instance(self, instance):
self.instances.append(instance)
def unregister_instance(self, instance):
# nieve example for now.
self.instances.remove(instance)
def propagate_state(self):
for instance in self.instances:
instance.cb_state_update(data)
if __name__ == '__main__':
app = AppA()
Any Suggestions?
There are a few options for this kind of design.
You could use a message queue, its made for this kind of stuff, e.g. AMQP or some ZeroMQ or something like it.
Or you could use something like Redis or some other (in-memory) database for synchronization.
If you don't want to use something like that, you could use the multiprocessing modules synchronization stuff.
Or use a platform specific IPC system, e.g. shared memory via mmap, sysv sockets, etc.
If you want to do things the way you explained, you could have a look at Twisteds perspective broker.