Hadoop streaming --- expensive shared resource (COOL)

Hadoop streaming --- expensive shared resource (COOL) - python

I am looking for a nice pattern for python hadoop streaming that involves loading an expensive resource, for example a pickled python object on the server. Here is what I came up with; I've tested by piping input files and slow running programs directly into the script in bash, but haven't yet run it on a hadoop cluster. For you hadoop wizards---am i handling io such that this will work as a python streaming job? I guess I'll go spin up something on amazon to test but it would be nice if someone knew off the top.
you can test it out via cat file.txt | the_script or ./a_streaming_program | the_script
#!/usr/bin/env python
import sys
import time
def resources_for_many_lines():
# load slow, shared resources here
# for example, a shared pickle file
# in this example we use a 1 second sleep to simulate
# a long data load
time.sleep(1)
# we will pretend the value zero is the product
# of our long slow running import
resource = 0
return resource
def score_a_line(line, resources):
# put fast code to score a single example line here
# in this example we will return the value of resource + 1
return resources + 1
def run():
# here is the code that reads stdin and scores the model over a streaming data set
resources = resources_for_many_lines()
while 1:
# reads a line of input
line = sys.stdin.readline()
# ends if pipe closes
if line == '':
break
# scores a line
print score_a_line(line, resources)
# prints right away instead of waiting
sys.stdout.flush()
if __name__ == "__main__":
run();

This looks fine to me. I often load up yaml or sqlite resources in my mappers.
You typically won't be running that many mappers in your job so even if you spend a couple of seconds in loading something from disk it's usually not a huge problem.

Related

Python multiprocessing progress approach

I've been busy writing my first multiprocessing code and it works, yay.
However, now I would like some feedback of the progress and I'm not sure what the best approach would be.
What my code (see below) does in short:
A target directory is scanned for mp4 files
Each file is analysed by a separate process, the process saves a result (an image)
What I'm looking for could be:
Simple
Each time a process finishes a file it sends a 'finished' message
The main code keeps count of how many files have finished
Fancy
Core 0 processing file 20 of 317 ||||||____ 60% completed
Core 1 processing file 21 of 317 |||||||||_ 90% completed
...
Core 7 processing file 18 of 317 ||________ 20% completed
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Thanks in advance!
EDIT: Changed my code that starts the processes as suggested by gsb22
My code:
# file operations
import os
import glob
# Multiprocessing
from multiprocessing import Process
# Motion detection
import cv2
# >>> Enter directory to scan as target directory
targetDirectory = "E:\Projects\Programming\Python\OpenCV\\videofiles"
def get_videofiles(target_directory):
# Find all video files in directory and subdirectories and put them in a list
videofiles = glob.glob(target_directory + '/**/*.mp4', recursive=True)
# Return the list
return videofiles
def process_file(videofile):
'''
What happens inside this function:
- The video is processed and analysed using openCV
- The result (an image) is saved to the results folder
- Once this function receives the videofile it completes
without the need to return anything to the main program
'''
# The processing code is more complex than this code below, this is just a test
cap = cv2.VideoCapture(videofile)
for i in range(10):
succes, frame = cap.read()
# cv2.imwrite('{}/_Results/{}_result{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
if succes:
try:
cv2.imwrite('{}/_Results/{}_result_{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
except:
print('something went wrong')
if __name__ == "__main__":
# Create directory to save results if it doesn't exist
if not os.path.exists(targetDirectory + '/_Results'):
os.makedirs(targetDirectory + '/_Results')
# Get a list of all video files in the target directory
all_files = get_videofiles(targetDirectory)
print(f'{len(all_files)} video files found')
# Create list of jobs (processes)
jobs = []
# Create and start processes
for file in all_files:
proc = Process(target=process_file, args=(file,))
jobs.append(proc)
for job in jobs:
job.start()
for job in jobs:
job.join()
# TODO: Print some form of progress feedback
print('Finished :)')

I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Here's a very simple way to get progress indication at minimal cost:
from multiprocessing.pool import Pool
from random import randint
from time import sleep
from tqdm import tqdm
def process(fn) -> bool:
sleep(randint(1, 3))
return randint(0, 100) < 70
files = [f"file-{i}.mp4" for i in range(20)]
success = []
failed = []
NPROC = 5
pool = Pool(NPROC)
for status, fn in tqdm(zip(pool.imap(process, files), files), total=len(files)):
if status:
success.append(fn)
else:
failed.append(fn)
print(f"{len(success)} succeeded and {len(failed)} failed")
Some comments:
tqdm is a 3rd-party library which implements progressbars extremely well. There are others. pip install tqdm.
we use a pool (there's almost never a reason to manage processes yourself for simple things like this) of NPROC processes. We let the pool handle iterating our process function over the input data.
we signal state by having the function return a boolean (in this example we choose randomly, weighting in favour of success). We don't return the filename, although we could, because it would have to be serialised and sent from the subprocess, and that's unnecessary overhead.
we use Pool.imap, which returns an iterator which keeps the same order as the iterable we pass in. So we can use zip to iterate files directly. Since we use an iterator with unknown size, tqdm needs to be told how long it is. (We could have used pool.map, but there's no need to commit the ram---although for one bool it probably makes no difference.)
I've deliberately written this as a kind of recipe. You can do a lot with multiprocessing just by using the high-level drop in paradigms, and Pool.[i]map is one of the most useful.
References
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
https://tqdm.github.io/

Python: Pre-loading memory

I have a python program where I need to load and de-serialize a 1GB pickle file. It takes a good 20 seconds and I would like to have a mechanism whereby the content of the pickle is readily available for use. I've looked at shared_memory but all the examples of its use seem to involve numpy and my project doesn't use numpy. What is the easiest and cleanest way to achieve this using shared_memory or otherwise?
This is how I'm loading the data now (on every run):
def load_pickle(pickle_name):
return pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
I would like to be able to edit the simulation code in between runs without having to reload the pickle. I've been messing around with importlib.reload but it really doesn't seem to work well for a large Python program with many file:
def main():
data_manager.load_data()
run_simulation()
while True:
try:
importlib.reload(simulation)
run_simulation()
except:
print(traceback.format_exc())
print('Press enter to re-run main.py, CTRL-C to exit')
sys.stdin.readline()

This could be an XY problem, the source of which being the assumption that you must use pickles at all; they're just awful to deal with due to how they manage dependencies and are fundamentally a poor choice for any long-term data storage because of it
The source financial data is almost-certainly in some tabular form to begin with, so it may be possible to request it in a friendlier format
A simple middleware to deserialize and reserialize the pickles in the meantime will smooth the transition
input -> load pickle -> write -> output
Converting your workflow to use Parquet or Feather which are designed to be efficient to read and write will almost-certainly make a considerable difference to your load speed
Further relevant links
Answer to How to reversibly store and load a Pandas dataframe to/from disk
What are the pros and cons of parquet format compared to other formats?
You may also be able to achieve this with hickle, which will internally use a HDH5 format, ideally making it significantly faster than pickle, while still behaving like one

An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.
sudo mkdir mnt/pickle
mount -o size=1536M -t tmpfs none /mnt/pickle
cp path/to/pickle.pkl mnt/pickle/pickle.pkl
Then you can access the pickle at mnt/pickle/pickle.pkl. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)

You can use shareable list:
So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json.
So
Program1
import pickle
import json
from multiprocessing.managers import SharedMemoryManager
YOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
data_dict={'DATA':YOUR_DATA}
data_dict_json=json.dumps(data_dict)
smm = SharedMemoryManager()
smm.start()
sl = smm.ShareableList(['alpha','beta',data_dict_json])
print (sl)
#smm.shutdown() commenting shutdown now but you will need to do it eventually
The output will look like this
#OUTPUT
>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')
Now in Program2:
from multiprocessing import shared_memory
load_from_mem=shared_memory.ShareableList(name='psm_12abcd')
load_from_mem[1]
#OUTPUT
'beta'
load_from_mem[2]
#OUTPUT
yourdataindictionaryformat
You can look for more over here
https://docs.python.org/3/library/multiprocessing.shared_memory.html

Adding another assumption-challenging answer, it could be where you're reading your files from that makes a big difference
1G is not a great amount of data with today's systems; at 20 seconds to load, that's only 50MB/s, which is a fraction of what even the slowest disks provide
You may find you actually have a slow disk or some type of network share as your real bottleneck and that changing to a faster storage medium or compressing the data (perhaps with gzip) makes a great difference to read and writing

Here are my assumptions while writing this answer:
Your Financial data is being produced after complex operations and you want the result to persist in memory
The code that consumes must be able to access that data fast
You wish to use shared memory
Here are the codes (self-explanatory, I believe)
Data structure
'''
Nested class definitions to simulate complex data
'''
class A:
def __init__(self, name, value):
self.name = name
self.value = value
def get_attr(self):
return self.name, self.value
def set_attr(self, n, v):
self.name = n
self.value = v
class B(A):
def __init__(self, name, value, status):
super(B, self).__init__(name, value)
self.status = status
def set_attr(self, n, v, s):
A.set_attr(self, n,v)
self.status = s
def get_attr(self):
print('\nName : {}\nValue : {}\nStatus : {}'.format(self.name, self.value, self.status))
Producer.py
from multiprocessing import shared_memory as sm
import time
import pickle as pkl
import pickletools as ptool
import sys
from class_defs import B
def main():
# Data Creation/Processing
obj1 = B('Sam Reagon', '2703', 'Active')
#print(sys.getsizeof(obj1))
obj1.set_attr('Ronald Reagon', '1023', 'INACTIVE')
obj1.get_attr()
###### real deal #########
# Create pickle string
byte_str = pkl.dumps(obj=obj1, protocol=pkl.HIGHEST_PROTOCOL, buffer_callback=None)
# compress the pickle
#byte_str_opt = ptool.optimize(byte_str)
byte_str_opt = bytearray(byte_str)
# place data on shared memory buffer
shm_a = sm.SharedMemory(name='datashare', create=True, size=len(byte_str_opt))#sys.getsizeof(obj1))
buffer = shm_a.buf
buffer[:] = byte_str_opt[:]
#print(shm_a.name) # the string to access the shared memory
#print(len(shm_a.buf[:]))
# Just an infinite loop to keep the producer running, like a server
# a better approach would be to explore use of shared memory manager
while(True):
time.sleep(60)
if __name__ == '__main__':
main()
Consumer.py
from multiprocessing import shared_memory as sm
import pickle as pkl
from class_defs import B # we need this so that while unpickling, the object structure is understood
def main():
shm_b = sm.SharedMemory(name='datashare')
byte_str = bytes(shm_b.buf[:]) # convert the shared_memory buffer to a bytes array
obj = pkl.loads(data=byte_str) # un-pickle the bytes array (as a data source)
print(obj.name, obj.value, obj.status) # get the values of the object attributes
if __name__ == '__main__':
main()
When the Producer.py is executed in one terminal, it will emit a string identifier (say, wnsm_86cd09d4) for the shared memory. Enter this string in the Consumer.py and execute it in another terminal.
Just run the Producer.py in one terminal and the Consumer.py on another terminal on the same machine.
I hope this is what you wanted!

You can take advantage of multiprocessing to run the simulations inside of subprocesses, and leverage the copy-on-write benefits of forking to unpickle/process the data only once at the start:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
def _run_simulation(_):
# Wrapper for `run_simulation` that takes one argument. The function passed
# into `multiprocessing.Pool.map` must take one argument.
run_simulation()
with mp.Pool() as pool:
pool.map(_run_simulation, range(num_simulations))
If you want to parameterize each simulation run, you can do so like so:
import multiprocessing
import pickle
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork')
# Load data once, in the parent process
data = pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
with mp.Pool() as pool:
simulations = ('arg for simulation run', 'arg for another simulation run')
pool.map(run_simulation, simulations)
This way the run_simulation function will be passed in the values from the simulations tuple, which can allow for having each simulation run with different parameters, or even just assign each run a ID number of name for logging/saving purposes.
This whole approach relies on fork being available. For more information about using fork with Python's built-in multiprocessing library, see the docs about contexts and start methods. You may also want to consider using the forkserver multiprocessing context (by using mp = multiprocessing.get_context('fork')) for the reasons described in the docs.
If you don't want to run your simulations in parallel, this approach can be adapted for that. The key thing is that in order to only have to process the data once, you must call run_simulation within the process that processed the data, or one of its child processes.
If, for instance, you wanted to edit what run_simulation does, and then run it again at your command, you could do it with code resembling this:
main.py:
import multiprocessing
from multiprocessing.connection import Connection
import pickle
from data import load_data
# Load/process data in the parent process
load_data()
# Now child processes can access the data nearly instantaneously
# Need to use forking to get copy-on-write benefits!
mp = multiprocessing.get_context('fork') # Consider using 'forkserver' instead
# This is only ever run in child processes
def load_and_run_simulation(result_pipe: Connection) -> None:
# Import `run_simulation` here to allow it to change between runs
from simulation import run_simulation
# Ensure that simulation has not been imported in the parent process, as if
# so, it will be available in the child process just like the data!
try:
run_simulation()
except Exception as ex:
# Send the exception to the parent process
result_pipe.send(ex)
else:
# Send this because the parent is waiting for a response
result_pipe.send(None)
def run_simulation_in_child_process() -> None:
result_pipe_output, result_pipe_input = mp.Pipe(duplex=False)
proc = mp.Process(
target=load_and_run_simulation,
args=(result_pipe_input,)
)
print('Starting simulation')
proc.start()
try:
# The `recv` below will wait until the child process sends sometime, or
# will raise `EOFError` if the child process crashes suddenly without
# sending an exception (e.g. if a segfault occurs)
result = result_pipe_output.recv()
if isinstance(result, Exception):
raise result # raise exceptions from the child process
proc.join()
except KeyboardInterrupt:
print("Caught 'KeyboardInterrupt'; terminating simulation")
proc.terminate()
print('Simulation finished')
if __name__ == '__main__':
while True:
choice = input('\n'.join((
'What would you like to do?',
'1) Run simulation',
'2) Exit\n',
)))
if choice.strip() == '1':
run_simulation_in_child_process()
elif choice.strip() == '2':
exit()
else:
print(f'Invalid option: {choice!r}')
data.py:
from functools import lru_cache
# <obtain 'DATA_ROOT' and 'pickle_name' here>
#lru_cache
def load_data():
with open(DATA_ROOT + pickle_name, 'rb') as f:
return pickle.load(f)
simulation.py:
from data import load_data
# This call will complete almost instantaneously if `main.py` has been run
data = load_data()
def run_simulation():
# Run the simulation using the data, which will already be loaded if this
# is run from `main.py`.
# Anything printed here will appear in the output of the parent process.
# Exceptions raised here will be caught/handled by the parent process.
...
The three files detailed above should all be within the same directory, alongside an __init__.py file that can be empty. The main.py file can be renamed to whatever you'd like, and is the primary entry-point for this program. You can run simulation.py directly, but that will result in a long time spent loading/processing the data, which was the problem you ran into initially. While main.py is running, the file simulation.py can be edited, as it is reloaded every time you run the simulation from main.py.
For macOS users: forking on macOS can be a bit buggy, which is why Python defaults to using the spawn method for multiprocessing on macOS, but still supports fork and forkserver for it. If you're running into crashes or multiprocessing-related issues, try adding OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES to your environment. See https://stackoverflow.com/a/52230415/5946921 for more details.

As I understood:
something is needed to be loaded
it is needed to be loaded often, because file with code which uses this something is edited often
you don't want to wait until it will be loaded every time
Maybe such solution will be okay for you.
You can write script loader file in such way (tested on Python 3.8):
import importlib.util, traceback, sys, gc
# Example data
import pickle
something = pickle.loads(pickle.dumps([123]))
if __name__ == '__main__':
try:
mod_path = sys.argv[1]
except IndexError:
print('Usage: python3', sys.argv[0], 'PATH_TO_SCRIPT')
exit(1)
modules_before = list(sys.modules.keys())
argv = sys.argv[1:]
while True:
MOD_NAME = '__main__'
spec = importlib.util.spec_from_file_location(MOD_NAME, mod_path)
mod = importlib.util.module_from_spec(spec)
# Change to needed global name in the target module
mod.something = something
sys.modules[MOD_NAME] = mod
sys.argv = argv
try:
spec.loader.exec_module(mod)
except:
traceback.print_exc()
del mod, spec
modules_after = list(sys.modules.keys())
for k in modules_after:
if k not in modules_before:
del sys.modules[k]
gc.collect()
print('Press enter to re-run, CTRL-C to exit')
sys.stdin.readline()
Example of module:
# Change 1 to some different number when first script is running and press enter
something[0] += 1
print(something)
Should work. And should reduce the reload time of pickle close to zero 🌝
UPD
Add a possibility to accept script name with command line arguments

This is not exact answer to the question as the Q looks as pickle and SHM are required, but others went of the path, so I am going to share a trick of mine. It might help you. There are some fine solutions here using the pickle and SHM anyway. Regarding this I can offer only more of the same. Same pasta with slight sauce modifications.
Two tricks I employ when dealing with your situations are as follows.
First is to use sqlite3 instead of pickle. You can even easily develop a module for a drop-in replacement using sqlite. Nice thing is that data will be inserted and selected using native Python types, and you can define yourown with converter and adapter functions that would use serialization method of your choice to store complex objects. Can be a pickle or json or whatever.
What I do is to define a class with data passed in through *args and/or **kwargs of a constructor. It represents whatever obj model I need, then I pick-up rows from "select * from table;" of my database and let Python unwrap the data during the new object initialization. Loading big amount of data with datatype conversions, even the custom ones is suprisingly fast. sqlite will manage buffering and IO stuff for you and do it faster than pickle. The trick is construct your object to be filled and initiated as fast as possible. I either subclass dict() or use slots to speed up the thing.
sqlite3 comes with Python so that's a bonus too.
The other method of mine is to use a ZIP file and struct module.
You construct a ZIP file with multiple files within. E.g. for a pronunciation dictionary with more than 400000 words I'd like a dict() object. So I use one file, let say, lengths.dat in which I define a length of a key and a length of a value for each pair in binary format. Then I have a one file of words and one file of pronunciations all one after the other.
When I load from file, I read the lengths and use them to construct a dict() of words with their pronunciations from two other files. Indexing bytes() is fast, so, creating such a dictionary is very fast. You can even have it compressed if diskspace is a concern, but some speed loss is introduced then.
Both methods will take less place on a disk than the pickle would.
The second method will require you to read into RAM all the data you need, then you will be constructing the objects, which will take almost double of RAM that the data took, then you can discard the raw data, of course. But alltogether shouldn't require more than the pickle takes. As for RAM, the OS will manage almost anything using the virtual memory/SWAP if needed.
Oh, yeah, there is the third trick I use. When I have ZIP file constructed as mentioned above or anything else which requires additional deserialization while constructing an object, and number of such objects is great, then I introduce a lazy load. I.e. Let say we have a big file with serialized objects in it. You make the program load all the data and distribute it per object which you keep in list() or dict().
You write your classes in such a way that when the object is first asked for data it unpacks its raw data, deserializes and what not, removes the raw data from RAM then returns your result. So you will not be losing loading time until you actually need the data in question, which is much less noticeable for a user than 20 secs taking for a process to start.

I implemented the python-preloaded script, which can help you here. It will store the CPython state at an early stage after some modules are loaded, and then when you need it, you can restore from this state and load your normal Python script. Storing currently means that it will stay in memory, and restoring means that it does a fork on it, which is very fast. But these are implementation details of python-preloaded and should not matter to you.
So, to make it work for your use case:
Make a new module, data_preloaded.py or so, and in there, just this code:
preloaded_data = load_pickle(...)
Now run py-preloaded-bundle-fork-server.py data_preloaded -o python-data-preloaded.bin. This will create python-data-preloaded.bin, which can be used as a replacement for python.
I assume you have started python your_script.py before. So now run ./python-data-preloaded.bin your_script.py. Or also just python-data-preloaded.bin (no args). The first time, this will still be slow, i.e. take about 20 seconds. But now it is in memory.
Now run ./python-data-preloaded.bin your_script.py again. Now it should be extremely fast, i.e. a few milliseconds. And you can start it again and again and it will always be fast, until you restart your computer.

Python raspberry Pi timelapse memory leak?

I am using my raspberry Pi3 to create timelapse videos. I have a cron that runs a python script every minute that decides how many photos to take and then imports a function from another python script that takes the actual photos. The problem is that after running for about 4 hours the camera stops taking photos- if I try and take one manually it says it is out of memory, and top confirms this. If I watch top while the timelapse is running the memory usage steadily climbs.
I think I have narrowed the problem down to the python script that takes the photos. I can run this on its own, and if I start up the pi and run it a few times I see that the memory used climbs by about 10MB the first run and about 1MB every subsequent run (screenshot at the bottom of the post). This is the script
import time
import picamera
import os
def ShutterTS(dirname):
with picamera.PiCamera() as cam:
cam.resolution=(1920,1440)
cam.rotation=180
cam.hflip=True
# camera warm up time
time.sleep(2)
FNfmt = "%4d%02d%02d_%02d:%02d:%02d.JPG"
Fname = FNfmt % time.localtime()[0:6]
framename = os.path.join(dirname, Fname)
cam.capture(framename)
return
def main():
dirname = [insert path here, my path hidden]
ShutterTS(dirname)
return
if __name__ == '__main__':
import sys
sys.exit(main())
I'm not a good coder, I basically cobble stuff together from bits I find on the internet so I'm hoping this is something really simple that I've missed. The with is the raspberry pi recommended way of calling the camera. I know this should close the camera instance on exit but I'm guessing something is hanging around in memory? I tried adding close.cam() at the end of the function and it made no difference (didn't think it would). I've tried del on all the variables at the end of the function and it made no difference. I think the return at the end of the function is redundant but adding it made no difference.
This website https://www.linuxatemyram.com/ suggests that top showing the memory climbing is normal and free -m is a better gauge, and that shows plenty available- but the fact remains the camera stops working, saying it is out of memory. Any clues would be much appreciated!
This is the cron script (some other imports cropped)
from ShutterTimestamp import ShutterTS
from makedirectory import testmakedir
from SunTimesA import gettimes
def Timer(dirname,FRAMES_PER_MINUTE):
# I take a picture first and then loop so the program isn't
# sleeping pointlessly to the end of the minute
start = time.time()
ShutterTS(dirname)
if FRAMES_PER_MINUTE>1:
for frame in range(FRAMES_PER_MINUTE-1):
time.sleep(int(60 / FRAMES_PER_MINUTE) - (time.time() - start))
start = time.time()
ShutterTS(dirname)
return
def main():
dirfmt = []
dirname = dirfmt % time.localtime()[0:3]
FPM=gettimes()
if FPM > 0:
testmakedir(dirname)
Timer(dirname,FPM)
return
if __name__ == '__main__':
sys.exit(main())
Screenshot of memory use

I suppose you have a wrapping python script which import the script you provide in the question and call ShutterTS in a loop. This function does not return any output to main script (just return).
If you can observe a memory leak it probably is located in the picamera module.
A workaround it to call this script as a sub-process, not as a function call in the main process. It can be done in a shell script or in the python script using subprocess module.
Thus the memory will be released after each capture.

Keeping Python Variables between Script Calls

I have a python script, that needs to load a large file from disk to a variable. This takes a while. The script will be called many times from another application (still unknown), with different options and the stdout will be used. Is there any possibility to avoid reading the large file for each single call of the script?
I guess i could have one large script running in the background that holds the variable. But then, how can I call the script with different options and read the stdout from another application?

Make it a (web) microservice: formalize all different CLI arguments as HTTP endpoints and send requests to it from main application.

(I misunderstood the original question, but the first answer I wrote has a different solution, which might be useful to someone fitting that scenario, so I am keeping that one as is and proposing second solution.
)
For a single machine, OS provided pipes are the best solution for what you are looking.
Essentially you will create a forever running process in python which reads from pipe, and process the commands entering the pipe, and then prints to sysout.
Reference: http://kblin.blogspot.com/2012/05/playing-with-posix-pipes-in-python.html
From above mentioned source
Workload
In order to simulate my workload, I came up with the following simple script called pipetest.py that takes an output file name and then writes some text into that file.
#!/usr/bin/env python
import sys
def main():
pipename = sys.argv[1]
with open(pipename, 'w') as p:
p.write("Ceci n'est pas une pipe!\n")
if __name__ == "__main__":
main()
The Code
In my test, this "file" will be a FIFO created by my wrapper code. The implementation of the wrapper code is as follows, I will go over the code in detail further down this post:
#!/usr/bin/env python
import tempfile
import os
from os import path
import shutil
import subprocess
class TemporaryPipe(object):
def __init__(self, pipename="pipe"):
self.pipename = pipename
self.tempdir = None
def __enter__(self):
self.tempdir = tempfile.mkdtemp()
pipe_path = path.join(self.tempdir, self.pipename)
os.mkfifo(pipe_path)
return pipe_path
def __exit__(self, type, value, traceback):
if self.tempdir is not None:
shutil.rmtree(self.tempdir)
def call_helper():
with TemporaryPipe() as p:
script = "./pipetest.py"
subprocess.Popen(script + " " + p, shell=True)
with open(p, 'r') as r:
text = r.read()
return text.strip()
def main():
call_helper()
if __name__ == "__main__":
main()

Since you already can read the data into a variable, then you might consider memory mapping the file using mmap. This is safe if multiple processes are only reading it - to support a writer would require a locking protocol.
Assuming you are not familiar with memory mapped objects, I'll wager you use them every day - this is how the operating system loads and maintains executable files. Essentially your file becomes part of the paging system - although it does not have to be in any special format.
When you read a file into memory it is unlikely it is all loaded into RAM, it will be paged out when "real" RAM becomes over-subscribed. Often this paging is a considerable overhead. A memory mapped file is just your data "ready paged". There is no overhead in reading into memory (virtual memory, that is), it is there as soon as you map it .
When you try to access the data a page fault occurs and a subset (page) is loaded into RAM - all done by the operating system, the programmer is unaware of this.
While a file remains mapped it is connected to the paging system. Another process mapping the same file will access the same object, provided changes have not been made (See MAP_SHARED).
It needs a daemon to keep the memory mapped object current in kernel, but other than creating the object linked to the physical file, it does not need to do anything else - it can sleep or wait on a shutdown signal.
Other processes open the file (use os.open()) and map the object.
See the examples in the documentation, here and also Giving access to shared memory after child processes have already started

You can store the processed values in a file, and then read the values from that file in another script.
>>> import pickle as p
>>> mystr="foobar"
>>> p.dump(mystr,open('/tmp/t.txt','wb'))
>>> mystr2=p.load(open('/tmp/t.txt','rb'))
>>> mystr2
'foobar'

python running coverage on never ending process

I have a multi processed web server with processes that never end, I would like to check my code coverage on the whole project in a live environment (not only from tests).
The problem is, that since the processes never end, I don't have a good place to set the cov.start() cov.stop() cov.save() hooks.
Therefore, I thought about spawning a thread that in an infinite loop will save and combine the coverage data and then sleep some time, however this approach doesn't work, the coverage report seems to be empty, except from the sleep line.
I would be happy to receive any ideas about how to get the coverage of my code,
or any advice about why my idea doesn't work. Here is a snippet of my code:
import coverage
cov = coverage.Coverage()
import time
import threading
import os
class CoverageThread(threading.Thread):
_kill_now = False
_sleep_time = 2
#classmethod
def exit_gracefully(cls):
cls._kill_now = True
def sleep_some_time(self):
time.sleep(CoverageThread._sleep_time)
def run(self):
while True:
cov.start()
self.sleep_some_time()
cov.stop()
if os.path.exists('.coverage'):
cov.combine()
cov.save()
if self._kill_now:
break
cov.stop()
if os.path.exists('.coverage'):
cov.combine()
cov.save()
cov.html_report(directory="coverage_report_data.html")
print "End of the program. I was killed gracefully :)"

Apparently, it is not possible to control coverage very well with multiple Threads.
Once different thread are started, stopping the Coverage object will stop all coverage and start will only restart it in the "starting" Thread.
So your code basically stops the coverage after 2 seconds for all Thread other than the CoverageThread.
I played a bit with the API and it is possible to access the measurments without stopping the Coverage object.
So you could launch a thread that save the coverage data periodically, using the API.
A first implementation would be something like in this
import threading
from time import sleep
from coverage import Coverage
from coverage.data import CoverageData, CoverageDataFiles
from coverage.files import abs_file
cov = Coverage(config_file=True)
cov.start()
def get_data_dict(d):
"""Return a dict like d, but with keys modified by `abs_file` and
remove the copied elements from d.
"""
res = {}
keys = list(d.keys())
for k in keys:
a = {}
lines = list(d[k].keys())
for l in lines:
v = d[k].pop(l)
a[l] = v
res[abs_file(k)] = a
return res
class CoverageLoggerThread(threading.Thread):
_kill_now = False
_delay = 2
def __init__(self, main=True):
self.main = main
self._data = CoverageData()
self._fname = cov.config.data_file
self._suffix = None
self._data_files = CoverageDataFiles(basename=self._fname,
warn=cov._warn)
self._pid = os.getpid()
super(CoverageLoggerThread, self).__init__()
def shutdown(self):
self._kill_now = True
def combine(self):
aliases = None
if cov.config.paths:
from coverage.aliases import PathAliases
aliases = PathAliases()
for paths in self.config.paths.values():
result = paths[0]
for pattern in paths[1:]:
aliases.add(pattern, result)
self._data_files.combine_parallel_data(self._data, aliases=aliases)
def export(self, new=True):
cov_report = cov
if new:
cov_report = Coverage(config_file=True)
cov_report.load()
self.combine()
self._data_files.write(self._data)
cov_report.data.update(self._data)
cov_report.html_report(directory="coverage_report_data.html")
cov_report.report(show_missing=True)
def _collect_and_export(self):
new_data = get_data_dict(cov.collector.data)
if cov.collector.branch:
self._data.add_arcs(new_data)
else:
self._data.add_lines(new_data)
self._data.add_file_tracers(get_data_dict(cov.collector.file_tracers))
self._data_files.write(self._data, self._suffix)
if self.main:
self.export()
def run(self):
while True:
sleep(CoverageLoggerThread._delay)
if self._kill_now:
break
self._collect_and_export()
cov.stop()
if not self.main:
self._collect_and_export()
return
self.export(new=False)
print("End of the program. I was killed gracefully :)")
A more stable version can be found in this GIST.
This code basically grab the info collected by the collector without stopping it.
The get_data_dict function take the dictionary in the Coverage.collector and pop the available data. This should be safe enough so you don't lose any measurement.
The report files get updated every _delay seconds.
But if you have multiple process running, you need to add extra efforts to make sure all the process run the CoverageLoggerThread. This is the patch_multiprocessing function, monkey patched from the coverage monkey patch...
The code is in the GIST. It basically replaces the original Process with a custom process, which start the CoverageLoggerThread just before running the run method and join the thread at the end of the process.
The script main.py permits to launch different tests with threads and processes.
There is 2/3 drawbacks to this code that you need to be carefull of:
It is a bad idea to use the combine function concurrently as it performs comcurrent read/write/delete access to the .coverage.* files. This means that the function export is not super safe. It should be alright as the data is replicated multiple time but I would do some testing before using it in production.
Once the data have been exported, it stays in memory. So if the code base is huge, it could eat some ressources. It is possible to dump all the data and reload it but I assumed that if you want to log every 2 seconds, you do not want to reload all the data every time. If you go with a delay in minutes, I would create a new _data every time, using CoverageData.read_file to reload previous state of the coverage for this process.
The custom process will wait for _delay before finishing as we join the CoverageThreadLogger at the end of the process so if you have a lot of quick processes, you want to increase the granularity of the sleep to be able to detect the end of the Process more quickly. It just need a custom sleep loop that break on _kill_now.
Let me know if this help you in some way or if it is possible to improve this gist.
EDIT:
It seems you do not need to monkey patch the multiprocessing module to start automatically a logger. Using the .pth in your python install you can use a environment variable to start automatically your logger on new processes:
# Content of coverage.pth in your site-package folder
import os
if "COVERAGE_LOGGER_START" in os.environ:
import atexit
from coverage_logger import CoverageLoggerThread
thread_cov = CoverageLoggerThread(main=False)
thread_cov.start()
def close_cov()
thread_cov.shutdown()
thread_cov.join()
atexit.register(close_cov)
You can then start your coverage logger with COVERAGE_LOGGER_START=1 python main.y

Since you are willing to run your code differently for the test, why not add a way to end the process for the test? That seems like it will be simpler than trying to hack coverage.

You can use pyrasite directly, with the following two programs.
# start.py
import sys
import coverage
sys.cov = cov = coverage.coverage()
cov.start()
And this one
# stop.py
import sys
sys.cov.stop()
sys.cov.save()
sys.cov.html_report()
Another way to go would be to trace the program using lptrace even if it only prints calls it can be useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.