matplotlib with multiprocessing *sometimes* changes figure format

matplotlib with multiprocessing *sometimes* changes figure format - python

I am using matplotlib to make many figures and save them. I have 5 or so functions that perform either simple or no computations with the data, plot the data, and then format the figure to fit a specific form (title, axes, paper size).
These 5 or so plotting functions get called one at a time from another function in between computations. (Some computations, plotting_function_1, some more computations, plotting_function_2, ...).
I start each plotting function in a new process via plotProcess1 = multiprocessing.Process(target=plot_data1, args=(data, save_directory); myProcess.start() in order to continue with the computation while the plotting functions are running.
When I check the figures after the program has finished, many of the figures have very strange formatting errors with the titles and paper size. The weird part is that the data is always plotted exactly as it should be (The scatter data may look like some is missing, but that is just part of the dataset). Take a look at the Good figures vs bad figures to see what I am talking about (Top left is expected output, others are the issue).
This only started when I started using multiprocessing to make the plots in the background. The weirdest part is that it does not always do it, and it seems to do it at random. Any thoughts as to why it might be doing this and how to fix it? I would really like to keep the computations running while I make the plots due to timing. With some datasets, a few hundred plots will be made with each plotting function and the entire program might take 10s of hours to complete.
Edit: My datasets are very large spatial datasets, so each one of my plotting functions creates and saves multiple plots (around 20, but could be less or more depending on the size of the dataset). I have figured out when the issue occurs now, but still not why. The strange behavior happens when two plotting functions are running at the same time.
A typical timeline where the strange behaviour happens is: (plotting_function_1 has been started --> some small computations happen --> plotting_function_2 is started --> plotting_function_1 finishes --> Plotting_function_2 finishes)
This still doesn't make sense to me, because each plotting function runs in a separate process, does not change any data, and saves to a unique filename.
Edit 2: Here is a snippet of code that will create strangely formatted figures.
# Plot the raw data
if plot_TF is True:
photon_data_copy = photon_data.copy()
plot_segments_copy = plot_segments.copy()
if isParallel is True:
rawPlotProcess = multiprocessing.Process(target=plot_raw_data, args=(photon_data_copy, plot_segments_copy, run_name, plotdir))
rawPlotProcess.start()
else:
plot_raw_data(photon_data_copy, plot_segments_copy, run_name, plotdir)
# Calculate signal slab and noise slab
start = time.time()
signal_mask, noise_mask = assign_slabs_by_histogram_max_bin(photon_data, pixel_dimensions, slab_thickness)
logger.info('Time elapsed assigning slabs: {}'.format(time.time() - start))
photon_signal = photon_data[signal_mask, :]
photon_noise = photon_data[noise_mask, :]
# Plot the Signal and Noise slabs
if plot_TF is True:
photon_signal_copy = photon_signal.copy()
photon_noise_copy = photon_noise.copy()
if isParallel is True:
slabPlotProcess = multiprocessing.Process(target=plot_slabs, args=(photon_signal_copy, photon_noise_copy, plot_segments_copy, run_name, plotdir))
slabPlotProcess.start()
else:
plot_slabs(photon_signal_copy, photon_noise_copy, plot_segments_copy, run_name, plotdir)

Related

Dask how to scatter data when doing a reduction

I am using Dask for a complicated operation. First I do a reduction which produces a moderately sized df (a few MBs) which I then need to pass to each worker to calculate the final result so my code looks a bit like this
intermediate_result = ddf.reduction().compute()
final_result = ddf.reduction(
chunk=function, chunk_kwargs={"intermediate_result": intermediate_result}
)
However I am getting the warning message that looks like this
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s)
I have tried doing this
intermediate_result = client.scatter(intermediate_result, broadcast=True)
But this isn't working as the function now sees this as a Future object and not the datatype it is supposed to be.
I can't seem to find any documentation on how to use scatter with reductions, does anyone know how to do this? Or should I just ignore the warning message and pass the moderately sized df as I am?

Actually, the best solution probably is not to scatter your materialised result, but to avoid computing it in the first place. You can simply remove the .compute(), which will mean all the calculation gets done in one stage, with the results automatically moved where you need them.
Alternatively, if you want to have a clear boundary between the stages, you can use
intermediate_result = ddf.reduction().persist()
which will kick off the reduction and store it on workers without pulling it to the client. You can choose to wait on this to finish before the next step or not.

python multiprocessing subprocess - high VIRT usage leads to memory error

I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?

Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

Interpreting cProfile results: total vs. cumulative time in small function

Most of my Python program is spent in a method called _build_userdbs. I'm using the awesome tool SnakeViz which helps interpreting the results. There's a screenshot below.
So right now, in that picture, I'm in _build_userdbs. The big green circle right outside of that is a method called _append_record and, as you can see, it takes up almost all of _build_userdbs. I understand this.
But here's the confusing part. The green circle outside of the inner green circle (which takes up the vast majority of the time) is the cumulative time of the _append_record minus the cumulative time of the functions called in _append_record.
Quantitatively, _append_record's cumulative time is 55970 seconds. That's the inner green circle. Its total time is 54210 seconds. That's the outer green circle.
_append_record, as you can see if you open that image in a new tab, calls a couple other functions. Those are:
json.dumps() (459 seconds cumulative)
_userdb_scratch_file_path() (161 seconds cumulative)
open (2160 seconds cumulative)
print (less than .1% of frame so not displayed)
Alright, this makes sense; because of the relatively small difference between _append_record's cumulative time and total time, it must be doing a lot of processing in its own stack frame, rather than delegating it to other functions. But here's the body of the function:
def _append_record(self, user, record):
record = json.dumps(record)
dest_path = self._userdb_scratch_file_path(user)
with open(dest_path, 'at') as dest:
print(record, file=dest)
So where is all of this processing going on? Is this function call overhead that accounts for the difference? Profiling overhead? Are these results just inaccurate? Why isn't the close() function called?

Multiprocessing, pooling and randomness

I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,

If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)

Reading and graphing data read from huge files

We have pretty large files, the order of 1-1.5 GB combined (mostly log files) with raw data that is easily parseable to a csv, which is subsequently supposed to be graphed to generate a set of graph images.
Currently, we are using bash scripts to turn the raw data into a csv file, with just the numbers that need to be graphed, and then feeding it into a gnuplot script. But this process is extremely slow. I tried to speed up the bash scripts by replacing some piped cuts, trs etc. with a single awk command, although this improved the speed, the whole thing is still very slow.
So, I am starting to believe there are better tools for this process. I am currently looking to rewrite this process in python+numpy or R. A friend of mine suggested using the JVM, and if I am to do that, I will use clojure, but am not sure how the JVM will perform.
I don't have much experience in dealing with these kind of problems, so any advice on how to proceed would be great. Thanks.
Edit: Also, I will want to store (to disk) the generated intermediate data, i.e., the csv, so I don't have to re-generate it, should I choose I want a different looking graph.
Edit 2: The raw data files have one record per one line, whose fields are separated by a delimiter (|). Not all fields are numbers. Each field I need in the output csv is obtained by applying a certain formula on the input records, which may use multiple fields from the input data. The output csv will have 3-4 fields per line, and I need graphs that plot 1-2, 1-3, 1-4 fields in a (may be) bar chart. I hope that gives a better picture.
Edit 3: I have modified #adirau's script a little and it seems to be working pretty well. I have come far enough that I am reading data, sending to a pool of processor threads (pseudo processing, append thread name to data), and aggregating it into an output file, through another collector thread.
PS: I am not sure about the tagging of this question, feel free to correct it.

python sounds to be a good choice because it has a good threading API (the implementation is questionable though), matplotlib and pylab. I miss some more specs from your end but maybe this could be a good starting point for you: matplotlib: async plotting with threads.
I would go for a single thread for handling bulk disk i/o reads and sync queueing to a pool of threads for data processing (if you have fixed record lengths things may get faster by precomputing reading offsets and passing just the offsets to the threadpool); with the diskio thread I would mmap the datasource files, read a predefined num bytes + one more read to eventually grab the last bytes to the end of the current datasource lineinput; the numbytes should be chosen somewhere near your average lineinput length; next is pool feeding via the queue and the data processing / plotting that takes place in the threadpool; I don't have a good picture here (of what are you plotting exactly) but I hope this helps.
EDIT: there's file.readlines([sizehint]) to grab multiple lines at once; well it may not be so speedy cuz the docs are saying its using readline() internally
EDIT: a quick skeleton code
import threading
from collections import deque
import sys
import mmap
class processor(Thread):
"""
processor gets a batch of data at time from the diskio thread
"""
def __init__(self,q):
Thread.__init__(self,name="plotter")
self._queue = q
def run(self):
#get batched data
while True:
#we wait for a batch
dataloop = self.feed(self._queue.get())
try:
while True:
self.plot(dataloop.next())
except StopIteration:
pass
#sanitizer exceptions following, maybe
def parseline(self,line):
""" return a data struct ready for plotting """
raise NotImplementedError
def feed(self,databuf):
#we yield one-at-time datastruct ready-to-go for plotting
for line in databuf:
yield self.parseline(line)
def plot(self,data):
"""integrate
https://www.esclab.tw/wiki/index.php/Matplotlib#Asynchronous_plotting_with_threads
maybe
"""
class sharedq(object):
"""i dont recall where i got this implementation from
you may write a better one"""
def __init__(self,maxsize=8192):
self.queue = deque()
self.barrier = threading.RLock()
self.read_c = threading.Condition(self.barrier)
self.write_c = threading.Condition(self.barrier)
self.msz = maxsize
def put(self,item):
self.barrier.acquire()
while len(self.queue) >= self.msz:
self.write_c.wait()
self.queue.append(item)
self.read_c.notify()
self.barrier.release()
def get(self):
self.barrier.acquire()
while not self.queue:
self.read_c.wait()
item = self.queue.popleft()
self.write_c.notify()
self.barrier.release()
return item
q = sharedq()
#sizehint for readine lines
numbytes=1024
for i in xrange(8):
p = processor(q)
p.start()
for fn in sys.argv[1:]
with open(fn, "r+b") as f:
#you may want a better sizehint here
map = mmap.mmap(f.fileno(), 0)
#insert a loop here, i forgot
q.put(map.readlines(numbytes))
#some cleanup code may be desirable

I think python+Numpy would be the most efficient way, regarding speed and ease of implementation.
Numpy is highly optimized so the performance is decent, and python would ease up the algorithm implementation part.
This combo should work well for your case, providing you optimize the loading of the file on memory, try to find the middle point between processing a data block that isn't too large but large enough to minimize the read and write cycles, because this is what will slow down the program
If you feel that this needs more speeding up (which i sincerely doubt), you could use Cython to speed up the sluggish parts.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.