Dask how to scatter data when doing a reduction

Dask how to scatter data when doing a reduction - python

I am using Dask for a complicated operation. First I do a reduction which produces a moderately sized df (a few MBs) which I then need to pass to each worker to calculate the final result so my code looks a bit like this
intermediate_result = ddf.reduction().compute()
final_result = ddf.reduction(
chunk=function, chunk_kwargs={"intermediate_result": intermediate_result}
)
However I am getting the warning message that looks like this
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s)
I have tried doing this
intermediate_result = client.scatter(intermediate_result, broadcast=True)
But this isn't working as the function now sees this as a Future object and not the datatype it is supposed to be.
I can't seem to find any documentation on how to use scatter with reductions, does anyone know how to do this? Or should I just ignore the warning message and pass the moderately sized df as I am?

Actually, the best solution probably is not to scatter your materialised result, but to avoid computing it in the first place. You can simply remove the .compute(), which will mean all the calculation gets done in one stage, with the results automatically moved where you need them.
Alternatively, if you want to have a clear boundary between the stages, you can use
intermediate_result = ddf.reduction().persist()
which will kick off the reduction and store it on workers without pulling it to the client. You can choose to wait on this to finish before the next step or not.

Related

matplotlib with multiprocessing sometimes changes figure format

I am using matplotlib to make many figures and save them. I have 5 or so functions that perform either simple or no computations with the data, plot the data, and then format the figure to fit a specific form (title, axes, paper size).
These 5 or so plotting functions get called one at a time from another function in between computations. (Some computations, plotting_function_1, some more computations, plotting_function_2, ...).
I start each plotting function in a new process via plotProcess1 = multiprocessing.Process(target=plot_data1, args=(data, save_directory); myProcess.start() in order to continue with the computation while the plotting functions are running.
When I check the figures after the program has finished, many of the figures have very strange formatting errors with the titles and paper size. The weird part is that the data is always plotted exactly as it should be (The scatter data may look like some is missing, but that is just part of the dataset). Take a look at the Good figures vs bad figures to see what I am talking about (Top left is expected output, others are the issue).
This only started when I started using multiprocessing to make the plots in the background. The weirdest part is that it does not always do it, and it seems to do it at random. Any thoughts as to why it might be doing this and how to fix it? I would really like to keep the computations running while I make the plots due to timing. With some datasets, a few hundred plots will be made with each plotting function and the entire program might take 10s of hours to complete.
Edit: My datasets are very large spatial datasets, so each one of my plotting functions creates and saves multiple plots (around 20, but could be less or more depending on the size of the dataset). I have figured out when the issue occurs now, but still not why. The strange behavior happens when two plotting functions are running at the same time.
A typical timeline where the strange behaviour happens is: (plotting_function_1 has been started --> some small computations happen --> plotting_function_2 is started --> plotting_function_1 finishes --> Plotting_function_2 finishes)
This still doesn't make sense to me, because each plotting function runs in a separate process, does not change any data, and saves to a unique filename.
Edit 2: Here is a snippet of code that will create strangely formatted figures.
# Plot the raw data
if plot_TF is True:
photon_data_copy = photon_data.copy()
plot_segments_copy = plot_segments.copy()
if isParallel is True:
rawPlotProcess = multiprocessing.Process(target=plot_raw_data, args=(photon_data_copy, plot_segments_copy, run_name, plotdir))
rawPlotProcess.start()
else:
plot_raw_data(photon_data_copy, plot_segments_copy, run_name, plotdir)
# Calculate signal slab and noise slab
start = time.time()
signal_mask, noise_mask = assign_slabs_by_histogram_max_bin(photon_data, pixel_dimensions, slab_thickness)
logger.info('Time elapsed assigning slabs: {}'.format(time.time() - start))
photon_signal = photon_data[signal_mask, :]
photon_noise = photon_data[noise_mask, :]
# Plot the Signal and Noise slabs
if plot_TF is True:
photon_signal_copy = photon_signal.copy()
photon_noise_copy = photon_noise.copy()
if isParallel is True:
slabPlotProcess = multiprocessing.Process(target=plot_slabs, args=(photon_signal_copy, photon_noise_copy, plot_segments_copy, run_name, plotdir))
slabPlotProcess.start()
else:
plot_slabs(photon_signal_copy, photon_noise_copy, plot_segments_copy, run_name, plotdir)

How to have a multi-procsesing function return and store values in python?

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?

In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)

You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.

If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.

Prevent RAM space from filling from repeated Python function

I have a Python function example below which simply takes in a variable and performs a simple mathematical operation on it before returning.
If I parallelise this function, to better reflect the operation I would like to do in real life, and run the parallelised function 10 times, I notice on my IDE that the memory increases despite using the del results line.
import multiprocessing as mp
import numpy as np
from tqdm import tqdm
def function(x):
return x*2
test_array = np.arange(0,1e4,1)
for i in range(10):
pool = mp.Pool(processes=4)
results = list(tqdm(pool.imap(function,test_array),total=len(test_array)))
results = [x for x in results if str(x) != 'nan']
del results
I have a few questions I would be grateful to know the answers to:
Is there a way to prevent this memory increase?
Is this memory loading due to the parallelisation process?

I haven't tried this out, but i'm quite sure you don't need to define
pool= mp.Pool(processes=4)
Within the loop, you're starting up 10 instances of the pool for no reason. Maybe try moving that out and seeing if your memory usage decreases?
If that doesn't help, consider restructuring your code to utilize yield instead to prevent your memory from filling up.

Each new process that pool.imap creates needs to receive some information about the function and the element it applies the function too. This information is copies, and will therefore cause information to be copies.
If you want to reduce it, you might want to look at the chunksize argument of pool.imap.
An other way would be to just rely on functions from numpy. You might already now, but you could just do results = test_array * 2. I don't know how your real life example looks like, but you might not need to use Python's pool.
Also, if you intend to actually write fast code, don't use tqdm. It is nice and if you need it, you need it, but it will slow down your code.

python multiprocessing subprocess - high VIRT usage leads to memory error

I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?

Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.

Reading and graphing data read from huge files

We have pretty large files, the order of 1-1.5 GB combined (mostly log files) with raw data that is easily parseable to a csv, which is subsequently supposed to be graphed to generate a set of graph images.
Currently, we are using bash scripts to turn the raw data into a csv file, with just the numbers that need to be graphed, and then feeding it into a gnuplot script. But this process is extremely slow. I tried to speed up the bash scripts by replacing some piped cuts, trs etc. with a single awk command, although this improved the speed, the whole thing is still very slow.
So, I am starting to believe there are better tools for this process. I am currently looking to rewrite this process in python+numpy or R. A friend of mine suggested using the JVM, and if I am to do that, I will use clojure, but am not sure how the JVM will perform.
I don't have much experience in dealing with these kind of problems, so any advice on how to proceed would be great. Thanks.
Edit: Also, I will want to store (to disk) the generated intermediate data, i.e., the csv, so I don't have to re-generate it, should I choose I want a different looking graph.
Edit 2: The raw data files have one record per one line, whose fields are separated by a delimiter (|). Not all fields are numbers. Each field I need in the output csv is obtained by applying a certain formula on the input records, which may use multiple fields from the input data. The output csv will have 3-4 fields per line, and I need graphs that plot 1-2, 1-3, 1-4 fields in a (may be) bar chart. I hope that gives a better picture.
Edit 3: I have modified #adirau's script a little and it seems to be working pretty well. I have come far enough that I am reading data, sending to a pool of processor threads (pseudo processing, append thread name to data), and aggregating it into an output file, through another collector thread.
PS: I am not sure about the tagging of this question, feel free to correct it.

python sounds to be a good choice because it has a good threading API (the implementation is questionable though), matplotlib and pylab. I miss some more specs from your end but maybe this could be a good starting point for you: matplotlib: async plotting with threads.
I would go for a single thread for handling bulk disk i/o reads and sync queueing to a pool of threads for data processing (if you have fixed record lengths things may get faster by precomputing reading offsets and passing just the offsets to the threadpool); with the diskio thread I would mmap the datasource files, read a predefined num bytes + one more read to eventually grab the last bytes to the end of the current datasource lineinput; the numbytes should be chosen somewhere near your average lineinput length; next is pool feeding via the queue and the data processing / plotting that takes place in the threadpool; I don't have a good picture here (of what are you plotting exactly) but I hope this helps.
EDIT: there's file.readlines([sizehint]) to grab multiple lines at once; well it may not be so speedy cuz the docs are saying its using readline() internally
EDIT: a quick skeleton code
import threading
from collections import deque
import sys
import mmap
class processor(Thread):
"""
processor gets a batch of data at time from the diskio thread
"""
def __init__(self,q):
Thread.__init__(self,name="plotter")
self._queue = q
def run(self):
#get batched data
while True:
#we wait for a batch
dataloop = self.feed(self._queue.get())
try:
while True:
self.plot(dataloop.next())
except StopIteration:
pass
#sanitizer exceptions following, maybe
def parseline(self,line):
""" return a data struct ready for plotting """
raise NotImplementedError
def feed(self,databuf):
#we yield one-at-time datastruct ready-to-go for plotting
for line in databuf:
yield self.parseline(line)
def plot(self,data):
"""integrate
https://www.esclab.tw/wiki/index.php/Matplotlib#Asynchronous_plotting_with_threads
maybe
"""
class sharedq(object):
"""i dont recall where i got this implementation from
you may write a better one"""
def __init__(self,maxsize=8192):
self.queue = deque()
self.barrier = threading.RLock()
self.read_c = threading.Condition(self.barrier)
self.write_c = threading.Condition(self.barrier)
self.msz = maxsize
def put(self,item):
self.barrier.acquire()
while len(self.queue) >= self.msz:
self.write_c.wait()
self.queue.append(item)
self.read_c.notify()
self.barrier.release()
def get(self):
self.barrier.acquire()
while not self.queue:
self.read_c.wait()
item = self.queue.popleft()
self.write_c.notify()
self.barrier.release()
return item
q = sharedq()
#sizehint for readine lines
numbytes=1024
for i in xrange(8):
p = processor(q)
p.start()
for fn in sys.argv[1:]
with open(fn, "r+b") as f:
#you may want a better sizehint here
map = mmap.mmap(f.fileno(), 0)
#insert a loop here, i forgot
q.put(map.readlines(numbytes))
#some cleanup code may be desirable

I think python+Numpy would be the most efficient way, regarding speed and ease of implementation.
Numpy is highly optimized so the performance is decent, and python would ease up the algorithm implementation part.
This combo should work well for your case, providing you optimize the loading of the file on memory, try to find the middle point between processing a data block that isn't too large but large enough to minimize the read and write cycles, because this is what will slow down the program
If you feel that this needs more speeding up (which i sincerely doubt), you could use Cython to speed up the sluggish parts.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.