I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.
Related
I have the following scenario that I need to solve with Dask scheduler and workers:
Dask program has N functions called in a loop (N defined by the user)
Each function is started with delayed(func)(args) to run in parallel.
When each function from the previous point starts, it triggers W workers. This is how I invoke the workers:
futures = client.map(worker_func, worker_args)
worker_responses = client.gather(futures)
That means that I need N * W workers to run everything in parallel. The problem is that this is not optimal as it's too much resource allocation, I run it on the cloud and it's expensive. Also, N is defined by the user, so I don't know beforehand how much processing capability I need to have.
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when a worker ends then the next one starts?
First define the number of workers you need, treat them as ephemeral, but static for the entire duration of your processing
You can create them dynamically (when you start or later on), but probably want to have them all ready right at the beginning of your processing
From your view, the client is an executor (so when you refer to workers and running in parallel, you probably mean the same thing
This class resembles executors in concurrent.futures but also allows Future objects within submit/map calls. When a Client is instantiated it takes over all dask.compute and dask.persist calls by default.
Once your workers are available, Dask will distribute work given to them via the scheduler
You should make any tasks that depend on each other do so by passing the result to dask.delayed() with the preceeding function result (which is a Future, and not yet the result)
This Futures-as-arguments will allow Dask to build a task graph of your work
Example use https://examples.dask.org/delayed.html
Future reference https://docs.dask.org/en/latest/futures.html#distributed.Future
Dependent Futures with dask.delayed
Here's a complete example from the Delayed docs (actually combines several successive examples to the same result)
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
def inc(x):
return x + 1
def double(x):
return x * 2
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b) # depends on a and b
output.append(c)
total = dask.delayed(sum)(output) # depends on everything
total.compute() # 45
You can call total.visualize() to see the task graph
(image from Dask Delayed docs)
Collections of Futures
If you're already using .map(..) to map function and argument pairs, you can keep creating Futures and then .gather(..) them all at once, even if they're in a collection (which is convenient to you here)
The .gather()'ed results will be in the same arrangement as they were given (a list of lists)
[[fn1(args11), fn1(args12)], [fn2(args21)], [fn3(args31), fn3(args32), fn3(args33)]]
https://distributed.dask.org/en/latest/api.html#distributed.Client.gather
import dask
from dask.distributed import Client
client = Client(...) # connect to distributed cluster
collection_of_futures = []
for worker_func, worker_args in iterable_of_pairs_of_fn_args:
futures = client.map(worker_func, worker_args)
collection_of_futures.append(futures)
results = client.gather(collection_of_futures)
notes
worker_args must be some iterable to map to worker_func, which can be a source of error
.gather()ing will block until all the futures are completed or raise
.as_completed()
If you need the results as quickly as possible, you could use .as_completed(..), but note the results will be in a non-deterministic order, so I don't think this makes sense for your case .. if you find it does, you'll need some extra guarantees
include information about what to do with the result in the result
keep a reference to each and check them
only combine groups where it doesn't matter (ie. all the Futures have the same purpose)
also note that the yielded futures are complete, but are still a Future, so you still need to call .result() or .gather() them
https://distributed.dask.org/en/latest/api.html#distributed.as_completed
I have a python program with multiple modules. They go like this:
Job class that is the entry point and manages the overall flow of the program
Task class that is the base class for the tasks to be run on given data. Many SubTask classes created specifically for different types of calculations on different columns of data are derived from the Task class. think of 10 columns in the data and each one having its own Task to do some processing. eg. 'price' column can used by a CurrencyConverterTask to return local currency values and so on.
Many other modules like a connector for getting data, utils module etc, which I don't think are relevant for this question.
The general flow of program: get data from the db continuously -> process the data -> write back the updated data to the db.
I decided to do it in multiprocessing because the tasks are relatively simple. Most of them do some basic arithmetic or logic operations and running it in one process takes a long time, especially getting data from a large db and processing in sequence is very slow.
So the multiprocessing (mp) code looks something like this (I cannot expose the entire file so i'm writing a simplified version, the parts not included are not relevant here. I've tested by commenting them out so this is an accurate representation of the actual code):
class Job():
def __init__():
block_size = 100 # process 100 rows at a time
some_query = "SELECT * IF A > B" # some query to filter data from db
def data_getter():
# continusouly get data from the db and put it into a queue in blocks
cursor = Connector.get_data(some_query)
block = []
for item in cursor:
block.append(item)
if len(block) ==block_size:
data_queue.put(data)
block = []
data_queue.put(None) # this will indicate the worker processors when to stop
def monitor():
# continuously monitor the system stats
timer = Timer()
while (True):
if timer.time_taken >= 60: # log some stats every 60 seconds
print(utils.system_stats())
timer.reset()
def task_runner():
while True:
# get data from the queue
# if there's no data, break out of loop
data = data_queue.get()
if data is None:
break
# run task one by one
for task in tasks:
task.do_something(data)
def run():
# queue to put data for processing
data_queue = mp.Queue()
# start a process for reading data from db
dg = mp.Process(target=self.data_getter).start()
# start a process for monitoring system stats
mon = mp.Process(target=self.monitor).start()
# get a list of tasks to run
tasks = [t for t in taskmodule.get_subtasks()]
workers = []
# start 4 processes to do the actual processing
for _ in range(4):
worker = mp.Process(target=task_runner)
worker.start()
workers.append(worker)
for w in workers:
w.join()
mon.terminate() # terminate the monitor process
dg.terminate() # end the data getting process
if __name__ == "__main__":
job = Job()
job.run()
The whole program is run like: python3 runjob.py
Expected behaviour: continuous stream of data goes in the data_queue and the each worker process gets the data and processes until there's no more data from the cursor at which point the workers finish and the entire program finishes.
This is working as expected but what is not expected is that the system memory usage keeps creeping up continuously until the system crashes. The data i'm getting here is not copied anywhere (at least intentionally). I expect the memory usage to be steady throughout the program. The length of the data_queue rarely exceeds 1 or 2 since the processes are fast enough to get the data when available so It's not the queue holding too much data.
My guess is that all the processes initiated here are long running ones and that has something to do with this. Although I can print the pid and if I follow the PID on top command the data_getter and monitor processes don't exceed more than 2% of memory usage. the 4 worker processes also don't use a lot of memory. And neither does the main process the whole thing runs in. there is an unaccounted for process that takes up 20%+ of the ram. And it bugs me so much I can't figure out what it is.
I'm doing my final thesis and my topic is the creation of a software that will run and control an on-satellite experiment.
For that reason, I had to implement the reading of multiple sensors while the experiment is running. To do that, I wrote the code so that it will create a new thread for each sensor (multiprocessing might not work because I don't yet know which system the software will run on and therefore I can't say if there will be multiple processors available) and these threads run as daemons all the while the software does its thing. It works well, but now I need to test the whole thing and this is where it gets problematic:
To properly test each and every route the software could take, I have multiple variables that need to be set and so there will be a lot of test runs (I calculated around 17.000 but could be wrong). While the first few test runs go over quickly, each run takes longer and longer. I have fiddled around with my code a little bit and it turns out that without threading, each test takes about the same time. Unfortunately, I do not know why and my knowledge of the matter is very limited. The code concerning the threading is as follows:
This sets up the creation of each thread (sensor_list will be populated with multiple sensors in non-test conditions)
sensor_list = [<a single sensor>]
for sensor in sensor_list:
thread = threading.Thread(
target=self.store_sensor_data,
args=[sensor, query_frequency],
daemon=True,
name=f"Thread_{sensor}",
)
self.threads.append(thread)
thread.start()
The function which actually deals with getting and writing the sensor data, self.store_sensor_data, looks like this:
def store_sensor_data(self, sensor, frequency):
"""Get the current reading and result from 'sensor' and store them.
sensor (Sensor) - the sensor whose data shall be stored
frequency (int) - the frequency (in 1/s) at which data shall be stored
"""
value_id = 0
while not self.HALT:
value_id += 1
sensor_reading = sensor.get_reading()
sensor_result = sensor.get_result()
try:
# if there already is a list for that sensor, append the data to it
self.experiment_report.sensor_data_raw[str(sensor)].append(
(value_id, sensor_reading)
)
except KeyError:
# if there is no list, create one containing the current sensor value
self.experiment_report.sensor_data_raw[str(sensor)] = [
(value_id, sensor_reading)
]
# repeat the same for the 'result'
try:
self.experiment_report.sensor_data[str(sensor)].append(
(value_id, sensor_result)
)
except KeyError:
self.experiment_report.sensor_data[str(sensor)] = [
(value_id, sensor_result)
]
time.sleep(1 / frequency)
after the experiment is done, I stop the threads by calling
def interrupt_sensor_data_recording(self):
"""Interrupt the storing of sensor data by ending all daemon threads.
threads (list) - a list of currently running threads
"""
if len(self.threads) > 0:
self.HALT = True
for thread in self.threads:
if thread.is_alive():
logger.debug(f"Stopping thread '{thread.getName()}'")
thread.join()
else:
thread.join()
logger.debug(f"Thread '{thread.getName()}' was already stopped")
Now I am unsure if how I stop the daemon threads is appropriate and this might be the source of my problems. But there also might be some implication that I don't know about yet and in both cases, it would be nice if someone with more knowledge than me could help me out here.
Thanks in advance!
I am learning how to use spark, but there are things I still don't understand. I have the following code
import urllib
import urllib.request
f = urllib.request.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)
normal_raw_data
from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print (("There are {} 'normal' interactions").format(normal_count))
print ("Count completed in {} seconds".format(round(tt,3))
I have already created my rdd, but supposedly Sparck works in parallel with multiple nodes. And in the code I don't see that at any time it clarifies the number of nodes in which I want to divide them, and the amount of memory that I'm going to use.
As you can see, I want to count the time it takes to process. To see the difference between working with Spark (and its parallel system) or working as I normally do with Pandas DataFrames
in the code I don't see that at any time it clarifies the number of nodes in which I want to divide them, and the amount of memory that I'm going to use
You ideally, wouldn't put that in the code, you would use --num-executors and --executor-memory arguments of spark-submit
https://spark.apache.org/docs/latest/submitting-applications.html
You could update os.environ['PYSPARK_SUBMIT_ARGS'] if you wanted to insert that into the code, though.
Note that if you do run that code as-is, every node will perform the same urlretrieve locally, and not fetch from a shared file location.
I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,
If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)