Warning, this is gonna be long since I want to be as specific as I can be.
Exact problem: This is a multi-processing problem. I have ensured that my classes all behave as built/expected in previous experiments.
edit: said threading beforehand.
When I run toy example of my problem in a threaded environment, everything behaves; however, when I transition into my real problem, the code breaks. Specifically, I get a TypeError: can't pickle _thread.lock objects error. Full stack is at the bottom.
My threading needs here are bit different than the example I adapted my code from -- https://github.com/CMA-ES/pycma/issues/31. In this example we have one fitness function that can be independently called by each evaluation and none of the function calls can interact with each other. However, in my real problem we are trying to optimize neural network weights using a genetic algorithm. The GA will suggest potential weights and we need to evaluate these NN controller-weights in our environment. In a single threaded case, we can have just one environment where we evaluate the weights with a simple for-loop: [nn.evaluate(weights) for weights in potential_candidates], find the best-performing individual, and use those weights in the next mutation round. However, we cannot simply have one simulation in a threaded environment.
So, instead of passing in a single function to evaluate I am passing in a list of function (one for each individual, where the environment is the same, but we have forked the processes so that the communication streams don't interact between individuals.)
One further thing of immediate note:
I am using a build-for-parallel evaluation data-structure from neat
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
Toy example code:
NPARAMS = nn.flat_init_weights.shape[0] # make this a 1000-dimensional problem.
NPOPULATION = 5 # use population size of 5.
MAX_ITERATION = 100 # run each solver for 100 function calls.
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(x):
time.sleep(0.1)
return sum(x**2)
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, weights, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, w in enumerate(weights):
jobs.append(self.pool.apply_async(self.eval_function[i], (w, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [fitness]*NPOPULATION)
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X))
es.disp()
Necessary background:
When I switch from the toy example to my real code, the above fails.
My classes are:
LevelGenerator (simple GA class that implements mutate, etc)
GridGame (OpenAI wrapper; launches a Java server in which to run the simulation;
handles all communication between the Agent and the environment)
Agent (neural-network class, has an evaluate fn which uses the NN to play a single rollout)
Objective (handles serializing/de-serializing weights: numpy <--> torch; launching the evaluate function)
# The classes get composed to get the necessary behavior:
env = GridGame(Generator)
agent = NNAgent(env) # NNAgent is a subclass of (Random) Agent)
obj = PyTorchObjective(agent)
# My code normally all interacts like this in the single-threaded case:
def test_solver(solver): # Solver: CMA-ES, Differential Evolution, EvolutionStrategy, etc
history = []
for j in range(MAX_ITERATION):
solutions = solver.ask() #2d-numpy array. (POPSIZE x NPARAMS)
fitness_list = np.zeros(solver.popsize)
for i in range(solver.popsize):
fitness_list[i] = obj.function(solutions[i], len(solutions[i]))
solver.tell(fitness_list)
result = solver.result() # first element is the best solution, second element is the best fitness
history.append(result[1])
scores[j] = fitness_list
return history, result
So, when I attempt to run:
NPARAMS = nn.flat_init_weights.shape[0]
NPOPULATION = 5
MAX_ITERATION = 100
_x = NNAgent(GridGame(Generator))
gyms = [_x.mutate(0.0) for _ in range(NPOPULATION)]
objs = [PyTorchObjective(a) for a in gyms]
def evaluate(objective, weights):
return objective.fun(weights, len(weights))
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(agent):
return agent.evalute()
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, X, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, x in enumerate(X):
jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [obj.fun for obj in objs])
# obj.fun takes in the candidate weights, loads them into the NN, and then evaluates the NN in the environment.
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X, NPARAMS))
es.disp()
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-57-3e6b7bf6f83a> in <module>
6 while not es.stop():
7 X = es.ask()
----> 8 es.tell(X, eval_all(X, NPARAMS))
9 es.disp()
<ipython-input-55-2182743d6306> in _evaluate2(self, X, *args)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
<ipython-input-55-2182743d6306> in <listcomp>(.0)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
429 break
430 try:
--> 431 put(task)
432 except Exception as e:
433 job, idx = task[:2]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/connection.py in send(self, obj)
204 self._check_closed()
205 self._check_writable()
--> 206 self._send_bytes(_ForkingPickler.dumps(obj))
207
208 def recv_bytes(self, maxlength=None):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/reduction.py in dumps(cls, obj, protocol)
49 def dumps(cls, obj, protocol=None):
50 buf = io.BytesIO()
---> 51 cls(buf, protocol).dump(obj)
52 return buf.getbuffer()
53
TypeError: can't pickle _thread.lock objects
I also read here that this might be being caused by the fact that this is a class function -- TypeError: can't pickle _thread.lock objects -- so I created the global scoped fitness function def fitness(agent): return agent.evalute(), but that didn't work either.
I thought this error might be coming from the fact that originally, I had the evaluate function in the PyTorchObjective class as a lambda function, but when I changed that it still broke.
Any insight would be greatly appreciated, and thanks for reading this giant wall of text.
You are not using multiple threads. You are using multiple processes.
All arguments that you pass to apply_async, including the function itself, are serialized (pickled) under the hood and passed to a worker process via an IPC channel (read up multiprocessing documentation for details). So you cannot pass any entities that are tied to things that are by their nature process-local. This includes most synchronization primitives since they have to use locks to do atomic operations.
Whenever this happens (as many other questions on this error message show), you are likely trying to be too smart and passing to a parallelization framework an object that already has parallelization logic built in.
If you want to create "multiple levels of parallelization" with such "parallelized object", you'll be better off either:
using the parallelization mechanism of that object proper and not bother about multiple levels: you can't do more stuff at a time than you have cores anyway; or
create and use these "parallelized objects" inside worker processes
but you are likely to hit multiprocessing limitations here since its worker processes are deliberately prohibited from spawning their own pools.
You can let workers add extra items to the work queue but may hit Queue limitations as well.
so for such a scenario, a more advanced 3rd-party distributed work queue solution may be preferrable.
Related
I'm trying to repeatedly run a function that requires a few positional arguments and involves random number generation (to generate many samples of a distribution). For a MWE, I think this captures everything:
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, k):
return np.random.rand(xsize, ysize)
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
np.random.seed()
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async( partial(rarr, xsize, ysize), range(nsamp))
p.close()
return np.array(out.get())
Note that the final positional argument for rarr() is just a dummy variable, since I am using map_async(), which requires an iterable. Now if I run %timeit clever_array(500, ncores = 1) I get 208 ms, whereas %timeit clever_array(500, ncores = 5) takes 149 ms. So there is definitely some kind of parallelism happening (the speedup isn't terribly impressive for this MWE but is decent in my real code).
However, I'm wondering a few things -- is there a more natural implementation other than the dummy variable for rarr() passed as an iterable to map_async to run this many times? Is there any obvious way to pass the xsize and ysize args to rarr() other than partial()? And is there any way to ensure different results from the different cores other than initializing a different random.seed() every time?
Thanks for any help!
Typically when we use multiprocessing we would expect different results from each invocation of a function, therefore it doesn't quite make sense to call the same function many times. In order to ensure the randomness of the sampling output, it is best to separate the random state (seed) from the function itself. The best approach as recommended by the numpy official doc is to use a np.random.Generator object, created via np.random.default_rng([seed]). With that we can modify your code to
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, rng):
return rng.random((xsize, ysize))
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async(partial(rarr, xsize, ysize), map(np.random.default_rng, range(nsamp)))
p.close()
return np.array(out.get())
Warning, this is gonna be long since I want to be as specific as I can be.
Exact problem: This is a multi-processing problem. I have ensured that my classes all behave as built/expected in previous experiments.
edit: said threading beforehand.
When I run toy example of my problem in a threaded environment, everything behaves; however, when I transition into my real problem, the code breaks. Specifically, I get a TypeError: can't pickle _thread.lock objects error. Full stack is at the bottom.
My threading needs here are bit different than the example I adapted my code from -- https://github.com/CMA-ES/pycma/issues/31. In this example we have one fitness function that can be independently called by each evaluation and none of the function calls can interact with each other. However, in my real problem we are trying to optimize neural network weights using a genetic algorithm. The GA will suggest potential weights and we need to evaluate these NN controller-weights in our environment. In a single threaded case, we can have just one environment where we evaluate the weights with a simple for-loop: [nn.evaluate(weights) for weights in potential_candidates], find the best-performing individual, and use those weights in the next mutation round. However, we cannot simply have one simulation in a threaded environment.
So, instead of passing in a single function to evaluate I am passing in a list of function (one for each individual, where the environment is the same, but we have forked the processes so that the communication streams don't interact between individuals.)
One further thing of immediate note:
I am using a build-for-parallel evaluation data-structure from neat
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
Toy example code:
NPARAMS = nn.flat_init_weights.shape[0] # make this a 1000-dimensional problem.
NPOPULATION = 5 # use population size of 5.
MAX_ITERATION = 100 # run each solver for 100 function calls.
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(x):
time.sleep(0.1)
return sum(x**2)
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, weights, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, w in enumerate(weights):
jobs.append(self.pool.apply_async(self.eval_function[i], (w, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [fitness]*NPOPULATION)
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X))
es.disp()
Necessary background:
When I switch from the toy example to my real code, the above fails.
My classes are:
LevelGenerator (simple GA class that implements mutate, etc)
GridGame (OpenAI wrapper; launches a Java server in which to run the simulation;
handles all communication between the Agent and the environment)
Agent (neural-network class, has an evaluate fn which uses the NN to play a single rollout)
Objective (handles serializing/de-serializing weights: numpy <--> torch; launching the evaluate function)
# The classes get composed to get the necessary behavior:
env = GridGame(Generator)
agent = NNAgent(env) # NNAgent is a subclass of (Random) Agent)
obj = PyTorchObjective(agent)
# My code normally all interacts like this in the single-threaded case:
def test_solver(solver): # Solver: CMA-ES, Differential Evolution, EvolutionStrategy, etc
history = []
for j in range(MAX_ITERATION):
solutions = solver.ask() #2d-numpy array. (POPSIZE x NPARAMS)
fitness_list = np.zeros(solver.popsize)
for i in range(solver.popsize):
fitness_list[i] = obj.function(solutions[i], len(solutions[i]))
solver.tell(fitness_list)
result = solver.result() # first element is the best solution, second element is the best fitness
history.append(result[1])
scores[j] = fitness_list
return history, result
So, when I attempt to run:
NPARAMS = nn.flat_init_weights.shape[0]
NPOPULATION = 5
MAX_ITERATION = 100
_x = NNAgent(GridGame(Generator))
gyms = [_x.mutate(0.0) for _ in range(NPOPULATION)]
objs = [PyTorchObjective(a) for a in gyms]
def evaluate(objective, weights):
return objective.fun(weights, len(weights))
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(agent):
return agent.evalute()
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, X, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, x in enumerate(X):
jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [obj.fun for obj in objs])
# obj.fun takes in the candidate weights, loads them into the NN, and then evaluates the NN in the environment.
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X, NPARAMS))
es.disp()
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-57-3e6b7bf6f83a> in <module>
6 while not es.stop():
7 X = es.ask()
----> 8 es.tell(X, eval_all(X, NPARAMS))
9 es.disp()
<ipython-input-55-2182743d6306> in _evaluate2(self, X, *args)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
<ipython-input-55-2182743d6306> in <listcomp>(.0)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
429 break
430 try:
--> 431 put(task)
432 except Exception as e:
433 job, idx = task[:2]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/connection.py in send(self, obj)
204 self._check_closed()
205 self._check_writable()
--> 206 self._send_bytes(_ForkingPickler.dumps(obj))
207
208 def recv_bytes(self, maxlength=None):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/reduction.py in dumps(cls, obj, protocol)
49 def dumps(cls, obj, protocol=None):
50 buf = io.BytesIO()
---> 51 cls(buf, protocol).dump(obj)
52 return buf.getbuffer()
53
TypeError: can't pickle _thread.lock objects
I also read here that this might be being caused by the fact that this is a class function -- TypeError: can't pickle _thread.lock objects -- so I created the global scoped fitness function def fitness(agent): return agent.evalute(), but that didn't work either.
I thought this error might be coming from the fact that originally, I had the evaluate function in the PyTorchObjective class as a lambda function, but when I changed that it still broke.
Any insight would be greatly appreciated, and thanks for reading this giant wall of text.
You are not using multiple threads. You are using multiple processes.
All arguments that you pass to apply_async, including the function itself, are serialized (pickled) under the hood and passed to a worker process via an IPC channel (read up multiprocessing documentation for details). So you cannot pass any entities that are tied to things that are by their nature process-local. This includes most synchronization primitives since they have to use locks to do atomic operations.
Whenever this happens (as many other questions on this error message show), you are likely trying to be too smart and passing to a parallelization framework an object that already has parallelization logic built in.
If you want to create "multiple levels of parallelization" with such "parallelized object", you'll be better off either:
using the parallelization mechanism of that object proper and not bother about multiple levels: you can't do more stuff at a time than you have cores anyway; or
create and use these "parallelized objects" inside worker processes
but you are likely to hit multiprocessing limitations here since its worker processes are deliberately prohibited from spawning their own pools.
You can let workers add extra items to the work queue but may hit Queue limitations as well.
so for such a scenario, a more advanced 3rd-party distributed work queue solution may be preferrable.
I was trying to implement a massively parallel Differential Equation solver (30k DEs) on Tensorflow CPU but was running out of memory (Around 30GB matrices). So I implemented a batch based solver (solve for small time and save data -> set new initial -> solve again). But the problem persisted. I learnt that Tensorflow does not clear the memory until the python interpreter is closed. So based on info on github issues I tried implementing a multiprocessing solution using pool but I keep getting a "can't pickle _thread.RLock objects" at the Pooling step. Could someone please help!
def dAdt(X,t):
dX = // vector of differential
return dX
global state_vector
global state
state_vector = [0]*n // initial state
def tensor_process():
with tf.Session() as sess:
print("Session started...",end="")
tf.global_variables_initializer().run()
state = sess.run(tensor_state)
sess.close()
n_batch = 3
t_batch = np.array_split(t,n_batch)
for n,i in enumerate(t_batch):
print("Batch",(n+1),"Running...",end="")
if n>0:
i = np.append(i[0]-0.01,i)
print("Session started...",end="")
init_state = tf.constant(state_vector, dtype=tf.float64)
tensor_state = tf.contrib.odeint_fixed(dAdt, init_state, i)
with Pool(1) as p:
p.apply_async(tensor_process).get()
state_vector = state[-1,:]
np.save("state.batch"+str(n+1),state)
state=None
Tensorflow doesn't support multiprocessing due to many reasons like it not able to fork the TensorFlow session itself. If you still want to use some kind of 'multi' stuff, try this (multiprocessing.pool.ThreadPool) which worked for me:
https://stackoverflow.com/a/46049195/5276428
Note: I did this by creating multiple sessions over threads and then calling each session variables belonging to each thread sequentially. If your issue is memory, I think it can be solved by reducing input batch-size.
Rather than use a Pool of N workers, try creating N distinct instances of multiprocessing.Process objects and passing your tensor_process() function as the target argument and each subset of data as the args arguments. Start the processes inside a for-loop, then join them beneath the loop. You can use a shared multiprocessing.Queue object to return results to the main process.
I have personally had success combining TensorFlow with Python's multiprocessing module by sub-classing Process and overriding its run() method.
def run(self):
logging.info('started inference.')
logging.debug('TF input frame shape == {}'.format(self.tensor_shape))
count = 0
with tf.device('/cpu:0') if self.device_type == 'cpu' else \
tf.device(None):
with tf.Session(config=self.session_config) as session:
frame_dataset = tf.data.Dataset.from_generator(
self.generate_frames, tf.uint8, tf.TensorShape(self.tensor_shape))
frame_dataset = frame_dataset.map(self._preprocess_frames,
self._get_num_parallel_calls())
frame_dataset = frame_dataset.batch(self.batch_size)
frame_dataset = frame_dataset.prefetch(self.batch_size)
next_batch = frame_dataset.make_one_shot_iterator().get_next()
while True:
try:
frame_batch = session.run(next_batch)
probs = session.run(self.output_node,
{self.input_node: frame_batch})
self.prob_array[count:count + probs.shape[0]] = probs
count += probs.shape[0]
except tf.errors.OutOfRangeError:
logging.info('completed inference.')
break
self.result_queue.put((count, self.prob_array, self.timestamp_array))
self.result_queue.close()
I would write an example based on your code, but I don't quite understand it.
I am currently using a function making extremely long dictionaries (used to compare DNA strings) and sometimes I'm getting MemoryError.
Is there a way to allot more memory to Python so it can deal with more data at once?
Python doesn’t limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.
You'd need to look at making your code more memory/performance friendly.
Python has MomeoryError which is the limit of your System RAM util you've defined it manually with resource package.
Defining your class with slots makes the python interpreter know that the attributes/members of your class are fixed. And can lead to significant memory savings!
You can reduce dict creation by python interpreter by using __slot__ . This will tell interpreter to not create dict internally and reuse same variable.
If the memory consumed by your python processes will continue to grow with time. This seems to be a combination of:
How the C memory allocator in Python works. This is essentially memory fragmentation, because the allocation cannot call ‘free’ unless the entire memory chunk is unused. But the memory chunk usage is usually not perfectly aligned to the objects that you are creating and using.
Using a number of small string to compare data. A process called interning used internally but creating multiple small strings brings load on interpreter.
The best way is to create Worker Thread or single threaded pool to do your work and invalidate worker/kill to free up resources attached/used in worker thread.
Below code creates single thread worker :
__slot__ = ('dna1','dna2','lock','errorResultMap')
lock = threading.Lock()
errorResultMap = []
def process_dna_compare(dna1, dna2):
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
futures = {executor.submit(getDnaDict, lock, dna_key): dna_key for dna_key in dna1}
'''max_workers=1 will create single threadpool'''
dna_differences_map={}
count = 0
dna_processed = False;
for future in concurrent.futures.as_completed(futures):
result_dict = future.result()
if result_dict :
count += 1
'''Do your processing XYZ here'''
logger.info('Total dna keys processed ' + str(count))
def getDnaDict(lock,dna_key):
'''process dna_key here and return item'''
try:
dataItem = item[0]
return dataItem
except:
lock.acquire()
errorResultMap.append({'dna_key_1': '', 'dna_key_2': dna_key_2, 'dna_key_3': dna_key_3,
'dna_key_4': 'No data for dna found'})
lock.release()
logger.error('Error in processing dna :'+ dna_key)
pass
if __name__ == "__main__":
dna1 = '''get data for dna1'''
dna2 = '''get data for dna2'''
process_dna_compare(dna1,dna2)
if errorResultMap != []:
''' print or write to file the errorResultMap'''
Below code will help you understand memory usage :
import objgraph
import random
import inspect
class Dna(object):
def __init__(self):
self.val = None
def __str__(self):
return "dna – val: {0}".format(self.val)
def f():
l = []
for i in range(3):
dna = Dna()
#print “id of dna: {0}”.format(id(dna))
#print “dna is: {0}”.format(dna)
l.append(dna)
return l
def main():
d = {}
l = f()
d['k'] = l
print("list l has {0} objects of type Dna()".format(len(l)))
objgraph.show_most_common_types()
objgraph.show_backrefs(random.choice(objgraph.by_type('Dna')),
filename="dna_refs.png")
objgraph.show_refs(d, filename='myDna-image.png')
if __name__ == "__main__":
main()
Output for memory usage :
list l has 3 objects of type Dna()
function 2021
wrapper_descriptor 1072
dict 998
method_descriptor 778
builtin_function_or_method 759
tuple 667
weakref 577
getset_descriptor 396
member_descriptor 296
type 180
More read on slots please visit : https://elfsternberg.com/2009/07/06/python-what-the-hell-is-a-slot/
Try to update your py from 32bit to 64bit.
Simply type python in the command line and you will see which your python is. The memory in 32bit python is very low.
I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())