I was trying to implement a massively parallel Differential Equation solver (30k DEs) on Tensorflow CPU but was running out of memory (Around 30GB matrices). So I implemented a batch based solver (solve for small time and save data -> set new initial -> solve again). But the problem persisted. I learnt that Tensorflow does not clear the memory until the python interpreter is closed. So based on info on github issues I tried implementing a multiprocessing solution using pool but I keep getting a "can't pickle _thread.RLock objects" at the Pooling step. Could someone please help!
def dAdt(X,t):
dX = // vector of differential
return dX
global state_vector
global state
state_vector = [0]*n // initial state
def tensor_process():
with tf.Session() as sess:
print("Session started...",end="")
tf.global_variables_initializer().run()
state = sess.run(tensor_state)
sess.close()
n_batch = 3
t_batch = np.array_split(t,n_batch)
for n,i in enumerate(t_batch):
print("Batch",(n+1),"Running...",end="")
if n>0:
i = np.append(i[0]-0.01,i)
print("Session started...",end="")
init_state = tf.constant(state_vector, dtype=tf.float64)
tensor_state = tf.contrib.odeint_fixed(dAdt, init_state, i)
with Pool(1) as p:
p.apply_async(tensor_process).get()
state_vector = state[-1,:]
np.save("state.batch"+str(n+1),state)
state=None
Tensorflow doesn't support multiprocessing due to many reasons like it not able to fork the TensorFlow session itself. If you still want to use some kind of 'multi' stuff, try this (multiprocessing.pool.ThreadPool) which worked for me:
https://stackoverflow.com/a/46049195/5276428
Note: I did this by creating multiple sessions over threads and then calling each session variables belonging to each thread sequentially. If your issue is memory, I think it can be solved by reducing input batch-size.
Rather than use a Pool of N workers, try creating N distinct instances of multiprocessing.Process objects and passing your tensor_process() function as the target argument and each subset of data as the args arguments. Start the processes inside a for-loop, then join them beneath the loop. You can use a shared multiprocessing.Queue object to return results to the main process.
I have personally had success combining TensorFlow with Python's multiprocessing module by sub-classing Process and overriding its run() method.
def run(self):
logging.info('started inference.')
logging.debug('TF input frame shape == {}'.format(self.tensor_shape))
count = 0
with tf.device('/cpu:0') if self.device_type == 'cpu' else \
tf.device(None):
with tf.Session(config=self.session_config) as session:
frame_dataset = tf.data.Dataset.from_generator(
self.generate_frames, tf.uint8, tf.TensorShape(self.tensor_shape))
frame_dataset = frame_dataset.map(self._preprocess_frames,
self._get_num_parallel_calls())
frame_dataset = frame_dataset.batch(self.batch_size)
frame_dataset = frame_dataset.prefetch(self.batch_size)
next_batch = frame_dataset.make_one_shot_iterator().get_next()
while True:
try:
frame_batch = session.run(next_batch)
probs = session.run(self.output_node,
{self.input_node: frame_batch})
self.prob_array[count:count + probs.shape[0]] = probs
count += probs.shape[0]
except tf.errors.OutOfRangeError:
logging.info('completed inference.')
break
self.result_queue.put((count, self.prob_array, self.timestamp_array))
self.result_queue.close()
I would write an example based on your code, but I don't quite understand it.
Related
EDIT: The error does not lie with multiprocessing but rather with another library. I would close the question but #Lukasz Tracewski's point of using joblib may help someone else.
I've got an issue with a task I'm trying to parallelise on windows.
It seems to work for a while then suddenly halts on a particular instance (I can tell as I get the code to print which iteration it's on). I've noticed in the task manager that the CPU usage, which such be about 50%, is minimal for the Python processes. When I then do a keyboard interrupt on the cmd prompt, this suddenly shoots forward a number of instances and the activity goes back up for a short while, only to go back down again.
I've included the bits of my code which I think are relevant. I know it can work (I don't think it's stuck on a problem) and there seems to be a degree of randomness as to when it does freeze.
I'm using the COBYLA solver with max_iter set. I'm not sure if it is relevant, but when I tried to use BFGS, I got a freezing problem.
def get_optimal_angles(G,p,coeff,objective_function,initial_gamma_range,initial_beta_range,seed):
'''
This performs the classical-quantum interchange, improving the values of beta and gamma by reducing the value of
< beta, gamma | C | beta, gamma >. Returns the best angles found and the objective value this refers to. Bounds on the minimiser are set as the starting points
'''
initial_starting_points = random_intial_values((np.array(initial_gamma_range)),(np.array(initial_beta_range)),p,seed)
optimiser_function = minimize(objective_function, initial_starting_points, method='COBYLA', options={'maxiter':1500})
best_angles = optimiser_function.x
objective_value = optimiser_function.fun
return best_angles,objective_value
def find_best_angles_for_graph_6(graph_seed):
print("6:On graph",graph_seed)
#graph = gp.unweighted_erdos_graph(10,0.4,graph_seed)
graph = gp.unweighted_erdos_graph(6,0.4,graph_seed)
graph_coefficients = quantum_operator_z_coefficients(graph,'Yo')
exact_energy =get_exact_energy(graph)
angle_starting_seed = np.arange(1,number_of_angle_points,1)
objective_function= get_black_box_objective_sv(graph,p,graph_coefficients)
list_of_results = []
for angle_seed in angle_starting_seed:
print("On Angle Seed",angle_seed)
best_angles_for_seed, best_objective_value_for_seed = get_optimal_angles(graph,p,graph_coefficients,objective_function,[0,np.pi],[0,np.pi],angle_seed)
success_prob = calculate_energy_success_prob(graph,p,exact_energy, graph_coefficients,best_angles_for_seed,angle_seed)
angle_seed_data_list = [angle_seed,best_objective_value_for_seed,success_prob,best_angles_for_seed]
list_of_results.append(angle_seed_data_list)
list_of_best = get_best_results(list_of_results)
full_results = [graph_seed,exact_energy,list_of_best,list_of_results]
return full_results
import multiprocessing as mp
def main():
physical_cores = 5
pool = mp.Pool(physical_cores)
list_of_results_every_graph_6 = []
list_of_all_graph_results_6 = pool.map(find_best_angles_for_graph_6,range(1,number_of_graphs+1))
print(list_of_all_graph_results_6)
file_name_6 = 'unweighted_erdos_graph_N_6_p_8.pkl'
pickle_6 = open((file_name_6),'wb')
pickle.dump(list_of_all_graph_results_6, pickle_6)
pickle_6.close()
list_of_results_every_graph_10 = []
list_of_all_graph_results_10 = pool.map(find_best_angles_for_graph_10,range(1,number_of_graphs+1))
print(list_of_all_graph_results_10)
file_name_10 = 'unweighted_erdos_graph_N_9_p_8.pkl'
pickle_10 = open((file_name_10),'wb')
pickle.dump(list_of_all_graph_results_10, pickle_10)
pickle_10.close()
if __name__ == "__main__":
main()
EDIT: Here is the full code as a Jupyter notebook. https://www.dropbox.com/sh/6xb7setjsn1c1o3/AAANfH7mEmhuuf9cxh5QWsRQa?dl=0
I doing 100 iterations of the function model so, i tried using multiprocessing to distribute the tasks and for getting the final output I tried using queue but it takes too much time, failing the purpose of multiprocessing. How to solve this problem?
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
q.put(total_labels2)
q1.put(auc_final)
q2.put(ada_clf)
print('done')
overall_labels={}
final_auc={}
final_ada_clf={}
processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
if __name__=='__main__':
p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
overall_labels[iteration]=q.get()
final_auc[iteration]=q1.get()
final_ada_clf[iteration]=q2.get()
p.start()
processes.append(p)
for each in processes:
each.join()
Below is my edited version, but returns only single output, i tried using multiple output but could not get it, so settled for only single output i.e. total_labels2:-
##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
total_labels2 = pd.Series(temp_arr)
return (repeat,total_labels2)
def get_result(total_labels2):
global testover_forall
testover_forall.append(total_labels2)
if __name__ == '__main__':
import multiprocessing as mp
testover_forall = []
pool = mp.Pool(40)
for repeat in range(100):
pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
pool.close()
pool.join()
repetations_index=[]
for i in range(100):
repetations_index.append(testover_forall[i][0])
final_last_labels = {}
for i in range(100):
temp = str(i)
final_last_labels[temp] = testover_forall[repetations_index[i]][1]
totally_last_labels=[]
for each in final_last_labels:
temp=np.array(final_last_labels[each])
totally_last_labels.append(temp)
See my comments (actually questions) to your post.
You should be using a multiprocessing pool to limit the number of processes that you create to the number of CPU cores that you have. This will also make it easier to get return values back from your model function instead of writing results to 3 different queues (and you could have written a tuple of 3 values to just one queue). You will, of course, require other import statements and code. Given your use of numpy and other libraries, which may be implemented in C Language, you could also retry running this using threading to see if that helps or hurts performance. Do this by replacing ProcessPoolExecutor with ThreadPoolExecutor in the two places it is referenced.
Note
Any changes that model makes to passed arguments X and Y will not be reflected back to the main process. So if model is called repeatedly with the same arguments over and over, as it appears to be, it's not clear whether each call will return different values, especially if the calls are being done in parallel.
from concurrent.futures import ProcessPoolExecutor
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
#q.put(total_labels2)
#q1.put(auc_final)
#q2.put(ada_clf)
return total_labels2, auc_final, ada_clf
#print('done')
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
futures = [executor.submit(model, x_train, y_labels) for iteration in range(100)]
# simple lists will suffice:
overall_labels = []
final_auc = []
final_ada_clf = []
for future in futures:
# get return value and store
total_labels2, auc_final, ada_clf = future.result()
overall_labels.append(total_labels2)
final_auc.append(auc_final)
final_ada_clf.append(ada_clf)
Update
It wasn't clear from the problem specification that the returned results are based on a random number generator and if successive calls to the worker function, model, do not employ a single random number generator across all processes in the multiprocessing pool, then the multiprocessing implementation will clearly return different results than the non-multiprocessing version. And it is not clear from the code provided where the random number generator is being used; it may be in library code that you have no access to. If that is the case, you have two options: (1) Use multithreading instead by changing the import statement as I have indicated in the code below; you may still achieve performance benefits as I have already mentioned or (2) Update the signature to model as follows. You will be passed a new argument, random_generator, that currently supports two methods, randint (like random.randint and random (like random.random), although it should be easy enough to modify the code if you need a different method from module random. You will use this random number generator in place of module random if you are able to. But note that this random generator will run much more slowly than the standard one; this is the price you pay.
Since we are also adding a repetition argument to model (it now has to be the final argument -- note the updated signature below), we can now use method map (no need to use a callback):
def model(X,Y, random_generator, repetition):
...
etc.
from multiprocessing import Pool
# or use the following import instead to use multithreading (but then use standard random generator):
# from multiprocessing.dummy import Pool
import random
from functools import partial
from multiprocessing.managers import BaseManager
class RandomGeneratorManager(BaseManager):
pass
class RandomGenerator:
def __init__(self):
random.seed(0)
def randint(self, a, b):
return random.randint(a, b)
def random(self):
return random.random()
# add other functions if needed
if __name__ == '__main__':
RandomGeneratorManager.register('RandomGenerator', RandomGenerator)
with RandomGeneratorManager() as manager:
random_generator = manager.RandomGenerator()
# why 40? why not use default, which is the number of cpu cores you have?:
pool = Pool(40):
worker = partial(model, x_train, y_labels, random_generator)
results = pool.map(worker, range(100))
I've a global NumPy array ys_final and have defined a function that generates an array ys. The ys array will be generated based on an input parameter and I want to add these ys arrays to the global array, i.e ys_final = ys_final + ys.
The order of addition doesn't matter so I want to use Pool.apply_async() from the multiprocessing library but I can't write to the global array. The code for reference is:
import multiprocessing as mp
ys_final = np.zeros(len)
def ys_genrator(i):
#code to generate ys array
return ys
pool = mp.Pool(mp.cpu_count())
for i in range(3954):
ys_final = ys_final + pool.apply_async(ys_genrator, args=(i)).get()
pool.close()
pool.join()
The above block of code keeps on running forever and nothing happens. I've tried *mp.Process also and still I face the same problem. There I defined a target function that directly adds to the global array but it is also not working as the block keeps running forever. Reference:
def func(i):
#code to generate ys
global ys_final
ys_final = ys_final + ys
for i in range(3954):
p = mp.Process(target=func, args=(i,))
p.start()
p.join()
Any suggestions will be really helpful.
EDIT:
My ys_genrator is a function for linear interpolation. Based on the parameter i which is an index for rows in a 2D image, the function creates an array of interpolated amplitudes that will be superimposed with all the interpolated amplitudes from the image, so ys need to be added to ys_final
The variable len is the length of the interpolated array, which is same for all rows.
For reference, a simpler version of ys_genrator(i) is as follows:
def ys_genrator(i):
ys = np.ones(10)*i
return ys
A few points:
pool.apply_async(ys_genrator, args=(i)) needs to be pool.apply_async(ys_genrator, args=(i,)). Note the comma after the i.
pool.apply_async(ys_genrator, args=(i,)).get() is exactly equivalent to pool.apply(ys.genrator, args=(i,)). That is, you will block because of your immediate call to get and you will have absolutely no parallism. You would need to do all your calls to pool.apply_async and save the returned AsyncResult instances and only then call get on these instances.
If you are running under Windows, you will have a problem. The code that creates new processes must be within a block governed by if __name__ == '__main__':
If you are running under something like Jupyter Notebook or iPython you will have a problem. The worker function, ys_genrator, would need to be in an external file and imported.
Using apply_async for submitting a lot of tasks is inefficient. You are better of using imap or imap_unordered where the tasks get submitted in "chunks" and you can process the results one by one as they become available. But you must choose a "suitable" chunksize argument.
Any code you have at the global level, such as ys_final = np.zeros(len) will be executed by every sub-process if you are running under Windows, and this can be wasteful if the subprocesses do not need to "see" this variable. If they do need to see this variable, be aware that each process in the pool will be working with its own copy of the variable so it better be a read-only usage. Even then, it can be very wasteful of storage if the variable is large. There are ways of sharing such a variable across the processes but it is not perfectly clear whether you need to (you haven't even defined variable len). So it is difficult to give you improved code. However, it appears that your worker function does not need to "see" ys_final, so I will take a shot at an improved solution.
But be aware that if your function ys_genrator is very trivial, nothing will be gained by using multiprocessing because there is overhead in both creating the processing pool and in passing arguments from one process to another. Also, if ys_genrator is using numpy, this can also be a source of problems since numpy uses multiprocessing for some of its own functions and you are better off not mixing numpy with your own multiprocessing.
import multiprocessing as mp
import numpy as np
SIZE = 3
def ys_genrator(i):
#code to generate ys array
# for this dummy example all SIZE entries will end up with the same result:
ys = [i] * SIZE # for example: [1, 1, 1]
return ys
def compute_chunksize(poolsize, iterable_size):
chunksize, remainder = divmod(iterable_size, 4 * poolsize)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
ys_final = np.zeros(SIZE)
n_iterations = 3954
poolsize = min(mp.cpu_count(), n_iterations)
chunksize = compute_chunksize(poolsize, n_iterations)
print('poolsize =', poolsize, 'chunksize =', chunksize)
pool = mp.Pool(poolsize)
for result in pool.imap_unordered(ys_genrator, range(n_iterations), chunksize):
ys_final += result
print(ys_final)
Prints:
poolsize = 8 chunksize = 124
[7815081. 7815081. 7815081.]
Update
You can also just use:
for result in pool.map(ys_genrator, range(n_iterations)):
ys_final += result
The issue is that when you use method map, the method wants to compute an efficient chunksize argument based on the size of the iterable argument (see my compute_chunksize function above, which is essentially what pool.map will use). But to do this, is will have to first convert the iterable to a list to get its size. If n_iterations is very large, this is not very efficient, although it's probably not an major issue for a size of 3954. Still, you would be better off using my compute_chunksize function in this case since you know the size of the iterable and then pass the chunksize argument explicitly to map as I have done in the code using imap_unordered.
Warning, this is gonna be long since I want to be as specific as I can be.
Exact problem: This is a multi-processing problem. I have ensured that my classes all behave as built/expected in previous experiments.
edit: said threading beforehand.
When I run toy example of my problem in a threaded environment, everything behaves; however, when I transition into my real problem, the code breaks. Specifically, I get a TypeError: can't pickle _thread.lock objects error. Full stack is at the bottom.
My threading needs here are bit different than the example I adapted my code from -- https://github.com/CMA-ES/pycma/issues/31. In this example we have one fitness function that can be independently called by each evaluation and none of the function calls can interact with each other. However, in my real problem we are trying to optimize neural network weights using a genetic algorithm. The GA will suggest potential weights and we need to evaluate these NN controller-weights in our environment. In a single threaded case, we can have just one environment where we evaluate the weights with a simple for-loop: [nn.evaluate(weights) for weights in potential_candidates], find the best-performing individual, and use those weights in the next mutation round. However, we cannot simply have one simulation in a threaded environment.
So, instead of passing in a single function to evaluate I am passing in a list of function (one for each individual, where the environment is the same, but we have forked the processes so that the communication streams don't interact between individuals.)
One further thing of immediate note:
I am using a build-for-parallel evaluation data-structure from neat
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
Toy example code:
NPARAMS = nn.flat_init_weights.shape[0] # make this a 1000-dimensional problem.
NPOPULATION = 5 # use population size of 5.
MAX_ITERATION = 100 # run each solver for 100 function calls.
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(x):
time.sleep(0.1)
return sum(x**2)
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, weights, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, w in enumerate(weights):
jobs.append(self.pool.apply_async(self.eval_function[i], (w, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [fitness]*NPOPULATION)
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X))
es.disp()
Necessary background:
When I switch from the toy example to my real code, the above fails.
My classes are:
LevelGenerator (simple GA class that implements mutate, etc)
GridGame (OpenAI wrapper; launches a Java server in which to run the simulation;
handles all communication between the Agent and the environment)
Agent (neural-network class, has an evaluate fn which uses the NN to play a single rollout)
Objective (handles serializing/de-serializing weights: numpy <--> torch; launching the evaluate function)
# The classes get composed to get the necessary behavior:
env = GridGame(Generator)
agent = NNAgent(env) # NNAgent is a subclass of (Random) Agent)
obj = PyTorchObjective(agent)
# My code normally all interacts like this in the single-threaded case:
def test_solver(solver): # Solver: CMA-ES, Differential Evolution, EvolutionStrategy, etc
history = []
for j in range(MAX_ITERATION):
solutions = solver.ask() #2d-numpy array. (POPSIZE x NPARAMS)
fitness_list = np.zeros(solver.popsize)
for i in range(solver.popsize):
fitness_list[i] = obj.function(solutions[i], len(solutions[i]))
solver.tell(fitness_list)
result = solver.result() # first element is the best solution, second element is the best fitness
history.append(result[1])
scores[j] = fitness_list
return history, result
So, when I attempt to run:
NPARAMS = nn.flat_init_weights.shape[0]
NPOPULATION = 5
MAX_ITERATION = 100
_x = NNAgent(GridGame(Generator))
gyms = [_x.mutate(0.0) for _ in range(NPOPULATION)]
objs = [PyTorchObjective(a) for a in gyms]
def evaluate(objective, weights):
return objective.fun(weights, len(weights))
import time
from neat.parallel import ParallelEvaluator # uses multiprocessing.Pool
import cma
def fitness(agent):
return agent.evalute()
# # serial evaluation of all solutions
# def serial_evals(X, f=fitness, args=()):
# return [f(x, *args) for x in X]
# parallel evaluation of all solutions
def _evaluate2(self, X, *args):
"""redefine evaluate without the dependencies on neat-internal data structures
"""
jobs = []
for i, x in enumerate(X):
jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
return [job.get() for job in jobs]
ParallelEvaluator.evaluate2 = _evaluate2
parallel_eval = ParallelEvaluator(12, [obj.fun for obj in objs])
# obj.fun takes in the candidate weights, loads them into the NN, and then evaluates the NN in the environment.
# time both
for eval_all in [parallel_eval.evaluate2]:
es = cma.CMAEvolutionStrategy(NPARAMS * [1], 1, {'maxiter': MAX_ITERATION,
'popsize': NPOPULATION})
es.disp_annotation()
while not es.stop():
X = es.ask()
es.tell(X, eval_all(X, NPARAMS))
es.disp()
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-57-3e6b7bf6f83a> in <module>
6 while not es.stop():
7 X = es.ask()
----> 8 es.tell(X, eval_all(X, NPARAMS))
9 es.disp()
<ipython-input-55-2182743d6306> in _evaluate2(self, X, *args)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
<ipython-input-55-2182743d6306> in <listcomp>(.0)
14 jobs.append(self.pool.apply_async(self.eval_function[i], (x, ) + args))
15
---> 16 return [job.get() for job in jobs]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
429 break
430 try:
--> 431 put(task)
432 except Exception as e:
433 job, idx = task[:2]
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/connection.py in send(self, obj)
204 self._check_closed()
205 self._check_writable()
--> 206 self._send_bytes(_ForkingPickler.dumps(obj))
207
208 def recv_bytes(self, maxlength=None):
~/miniconda3/envs/thesis/lib/python3.7/multiprocessing/reduction.py in dumps(cls, obj, protocol)
49 def dumps(cls, obj, protocol=None):
50 buf = io.BytesIO()
---> 51 cls(buf, protocol).dump(obj)
52 return buf.getbuffer()
53
TypeError: can't pickle _thread.lock objects
I also read here that this might be being caused by the fact that this is a class function -- TypeError: can't pickle _thread.lock objects -- so I created the global scoped fitness function def fitness(agent): return agent.evalute(), but that didn't work either.
I thought this error might be coming from the fact that originally, I had the evaluate function in the PyTorchObjective class as a lambda function, but when I changed that it still broke.
Any insight would be greatly appreciated, and thanks for reading this giant wall of text.
You are not using multiple threads. You are using multiple processes.
All arguments that you pass to apply_async, including the function itself, are serialized (pickled) under the hood and passed to a worker process via an IPC channel (read up multiprocessing documentation for details). So you cannot pass any entities that are tied to things that are by their nature process-local. This includes most synchronization primitives since they have to use locks to do atomic operations.
Whenever this happens (as many other questions on this error message show), you are likely trying to be too smart and passing to a parallelization framework an object that already has parallelization logic built in.
If you want to create "multiple levels of parallelization" with such "parallelized object", you'll be better off either:
using the parallelization mechanism of that object proper and not bother about multiple levels: you can't do more stuff at a time than you have cores anyway; or
create and use these "parallelized objects" inside worker processes
but you are likely to hit multiprocessing limitations here since its worker processes are deliberately prohibited from spawning their own pools.
You can let workers add extra items to the work queue but may hit Queue limitations as well.
so for such a scenario, a more advanced 3rd-party distributed work queue solution may be preferrable.
I need to use atomic CompareAndSet operation in my python program, but I didn't find reference about how to use it.
Does python provide such atomic function?
Thank you.
From the atomics library:
import atomics
a = atomics.atomic(width=4, atype=atomics.INT)
# set to 5 if a.load() compares == to 0
res = a.cmpxchg_strong(expected=0, desired=5)
print(res.success)
Note: I am the author of this library
Python atomic for shared data types.
https://sharedatomic.top
The module can be used for atomic operations under multiple processs and multiple threads conditions. High performance python! High concurrency, High performance!
atomic api Example with multiprocessing and multiple threads:
You need the following steps to utilize the module:
create function used by child processes, refer to UIntAPIs, IntAPIs, BytearrayAPIs, StringAPIs, SetAPIs, ListAPIs, in each process, you can create multiple threads.
def process_run(a):
def subthread_run(a):
a.array_sub_and_fetch(b'\x0F')
threadlist = []
for t in range(5000):
threadlist.append(Thread(target=subthread_run, args=(a,)))
for t in range(5000):
threadlist[t].start()
for t in range(5000):
threadlist[t].join()
create the shared bytearray
a = atomic_bytearray(b'ab', length=7, paddingdirection='r', paddingbytes=b'012', mode='m')
start processes / threads to utilize the shared bytearray
processlist = []
for p in range(2):
processlist.append(Process(target=process_run, args=(a,)))
for p in range(2):
processlist[p].start()
for p in range(2):
processlist[p].join()
assert a.value == int.to_bytes(27411031864108609, length=8, byteorder='big')