Does python have atomic CompareAndSet operation? - python

I need to use atomic CompareAndSet operation in my python program, but I didn't find reference about how to use it.
Does python provide such atomic function?
Thank you.

From the atomics library:
import atomics
a = atomics.atomic(width=4, atype=atomics.INT)
# set to 5 if a.load() compares == to 0
res = a.cmpxchg_strong(expected=0, desired=5)
print(res.success)
Note: I am the author of this library

Python atomic for shared data types.
https://sharedatomic.top
The module can be used for atomic operations under multiple processs and multiple threads conditions. High performance python! High concurrency, High performance!
atomic api Example with multiprocessing and multiple threads:
You need the following steps to utilize the module:
create function used by child processes, refer to UIntAPIs, IntAPIs, BytearrayAPIs, StringAPIs, SetAPIs, ListAPIs, in each process, you can create multiple threads.
def process_run(a):
def subthread_run(a):
a.array_sub_and_fetch(b'\x0F')
threadlist = []
for t in range(5000):
threadlist.append(Thread(target=subthread_run, args=(a,)))
for t in range(5000):
threadlist[t].start()
for t in range(5000):
threadlist[t].join()
create the shared bytearray
a = atomic_bytearray(b'ab', length=7, paddingdirection='r', paddingbytes=b'012', mode='m')
start processes / threads to utilize the shared bytearray
processlist = []
for p in range(2):
processlist.append(Process(target=process_run, args=(a,)))
for p in range(2):
processlist[p].start()
for p in range(2):
processlist[p].join()
assert a.value == int.to_bytes(27411031864108609, length=8, byteorder='big')

Related

setting cpu affinity using multiprocessing in python windows

I am trying to run 1200 iterations of a function with different values using multiprocessing.
Is there a way I can set the priority and affinities of the processors within the function itself ?
Here is an example of what I am doing :
with multiprocessing.Pool(processes=3) as pool:
r = pool.map(func, (c for c in combinations))
I want each of the 3 processes to have high priority using psutil, and the cpu_affinity to be specified. While I can use: psutil.Process().HIGH_PRIORITY_CLASS withing func, how should I specify different affinities for the three processors?
I would use the initializer function in mp.Pool:
#same prio for each child
def init_priority(prio_level):
set_prio(prio_level)
if __name__ == "__main__":
with Pool(nprocs, init_priority, (prio_level,)) as p:
p.map(...)
#Different prio for each child: (this may not be very useful
#because you cannot choose which child will accept each "task").
def init_priority(q):
prio_level = q.get()
set_prio(prio_level)
if __name__ == "__main__":
q = mp.Queue()
for _ in range(nprocs): #put one prio_level for each process
q.put(prio_level)
with Pool(nprocs, init_priority, (q,)) as p:
p.map(...)
If you need to have some high priority child processes and some low priority, and will need to be able to discern easily between them, I would skip mp.Pool, and just use your own Process's.

Why is the program execution time the same as before?

For some reason, the execution time is still the same as without threading.
But if I add something like time.sleep(secs) there clearly is threading in work inside the target def d.
def d(CurrentPos, polygon, angale, id):
Returnvalue = 0
lock = True
steg = 0.0005
distance = 0
x = 0
y = 0
while lock == True:
x = math.sin(math.radians(angale)) * distance + CurrentPos[0]
y = math.cos(math.radians(angale)) * distance + CurrentPos[1]
Localpoint = Point(x, y)
inout = polygon.contains(Localpoint)
distance = distance + steg
if inout == False:
lock = False
l = LineString([[CurrentPos[0], CurrentPos[1]],[x,y]])
Returnvalue = list(l.intersection(polygon).coords)[0]
Returnvalue = calculateDistance(CurrentPos[0], CurrentPos[1],
Returnvalue[0], Returnvalue[1])
with Arraylock:
ReturnArray.append(Returnvalue)
ReturnArray.append(id)
def Main(CurrentPos, Map):
threads = []
for i in range(8):
t = threading.Thread(target = d, name ='thread{}'.format(i), args =
(CurrentPos, Map, angales[i], i))
threads.append(t)
t.start()
for i in threads:
i.join()
Welcome to the world of the Global Interpreter Lock a.k.a. GIL. Your function looks like a CPU bound code (some calculations, loops, ifs, memory access, etc.). You can't use threads to increase performance of CPU bound tasks, sorry. It is Python's limitation.
There are functions in Python that release GIL, e.g. disk i/o, network i/o and the one you've actually tried: sleep. And indeed, threads do increase performance of i/o bound tasks. But arithmetic and/or memory access won't run parallely in Python.
The standard workaround is to use processes instead of threads. But this is often painful due to not-that-easy interprocess communication. You may also want to consider using some low level libraries like numpy that actually releases GIL in certain situations (you can only do that at C level, GIL is not accessible from Python itself) or using some other language without this limitation, e.g. C#, Java, C, C++ and so on.

Tensorflow Error using Pool: can't pickle _thread.RLock objects

I was trying to implement a massively parallel Differential Equation solver (30k DEs) on Tensorflow CPU but was running out of memory (Around 30GB matrices). So I implemented a batch based solver (solve for small time and save data -> set new initial -> solve again). But the problem persisted. I learnt that Tensorflow does not clear the memory until the python interpreter is closed. So based on info on github issues I tried implementing a multiprocessing solution using pool but I keep getting a "can't pickle _thread.RLock objects" at the Pooling step. Could someone please help!
def dAdt(X,t):
dX = // vector of differential
return dX
global state_vector
global state
state_vector = [0]*n // initial state
def tensor_process():
with tf.Session() as sess:
print("Session started...",end="")
tf.global_variables_initializer().run()
state = sess.run(tensor_state)
sess.close()
n_batch = 3
t_batch = np.array_split(t,n_batch)
for n,i in enumerate(t_batch):
print("Batch",(n+1),"Running...",end="")
if n>0:
i = np.append(i[0]-0.01,i)
print("Session started...",end="")
init_state = tf.constant(state_vector, dtype=tf.float64)
tensor_state = tf.contrib.odeint_fixed(dAdt, init_state, i)
with Pool(1) as p:
p.apply_async(tensor_process).get()
state_vector = state[-1,:]
np.save("state.batch"+str(n+1),state)
state=None
Tensorflow doesn't support multiprocessing due to many reasons like it not able to fork the TensorFlow session itself. If you still want to use some kind of 'multi' stuff, try this (multiprocessing.pool.ThreadPool) which worked for me:
https://stackoverflow.com/a/46049195/5276428
Note: I did this by creating multiple sessions over threads and then calling each session variables belonging to each thread sequentially. If your issue is memory, I think it can be solved by reducing input batch-size.
Rather than use a Pool of N workers, try creating N distinct instances of multiprocessing.Process objects and passing your tensor_process() function as the target argument and each subset of data as the args arguments. Start the processes inside a for-loop, then join them beneath the loop. You can use a shared multiprocessing.Queue object to return results to the main process.
I have personally had success combining TensorFlow with Python's multiprocessing module by sub-classing Process and overriding its run() method.
def run(self):
logging.info('started inference.')
logging.debug('TF input frame shape == {}'.format(self.tensor_shape))
count = 0
with tf.device('/cpu:0') if self.device_type == 'cpu' else \
tf.device(None):
with tf.Session(config=self.session_config) as session:
frame_dataset = tf.data.Dataset.from_generator(
self.generate_frames, tf.uint8, tf.TensorShape(self.tensor_shape))
frame_dataset = frame_dataset.map(self._preprocess_frames,
self._get_num_parallel_calls())
frame_dataset = frame_dataset.batch(self.batch_size)
frame_dataset = frame_dataset.prefetch(self.batch_size)
next_batch = frame_dataset.make_one_shot_iterator().get_next()
while True:
try:
frame_batch = session.run(next_batch)
probs = session.run(self.output_node,
{self.input_node: frame_batch})
self.prob_array[count:count + probs.shape[0]] = probs
count += probs.shape[0]
except tf.errors.OutOfRangeError:
logging.info('completed inference.')
break
self.result_queue.put((count, self.prob_array, self.timestamp_array))
self.result_queue.close()
I would write an example based on your code, but I don't quite understand it.

cexprtk with multiprocessing Python

I am using the cexprtk wrapper in python to evaluate arithmetic expressions as it offers very fast evaluation compared to the standard eval(). For a large list of expressions the initial overhead is cumbersome as it has to compile all the terms which can take a long time.
However it offers a very nice feature whereby you only need to compile once and can then re-evaluate the expressions using different values for the variables later on; which I want to do.
I was wondering if it was possible to apply Python multiprocessing to this compilation process? I would break apart the large list of arithmetic expressions into sub-lists and feed them separately into functions which apply the cexprtk compilation to the different lists. These can then be run in parallel.
I attempted to do this, but the output is nan whatever I try. Here is a very simple example showing a working cexprtk code without multiprocessing:
import cexprtk
st = cexprtk.Symbol_Table({"W":1, "X":3, "Y":1, "Z":2}, add_constants= True)
L = ['W+X+Y+Z','Y^2*W+Z']
A = [cexprtk.Expression(x, st) for x in L]
print(A)[0]() ## This gives 3 which is correct
print(A)[1]() ## This gives 7 which is correct
Now here is the attempt at using multiprocessing with two lists and two queues:
from multiprocessing import Process, Queue
import cexprtk
st = cexprtk.Symbol_Table({"W":1, "X":3, "Y":1, "Z":2}, add_constants= True)
## Define two lists
L = ['W+X+Y+Z','Y^2*W+Z']
L2 = ['W^5+Z-Y','Y^7+20-X']
## Define functions and put into queue (que)
def myfunc1(que):
lst1 = [cexprtk.Expression(x, st) for x in L]
que.put(lst1)
def myfunc2(que):
lst2 = [cexprtk.Expression(x, st) for x in L2]
que.put(lst2)
queue1 = Queue()
queue2 = Queue()
p1 = Process(target= myfunc1, args= (queue1,))
p2 = Process(target= myfunc2, args= (queue2,))
p1.start()
p2.start()
ans = queue1.get()
ans2 = queue2.get()
print(ans1[0]()) # Gives nan
print(ans2[0]()) # Gives nan
I feel as though this falls in the category of "embarrassingly parallel problems" as the lists are completely separate and no communication is needed between the processes. I have used this exact method of multiprocessing before with great success; but in this instance it is not giving an answer; and as there are no error messages, I have not got any error feedback to work with.
If you use eval() instead it works without issue; so I assume it is the cexprtk wrapper. Is there a way to achieve what I am after? Or is the Python -> C++ -> Python too much for multiprocessing?

How to utilize all cores with python multiprocessing

I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)

Categories

Resources