How can I prevent values from overlapping in a Python multiprocessing? - python

I'm trying Python multiprocessing, and I want to use Lock to avoid overlapping variable 'es_id' values.
According to theory and examples, when a process calls lock, 'es_id' can't overlap because another process can't access it, but, the results show that es_id often overlaps.
How can the id values not overlap?
Part of my code is:
def saveDB(imgName, imgType, imgStar, imgPull, imgTag, lock): #lock=Lock() in main
imgName=NameFormat(imgName) #name/subname > name:subname
i=0
while i < len(imgName):
lock.acquire() #since global es_id
global es_id
print "getIMG.pt:save information about %s"%(imgName[i])
cmd="curl -XPUT http://localhost:9200/kimhk/imgName/"+str(es_id)+" -d '{" +\
'"image_name":"'+imgName[i]+'", '+\
'"image_type":"'+imgType[i]+'", '+\
'"image_star":"'+imgStar[i]+'", '+\
'"image_pull":"'+imgPull[i]+'", '+\
'"image_Tag":"'+",".join(imgTag[i])+'"'+\
"}'"
try:
subprocess.call(cmd,shell=True)
except subprocess.CalledProcessError as e:
print e.output
i+=1
es_id+=1
lock.release()
...
#main
if __name__ == "__main__":
lock = Lock()
exPg, proc_num=option()
procs=[]
pages=[ [] for i in range(proc_num)]
i=1
#Use Multiprocessing to get HTML data quickly
if proc_num >= exPg: #if page is less than proc_num, don't need to distribute the page to the process.
while i<=exPg:
page=i
proc=Process(target=getExplore, args=(page,lock,))
procs.append(proc)
proc.start()
i+=1
else:
while i<=exPg: #distribute the page to the process
page=i
index=(i-1)%proc_num #if proc_num=4 -> 0 1 2 3
pages[index].append(page)
i+=1
i=0
while i<proc_num:
proc=Process(target=getExplore, args=(pages[i],lock,))#
procs.append(proc)
proc.start()
i+=1
for proc in procs:
proc.join()
execution result screen:
result is the output of subprocess.call (cmd, shell = True). I use XPUT to add data to ElasticSearch, and es_id is the id of the data. I want these id to increase sequentially without overlap. (Because they will be overwritten by the previous data if they overlap)
I know XPOST doesn't need to use a lock code because it automatically generates an ID, but I need to access all the data sequentially in the future (like reading one line of files).
If you know how to access all the data sequentially after using XPOST, can you tell me?

It looks like you are trying to access a global variable with a lock, but global variables are different instances between processes. What you need to use is a shared memory value. Here's a working example. It has been tested on Python 2.7 and 3.6:
from __future__ import print_function
import multiprocessing as mp
def process(counter):
# Increment the counter 3 times.
# Hold the counter's lock for read/modify/write operations.
# Keep holding it so the value doesn't change before printing,
# and keep prints from multiple processes from trying to write
# to a line at the same time.
for _ in range(3):
with counter.get_lock():
counter.value += 1
print(mp.current_process().name,counter.value)
def main():
counter = mp.Value('i') # shared integer
processes = [mp.Process(target=process,args=(counter,)) for i in range(3)]
for p in processes:
p.start()
for p in processes:
p.join()
if __name__ == '__main__':
main()
Output:
Process-2 1
Process-2 2
Process-1 3
Process-3 4
Process-2 5
Process-1 6
Process-3 7
Process-1 8
Process-3 9

You've only given part of your code, so I can only see a potential problem. It doesn't do any good to lock-protect one access to es_id. You must lock-protect them all, anywhere they occur in the program. Perhaps it is best to create an access function for this purpose, like:
def increment_es_id():
global es_id
lock.acquire()
es_id += 1
lock.release()
This can be called safely from any thread.
In your code, it's a good practice to move the acquire/release calls as close together as you can make them. Here you only need to protect one variable, so you can move the acquire/release pair to just before and after the es_id += 1 statement..
Even better is to use the lock in a context manager (although in this simple case it won't make any difference):
def increment_es_id2():
global es_id
with lock:
es_id += 1

Related

random.random() generates same number in multiprocessing

I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)
You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.

Multiprocessing Running Slower than a Single Process

I'm attempting to use multiprocessing to run many simulations across multiple processes; however, the code I have written only uses 1 of the processes as far as I can tell.
Updated
I've gotten all the processes to work (I think) thanks to #PaulBecotte ; however, the multiprocessing seems to run significantly slower than its non-multiprocessing counterpart.
For instance, not including the function and class declarations/implementations and imports, I have:
def monty_hall_sim(num_trial, player_type='AlwaysSwitchPlayer'):
if player_type == 'NeverSwitchPlayer':
player = NeverSwitchPlayer('Never Switch Player')
else:
player = AlwaysSwitchPlayer('Always Switch Player')
return (MontyHallGame().play_game(player) for trial in xrange(num_trial))
def do_work(in_queue, out_queue):
while True:
try:
f, args = in_queue.get()
ret = f(*args)
for result in ret:
out_queue.put(result)
except:
break
def main():
logging.getLogger().setLevel(logging.ERROR)
always_switch_input_queue = multiprocessing.Queue()
always_switch_output_queue = multiprocessing.Queue()
total_sims = 20
num_processes = 5
process_sims = total_sims/num_processes
with Timer(timer_name='Always Switch Timer'):
for i in xrange(num_processes):
always_switch_input_queue.put((monty_hall_sim, (process_sims, 'AlwaysSwitchPlayer')))
procs = [multiprocessing.Process(target=do_work, args=(always_switch_input_queue, always_switch_output_queue)) for i in range(num_processes)]
for proc in procs:
proc.start()
always_switch_res = []
while len(always_switch_res) != total_sims:
always_switch_res.append(always_switch_output_queue.get())
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\tLength of Always Switch Result List: {alw_sw_len}'.format(alw_sw_len=len(always_switch_res))
print '\tThe success average of switching doors was: {alw_sw_prob}'.format(alw_sw_prob=always_switch_success)
which yields:
Time Elapsed: 1.32399988174 seconds
Length: 20
The success average: 0.6
However, I am attempting to use this for total_sims = 10,000,000 over num_processes = 5, and doing so has taken significantly longer than using 1 process (1 process returned in ~3 minutes). The non-multiprocessing counterpart I'm comparing it to is:
def main():
logging.getLogger().setLevel(logging.ERROR)
with Timer(timer_name='Always Switch Monty Hall Timer'):
always_switch_res = [MontyHallGame().play_game(AlwaysSwitchPlayer('Monty Hall')) for x in xrange(10000000)]
always_switch_success = float(always_switch_res.count(True))/float(len(always_switch_res))
print '\n\tThe success average of not switching doors was: {not_switching}' \
'\n\tThe success average of switching doors was: {switching}'.format(not_switching=never_switch_success,
switching=always_switch_success)
You could try import “process “ under some if statements
EDIT- you changed some stuff, let me try and explain a bit better.
Each message you put into the input queue will cause the monty_hall_sim function to get called and send num_trial messages to the output queue.
So your original implementation was right- to get 20 output messages, send in 5 input messages.
However, your function is slightly wrong.
for trial in xrange(num_trial):
res = MontyHallGame().play_game(player)
yield res
This will turn the function into a generator that will provide a new value on each next() call- great! The problem is here
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
out_queue.put(ret.next())
except:
break
Here, on each pass through the loop you create a NEW generator with a NEW message. The old one is thrown away. So here, each input message only adds a single output message to the queue before you throw it away and get another one. The correct way to write this is-
while True:
try:
f, args = in_queue.get(timeout=1)
ret = f(*args)
for result in ret:
out_queue.put(ret.next())
except:
break
Doing it this way will continue to yield output messages from the generator until it finishes (after yielding 4 messages in this case)
I was able to get my code to run significantly faster by changing monty_hall_sim's return to a list comprehension, having do_work add the lists to the output queue, and then extend the results list of main with the lists returned by the output queue. Made it run in ~13 seconds.

running two interdependent while loops in python?

For a web-scraping analysis I need two loops that run permanently, one returning a list with websites updated every x minutes, while the other one analyses the sites (old an new ones) every y seconds. This is the code construction that exemplifies, what I am trying to do, but it doesn't work: Code has been edited to incorporate answers and my research
from multiprocessing import Process
import time, random
from threading import Lock
from collections import deque
class MyQueue(object):
def __init__(self):
self.items = deque()
self.lock = Lock()
def put(self, item):
with self.lock:
self.items.append(item)
# Example pointed at in [this][1] answer
def get(self):
with self.lock:
return self.items.popleft()
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(10)
def b(queue):
try:
while queue:
x = queue.get()
print 'recieve', x
for i in x:
print i
time.sleep(2)
except IndexError:
print queue.get()
if __name__ == '__main__':
q = MyQueue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
So, this is my first Python project after an online introduction course and I am struggling here big time. I understand now, that the functions don't truly run in parallel, as b does not start until a is finished ( I used this answer an tinkered with the timer and while True). EDIT: Even after using the approach given in the answer, I think this is still the case, as the queue.get() throws an IndexError saying, the deque is empty. I can only explain that with process a not finishing, because when I print queue.get()
immediately after .put(x) it is not empty.
I eventually want an output like this:
send [3,4,6]
3
4
6
3
4
send [3,8,6,5] #the code above gives always 3 entries, but in my project
3 #the length varies
8
6
5
3
8
6
.
.
What do I need for having two truly parallel loops where one is returning an updated list every x minutes which the other loop needs as basis for analysis? Is Process really the right tool here?
And where can I get good info about designing my program.
I did something a little like this a while ago. I think using the Process is the correct approach, but if you want to pass data between processes then you should probably use a Queue.
https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
Create the queue first and pass it into both processes. One can write to it, the other can read from it.
One issue I remember is that the reading process will block on the queue until something is pushed to it, so you may need to push a special 'terminate' message of some kind to the queue when process 1 is done so process 2 knows to stop.
EDIT: Simple example. This doesn't include a clean way to stop the processes. But it shows how you can start 2 new processes and pass data from one to the other. Since the queue blocks on get() function b will automatically wait for data from a before continuing.
from multiprocessing import Process, Queue
import time, random
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(5)
def b(queue):
x = []
while True:
time.sleep(1)
try:
x = queue.get(False)
print 'receive', x
except:
pass
for i in x:
print i
if __name__ == '__main__':
q = Queue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()

Using Queues in Multi-Thread Python, Passing the queue to a subfunction as reference

I am about to start on an endevour with python. The goal is to multithread different tasks and use queues to communicate between tasks. For the sake of clarity I would like to be able to pass a queue to a sub-function, thus sending information to the queue from there. So something similar like so:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i, out_q):
i += 1
print(i)
out_q.put(i)
return
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
increment( i , out_q)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
while True:
# Get some data
data = in_q.get()
# Process the data
# Check for termination
if data is _sentinel:
in_q.put(_sentinel)
break
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
# Wait for all produced items to be consumed
q.join()
Currently the output is a row of 0's, where I would like it to be the numbers 1 to 6. I have read the difficulty of passing references in python, but would like to clarify if this is just not possible in python or am I looking at this issue wrongly?
The problem has nothing to do with the way the queues are passed; you're doing that right. The issue is actually related to how you're trying to increment i. Because variable in python are passed by assignment, you have to actually return the incremented value of i back to the caller for the change you made inside increment to have any effect. Otherwise, you just rebind the local variable i inside of increment, and then i gets thrown away when increment completes.
You can also simplify your consume method a bit by using the iter built-in function, along with a for loop, to consume from the queue until _sentinel is reached, rather than a while True loop:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i):
i += 1
return i
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
i = increment( i )
print(i)
out_q.put(i)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
for data in iter(in_q.get, _sentinel):
# Process the data
pass
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
Output:
1
2
3
4
5
6

in this Semaphore example ,Is it necessary to lock for refill() and buy()?

in this Semaphore example ,Is it necessary to lock for refill() and buy() ?
the Book said :
The refill() function is performed when the owner of the fictitious vend-
ing machines comes to add one more item to inventory. The entire routine
represents a critical section; this is why acquiring the lock is the only way
to execute all lines.
but I think it is ot necessary to lock for refill() and buy()
what about your opinion ?
#!/usr/bin/env python
from atexit import register
from random import randrange
from threading import BoundedSemaphore, Lock, Thread
from time import sleep, ctime
lock = Lock()
MAX = 5
candytray = BoundedSemaphore(MAX)
def refill():
# lock.acquire()
try:
candytray.release()
except ValueError:
pass
#lock.release()
def buy():
#lock.acquire()
candytray.acquire(False)
#lock.release()
def producer(loops):
for i in range(loops):
refill()
sleep(randrange(3))
def consumer(loops):
for i in range(loops):
buy()
sleep(randrange(3))
def _main():
print('starting at:', ctime())
nloops = randrange(2, 6)
print('THE CANDY MACHINE (full with %d bars)!' % MAX)
Thread(target=consumer, args=(randrange(nloops, nloops+MAX+2),)).start() # buyer
Thread(target=producer, args=(nloops,)).start() # vendor
#register
def _atexit():
print('all DONE at:', ctime())
if __name__ == '__main__':
_main()
A lock is absolutely necessary. Perhaps it will help if you changed the code a little to print the number of candies left after each producer/consumer call. Replaced semaphore because all it was doing was keeping a count.
I added
numcandies = 5
For refill:
def refill():
global numcandies
numcandies += 1
print ("Refill: %d left" % numcandies)
For buy:
def buy():
global numcandies
numcandies -= 1
print("Buy: %d left" %numcandies)
Here's the output without locks (which shows data race issue).
('starting at:', 'Tue Mar 26 23:09:41 2013')
THE CANDY MACHINE (full with 5 bars)!
Buy: 4 left
Refill: 5 left
Refill: 6 left
Buy: 5 left
Buy: 4 left
Buy: 3 left
Refill: 6 left
Refill: 7 left
Buy: 6 left
('all DONE at:', 'Tue Mar 26 23:09:43 2013')
Somewhere between the call of producer and the actual update of the numcandies counter, we had 2 successive calls to consumer.
Without locking, there is no control over the order of who actually modifies the counter. So in the above case, even though numcandies was updated to 3 buy consumer, the producer still has a local copy of 5. After updating, it sets the counter to 6, which is completely wrong.
Of course it's critical section - you have to lock it. Uncomment those commented lines.
candytray is a resource for which are threads struggling. There is the rule about independence of tasks: Two tasks are independent if they don't have the same codomain AND first's domain NOT EQUALS second's codomain AND second's domain NOT EQUALS first's codomain. It means that, two tasks only may READ from one memory / variable / etc.
In your case, if candytray is implemented as some queue, you don't need to lock, because "writer" puts data on, let's say, left side, and "reader" reads from right side. So, their domains and codomains not equal and that task are independent.
But, if it is not queue, if it is some, let's say heap, then writer's codomain interfere with reader's domain. They are dependent then and you need to lock.
edit
as you can see, I talked from theoretical aspect, not Python's aspect. But I guess you wanted that.
The original code, from Wesley Chun's book Core Python Applications Programming looked like this:
def refill():
lock.acquire()
print 'Refilling candy...',
try:
candytray.release()
except ValueError:
print 'full, skipping'
else:
print 'OK'
lock.release()
def buy():
lock.acquire()
print 'Buying candy...',
if candytray.acquire(False):
print 'OK'
else:
print 'empty, skipping'
lock.release()
Without the lock, the print statements can become interwoven into unintelligible output. For example, suppose the candy tray is full. Then suppose there is a call to refill followed by a call to buy such that the lines of code get executed in this order (without the lock):
print 'Refilling candy...',
print 'Buying candy...',
try:
candytray.release()
if candytray.acquire(False):
print 'OK'
except ValueError:
print 'full, skipping'
The output would look like this:
# we start with 5 candy bars (full tray)
Refilling candy... # oops... tray is full
Buying candy...
OK # So now there are 4 candy bars
full, skipping # huh?
That would not make sense, so a lock is required.

Categories

Resources