For a web-scraping analysis I need two loops that run permanently, one returning a list with websites updated every x minutes, while the other one analyses the sites (old an new ones) every y seconds. This is the code construction that exemplifies, what I am trying to do, but it doesn't work: Code has been edited to incorporate answers and my research
from multiprocessing import Process
import time, random
from threading import Lock
from collections import deque
class MyQueue(object):
def __init__(self):
self.items = deque()
self.lock = Lock()
def put(self, item):
with self.lock:
self.items.append(item)
# Example pointed at in [this][1] answer
def get(self):
with self.lock:
return self.items.popleft()
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(10)
def b(queue):
try:
while queue:
x = queue.get()
print 'recieve', x
for i in x:
print i
time.sleep(2)
except IndexError:
print queue.get()
if __name__ == '__main__':
q = MyQueue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
So, this is my first Python project after an online introduction course and I am struggling here big time. I understand now, that the functions don't truly run in parallel, as b does not start until a is finished ( I used this answer an tinkered with the timer and while True). EDIT: Even after using the approach given in the answer, I think this is still the case, as the queue.get() throws an IndexError saying, the deque is empty. I can only explain that with process a not finishing, because when I print queue.get()
immediately after .put(x) it is not empty.
I eventually want an output like this:
send [3,4,6]
3
4
6
3
4
send [3,8,6,5] #the code above gives always 3 entries, but in my project
3 #the length varies
8
6
5
3
8
6
.
.
What do I need for having two truly parallel loops where one is returning an updated list every x minutes which the other loop needs as basis for analysis? Is Process really the right tool here?
And where can I get good info about designing my program.
I did something a little like this a while ago. I think using the Process is the correct approach, but if you want to pass data between processes then you should probably use a Queue.
https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
Create the queue first and pass it into both processes. One can write to it, the other can read from it.
One issue I remember is that the reading process will block on the queue until something is pushed to it, so you may need to push a special 'terminate' message of some kind to the queue when process 1 is done so process 2 knows to stop.
EDIT: Simple example. This doesn't include a clean way to stop the processes. But it shows how you can start 2 new processes and pass data from one to the other. Since the queue blocks on get() function b will automatically wait for data from a before continuing.
from multiprocessing import Process, Queue
import time, random
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(5)
def b(queue):
x = []
while True:
time.sleep(1)
try:
x = queue.get(False)
print 'receive', x
except:
pass
for i in x:
print i
if __name__ == '__main__':
q = Queue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
Related
I have three processes running concurrently. These Processes receive data at microsecond interval with the help of Queue. The processes are never joined and basically live in a while loop till program is terminated. Calculations on the received data are made and if certain conditions are met, a variable needs to be sent back to the gui. All three processes use the same functions.
class Calc:
def __init__(self):
self.x = [0,0,0,0,0]
self.y = [0,0,0,0,0]
def doCalculations(self, x, y, i):
#do calculation stuff.
#if certain conditions are met:
#self.returnFunction(i)
def returnFunction(self, i):
return (self.x[i] - self.y[i]), i # These are two integers
def startProcesses(q: multiprocessing.Queue):
calc = Calc()
while True:
x, y, i = q.get()
calc.doCalculations(x,y,i)
def main():
q1 = Queue()
q2 = Queue()
q3 = Queue()
p1 = Process(target=startProcesses, args=(q1,))
p2 = Process(target=startProcesses, args=(q2,))
p3 = Process(target=startProcesses, args=(q3,))
p1.start()
p2.start()
p3.start()
#run() which is the structure that feeds the data to the queues.
if __name__ == '__main__':
main()
Now i want to add the gui, which is fine but how do i sent the data from the process back to the gui? Do i make another Queue() in the Process itself? This would add 3 more queues because the processes dont share memory. Or is there a more elegant/simpler way to do this?
You just need to pass one queue if all you want is the result. Create one queue and pass it to all three processes. They can then add results on the queue itself while you can receive them from the parent process.
However, if you want to identify which results came from which process, and only get those specific results, then you can use a managed dictionary (reference)
For example, create a dictionary with keys p1, p2 and p3. These will be where each process will store their results. Then pass this dictionary to each process. When the process wants to return something, make it edit the value inside the relevant key (process 1 edits p1, etc., you can pass an additional string to each process which points to the key it should edit). Since this dictionary is synchronized across processes, these values will be available to the parent process as well. This way, you would not have to create three separate structures.
To create a managed dictionary:
from multiprocessing import Manager
if __name__ == "__main__":
manager = Manager()
d = manager.dict({'p1': None, 'p2': None, 'p3': None})
Be sure to close the manager using manager.shutdown() so it gets garbage collected as well.
import multiprocessing
import time
def WORK(x,q,it):
for i in range(it):
t = x + '---'+str(i)
q.put(t)
def cons(q,cp):
while not q.empty():
cp.append(q.get())
return q.put(cp)
if __name__ == '__main__':
cp = []
it = 600 #iteratons
start = time.perf_counter()
q = multiprocessing.Queue()
p1 = multiprocessing.Process(target = WORK, args = ('n',q,it))
p2 = multiprocessing.Process(target=WORK, args=('x',q,it))
p3 = multiprocessing.Process(target=cons, args=(q,cp,))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
print(q.get())
end = time.perf_counter()
print(end - start)
I encountered a problem running this code in Pycharm and Colab, if i run this in colab it works fine only with 1000 iterations and less in WORK() process, if more - it freezes.
In Pycharm it works fine only with 500 iterations or less
What is a problem??? Any limitations?
So i find not very good solution is to remove join or put it after dict call from queue, it help to get mor limits, with this code it started to work with 1000 iterations in pycharm but 10000 iteration is deadlock again
p1.join()
p2.join()
print(q.get())
p3.join()
end = time.perf_counter()
print(end - start)
Further change helped me to increase iterations limit to 10000 by adding queuq maxsize:
q = multiprocessing.Queue(maxsize = 1000)
So what is limitations and laws with this queues???
How to manage endless queue, from websockets for example, they sends data continiously
You have several issues with your code. First, according to the documentation on multiprocessing.Queue, method empty is not reliable. So in function cons the statement while not q.empty(): is problematic. But even if method Queue.empty were reliable, you have here a race condition. You have started processes WORK and cons in parallel where the former is writing elements to a queue and the latter is reading until it finds the queue is empty. But if cons runs before WORK gets to write its first element, it will find the queue immediately empty and that is not your expected result. And as I mentioned in my comment above, you must not try to join a process that is writing to a queue before you have retrieved all of the records that process has written.
Another problem you have is you are passing to cons an empty list cp to which you keep on appending. But cons is a function belonging to a process running in a different address space and consequently the cp list it is appending to is not the same cp list as in the main process. Just be aware of this.
Finally, cons is writing its result to the same queue that it is reading from and consequently the main process is reading this result from that same queue. So we have another race condition: Once the main process has been modified not to read from this queue until after it has joined all the processes, the main process and cons are now both reading from the same queue in parallel. We now need a separate input and output queue so that there is no conflict. That solves this race condition.
To solve the the first race condition, the WORK process should write a special sentinel record that serves as an end of records indicator. It could be the value None if None is not a valid normal record or it could be any special object that cannot be mistaken for an actual record. Since we have two processes writing records to the same input queue for cons to read, we will end up with two sentinel records, which cons will have to be looking for to know that there are truly no more records left.
import multiprocessing
import time
SENTINEL = 'SENTINEL' # or None
def WORK(x, q, it):
for i in range(it):
t = x + '---' + str(i)
q.put(t)
q.put(SENTINEL) # show end of records
def cons(q_in, q_out, cp):
# We now are looking for two end of record indicators:
for record in iter(q_in.get, SENTINEL):
cp.append(record)
for record in iter(q_in.get, SENTINEL):
cp.append(record)
q_out.put(cp)
if __name__ == '__main__':
it = 600 #iteratons
start = time.perf_counter()
q_in = multiprocessing.Queue()
q_out = multiprocessing.Queue()
p1 = multiprocessing.Process(target=WORK, args = ('n', q_in, it))
p2 = multiprocessing.Process(target=WORK, args=('x', q_in, it))
cp = []
p3 = multiprocessing.Process(target=cons, args=(q_in, q_out, cp))
p1.start()
p2.start()
p3.start()
cp = q_out.get()
print(len(cp))
p1.join()
p2.join()
p3.join()
end = time.perf_counter()
print(end - start)
Prints:
1200
0.1717168
Here is a simple example.
I am trying to find a maximum element in an incremented array which only contains positive integers. I want to let two algorithms run find_max_1 and find_max_2 in parallel, then the whole program terminates when one algorithm returns a result.
def find_max_1(array):
# method 1, just return the last element
return array[len(array)-1]
def find_max_2(array):
# method 2
solution = array[0];
for n in array:
solution = max(n)
return solution
if __name__ == '__main__':
# Two algorithms run in parallel, when one returns a result, the whole program stop
pass
I hope I explained my question clearly and correctly. I find can use event and terminate in multiprocessing, all processes terminate when event.is_set() is true.
def find_max_1(array, event):
# method 1, just return the last element
event.set()
return array[len(array)-1]
def find_max_2(array, event):
# method 2
solution = array[0];
for n in array:
solution = max(n)
event.set()
return solution
if __name__ == '__main__':
# Two algorithms run in parallel, when one returns a result, the whole program stop
event = multiprocessing.Event()
array = [1, 2, 3, 4, 5, 6, 7, 8, 9... 1000000007]
p1 = multiprocessing.Process(target=find_max_1, args=(array, event,))
p2 = multiprocessing.Process(target=find_max_2, args=(array, event,))
jobs = [p1, p2]
p1.start()
p2.start()
while True:
if event.is_set():
for p in jobs:
p.terminate()
sys.exit(1)
time.sleep(2)
But not efficient. If there is a faster implementation to solve it? Thank you very much!
Whatever you are doing, you are making zombie processes. In python, the multiprocessing library works a bit confusingly.If you want to terminate a process, make sure you joined it. In python multiprocessing guidelines, its clearly said.
Joining zombie processes
On Unix when a process finishes but has not been joined it becomes a zombie. There should never be very many because each time a new process starts (or active_children() is called) all completed processes which have not yet been joined will be joined. Also calling a finished process’s Process.is_alive will join the process. Even so it is probably good practice to explicitly join all the processes that you start.
So consider using the join() keyword when using terminate().
The following code starts three processes, they are in a pool to handle 20 worker calls:
import multiprocessing
def worker(nr):
print(nr)
numbers = [i for i in range(20)]
if __name__ == '__main__':
multiprocessing.freeze_support()
pool = multiprocessing.Pool(processes=3)
results = pool.map(worker, numbers)
pool.close()
pool.join()
Is there a way to start the processes in a sequence (as opposed to having them starting all at the same time), with a delay inserted between each process start?
If not using a Pool I would have used multiprocessing.Process(target=worker, args=(nr,)).start() in a loop, starting them one after the other and inserting the delay as needed. I find Pool to be extremely useful, though (together with the map call) so I would be glad to keep it if possible.
According to the documentation, no such control over pooled processes exists. You could however, simulate it with a lock:
import multiprocessing
import time
lock = multiprocessing.Lock()
def worker(nr):
lock.acquire()
time.sleep(0.100)
lock.release()
print(nr)
numbers = [i for i in range(20)]
if __name__ == '__main__':
multiprocessing.freeze_support()
pool = multiprocessing.Pool(processes=3)
results = pool.map(worker, numbers)
pool.close()
pool.join()
Your 3 processes will still start simultaneously. Well, what I mean is you don't have control over which process starts executing the callback first. But at least you get your delay. This effectively has each worker "starting" (but really, continuing) at designated intervals.
Ammendment resulting from discussion below:
Note that on Windows it's not possible to inherit a lock from a parent process. Instead, you can use multiprocessing.Manager().Lock() to communicate a global lock object between processes (with additional IPC overhead, of course). The global lock object needs to be initialized in each process, as well. This would look like:
from multiprocessing import Process, freeze_support
import multiprocessing
import time
from datetime import datetime as dt
def worker(nr):
glock.acquire()
print('started job: {} at {}'.format(nr, dt.now()))
time.sleep(1)
glock.release()
print('ended job: {} at {}'.format(nr, dt.now()))
numbers = [i for i in range(6)]
def init(lock):
global glock
glock = lock
if __name__ == '__main__':
multiprocessing.freeze_support()
lock = multiprocessing.Manager().Lock()
pool = multiprocessing.Pool(processes=3, initializer=init, initargs=(lock,))
results = pool.map(worker, numbers)
pool.close()
pool.join()
Couldn't you do something simple like this:
from multiprocessing import Process
from time import sleep
def f(n):
print 'started job: '+str(n)
sleep(3)
print 'ended job: '+str(n)
if __name__ == '__main__':
for i in range(0,100):
p = Process(target=f, args=(i,))
p.start()
sleep(1)
Result
started job: 0
started job: 1
started job: 2
ended job: 0
started job: 3
ended job: 1
started job: 4
ended job: 2
started job: 5
could you try defining a function that yields your values slowly?
def get_numbers_on_delay(numbers, delay):
for i in numbers:
yield i
time.sleep(delay)
and then:
results = pool.map(worker, get_numbers_on_delay(numbers, 5))
i haven't tested it, so i'm not sure, but give it a shot.
I couldn't get the locking answer to work for some reason so i implemented it this way.
I realize the question is old, but maybe someone else has the same problem.
It spawns all the processes similar to the locking solution, but sleeps before work based on their process name number.
from multiprocessing import current_process
from re import search
from time import sleep
def worker():
process_number = search('\d+', current_process().name).group()
time_between_workers = 5
sleep(time_between_workers * int(process_number))
#do your work here
Since the names given to the processes seem to be unique and incremental, this grabs the number of the process and sleeps based on that.
SpawnPoolWorker-1 sleeps 1 * 5 seconds, SpawnPoolWorker-2 sleeps 2 * 5 seconds etc.
I am about to start on an endevour with python. The goal is to multithread different tasks and use queues to communicate between tasks. For the sake of clarity I would like to be able to pass a queue to a sub-function, thus sending information to the queue from there. So something similar like so:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i, out_q):
i += 1
print(i)
out_q.put(i)
return
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
increment( i , out_q)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
while True:
# Get some data
data = in_q.get()
# Process the data
# Check for termination
if data is _sentinel:
in_q.put(_sentinel)
break
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
# Wait for all produced items to be consumed
q.join()
Currently the output is a row of 0's, where I would like it to be the numbers 1 to 6. I have read the difficulty of passing references in python, but would like to clarify if this is just not possible in python or am I looking at this issue wrongly?
The problem has nothing to do with the way the queues are passed; you're doing that right. The issue is actually related to how you're trying to increment i. Because variable in python are passed by assignment, you have to actually return the incremented value of i back to the caller for the change you made inside increment to have any effect. Otherwise, you just rebind the local variable i inside of increment, and then i gets thrown away when increment completes.
You can also simplify your consume method a bit by using the iter built-in function, along with a for loop, to consume from the queue until _sentinel is reached, rather than a while True loop:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i):
i += 1
return i
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
i = increment( i )
print(i)
out_q.put(i)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
for data in iter(in_q.get, _sentinel):
# Process the data
pass
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
Output:
1
2
3
4
5
6