I am trying to write values to a dictionary every 5 seconds for 1 minute. I then want to take those values and put into a dataframe to write to csv and clear the original dictionary and keep going.
import time
import random
from multiprocessing import Process
a = {'value':[], 'timeStamp': []}
def func1():
global a
print "starting First Function"
a['value'].append(random.randint(1,101))
a['timeStamp'].append(time.time()*1000.0)
time.sleep(5)
return a
def func2():
print "starting Second Function"
time.sleep(60)
d = pd.DataFrame(a)
print d
# here i would write out the df to csv and del d
a.update({}.fromkeys(a,0))
print "cleared"
if __name__=='__main__':
while True:
p1 = Process(target = func1)
p1.start()
p2 = Process(target = func2)
p2.start()
p1.join()
p2.join()
print "test"
print a
This is where i'm at now, which may or may not be the correct way to do this. Regardless, this code is not giving me the correct results. I am trying to figure out the best way to get the dict into the df and clear it. Hopefully, someone has done something similar?
Processes do not share memory - each function modifies a separate a. Therefore, changes are not seen across functions and the main process.
To share memory between your functions, use the threading module instead. You can test this in your example by replacing Process with Thread:
from threading import Thread as Process
This allows you to run your example unchanged otherwise.
Note that threading in Python is limited by the Global Interpreter Lock. Threads run concurrently, but not in parallel - Python code only ever runs on one core. Extensions and system calls such as time.sleep and the underlying data structures of pandas can sidestep this, however.
Your code has so many problems that it is hardly suitable for any use. You may start your research with something like this (python 3, threads instead of processes):
import time
import random
import threading
def func1(a):
print("starting First Function")
for dummy in range(10):
a['value'].append(random.randint(1, 101))
a['timeStamp'].append(time.time() * 1000.0)
time.sleep(1)
print("stopping First Function")
def func2(a):
print("starting Second Function")
for dummy in range(2):
time.sleep(5)
print(a)
a['value'] = list()
a['timeStamp'] = list()
print("cleared")
print('stopping Second Function')
if __name__ == '__main__':
a = {'value': list(), 'timeStamp': list()}
t1 = threading.Thread(target=func1, args=(a,))
t1.start()
t2 = threading.Thread(target=func2, args=(a,))
t2.start()
The output is:
starting First Function
starting Second Function
{'value': [32, 95, 2, 71, 65], 'timeStamp': [1536244351577.3914, 1536244352584.13, 1536244353586.6367, 1536244354589.3767, 1536244355591.9202]}
cleared
{'value': [43, 44, 28, 69, 25], 'timeStamp': [1536244356594.6294, 1536244357597.2498, 1536244358599.9812, 1536244359602.9592, 1536244360605.9316]}
cleared
stopping Second Function
stopping First Function
Related
so I have a JSON user database, I want to check if user has a valid id or not, if not then remove it from database. I am using threads for this but each thread will start from the starting of database, I don't want that.
Example: if thread A starts from 10 then thread B will start from 20. Also when thread A ends, I want it to start from 30 instead of 20.
I am a beginner so detailed guide & explanation would be great!
Thanks for your help.
Here is an example :
import threading
import time
import typing
MAX_NUMBER = 57 # assumed to be inclusive
JOB_SIZE = 10
indexes = tuple(
tuple(range(0, MAX_NUMBER + 1, JOB_SIZE)) + (MAX_NUMBER + 1,)
)
jobs_spans = tuple(zip(indexes, indexes[1:])) # cf https://stackoverflow.com/a/21303286/11384184
print(jobs_spans)
# ((0, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 58))
jobs_left = list(jobs_spans) # is thread-safe thanks to the GIL
def process_user(user_id: int) -> None:
sleep_duration = ((user_id // JOB_SIZE) % 3) * 0.4 + 0.1 # just to add some variance to each job
time.sleep(sleep_duration)
def process_users() -> typing.NoReturn:
while True:
try:
job = jobs_left.pop()
except IndexError:
break # no job left
else:
print(f"{threading.current_thread().name!r} processing users from {job[0]} to {job[1]} (exclusive) ...")
for user_id in range(job[0], job[1]):
process_user(user_id)
print(f"user {user_id} processed")
print(f"{threading.current_thread().name!r} finished")
if __name__ == "__main__":
thread1 = threading.Thread(target=process_users)
thread2 = threading.Thread(target=process_users)
thread1.start()
thread2.start()
thread1.join()
thread2.join()
I started by computing the spans that the jobs will cover, using only the number of users and the size of each job.
I use it to define a queue of jobs left. It is actually a list that the threads will pop onto.
I have two different functions :
one to process a user given its id, which has nothing to do with threading, i could use it the exact same way in a completely sequential program
one to handle the threading. It is the target of the threads, which means which code will get executed by each threads once it is starteded. It is an infinite loop, which try to get a new job until there is no more.
I join each thread to wait for its completion, before the script exits.
If you don't have time to understand Original Answer code, then you can use this. Its easy & small.
Original Source
from multiprocessing.dummy import Pool as ThreadPool
# Make the Pool of workers
pool = ThreadPool(4)
# Execute function in their own Threads
results = pool.map(func, arg)
# Close the pool and wait for the work to finish
pool.close()
pool.join()
func is your function that you want to execute.
arg is your function arg
Example:
names = ["John", "David", "Bob"]
def printNames(name):
print(name)
results = pool.map(printNames, names)
It will print all names from names list using printNames function.
function arg - names
Links
Multiprocessing (python.org, geeksforgeeks.org)
Functions (w3schools.com)
Here is a simple example.
I am trying to find a maximum element in an incremented array which only contains positive integers. I want to let two algorithms run find_max_1 and find_max_2 in parallel, then the whole program terminates when one algorithm returns a result.
def find_max_1(array):
# method 1, just return the last element
return array[len(array)-1]
def find_max_2(array):
# method 2
solution = array[0];
for n in array:
solution = max(n)
return solution
if __name__ == '__main__':
# Two algorithms run in parallel, when one returns a result, the whole program stop
pass
I hope I explained my question clearly and correctly. I find can use event and terminate in multiprocessing, all processes terminate when event.is_set() is true.
def find_max_1(array, event):
# method 1, just return the last element
event.set()
return array[len(array)-1]
def find_max_2(array, event):
# method 2
solution = array[0];
for n in array:
solution = max(n)
event.set()
return solution
if __name__ == '__main__':
# Two algorithms run in parallel, when one returns a result, the whole program stop
event = multiprocessing.Event()
array = [1, 2, 3, 4, 5, 6, 7, 8, 9... 1000000007]
p1 = multiprocessing.Process(target=find_max_1, args=(array, event,))
p2 = multiprocessing.Process(target=find_max_2, args=(array, event,))
jobs = [p1, p2]
p1.start()
p2.start()
while True:
if event.is_set():
for p in jobs:
p.terminate()
sys.exit(1)
time.sleep(2)
But not efficient. If there is a faster implementation to solve it? Thank you very much!
Whatever you are doing, you are making zombie processes. In python, the multiprocessing library works a bit confusingly.If you want to terminate a process, make sure you joined it. In python multiprocessing guidelines, its clearly said.
Joining zombie processes
On Unix when a process finishes but has not been joined it becomes a zombie. There should never be very many because each time a new process starts (or active_children() is called) all completed processes which have not yet been joined will be joined. Also calling a finished process’s Process.is_alive will join the process. Even so it is probably good practice to explicitly join all the processes that you start.
So consider using the join() keyword when using terminate().
I'm new to multiprocessing and want to collect data in one function and write data in another function simultaneously. Here's a psuedocode of what I have.
def Read_Data():
for num in range(0,5):
## Read 5 random values
print('Reading ' + str(values[num]))
return(values)
def Write_data(values):
## Arrange the random values in ascending order
for num in range(0,5):
print('Writing' + str(arranged_values[num]))
if __name__=='__main__'
values = Read_Data()
Write_data(values)
I want the output to look like this.
reading 1, writing none
reading 3, writing none
reading 5, writing none
reading 4, writing none
reading 2, writing none
reading 7, writing 1
reading 8, writing 2
reading 10, writing 3
reading 9, writing 4
reading 6, writing 5
Now the reason I want it to run parallel is to make sure I'm collecting data all the time and not losing data while I'm modifying and printing.
How can I do it using multiprocessing?
This should illustrate a few concepts. The queue is used to pass objects between the processes.
The reader simply gets its value somewhere and places it on the queue.
The writer listens on the queue forever.
Adding a "TERMINATED" signal is a very simple way of telling the writer to stop listening forever (there are other more effective ways using signals and events but this just illustrates the concept).
At the end we "join" on the two processes to make sure they exit before we exit the main process (otherwise they are left hanging in space and time)
from multiprocessing import Process, Queue
from time import sleep
def reader(q):
for i in range(10):
print("Reading", i)
q.put(i)
sleep(1)
print("Reading TERMINATED")
q.put("TERMINATE")
def writer(q):
while True:
i = q.get()
if i == "TERMINATE":
print("Writer TERMINATED")
break
print("Writing", i)
q = Queue()
pr = Process(target=reader, args=(q,))
pw = Process(target=writer, args=(q,))
pw.start()
pr.start()
pw.join()
pr.join()
For a web-scraping analysis I need two loops that run permanently, one returning a list with websites updated every x minutes, while the other one analyses the sites (old an new ones) every y seconds. This is the code construction that exemplifies, what I am trying to do, but it doesn't work: Code has been edited to incorporate answers and my research
from multiprocessing import Process
import time, random
from threading import Lock
from collections import deque
class MyQueue(object):
def __init__(self):
self.items = deque()
self.lock = Lock()
def put(self, item):
with self.lock:
self.items.append(item)
# Example pointed at in [this][1] answer
def get(self):
with self.lock:
return self.items.popleft()
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(10)
def b(queue):
try:
while queue:
x = queue.get()
print 'recieve', x
for i in x:
print i
time.sleep(2)
except IndexError:
print queue.get()
if __name__ == '__main__':
q = MyQueue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
So, this is my first Python project after an online introduction course and I am struggling here big time. I understand now, that the functions don't truly run in parallel, as b does not start until a is finished ( I used this answer an tinkered with the timer and while True). EDIT: Even after using the approach given in the answer, I think this is still the case, as the queue.get() throws an IndexError saying, the deque is empty. I can only explain that with process a not finishing, because when I print queue.get()
immediately after .put(x) it is not empty.
I eventually want an output like this:
send [3,4,6]
3
4
6
3
4
send [3,8,6,5] #the code above gives always 3 entries, but in my project
3 #the length varies
8
6
5
3
8
6
.
.
What do I need for having two truly parallel loops where one is returning an updated list every x minutes which the other loop needs as basis for analysis? Is Process really the right tool here?
And where can I get good info about designing my program.
I did something a little like this a while ago. I think using the Process is the correct approach, but if you want to pass data between processes then you should probably use a Queue.
https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
Create the queue first and pass it into both processes. One can write to it, the other can read from it.
One issue I remember is that the reading process will block on the queue until something is pushed to it, so you may need to push a special 'terminate' message of some kind to the queue when process 1 is done so process 2 knows to stop.
EDIT: Simple example. This doesn't include a clean way to stop the processes. But it shows how you can start 2 new processes and pass data from one to the other. Since the queue blocks on get() function b will automatically wait for data from a before continuing.
from multiprocessing import Process, Queue
import time, random
def a(queue):
while True:
x=[random.randint(0,10), random.randint(0,10), random.randint(0,10)]
print 'send', x
queue.put(x)
time.sleep(5)
def b(queue):
x = []
while True:
time.sleep(1)
try:
x = queue.get(False)
print 'receive', x
except:
pass
for i in x:
print i
if __name__ == '__main__':
q = Queue()
p1 = Process(target=a, args=(q,))
p2 = Process(target=b, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
I am about to start on an endevour with python. The goal is to multithread different tasks and use queues to communicate between tasks. For the sake of clarity I would like to be able to pass a queue to a sub-function, thus sending information to the queue from there. So something similar like so:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i, out_q):
i += 1
print(i)
out_q.put(i)
return
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
increment( i , out_q)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
while True:
# Get some data
data = in_q.get()
# Process the data
# Check for termination
if data is _sentinel:
in_q.put(_sentinel)
break
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
# Wait for all produced items to be consumed
q.join()
Currently the output is a row of 0's, where I would like it to be the numbers 1 to 6. I have read the difficulty of passing references in python, but would like to clarify if this is just not possible in python or am I looking at this issue wrongly?
The problem has nothing to do with the way the queues are passed; you're doing that right. The issue is actually related to how you're trying to increment i. Because variable in python are passed by assignment, you have to actually return the incremented value of i back to the caller for the change you made inside increment to have any effect. Otherwise, you just rebind the local variable i inside of increment, and then i gets thrown away when increment completes.
You can also simplify your consume method a bit by using the iter built-in function, along with a for loop, to consume from the queue until _sentinel is reached, rather than a while True loop:
from queue import Queue
from threading import Thread
import copy
# Object that signals shutdown
_sentinel = object()
# increment function
def increment(i):
i += 1
return i
# A thread that produces data
def producer(out_q):
i = 0
while True:
# Produce some data
i = increment( i )
print(i)
out_q.put(i)
if i > 5:
out_q.put(_sentinel)
break
# A thread that consumes data
def consumer(in_q):
for data in iter(in_q.get, _sentinel):
# Process the data
pass
# Create the shared queue and launch both threads
q = Queue()
t1 = Thread(target=consumer, args=(q,))
t2 = Thread(target=producer, args=(q,))
t1.start()
t2.start()
Output:
1
2
3
4
5
6