I just started out with Python so please bear with me.
My code looks something like this right now (simplified)
lst = []
def func1():
while True:
**doing some stuff with selenium, performing some operations on lst**
**I never break the loop**
def func2():
while True:
**doing some stuff with selenium, performing some operations on lst**
**I never break the loop**
So far so good. However, I need both functions to run simultaneously while also doing stuff to the same list and exchanging them. For example, func1 might append something to lst and func2 might remove something from lst then func1 might remove something etc. Both functions need to run indefinitely, so the infinte loops don't make it any easier.
I read a little about multithreading but from my understanding multithreading doesn't really run parallel, so my code will get executed slower. That's simply not an option. I also read that multithreading and Selenium aren't exactly a match made in heaven.
So, how can I achieve this? I need both functions to be able to perform operations on my list while running simultaneously indefinitely.
I could also use some help on the Multiprocessing stuff. Mapping, pools, queues... I don't even know where to start.
I really need your help guys and I would very much appreciate it.
Additional information (I don't really know if it matters): all of this is being run on a Windows machine using Python 2.7 and Selenium and Chromedriver.
Use a shared list proxy and a lock to sync the lst between processes.
Pseudo code:
import multiprocessing as mp
def func1(lst, lock):
while True:
lock.acquire()
# **doing some stuff with selenium, performing some operations on lst**
lock.release()
# **I never break the loop**
def func2(lst, lock):
while True:
lock.acquire()
# **doing some stuff with selenium, performing some operations on lst**
lock.release()
# **I never break the loop**
lst = mp.Manager().list()
lock = mp.Lock()
p1 = mp.Process(target=func1, args=(lst,lock))
p2 = mp.Process(target=func2, args=(lst,lock))
p1.start()
p2.start()
p1.join()
p2.join()
Note that the item in lst should be a scalar by default, Python uses shadow copy for the sync between processes.
If lst contains other types of element such as list or dict or object, you have to reassign it to the lst every operation.
Related
Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?
If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
print("a")
def loop_b():
while 1:
print("b")
if __name__ == '__main__':
Process(target=loop_a).start()
Process(target=loop_b).start()
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.
There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
taskA()
taskB()
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
multiprocess
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
coroutines
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.
Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
doTaskA()
doTaskB()
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.
from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
threadA.run()
threadB.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
You could use threading or multiprocessing.
How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.
I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!
This might have asked already here, but I couldn't come up with right keywords to search.
I have a array that I would like to split them into chunks, and hand them to threads to do some work on each slice and dump the result.
However, I need to reassemble the results from each thread in order.
I tried passing the lock for each thread to lock and dump the result into another array, but the order is not correct. I assume because each thread completes in different time.
What would be the best way to do this in Python 3?
import threading
import numpy as np
from queue import Queue
def add(lock, work):
value = 0
for v in work:
#Do some work!
lock.acquire()
result.append(value)
lock.release()
a = np. arange(0,100)
result = []
lock = threading.Lock()
q = Queue()
for i in range(0,a.shape[0],10):
work = a[i:i+10]
t = threading.Thread(target=add, args=(lock,work))
t.start()
q.put(t)
while q.empty() == False:
q.get().join()
value = 0
for v in result:
#Assemble
print(value)
You're getting your results in a mixed up order because append puts each result at the end of the list when it comes in, which may not be in the same order the threads were started. A better approach might be to pass each worker an index into a properly-sized list, and let it assign its results there whenever it finishes. Lists are sufficiently thread-safe that you shouldn't need a lock for this (your Queue is also completely unnecessary since only the main thread interacts with it).
def add(work, result_index):
value = 0
for v in work:
#Do some work!
result[result_index] = value
a = np.arange(0,100)
results = []
threads = []
for i in range(0,a.shape[0],10):
work = a[i:i+10]
results.append(None) # enlarge the results list, so we have room for this thread's result
t = threading.Thread(target=add, args=(work, i//10))
t.start()
threads.append(t)
for t in threads:
t.join()
I would warn you that if your #Do some work! code is CPU limited, you're unlikely to get much benefit from using multiple threads. The CPython interpreter has a Global Interpreter Lock that prevents more than one thread from running Python code at the same time (so that interpreter state like reference counts on objects can remain consistent without each one needing its own lock). Threading is only really useful for IO-limited jobs (like fetching lots of documents from the internet.
For CPU limited work, you usually want to use multiprocessing instead. If that's what you need, look at multiprocessing.map, which can handle passing objects between processes and reassembling the results into an ordered list automatically.
So I have been figuring how I should make a thread-safe, the reason for it was that whenever I ran the program that I created just for fun. I realized the console got so much spammed that it doesn't happen to be fast enough to print it one by one.
Basically what I did is that I use a list of list that is no special than just a list of different fruits lets say
list = ['apple','banana','kiwi'....]
and then I have something called data that basically prints out using logger.
logger.log(data)
The full program would look like something like
def sendData(list, data):
logger.log(data)
def main():
...
...
...
data_list.append((list[i], data))
for index, data in data_list:
threading.Thread(target=sendData, args=(list, data)).start()
So basically as we can see this would probably be a lot of threads running at the same time which would cause a interact that would make the console to print out alot of mistake so now the question is:
How can I turn this into a sort of thread-safe? Would sort of sleep for each thread start be the magic?
You might want to look into threading.Lock(), it can be used to prevent multiples threads from doing output tasks at the same time and thus mixing the words in the console :
def sendData(list, data):
with lock:
logger.log(data)
lock = threading.Lock()
def main():
...
...
...
data_list.append((list[i], data))
for index, data in data_list:
threading.Thread(target=sendData, args=(list, data)).start()
This will prevent multiple threads from running the code in the "with" at the same time.
When a thread X enter in the "with" block, it will claim the lock. If another thread try to claim it (enter the "with" block), it will have to wait until the lock is released by the thread X.
I have a code which is basically running an infinite loop, and in each iteration of the loop I run some instructions. Some of these instructions have to run in "parallel", which I do by using multiprocessing. Here is an example of my code structure:
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
def buy_fruit(fruit, number):
print('I bought '+str(number)+' times the following fruit:'+fruit)
return 'ok'
def func1(parameter1, parameter2):
myParameters=(parameter1,parameter2)
pool= Threadpool(2)
data = pool.starmap(func2,zip(myParameters))
return 'ok'
def func2(parameter1):
print(parameter1)
return 'ok'
while true:
myFruits=('apple','pear','orange')
myQuantities=(5,10,2)
pool= Threadpool(2)
data = pool.starmap(buy_fruit,zip(myFruits,myQuantities))
func1('hello', 'hola')
I agree it's a bit messy, because I have multi-processes within the main loop, but also within functions.
So everything works well, until the loop runs a few minutes and I get an error:
"RuntimeError: can't start new thread"
I saw online that this is due to the fact that I have opened too many threads.
What is the simplest way to close all my Threads by the end of each loop iteration, so I can restart "fresh" at the start of the new loop iteration?
Thank you in advance for your time and help!
Best,
Julia
PS: The example code is just an example, my real function opens many threads in each loop and each function takes a few seconds to execute.
You are creating a new ThreadPool object inside the endless loop, which is a likely cause to your problem, because you are not terminating the threads at the end of the loop. Have you tried creating the object outside of the endless loop?
pool = ThreadPool(2)
while True:
myFruits = ('apple','pear','orange')
myQuantities = (5,10,2)
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
Alternatively, and to answer your question, if your use case for some reason requires creating a new ThreadPool Object in each loop iteration, use a ContextManager (with Notation) to make sure all threads are closed upon leaving the ContextManager.
while True:
myFruits = ('apple','pear','orange')
myQuantities = (5,10,2)
with ThreadPool(2) as pool:
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
Notice however the noticable performance difference this has compared to the above code. Creating and terminating Threads is expensive, which is why the example above will run much faster, and is probably what you'll want to use.
Regarding your edit involving "nested ThreadPools": I would suggest to maintain one single instance of your ThreadPool, and pass references to your nested functions as required.
def func1(pool, parameter1, parameter2):
...
...
pool = ThreadPool(2)
while True:
myFruits=('apple','pear','orange')
myQuantities=(5,10,2)
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
func1(pool, 'hello', 'hola')
I wrote a multiprocessing program in python. It can illustrate as follow:
nodes = multiprocessing.Manager().list()
lock = multiprocess.Lock()
def get_elems(node):
#get elements by send requests
def worker():
lock.acquire()
node = nodes.pop(0)
lock.release()
elems = get_elems(node)
lock.acquire()
for elem in elems:
nodes.append(node)
lock.release()
if __name__ == "__main__":
node = {"name":"name", "group":0}
nodes.append(node)
processes = [None for i in xrange(10)]
for i in xrange(10):
processes[i] = multiprocessing.Process(target=worker)
processes[i].start()
for i in xrange(10):
processes[i].join()
At the beginning of the program run, it seems everything is okay. After run for a while. the speed of the program slow down. The phenomenon also exist when use multithreading. And I saw there is a Global Interpreter Lock in Python, So I change to multiprocessing. But still have this phenomenon. The complete code is in here. I have tried Cython, still have this phenomenon. Is there something wrong in my code? Or is there a birth defects in python about parallel?
I'm not sure it's the actual cause but, you are popping from the beginning of an increasingly longer list. That's expensive. Try to use a collections.deque.
Update: Read the linked code. You should use a Queue, as suggested in the comments to this post, and threads.
You do away with locks using the Queue.
The workers are IO bound so threads are appropriate.