Python - How to turn thread into thread-safe - python

So I have been figuring how I should make a thread-safe, the reason for it was that whenever I ran the program that I created just for fun. I realized the console got so much spammed that it doesn't happen to be fast enough to print it one by one.
Basically what I did is that I use a list of list that is no special than just a list of different fruits lets say
list = ['apple','banana','kiwi'....]
and then I have something called data that basically prints out using logger.
logger.log(data)
The full program would look like something like
def sendData(list, data):
logger.log(data)
def main():
...
...
...
data_list.append((list[i], data))
for index, data in data_list:
threading.Thread(target=sendData, args=(list, data)).start()
So basically as we can see this would probably be a lot of threads running at the same time which would cause a interact that would make the console to print out alot of mistake so now the question is:
How can I turn this into a sort of thread-safe? Would sort of sleep for each thread start be the magic?

You might want to look into threading.Lock(), it can be used to prevent multiples threads from doing output tasks at the same time and thus mixing the words in the console :
def sendData(list, data):
with lock:
logger.log(data)
lock = threading.Lock()
def main():
...
...
...
data_list.append((list[i], data))
for index, data in data_list:
threading.Thread(target=sendData, args=(list, data)).start()
This will prevent multiple threads from running the code in the "with" at the same time.
When a thread X enter in the "with" block, it will claim the lock. If another thread try to claim it (enter the "with" block), it will have to wait until the lock is released by the thread X.

Related

Multiprocessing two functions while sharing data

I just started out with Python so please bear with me.
My code looks something like this right now (simplified)
lst = []
def func1():
while True:
**doing some stuff with selenium, performing some operations on lst**
**I never break the loop**
def func2():
while True:
**doing some stuff with selenium, performing some operations on lst**
**I never break the loop**
So far so good. However, I need both functions to run simultaneously while also doing stuff to the same list and exchanging them. For example, func1 might append something to lst and func2 might remove something from lst then func1 might remove something etc. Both functions need to run indefinitely, so the infinte loops don't make it any easier.
I read a little about multithreading but from my understanding multithreading doesn't really run parallel, so my code will get executed slower. That's simply not an option. I also read that multithreading and Selenium aren't exactly a match made in heaven.
So, how can I achieve this? I need both functions to be able to perform operations on my list while running simultaneously indefinitely.
I could also use some help on the Multiprocessing stuff. Mapping, pools, queues... I don't even know where to start.
I really need your help guys and I would very much appreciate it.
Additional information (I don't really know if it matters): all of this is being run on a Windows machine using Python 2.7 and Selenium and Chromedriver.
Use a shared list proxy and a lock to sync the lst between processes.
Pseudo code:
import multiprocessing as mp
def func1(lst, lock):
while True:
lock.acquire()
# **doing some stuff with selenium, performing some operations on lst**
lock.release()
# **I never break the loop**
def func2(lst, lock):
while True:
lock.acquire()
# **doing some stuff with selenium, performing some operations on lst**
lock.release()
# **I never break the loop**
lst = mp.Manager().list()
lock = mp.Lock()
p1 = mp.Process(target=func1, args=(lst,lock))
p2 = mp.Process(target=func2, args=(lst,lock))
p1.start()
p2.start()
p1.join()
p2.join()
Note that the item in lst should be a scalar by default, Python uses shadow copy for the sync between processes.
If lst contains other types of element such as list or dict or object, you have to reassign it to the lst every operation.

Does the "print" function run in a new subprocess or something similar?

I'm using praw to get new submissions from reddit:
for submission in submissions:
print('Submission being evaluated:', submission.id)
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
When using this code I sometimes get ids that link to older submissions.
So I changed my script to check if the submissions are new:
for submission in submissions:
if ((time.time()-submission.created) < 15): #if submission is new
lock.acquire()
print('Submission being evaluated:', submission.id)
lock.release()
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
else:
lock.acquire()
print("Submission "+submission.id+" was older than 15 seconds")
lock.release()
But for an extended period of time the else part didn't get executed even though I got a fair amount of old submission ids with the previous script.
So my question is, when I run print(submission.id) is it running in the background when the subprocess is created, maybe causing a problem and changing the value of submission.id or is it just some coincidence that with the second script I got no old submissions?
Thanks in advance!
To answer the question in the title, no.
sys.stdout, the stream print writes to, is usually line buffered though (though that shouldn't matter in this case, as print writes a newline character (unless told not to)), and is shared between threads and subprocesses (unless explicitly unshared).
Without knowing more about the code around this, it's hard to say more. (Who knows, maybe you have a background thread somewhere in there that sneakily changes submission.id?)
EDIT:
The new information in the original post, namely that
print('Submission being evaluated:', submission.id)
is being printed, not
print(submission.id)
is critical.
Each argument of a print() call is printed atomically, but if two processes or threads are print()ing simultaneously, let's say print('a', 'b'), it's entirely possible that you get a a b b instead of a b a b.
Here is a function I like to use for safely printing to the console. That is the proper use of Lock(), use it around very simple operations. I actually use it in a class, so I dont pass around the lock object as in the below example, but same principle.
Also, the answer is likely yes, but it's more uncertain than certain. Are you also using a lock everytime you read and write submission.id? Generally, if you have an object shared by multiple processes, its best to do this, and, also best to use the Value class from the multiprocessing library, since value objects are designed to be safely shared between processes. Below is a trivial but clear example without processes (thats your job!).
from multiprocessing import Process, Lock, #...
myLock = Lock()
myID = ""
def SetID(id, lock):
with lock:
id = "set with lock"
return
def SafePrint(msg, lock):
lock.acquire()
print(msg)
lock.release()
return
SetID(myID, myLock)
SafePrint(myID, myLock)

Can a python Multiprocessing queue be passed to the child process?

I have a big dataset in a data acquisition system I wrote in python that takes infinitely long to pass over a queue from the child process to the parent. I want to save the data acquired at the end of the acquisition and tried this using the queue function in Multiprocessing. Instead of doing it this way I would prefer it if I could instead pass a message over the queue from the parent to the child to save my data before I kill the child process. Is this possible? An example of what I thought it might look like is:
def acquireData(self, var1, queue):
import h5py
# Put my acquisition code here
queue.get()
if queue == True:
f = h5py.File("FileName","w")
f.create_dataset('Data',data=data)
f.close()
if __name__ == '__main__':
from multiprocessing import Process, Queue
queue = Queue()
inter_thread = Process(target=acquireData, args=(var1,queue))
queue.put(False)
inter_thread.start()
while True:
if not args.automate:
# Let c++ threads run for given amount of time
# Wait for stop from OP GUI
else:
queue.put(True)
break
print("Acquisition finished, cleaning up...")
sleep(2)
inter_thread.terminate()
Is this allowed? If this type of interfacing between processes is allowed then do I have the right notation? For some reference I have on the order of 9e7 data points in the array I'm trying to save and I have 7 arrays which is simply not being passed to my parent process in a timely manner by putting these arrays into the queue. Thank you.
First, yes, passing a queue to a child is not only legal, but the main use case for queues. See the first example in the docs, which does exactly that.
However, you've got some problems with your code:
queue.get()
if queue == True:
First, your queue is never going to be the boolean value True, it's going to be a Queue. You almost never want to check if x == True: in Python; you want to check if x:. For example, if [1, 2]: will pass, while if [1, 2] == True: will not.
Second, your queue isn't even the thing you want to check in the first place. It isn't truthy or falsey (or it isn't relevant whether it is); it's the value the main process put on the queue and you pulled off that's either truthy or falsey. Which you discarded as soon as you retrieved it.
So, do this:
flag = queue.get()
if flag:
Or, more simply:
if queue.get():
I'm not sure whether this is exactly what you want or not. That queue.get() will block forever until the main process puts something there. Is that what you wanted? If so, great; you're done with this part of your code. If not, you need to think about what you wanted instead.
As designed, the parent will always wait 2 seconds, even if the child finished long before that. A better solution is to join the child with a timeout of 2 seconds. Then you can terminate it if times out.
Plus, are you sure the termination behavior you've designed is what you want? You're doing a "soft kill request" with the queue, then waiting 2 seconds, then doing a "medium-hard kill request" with terminate, and never doing a "hard kill" with kill. That could be a perfectly reasonable design—but if it's not your design, you've implemented the wrong thing.

Multi-threaded web scraping in Python/PySide/PyQt

I'm building a web scraper of a kind. Basically, what the soft would do is:
User (me) inputs some data (IDs) - IDs are complex, so not just numbers
Based on those IDs, the script visits http://localhost/ID
What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.
I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.
The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?
Now I've written some code but I am not sure if it is correct:
class GUI(QObject):
def __init__(self):
print "GUI CLASS INITIALIZED!!!"
self.worker = Worker()
for i in xrange(300):
QThreadPool().globalInstance().start(self.worker)
class Worker(QRunnable):
def run(self):
print "Hello world from thread", QThread.currentThread()
Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.
Basically, my question comes down to how do I execute several same QThreads concurrently?
Thanks in advance for the answer!
As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.
In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.
We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.
import threading
import random
import time
pool_size = 5
job_queue = set(range(100))
def worker_thread(queue):
while True:
try:
job = queue.pop()
except KeyError:
break
print "Processing %i..." % (job, )
time.sleep(random.random())
print "Thread exiting."
workers = []
for thread in range(pool_size):
workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
workers[-1].start()
for worker in workers:
worker.join()
print "All threads exited"

Dumping a multiprocessing.Queue into a list

I wish to dump a multiprocessing.Queue into a list. For that task I've written the following function:
import Queue
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
# START DEBUG CODE
initial_size = queue.qsize()
print("Queue has %s items initially." % initial_size)
# END DEBUG CODE
while True:
try:
thing = queue.get(block=False)
result.append(thing)
except Queue.Empty:
# START DEBUG CODE
current_size = queue.qsize()
total_size = current_size + len(result)
print("Dumping complete:")
if current_size == initial_size:
print("No items were added to the queue.")
else:
print("%s items were added to the queue." % \
(total_size - initial_size))
print("Extracted %s items from the queue, queue has %s items \
left" % (len(result), current_size))
# END DEBUG CODE
return result
But for some reason it doesn't work.
Observe the following shell session:
>>> import multiprocessing
>>> q = multiprocessing.Queue()
>>> for i in range(100):
... q.put([range(200) for j in range(100)])
...
>>> q.qsize()
100
>>> l=dump_queue(q)
Queue has 100 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 99 items left
>>> l=dump_queue(q)
Queue has 99 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 3 items from the queue, queue has 96 items left
>>> l=dump_queue(q)
Queue has 96 items initially.
Dumping complete:
0 items were added to the queue.
Extracted 1 items from the queue, queue has 95 items left
>>>
What's happening here? Why aren't all the items being dumped?
Try this:
import Queue
import time
def dump_queue(queue):
"""
Empties all pending items in a queue and returns them in a list.
"""
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
time.sleep(.1)
return result
import multiprocessing
q = multiprocessing.Queue()
for i in range(100):
q.put([range(200) for j in range(100)])
q.put('STOP')
l=dump_queue(q)
print len(l)
Multiprocessing queues have an internal buffer which has a feeder thread which pulls work off a buffer and flushes it to the pipe. If not all of the objects have been flushed, I could see a case where Empty is raised prematurely. Using a sentinel to indicate the end of the queue is safe (and reliable). Also, using the iter(get, sentinel) idiom is just better than relying on Empty.
I don't like that it could raise empty due to flushing timing (I added the time.sleep(.1) to allow a context switch to the feeder thread, you may not need it, it works without it - it's a habit to release the GIL).
# in theory:
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
# in practice this might be more resilient:
def dump_queue(q):
q.put(None)
return list(iter(lambda : q.get(timeout=0.00001), None))
# but neither case handles all the ways things can break
# for that you need 'managers' and 'futures' ... see Commentary
I prefer None for sentinels, but I would tend to agree with jnoller that mp.queue could use a safe and simple sentinel. His comments on risks of getting empty raised early is also valid, see below.
Commentary:
This is old and Python has changed, but, this does come up has a hit if you're having issues with lists <-> queue in MP Python. So, let's look a little deeper:
First off, this is not a bug, it's a feature: https://bugs.python.org/issue20147. To save you some time from reading that discussion and more details in the documentation, here are some highlights (kind of philosophical but I think it might help some who are starting with MP/MT in Python):
MP Queues are structures capable of being communicated with from different threads, different processes on the same system, and in fact can be different (networked) computers
In general with parallel/distributed systems, strict synchronization is expensive, so every time you use part of the API for any MP/MT datastructures, you need to look at the documentation to see what it promises to do, or not. Hint: if a function doesn't include the word "lock" or "semaphore" or "barrier" etc, then it will be some mixture of "asynchronous" and "best effort" (approximate), or what you might call "flaky."
Specific to this situation: Python is an interpreted language, with a famous single interpreter thread with it's famous "Global Interpreter Lock" (GIL). If your entire program is single-process, single threaded, then everything is hunky dory. If not (and with MP it's egregiously not), you need to give the interpreter some breathing room. time.sleep() is your friend. In this case, timeouts.
In your solution you are only using flaky functions - get() and qsize(). And the code is in fact worse than you might think - dial up the size of the queue and the size of the objects and you're likely to break things:
Now, you can work with flaky routines, but you need to give them room to maneuver. In your example you're just hammering that queue. All you need to do is change the line thing = queue.get(block=False) to instead be thing = queue.get(block=True,timeout=0.00001) and you should be fine.
The time 0.00001 is chosen carefully (10^-5), it's about the smallest that you can safely make it (this is where art meets science).
Some comments on why you need the timout: this relates to the internals of how MP queues work. When you 'put' something into an MP queue, it's not actually put into the queue, it's queued up to eventually be there. That's why qsize() happens to give you a correct result - that part of the code knows there's a pile of things "in" the queue. You just need to realize that an object "in" the queue is not the same thing as "i can now read it." Think of MP queues as sending a letter with USPS or FedEx - you might have a receipt and a tracking number showing that "it's in the mail," but the recipient can't open it yet. Now, to be even more specific, in your case you get '0' items accessible right away. That's because the single interpreter thread you're running hasn't had any chance to process stuff that's "queued up", so your first loop just queues up a bunch of stuff for the queue, but you're immediately forcing your single thread to try to do a get() before it's even had a chance to line up even a single object for you.
One might argue that it slows code down to have these timeouts. Not really - MP queues are heavy-weight constructs, you should only be using them to pass pretty heavy-weight "things" around, either big chunks of data, or at least complex computation. the act of adding 10^-5 seconds actually does is give the interpreter a chance to do thread scheduling - at which point it will see your backed-up put() operations.
Caveat
The above is not completely correct, and this is (arguably) an issue with the design of the get() function. The semantics of setting timeout to non-zero is that the get() function will not block for longer than that before returning Empty. But it might not actually be Empty (yet). So if you know your queue has a bunch of stuff to get, then the second solution above works better, or even with a longer timeout. Personally I think they should have kept the timeout=0 behavior, but had some actual built-in tolerance of 1e-5, because a lot of people will get confused about what can happen around gets and puts to MP constructs.
In your example code, you're not actually spinning up parallel processes. If we were to do that, then you'd start getting some random results - sometimes only some of the queue objects will be removed, sometimes it will hang, sometimes it will crash, sometimes more than one thing will happen. In the below example, one process crashes and the other hangs:
The underlying problem is that when you insert the sentinel, you need to know that the queue is finished. That should be done has part of the logic around the queue - if for example you have a classical master-worker design, then the master would need to push a sentinel (end) when the last task has been added. Otherwise you end up with race conditions.
The "correct" (resilient) approach is to involve managers and futures:
import multiprocessing
import concurrent.futures
def fill_queue(q):
for i in range(5000):
q.put([range(200) for j in range(100)])
def dump_queue(q):
q.put(None)
return list(iter(q.get, None))
with multiprocessing.Manager() as manager:
q = manager.Queue()
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.submit(fill_queue, q) # add stuff
executor.submit(fill_queue, q) # add more stuff
executor.submit(fill_queue, q) # ... and more
# 'step out' of the executor
l = dump_queue(q)
# 'step out' of the manager
print(f"Saw {len(l)} items")
Let the manager handle your MP constructs (queues, dictionaries, etc), and within that let the futures handle your processes (and within that, if you want, let another future handle threads). This assures that things are cleaned up as you 'unravel' the work.

Categories

Resources