Killing all workers/threads ThreadPoolExecutor - python

I have been trying for the past day to come up with a fix to my current problem.
I have a python script which is supposed to count up using threads and perform requests based on each thread.
Each thread is going through a function called doit(), which has a while True function. This loop only breaks if it meets a certain criteria and when it breaks, the following thread breaks as well.
What I want to achieve is that once one of these threads/workers gets status code 200 from their request, all workers/threads should stop. My problem is that it won't stop even though the criteria is met.
Here is my code:
import threading
import requests
import sys
import urllib.parse
import concurrent.futures
import simplejson
from requests.auth import HTTPDigestAuth
from requests.packages import urllib3
from concurrent.futures import ThreadPoolExecutor
def doit(PINStart):
PIN = PINStart
while True:
req1 = requests.post(url, data=json.dumps(data), headers=headers1, verify=False)
if str(req1.status_code) == "200":
print(str(PINs))
c0 = req1.content
j0 = simplejson.loads(c0)
AuthUID = j0['UserId']
print(UnAuthUID)
AuthReqUser()
#Kill all threads/workers if any of the threads get here.
break
elif(PIN > (PINStart + 99)):
break
else:
PIN+=1
def main():
threads = 100
threads = int(threads)
Calcu = 10000/threads
NList = [0]
for i in range(1,threads):
ListAdd = i*Calcu
if ListAdd == 10000:
NList.append(int(ListAdd))
else:
NList.append(int(ListAdd)+1)
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
if __name__ == "__main__":
main()
I understand why this is happening. As I only break the while True loop in one of the threads, so the other 99 (I run the code with 100 threads by default) doesn't break until they finish their count (which is running through the loop 100 times or getting status code 200).
What I originally did was to define a global variable at the top of the code and I changed while Counter < 10000, meaning it will run the loop for all workers until Counter is greater than 10000. And inside the loop it will increment the global variable. This way, when a worker gets status code 200, I set Counter (my global variable) to for example 15000 (something above 10000), so all the other workers stop running the loop 100 times.
This did not work. When I add that into the code, all threads instantly stop, doesn't even run through the loop once.
Here is an example code of this solution:
import threading
import requests
import sys
import urllib.parse
import concurrent.futures
import simplejson
from requests.auth import HTTPDigestAuth
from requests.packages import urllib3
from concurrent.futures import ThreadPoolExecutor
global Counter
def doit(PINStart):
PIN = PINStart
while Counter < 10000:
req1 = requests.post(url, data=json.dumps(data), headers=headers1, verify=False)
if str(req1.status_code) == "200":
print(str(PINs))
c0 = req1.content
j0 = simplejson.loads(c0)
AuthUID = j0['UserId']
print(UnAuthUID)
AuthReqUser()
#Kill all threads/workers if any of the threads get here.
Counter = 15000
break
elif(PIN > (PINStart + 99)):
Counter = Counter+1
break
else:
Counter = Counter+1
PIN+=1
def main():
threads = 100
threads = int(threads)
Calcu = 10000/threads
NList = [0]
for i in range(1,threads):
ListAdd = i*Calcu
if ListAdd == 10000:
NList.append(int(ListAdd))
else:
NList.append(int(ListAdd)+1)
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
if __name__ == "__main__":
main()
Any idea on how I can kill all workers once I get status code 200 from one of the requests I sending out?

The problem is that you're not using a global variable.
To use a global variable in a function, you have to put the global statement in that function, not at the top level. Because you didn't, the Counter inside doit is a local variable. Any variable that you assign to anywhere in a function is local, unless you have a global (or nonlocal) declaration.
And the first time you use that local Counter is right at the top of the while loop, before you've assigned anything to it. So, it's going to raise an UnboundLocalError immediately.
This exception will be propagated back to the main thread as the result of the future. Which you would have seen, except that you never actually evaluate your futures. You just do this:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
So, you get the PinS corresponding to the function you ran, but you don't look at the result or exception; you just ignore it. Hence you don't see that you're getting back 100 exceptions, any of which would have told you what was actually wrong. This is equivalent to having a bare except: pass in non-threaded code. Even if you don't want to check the result of your futures in "production" for some reason, you definitely should do it when debugging a problem.
Anyway, just put the global in the right place, and your bug is fixed.
However, you do have at least two other problems.
First, sharing globals between threads without synchronizing them is not safe. In CPython, thanks to the GIL, you never get a segfault because of it, and you often get away with it completely, but you often don't. You can miss counts because two threads tried to do Counter = Counter + 1 at the same time, so they both incremented it from 42 to 43. And you can get a stale value in the while Counter < 10000: check and go through the loop an extra time.
Second, you don't check the Counter until you've finished downloading and processing a complete requests. This could take seconds, maybe even minutes depending on your timeout settings. And add that to the fact that you might go through the loop an extra time before knowing it's time to quit…

Related

Why is ThreadPoolExecutor freezing?

i am using ThreadPoolExecutor to get a lot of requests from websites quickly, but sometimes, maybe 1 in 5 times, ThreadPoolExecutor finishes running all of the thread functions and then just freezes without moving on to the rest of my code. I need this to be reliable for a project i'm working on.
from concurrent.futures import ThreadPoolExecutor
import ballotpedialinks as bl
data =[[link,0],[link,1],[link,2]...[link,500]]
def threadFunction(data):
page = data[0]
counter = data[1]
a = bl.checkLink(page)
print(a[0])
if a[0] == '':
links = bl.generateNewLinks(page,state)
for link in links:
a = bl.checkLink(link)
if a[0] != '':
print(f'{a[0]} is a fixed link')
break
def quickRun(threads):
with ThreadPoolExecutor(threads) as pool:
pool.map(threadFunction,data[0:-1])
quickRun(32)
print('scraper complete')
this is basically what i'm doing but thread function is sending requests to websites. executor finishes all the tasks i give it but sometimes it just freezes once its done. Is there anything i can do to make executor not freeze?

How do I add limited input time to my code?

I've been trying to find a good limited-input-time code for Python scripts and I finally got a code to work:
from threading import Timer
timeout = 5
t = Timer(timeout, print, ["Time's up!"])
t.start()
entry = input('> ')
t.cancel()
but, I need to be able to run a function when the timer ends.
Also - I want the function called inside of the timer code - otherwise if you type your entry before the timer runs out, the function will still be called no matter what.
Could anyone kindly edit this code I have to be able to run a function when the timer ends?
If it is fine that you block the main thread when the user has not provided any answer, the above code that you have shared might work.
Otherwise you could use msvcrt in the following sense:
import msvcrt
import time
class TimeoutExpired(Exception):
pass
def input_with_timeout(prompt, timeout, timer=time.monotonic):
sys.stdout.write(prompt)
sys.stdout.flush()
endtime = timer() + timeout
result = []
while timer() < endtime:
if msvcrt.kbhit():
result.append(msvcrt.getwche()) #XXX can it block on multibyte characters?
if result[-1] == '\n': #XXX check what Windows returns here
return ''.join(result[:-1])
time.sleep(0.04) # just to yield to other processes/threads
raise TimeoutExpired
The above code is compliant with Python3 and you will need to test it.
Reading from the Python Documentation https://docs.python.org/3/library/threading.html#timer-objects
I have come up with the following snippet which might work(Try running in your command line prompt)
from threading import Timer
def input_with_timeout(x):
def time_up():
answer= None
print('time up...')
t = Timer(x,time_up) # x is amount of time in seconds
t.start()
try:
answer = input("enter answer : ")
except Exception:
print('pass\n')
answer = None
if answer != True: # it means if variable has something
t.cancel() # time_up will not execute(so, no skip)
input_with_timeout(5) # try this for five seconds

Python threading calls the method multiple times

I have to curl a website and display a message if the header status is not 200. The logic works fine, but I'm facing trouble with calling the method once.
The threading.Time is supposed to call the method ONCE every 20 seconds but apparently, it calls it multiple times. Could someone please tell me how can I make it call the method once every 20 seconds?
import requests
import threading
def check_status(url):
while True:
status = requests.get(url)
if status.status_code != 200:
print('faulty')
def main():
threading.Timer(2.0, check_status, args=('https://www.google.com',)).start()
if __name__ == "__main__":
main()
Just create a new timer after you finish the old one.
import requests
import threading
def check_status(url):
status = requests.get(url)
if status.status_code != 200:
print('faulty')
threading.Timer(2.0, check_status, args=('https://www.google.com',)).start()
def main():
threading.Timer(2.0, check_status, args=('https://www.google.com',)).start()
if __name__ == "__main__":
main()
Every 20 seconds, you are creating a thread that enters an infinite loop which checks the HTTP status. Even if you did not use threading.Time, you would still get multiple prints. Remove the while loop and it will work as you expect.
Update
My mistake, looking at the documentation: https://docs.python.org/2/library/threading.html#timer-objects
Time will execute the function after the time has passed. Then it will exit. What you need to do, is have time.sleep inside the while loop, and call the function just once inside your main.

Issue with MultiProcessing in Python with BeautifulSoup 4

I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)

Different way to implement this threading?

I'm trying out threads in python. I want a spinning cursor to display while another method runs (for 5-10 mins). I've done out some code but am wondering is this how you would do it? i don't like to use globals, so I assume there is a better way?
c = True
def b():
for j in itertools.cycle('/-\|'):
if (c == True):
sys.stdout.write(j)
sys.stdout.flush()
time.sleep(0.1)
sys.stdout.write('\b')
else:
return
def a():
global c
#code does stuff here for 5-10 minutes
#simulate with sleep
time.sleep(2)
c = False
Thread(target = a).start()
Thread(target = b).start()
EDIT:
Another issue now is that when the processing ends the last element of the spinning cursor is still on screen. so something like \ is printed.
You could use events:
http://docs.python.org/2/library/threading.html
I tested this and it works. It also keeps everything in sync. You should avoid changing/reading the same variables in different threads without synchronizing them.
#!/usr/bin/python
from threading import Thread
from threading import Event
import time
import itertools
import sys
def b(event):
for j in itertools.cycle('/-\|'):
if not event.is_set():
sys.stdout.write(j)
sys.stdout.flush()
time.sleep(0.1)
sys.stdout.write('\b')
else:
return
def a(event):
#code does stuff here for 5-10 minutes
#simulate with sleep
time.sleep(2)
event.set()
def main():
c = Event()
Thread(target = a, kwargs = {'event': c}).start()
Thread(target = b, kwargs = {'event': c}).start()
if __name__ == "__main__":
main()
Related to 'kwargs', from Python docs (URL in the beginning of the post):
class threading.Thread(group=None, target=None, name=None, args=(), kwargs={})
...
kwargs is a dictionary of keyword arguments for the target invocation. Defaults to {}.
You're on the right track mostly, except for the global variable. Normally you'd needed to coordinate access to shared data like that with a lock or semaphore, but in this special case you can take a short-cut and just use whether one of the threads is running or not instead. This is what I mean:
from threading import Thread
from threading import Event
import time
import itertools
import sys
def monitor_thread(watched_thread):
chars = itertools.cycle('/-\|')
while watched_thread.is_alive():
sys.stdout.write(chars.next())
sys.stdout.flush()
time.sleep(0.1)
sys.stdout.write('\b')
def worker_thread():
# code does stuff here - simulated with sleep
time.sleep(2)
if __name__ == "__main__":
watched_thread = Thread(target=worker_thread)
watched_thread.start()
Thread(target=monitor_thread, args=(watched_thread,)).start()
This is not properly synchronized. But I will not try to explain it all to you right now because it's a whole lot of knowledge. Try to read this: http://effbot.org/zone/thread-synchronization.htm
But in your case it's not that bad that things aren't synchronized correctyl. The only thing that could happen, is that the spining bar spins a few ms longer than the background task actually needs.

Categories

Resources