i am using ThreadPoolExecutor to get a lot of requests from websites quickly, but sometimes, maybe 1 in 5 times, ThreadPoolExecutor finishes running all of the thread functions and then just freezes without moving on to the rest of my code. I need this to be reliable for a project i'm working on.
from concurrent.futures import ThreadPoolExecutor
import ballotpedialinks as bl
data =[[link,0],[link,1],[link,2]...[link,500]]
def threadFunction(data):
page = data[0]
counter = data[1]
a = bl.checkLink(page)
print(a[0])
if a[0] == '':
links = bl.generateNewLinks(page,state)
for link in links:
a = bl.checkLink(link)
if a[0] != '':
print(f'{a[0]} is a fixed link')
break
def quickRun(threads):
with ThreadPoolExecutor(threads) as pool:
pool.map(threadFunction,data[0:-1])
quickRun(32)
print('scraper complete')
this is basically what i'm doing but thread function is sending requests to websites. executor finishes all the tasks i give it but sometimes it just freezes once its done. Is there anything i can do to make executor not freeze?
Related
I am running a piece of python code in which multiple threads are run through threadpool executor. Each thread is supposed to perform a task (fetch a webpage for example). What I want to be able to do is to terminate all threads, even if one of the threads fail. For instance:
with ThreadPoolExecutor(self._num_threads) as executor:
jobs = []
for path in paths:
kw = {"path": path}
jobs.append(executor.submit(start,**kw))
for job in futures.as_completed(jobs):
result = job.result()
print(result)
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
#Signal all threads to stop
Is it possible to do so? The results returned by threads are useless to me unless all of them are successful, so if even one of them fails, I would like to save some execution time of the rest of the threads and terminate them immediately. The actual code obviously is doing relatively lengthy tasks with a couple of failure points.
If you are done with threads and want to look into processes, then this peace of code here looks very promising and simple, almost the same syntax as thread, but with the multiprocessing module.
When the timeout flag expires the process is terminated, very convenient.
import multiprocessing
def get_page(*args, **kwargs):
# your web page downloading code goes here
def start_get_page(timeout, *args, **kwargs):
p = multiprocessing.Process(target=get_page, args=args, kwargs=kwargs)
p.start()
p.join(timeout)
if p.is_alive():
# stop the downloading 'thread'
p.terminate()
# and then do any post-error processing here
if __name__ == "__main__":
start_get_page(timeout, *args, **kwargs)
I have created an answer for a similar question I had, which I think will work for this question.
Terminate executor using ThreadPoolExecutor from concurrent.futures module
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep
NUM_REQUESTS = 100
def long_request(id):
sleep(1)
# Simulate bad response
if id == 10:
return {"data": {"valid": False}}
else:
return {"data": {"valid": True}}
def check_results(results):
valid = True
for result in results:
valid = result["data"]["valid"]
return valid
def main():
futures = []
responses = []
num_requests = 0
with ThreadPoolExecutor(max_workers=10) as executor:
for request_index in range(NUM_REQUESTS):
future = executor.submit(long_request, request_index)
# Future list
futures.append(future)
for future in as_completed(futures):
is_responses_valid = check_results(responses)
# Cancel all future requests if one invalid
if not is_responses_valid:
executor.shutdown(wait=False)
else:
# Append valid responses
num_requests += 1
responses.append(future.result())
return num_requests
if __name__ == "__main__":
requests = main()
print("Num Requests: ", requests)
In my code I used multiprocessing
import multiprocessing as mp
pool = mp.Pool()
for i in range(threadNumber):
pool.apply_async(publishMessage, args=(map_metrics, connection_parameters...,))
pool.close()
pool.terminate()
This is how I would do it:
import concurrent.futures
def start(*args,**kwargs):
#fetch the page
if(success):
return True
else:
return False
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(start, {"path": path}) for path in paths]
concurrent.futures.wait(results, timeout=10, return_when=concurrent.futures.FIRST_COMPLETED)
for f in concurrent.futures.as_completed(results):
f_success = f.result()
if not f_success:
executor.shutdown(wait=False, cancel_futures=True) # shutdown if one fails
else:
#do stuff here
If any result is not True, everything will be shut down immediately.
You can try to use StoppableThread from func-timeout.
But terminating threads is strongly discouraged. And if you need to kill a thread, you probably have a design problem. Look at alternatives: asyncio coroutines and multiprocessing with legal cancel/terminating functionality.
I'm trying to use concurrent futures using the below example but my job never gets submitted. Don't see the print stmt in load_url.
import sys
from concurrent import futures
import multiprocessing
import time
import queue
def load_url(url,q):
# it will take 2 seconds to process a URL
print('load_url')
try:
time.sleep(2)
# put some dummy results in queue
for x in range(5):
print('put in queue')
q.put(x)
except Exception as e:
print('exception')
def main():
print('start')
manager = multiprocessing.Manager()
e = manager.Event()
q = queue.Queue()
with futures.ProcessPoolExecutor(max_workers=5) as executor:
livefutures = {executor.submit(load_url, url, q): url
for url in ['a','b']}
runningfutures = True
print('check_futures')
while runningfutures:
print('here')
runningfutures = [f for f in livefutures if f.running()]
if not runningfutures:
print('not running futures == ', q.empty())
while not q.empty():
print('not running futures1')
yield q.get(False)
if __name__ == '__main__':
for x in main():
print('x=',x)
Probably a bit late but I just ran into your post.
ProcessPoolExecutor is a bit picky, it requires the treads to execute simple functions and also sometimes behaves differently on Windows and Linux.
ThreadPoolExecutor is more permissive.
If you replace futures.ProcessPoolExecutor by futures.ThreadPoolExecutor it seems to work.
You are passing python's standard Queue to your asyncronous processes rather than a multiprocessing-safe Queue implementation. Therefore, your asyncronous job is failing with: TypeError: cannot pickle '_thread.lock' object. However, because you are not calling .result on the future object - this exception is never raised in the main process.
Instantiate your queue with manager.Queue() and the code works.
I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)
I have been trying for the past day to come up with a fix to my current problem.
I have a python script which is supposed to count up using threads and perform requests based on each thread.
Each thread is going through a function called doit(), which has a while True function. This loop only breaks if it meets a certain criteria and when it breaks, the following thread breaks as well.
What I want to achieve is that once one of these threads/workers gets status code 200 from their request, all workers/threads should stop. My problem is that it won't stop even though the criteria is met.
Here is my code:
import threading
import requests
import sys
import urllib.parse
import concurrent.futures
import simplejson
from requests.auth import HTTPDigestAuth
from requests.packages import urllib3
from concurrent.futures import ThreadPoolExecutor
def doit(PINStart):
PIN = PINStart
while True:
req1 = requests.post(url, data=json.dumps(data), headers=headers1, verify=False)
if str(req1.status_code) == "200":
print(str(PINs))
c0 = req1.content
j0 = simplejson.loads(c0)
AuthUID = j0['UserId']
print(UnAuthUID)
AuthReqUser()
#Kill all threads/workers if any of the threads get here.
break
elif(PIN > (PINStart + 99)):
break
else:
PIN+=1
def main():
threads = 100
threads = int(threads)
Calcu = 10000/threads
NList = [0]
for i in range(1,threads):
ListAdd = i*Calcu
if ListAdd == 10000:
NList.append(int(ListAdd))
else:
NList.append(int(ListAdd)+1)
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
if __name__ == "__main__":
main()
I understand why this is happening. As I only break the while True loop in one of the threads, so the other 99 (I run the code with 100 threads by default) doesn't break until they finish their count (which is running through the loop 100 times or getting status code 200).
What I originally did was to define a global variable at the top of the code and I changed while Counter < 10000, meaning it will run the loop for all workers until Counter is greater than 10000. And inside the loop it will increment the global variable. This way, when a worker gets status code 200, I set Counter (my global variable) to for example 15000 (something above 10000), so all the other workers stop running the loop 100 times.
This did not work. When I add that into the code, all threads instantly stop, doesn't even run through the loop once.
Here is an example code of this solution:
import threading
import requests
import sys
import urllib.parse
import concurrent.futures
import simplejson
from requests.auth import HTTPDigestAuth
from requests.packages import urllib3
from concurrent.futures import ThreadPoolExecutor
global Counter
def doit(PINStart):
PIN = PINStart
while Counter < 10000:
req1 = requests.post(url, data=json.dumps(data), headers=headers1, verify=False)
if str(req1.status_code) == "200":
print(str(PINs))
c0 = req1.content
j0 = simplejson.loads(c0)
AuthUID = j0['UserId']
print(UnAuthUID)
AuthReqUser()
#Kill all threads/workers if any of the threads get here.
Counter = 15000
break
elif(PIN > (PINStart + 99)):
Counter = Counter+1
break
else:
Counter = Counter+1
PIN+=1
def main():
threads = 100
threads = int(threads)
Calcu = 10000/threads
NList = [0]
for i in range(1,threads):
ListAdd = i*Calcu
if ListAdd == 10000:
NList.append(int(ListAdd))
else:
NList.append(int(ListAdd)+1)
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
if __name__ == "__main__":
main()
Any idea on how I can kill all workers once I get status code 200 from one of the requests I sending out?
The problem is that you're not using a global variable.
To use a global variable in a function, you have to put the global statement in that function, not at the top level. Because you didn't, the Counter inside doit is a local variable. Any variable that you assign to anywhere in a function is local, unless you have a global (or nonlocal) declaration.
And the first time you use that local Counter is right at the top of the while loop, before you've assigned anything to it. So, it's going to raise an UnboundLocalError immediately.
This exception will be propagated back to the main thread as the result of the future. Which you would have seen, except that you never actually evaluate your futures. You just do this:
tGen = {executor.submit(doit, PinS): PinS for PinS in NList}
for NLister in concurrent.futures.as_completed(tGen):
PinS = tGen[NLister]
So, you get the PinS corresponding to the function you ran, but you don't look at the result or exception; you just ignore it. Hence you don't see that you're getting back 100 exceptions, any of which would have told you what was actually wrong. This is equivalent to having a bare except: pass in non-threaded code. Even if you don't want to check the result of your futures in "production" for some reason, you definitely should do it when debugging a problem.
Anyway, just put the global in the right place, and your bug is fixed.
However, you do have at least two other problems.
First, sharing globals between threads without synchronizing them is not safe. In CPython, thanks to the GIL, you never get a segfault because of it, and you often get away with it completely, but you often don't. You can miss counts because two threads tried to do Counter = Counter + 1 at the same time, so they both incremented it from 42 to 43. And you can get a stale value in the while Counter < 10000: check and go through the loop an extra time.
Second, you don't check the Counter until you've finished downloading and processing a complete requests. This could take seconds, maybe even minutes depending on your timeout settings. And add that to the fact that you might go through the loop an extra time before knowing it's time to quit…
in python here is my multiprocessing setup. I subclassed the Process method and gave it
a queue and some other fields for pickling/data purposes.
This strategy works about 95% of the time, the other 5% for an unknown reason the queue just hangs and it never finishes (it's common that 3 of the 4 cores finish their jobs and the last one takes forever so I have to just kill the job).
I am aware that queue's have a fixed size in python, or they will hang. My queue only stores one character strings... the id of the processor, so it can't be that.
Here is the exact line where my code halts:
res = self._recv()
Does anyone have ideas? The formal code is below.
Thank you.
from multiprocessing import Process, Queue
from multiprocessing import cpu_count as num_cores
import codecs, cPickle
class Processor(Process):
def __init__(self, queue, elements, process_num):
super(Processor, self).__init__()
self.queue = queue
self.elements = elements
self.id = process_num
def job(self):
ddd = []
for l in self.elements:
obj = ... heavy computation ...
dd = {}
dd['data'] = obj.data
dd['meta'] = obj.meta
ddd.append(dd)
cPickle.dump(ddd, codecs.open(
urljoin(TOPDIR, self.id+'.txt'), 'w'))
return self.id
def run(self):
self.queue.put(self.job())
if __name__=='__main__':
processes = []
for i in range(0, num_cores()):
q = Queue()
p = Processor(q, divided_work(), process_num=str(i))
processes.append((p, q))
p.start()
for val in processes:
val[0].join()
key = val[1].get()
storage = urljoin(TOPDIR, key+'.txt')
ddd = cPickle.load(codecs.open(storage , 'r'))
.. unpack ddd process data ...
Do a time.sleep(0.001) at the beginning of your run() method.
From my experience
time.sleep(0.001)
Is by far not long enough.
I had a similar problem. It seems to happen if you call get() or put() on a queue "too early". I guess it somehow fails to initialize quick enough. Not entirely sure, but I'm speculating that it might have something to do with the ways a queue might use the underlying operating system to pass messages. It started happening to me after I started using BeautifulSoup and lxml and it affected totally unrelated code.
My solution is a little big ugly but it's simple and it works:
import time
def run(self):
error = True
while error:
self.queue.put(self.job())
error = False
except EOFError:
print "EOFError. retrying..."
time.sleep(1)
On my machine it usually retries twice during application start-up and afterwards never again. You need to do that inside of sender AND receiver since this error will occur on both sides.