I am trying to make concurrent API calls with python.
I based my code on the solution (first answer) presented in this thread: What is the fastest way to send 100,000 HTTP requests in Python?
Currently, my code is broken.
I have a main function which creates the queue, populates it, initiates the threads, starts them, and joins the queue.
I also have a target function which should make the get requests to the API.
The difficulties I am experiencing right now is that
the target function does not execute the necessary work.
The target is called, but it acts as the queue is empty.
The first print is executed ("inside scraper worker"), while the second ("inside scraper worker - queue NOT empty") is not.
def main_scraper(flights):
print("main scraper was called, got: ")
print(flights)
data = []
q = Queue()
map(q.put, flights)
for i in range(0, 5):
t = Thread(target = scraper_worker, args = (q, data))
t.daemon = True
t.start()
q.join()
return data
def scraper_worker(q, data):
print("inside scraper worker")
while not q.empty():
print("inside scraper worker, queue not empty")
f = q.get()
url = kiwi_url(f)
response = requests.get(url)
response_data = response.json()
results = parseResults(response_data)
q.task_done()
print("task done. results:")
print(results)
#f._price = results[0]["price"]
#f._url = results[0]["deep_link"]
data.append(results)
return data
I hope this is enough information for you to help me out.
Otherwise, I will rewrite the code in order to create a code that can be run by anyone.
I would guess that the flights are not being put on the queue. map(q.put, flights) is lazy, and is never accessed so it is as if it didn't happen. I would just iterate.
def main_scraper(flights):
print("main scraper was called, got: ")
print(flights)
data = []
q = Queue()
for flight in flights:
q.put(flight)
for i in range(0, 5):
t = Thread(target = scraper_worker, args = (q, data))
t.daemon = True
t.start()
q.join()
return data
Related
I'm using multiprocessing to run workers on different files in parallel. Worker's results are put into queue. A listener gets the results from the queue and writes them to the file.
Sometimes listener might run into errors (of various origins). In this case, the listener silently dies, but all other processes continue running (rather surprisingly, worker errors causes all processes to terminate).
I would like to stop all processes (workers, listener, e.t.c.) when listener catches an error. How this can be done?
The scheme of my code is as follows:
def worker(file_path, q):
## do something
q.put(1.)
return True
def listener(q):
while True:
m = q.get()
if m == 'kill':
break
else:
try:
# do something and write to file
except Exception as err:
# raise error
tb = sys.exc_info()[2]
raise err.with_traceback(tb)
def main():
manager = mp.Manager()
q = manager.Queue(maxsize=3)
with mp.Pool(5) as pool:
watcher = pool.apply_async(listener, (q,))
files = ['path_1','path_2','path_3']
jobs = [ pool.apply_async(worker, (p,q,)) for p in files ]
# fire off workers
for job in jobs:
job.get()
# kill the listener when done
q.put('kill')
# run
if __name__ == "__main__":
main()
I tried introducing event = manager.Event() and using it as a flag in main():
## inside the pool, after starting workers
while True:
if event.is_set():
for job in jobs:
job.terminate()
No success. Calling os._exit(1) in listener exception block rises broken pipe error, but processes are not killed.
I also tried setting daemon = True,
for job in jobs:
job.daemon = True
Did not help.
In fact, to handle listener exceptions, I'm using a callable, as required by apply_async (so that they are not entirely silenced). This complicates the situation, but not much.
Thank you in advance.
As always there are many ways to accomplish what you're after, but I would probably suggest using an Event to signal that the processes should quit. I also would not use a Pool in this instance, as it only really simplifies things for simple cases where you need something like map. More complicated use cases quickly make it easier to just build you own "pool" with the functionality you need.
from multiprocessing import Process, Queue, Event
from random import random
def might_fail(a):
assert(a > .001)
def worker(args_q: Queue, result_q: Queue, do_quit: Event):
try:
while not do_quit.is_set():
args = args_q.get()
if args is None:
break
else:
# do something
result_q.put(random())
finally: #signal that worker is exiting even if exception is raised
result_q.put(None) #signal listener that worker is exiting
def listener(result_q: Queue, do_quit: Event, n_workers: int):
n_completed = 0
while n_workers > 0:
res = result_q.get()
if res is None:
n_workers -= 1
else:
n_completed += 1
try:
might_fail(res)
except:
do_quit.set() #let main continue
print(n_completed)
raise #reraise error after we signal others to stop
do_quit.set() #let main continue
print(n_completed)
if __name__ == "__main__":
args_q = Queue()
result_q = Queue()
do_quit = Event()
n_workers = 4
listener_p = Process(target=listener, args=(result_q, do_quit, n_workers))
listener_p.start()
for _ in range(n_workers):
worker_p = Process(target=worker, args=(args_q, result_q, do_quit))
worker_p.start()
for _ in range(1000):
args_q.put("some/file.txt")
for _ in range(n_workers):
args_q.put(None)
do_quit.wait()
print('done')
I want to build a tool that scan a website for sub domains, I know how to do his, but my function is slower, I looked up in the gobuster usage, and I saw that the gobuster can use many concurrent threads, how can I implement this too ?
I have asked Google many times, but I can't see anything about this, can someone give me an example ?
gobuster usage: -t Number of concurrent threads (default 10)
My current program:
def subdomaines(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
for line in open(wordlist).readlines():
resp = requests.get(url + line) # resp
if resp.status_code in (301, 200):
print(f'Valid - {line}')
print(f'{count} / {num_lines}')
count += 1
Note* : gobuster is a very fast tool for searching subdomains in websites
If you're trying to use threading in python you should start from the basics and learn what's available. But here's a simple example taken from https://pymotw.com/2/threading/
import threading
def worker():
"""thread worker function"""
print 'Worker'
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
To apply this to your task, a simple approach would be to spawn a thread for each request. Something like the code below. Note: if your wordlist is long this might be very expensive. Look into some of the thread pool libraries in python for better thread management that you won't need to explicitly control yourself.
import threading
def subdomains(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
threads = []
for line in open(wordlist).readlines():
t = threading.Thread(target=checkUrl,args=(url,line))
threads.append(t)
t.start()
for thread in threads: #wait for all threads to complete
thread.join()
def checkUrl(url,line):
resp = requests.get(url + line)
if resp.status_code in (301, 200):
print(f'Valid - {line}')
To implement the counter you'll need to control shared access between threads to prevent race conditions (two threads accessing the variable at the same time resulting in... problems). A counter object with protected access is provided in the link above:
class Counter(object):
def __init__(self, start=0):
self.lock = threading.Lock()
self.value = start
def increment(self):
#Waiting for lock
self.lock.acquire()
try:
#Acquired lock
self.value = self.value + 1
finally:
#Release lock, so other threads can count
self.lock.release()
#usage:
#in subdomains()...
counter = Counter()
for ...
t = threading.Thread(target=checkUrl,args=(url,line,counter))
#in checkUrl()...
c.increment()
Final note: I have not compiled or tested any of this code.
Python have threading module.
The simplest way to use a Thread is to instantiate it with a target function and call start() to let it begin working.
import threading
def subdomains(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
for line in open(wordlist).readlines():
resp = requests.get(url + line) # resp
if resp.status_code in (301, 200):
print(f'Valid - {line}')
print(f'{count} / {num_lines}')
count += 1
threads = []
for i in range(10):
t = threading.Thread(target=subdomains)
threads.append(t)
t.start()
I'm running a thread pool that is giving a random bug. Sometimes it works, sometimes it gets stuck at the pool.join part of this code. I've been at this several days, yet cannot find any difference between when it works or when it gets stuck. Please help...
Here's the code...
def run_thread_pool(functions_list):
# Make the Pool of workers
pool = ThreadPool() # left blank to default to machine number of cores
pool.map(run_function, functions_list)
# close the pool and wait for the work to finish
pool.close()
pool.join()
return
Similarly, this code is also randomly getting stuck at q.join(:
def run_queue_block(methods_list, max_num_of_workers=20):
from views.console_output_handler import add_to_console_queue
'''
Runs methods on threads. Stores method returns in a list. Then outputs that list
after all methods in the list have been completed.
:param methods_list: example ((method name, args), (method_2, args), (method_3, args)
:param max_num_of_workers: The number of threads to use in the block.
:return: The full list of returns from each method.
'''
method_returns = []
log = StandardLogger(logger_name='run_queue_block')
# lock to serialize console output
lock = threading.Lock()
def _output(item):
# Make sure the whole print completes or threads can mix up output in one line.
with lock:
if item:
add_to_console_queue(item)
msg = threading.current_thread().name, item
log.log_debug(msg)
return
# The worker thread pulls an item from the queue and processes it
def _worker():
log = StandardLogger(logger_name='_worker')
while True:
try:
method, args = q.get() # Extract and unpack callable and arguments
except:
# we've hit a nonetype object.
break
if method is None:
break
item = method(*args) # Call callable with provided args and store result
method_returns.append(item)
_output(item)
q.task_done()
num_of_jobs = len(methods_list)
if num_of_jobs < max_num_of_workers:
max_num_of_workers = num_of_jobs
# Create the queue and thread pool.
q = Queue()
threads = []
# starts worker threads.
for i in range(max_num_of_workers):
t = threading.Thread(target=_worker)
t.daemon = True # thread dies when main thread (only non-daemon thread) exits.
t.start()
threads.append(t)
for method in methods_list:
q.put(method)
# block until all tasks are done
q.join()
# stop workers
for i in range(max_num_of_workers):
q.put(None)
for t in threads:
t.join()
return method_returns
I never know when it's going to work. It works most the time, but most the time is not good enough. What might possibly cause a bug like this?
You have to call shutdown on the concurrent.futures.ThreadPoolExecutor object. Then return the result of pool.map.
def run_thread_pool(functions_list):
# Make the Pool of workers
pool = ThreadPool() # left blank to default to machine number of cores
result = pool.map(run_function, functions_list)
# close the pool and wait for the work to finish
pool.shutdown()
return result
I've simplified your code without a Queue object and daemon Thread. Check if it fits your requirement.
def run_queue_block(methods_list):
from views.console_output_handler import add_to_console_queue
'''
Runs methods on threads. Stores method returns in a list. Then outputs that list
after all methods in the list have been completed.
:param methods_list: example ((method name, args), (method_2, args), (method_3, args)
:param max_num_of_workers: The number of threads to use in the block.
:return: The full list of returns from each method.
'''
method_returns = []
log = StandardLogger(logger_name='run_queue_block')
# lock to serialize console output
lock = threading.Lock()
def _output(item):
# Make sure the whole print completes or threads can mix up output in one line.
with lock:
if item:
add_to_console_queue(item)
msg = threading.current_thread().name, item
log.log_debug(msg)
return
# The worker thread pulls an item from the queue and processes it
def _worker(method, *args, **kwargs):
log = StandardLogger(logger_name='_worker')
item = method(*args, **kwargs) # Call callable with provided args and store result
with lock:
method_returns.append(item)
_output(item)
threads = []
# starts worker threads.
for method, args in methods_list:
t = threading.Thread(target=_worker, args=(method, args))
t.start()
threads.append(t)
# stop workers
for t in threads:
t.join()
return method_returns
To allow your queue to join in your second example, you need to ensure that all tasks are removed from the queue.
So in your _worker function, mark tasks as done even if they could not be processed, otherwise the queue will never be emptied, and your program will hang.
def _worker():
log = StandardLogger(logger_name='_worker')
while True:
try:
method, args = q.get() # Extract and unpack callable and arguments
except:
# we've hit a nonetype object.
q.task_done()
break
if method is None:
q.task_done()
break
item = method(*args) # Call callable with provided args and store result
method_returns.append(item)
_output(item)
q.task_done()
I have a master python script, that creates two objects
obj1 = xmlobj()
list1, list2, list3 = obj1.parsexml("xmlfile")
//parsexml returns me three lists
obj2 = htmlobj()
str1 = obj2.createhtmltable(list1, list2, list3)
//creates a html table
But, when I run script, master script does not wait for the obj1.parsexml to return value, and executes and immediately goes to next statement of creating obj2, thus not even executing createhtmltable as list1, list2, list3 has no return values in it. Separately, parsexml works fine without any error
How can I fix this, thank you
Here is an example:
import Queue #python 3.x => import queue
import urllib2 #python 3.x => import urllib.request
import threading
urls = [
"http://www.apple.com",
"http://www.yahoo.com",
"http://www.google.com",
"http://www.espn.com",
]
def do_stuff():
while True:
url = task_q.get()
if url == "NO_MORE_DATA":
break
data = urllib2.urlopen(url).read()[:200]
results_q.put(data)
#Create the Queues:
task_q = Queue.Queue()
results_q = Queue.Queue()
num_worker_threads = 2
for i in range(num_worker_threads):
t = threading.Thread(target=do_stuff)
t.daemon = True #daemon threads are killed when the main thread finishes
t.start() #All threads block because the task_q is empty.
#Add the tasks/urls to the task_q:
for url in urls:
task_q.put(url)
#Like frantic dogs, all the threads are chomping on the task_q...
#Add a stop for each thread, so that after all the urls
#have been removed from the task_q, the threads will terminate:
for i in range(num_worker_threads):
task_q.put("NO_MORE_DATA")
results_count = len(urls) #Need to know how many results are expected in order to know when to stop reading from the results_q
while True:
result = results_q.get()
#process the result here
print result
print '-' * 10
results_count -= 1
if results_count == 0: #then all results have been processed, so you should stop trying to read from the results_q
break
Queue.join()
Blocks until all items in the queue have been gotten and
processed.
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls
task_done() to indicate that the item was retrieved and all work on it
is complete. When the count of unfinished tasks drops to zero, join()
unblocks.
https://docs.python.org/2.7/library/queue.html#module-Queue
Yeah, but who wants to wait until all the results have been added to the results Queue before starting to process the results. Why not start processing the first result as soon as it is available?
I don't know why I'm having such a problem with this, basically, I want to have a Queue that is constantly running during the program called "Worker" this then works, however, every 10 seconds or so.. Another method called "Process" comes in and processes the data. Let's assume the following, data is captured every 10 seconds.. (0, 1, 2, 3, ..... n) and then the "Proces" function receives this, processes the data, ends, and then the "Worker" goes back to work and does their job until the program has ended.
I have the following code:
import multiprocessing as mp
import time
DELAY_SIZE = 10
def Worker(q):
print "I'm working..."
def Process(q):
print "I'm processing.."
queue = mp.Queue(maxsize=DELAY_SIZE)
p = mp.Process(target=Worker, args=(queue,))
p.start()
while True:
d = queue.get()
time.sleep(10)
Process()
In this example, it would look like the following:
I'm working...
I'm working...
I'm working...
...
...
...
I'm working...
I'm processing...
I'm processing...
I'm processing...
...
...
I'm working..
I'm working..
Any ideas?
Here is an alternative way using threads:
import threading
import Queue
import time
class Worker(threading.Thread):
def __init__(self, q):
threading.Thread.__init__(self)
self._q = q
def run(self):
# here, worker does its job
# results are pushed to the shared queue
while True:
print 'I am working'
time.sleep(1)
result = time.time() # just an example
self._q.put(result)
def process(q):
while True:
if q.empty():
time.sleep(10)
print 'I am processing'
worker_result = q.get()
# do whatever you want with the result...
print " ", worker_result
if __name__ == '__main__':
shared_queue = Queue.Queue()
worker = Worker(shared_queue)
worker.start()
process(shared_queue)