I'm using ThreadPoolExecutor to run several simultaneous queries of an API simultaneously. I'd like to return the results as they become available and perform some action on them. I found this post which details the various methods for handling returned results from ThreadPoolExecutor quite helpful. However, it seems like my code is still waiting for all of the API queries to finish before it performs the subsequent actions I'm requesting on the returns from the API. While my code works, I feel like I'm missing something. Is this the best way, or even the correct way to structure this?
Here is an example of my code:
futures_lst = []
with ThreadPoolExecutor(max_workers=6) as executor:
for i in df.index:
future = executor.submit(api_get_function, single_api_input_variable)
futures_lst.append(future)
for future in concurrent.futures.as_completed(futures_lst):
# First run a function to format the results returned by the API
result_df = format_columns_in(future.result())
# Append the formatted results to a csv file as we go
result_df.to_csv('result.csv', mode='a', index=True, header=True)
You using with statement, that's why you will be blocked until all futures completed. with is special statement that is come with __enter__ and __exit__ methods. __enter__ will be called when you enter the with block, and __exit__ method will be called automatically(by interpreter) end of the with block.
When you look at the source code of ThreadPoolExecutor class, you will see this __exit__ function (inherited from Executor class) :
def __exit__(self, exc_type, exc_val, exc_tb):
self.shutdown(wait=True)
return False
And when we look at the description of shutdown method, it says
If wait is True then this method will not return until all the pending
futures are done executing and the resources associated with the
executor have been freed
That's why your code "is still waiting for all of the API queries to finish before it performs the subsequent actions".
To solve this, you have some option. First option is, you can remove the with statement. By doing this, you will not be blocked
futures_lst = []
executor = ThreadPoolExecutor(max_workers=6)
for i in df.index:
future = executor.submit(api_get_function, single_api_input_variable)
futures_lst.append(future)
for future in concurrent.futures.as_completed(futures_lst):
# First run a function to format the results returned by the API
result_df = format_columns_in(future.result())
# Append the formatted results to a csv file as we go
result_df.to_csv('result.csv', mode='a', index=True, header=True)
But above option is not concurrent, because you looping over futures as they completed, but what if two of them are completed at the same time ?
If you want to do this operation concurrently(not parallel) you can use below option also:
def custom_callback(future):
result_df = format_columns_in(future.result())
# Append the formatted results to a csv file as we go
result_df.to_csv('result.csv', mode='a', index=True, header=True)
with ThreadPoolExecutor(max_workers=6) as executor:
for i in df.index:
future = executor.submit(api_get_function, single_api_input_variable)
future.add_done_callback(custom_callback)
This solution is concurrent, if any thread finish it's job, then it will call the custom_callback function
But this option is not thread safe, because you appending to file and I don't know what format_columns_in is do. result_df.to_csv require an extra locking(format_columns_in can also require locking accoring to its operation). You can do this option thread-safe like ;
Appending to file can be also thread safe, to more detail please look at this
def custom_callback(future):
result_df = format_columns_in(future.result())
# Append the formatted results to a csv file as we go
with global_lock:
result_df.to_csv('result.csv', mode='a', index=True, header=True)
with ThreadPoolExecutor(max_workers=6) as executor:
for i in df.index:
future = executor.submit(api_get_function, single_api_input_variable)
future.add_done_callback(custom_callback)
For more info about add_done_callback() method, you can see it from doc
Related
I am trying to use the ThreadPoolExecutor() in a method of a class to create a pool of threads that will execute another method within the same class. I have the with concurrent.futures.ThreadPoolExecutor()... however it does not wait, and an error is thrown saying there was no key in the dictionary I query after the "with..." statement. I understand why the error is thrown because the dictionary has not been updated yet because the threads in the pool did not finish executing. I know the threads did not finish executing because I have a print("done") in the method that is called within the ThreadPoolExecutor, and "done" is not printed to the console.
I am new to threads, so if any suggestions on how to do this better are appreciated!
def tokenizer(self):
all_tokens = []
self.token_q = Queue()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for num in range(5):
executor.submit(self.get_tokens, num)
executor.shutdown(wait=True)
print("Hi")
results = {}
while not self.token_q.empty():
temp_result = self.token_q.get()
results[temp_result[1]] = temp_result[0]
print(temp_result[1])
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
return all_tokens
def get_tokens(self, thread_index):
print("!!!!!!!")
switch = {
0: self.zettels[:(len(self.zettels)/5)],
1: self.zettels[(len(self.zettels)/5): (len(self.zettels)/5)*2],
2: self.zettels[(len(self.zettels)/5)*2: (len(self.zettels)/5)*3],
3: self.zettels[(len(self.zettels)/5)*3: (len(self.zettels)/5)*4],
4: self.zettels[(len(self.zettels)/5)*4: (len(self.zettels)/5)*5],
}
new_tokens = []
for zettel in switch.get(thread_index):
tokens = re.split('\W+', str(zettel))
tokens = list(filter(None, tokens))
new_tokens.append(tokens)
print("done")
self.token_q.put([new_tokens, thread_index])
'''
Expected to see all print("!!!!!!") and print("done") statements before the print ("Hi") statement.
Actually shows the !!!!!!! then the Hi, then the KeyError for the results dictionary.
As you have already found out, the pool is waiting; print('done') is never executed because presumably a TypeError raises earlier.
The pool does not directly wait for the tasks to finish, it waits for its worker threads to join, which implicitly requires the execution of the tasks to complete, one way (success) or the other (exception).
The reason you do not see that exception raising is because the task is wrapped in a Future. A Future
[...] encapsulates the asynchronous execution of a callable.
Future instances are returned by the executor's submit method and they allow to query the state of the execution and access whatever its outcome is.
That brings me to some remarks I wanted to make.
The Queue in self.token_q seems unnecessary
Judging by the code you shared, you only use this queue to pass the results of your tasks back to the tokenizer function. That's not needed, you can access that from the Future that the call to submit returns:
def tokenizer(self):
all_tokens = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(get_tokens, num) for num in range(5)]
# executor.shutdown(wait=True) here is redundant, it is called when exiting the context:
# https://github.com/python/cpython/blob/3.7/Lib/concurrent/futures/_base.py#L623
print("Hi")
results = {}
for fut in futures:
try:
res = fut.result()
results[res[1]] = res[0]
except Exception:
continue
[...]
def get_tokens(self, thread_index):
[...]
# instead of self.token_q.put([new_tokens, thread_index])
return new_tokens, thread_index
It is likely that your program does not benefit from using threads
From the code you shared, it seems like the operations in get_tokens are CPU bound, rather than I/O bound. If you are running your program in CPython (or any other interpreter using a Global Interpreter Lock), there will be no benefit from using threads in that case.
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.
That means for any Python process, only one thread can execute at any given time. This is not so much of an issue if your task at hand is I/O bound, i.e. frequently pauses to wait for I/O (e.g. for data on a socket). If your tasks need to constantly execute bytecode in a processor, there's no benefit for pausing one thread to let another execute some instructions. In fact, the resulting context switches might even prove detrimental.
You might want to go for parallelism instead of concurrency. Take a look at ProcessPoolExecutor for this.However, I recommend to benchmark your code running sequentially, concurrently and in parallel. Creating processes or threads comes at a cost and, depending on the task to complete, doing so might take longer than just executing one task after the other in a sequential manner.
As an aside, this looks a bit suspicious:
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
results seems to always have five items, because for num in range(5). If the length of self.zettels is greater than five, I'd expect a KeyError to raise here.If self.zettels is guaranteed to have a length of five, then I'd see potential for some code optimization here.
You need to loop over concurrent.futures.as_completed() as shown here. It will yield values as each thread completes.
I have a job that uses the multiprocessing package and calls a function via
resultList = pool.map(myFunction, myListOfInputParameters).
Each entry of the list of input parameters is independent from others.
This job will run a couple of hours. For safety reasons, I would like to store the results that are made in between in regular time intervals, like e.g. once an hour.
How can I do this and be able to continue with the processing when the job was aborted and I want to restart it based on the last available backup?
Perhaps use pickle. Read more here:
https://docs.python.org/3/library/pickle.html
Based on aws_apprentice's comment I created a full multiprocessing example in case you weren't sure how to use intermediate results. The first time this is run it will print "None" as there are no intermediate results. Run it again to simulate restarting.
from multiprocessing import Process
import pickle
def proc(name):
data = None
# Load intermediate results if they exist
try:
f = open(name+'.pkl', 'rb')
data = pickle.load(f)
f.close()
except:
pass
# Do something
print(data)
data = "intermediate result for " + name
# Periodically save your intermediate results
f = open(name+'.pkl', 'wb')
pickle.dump(data, f, -1)
f.close()
processes = []
for x in range(5):
p = Process(target=proc, args=("proc"+str(x),))
p.daemon = True
p.start()
processes.append(p)
for process in processes:
process.join()
for process in processes:
process.terminate()
You can also use json if that makes sense to output intermediate results in human readable format. Or sqlite as a database if you need to push data into rows.
There are at least two possible options.
Have each call of myFunction save its output into a uniquely named file. The file name should be based on or linked to the input data. Use the parent program to gather the results. In this case myFunction should return an identifier of the item that is finished.
Use imap_unordered instead of map. This will start yielding results as soon as they are available, instead of returing when all processing is finished. Have the parent program save the returned data and a indication which items are finished.
In both cases, the program would have to examine the data saved from previous runs to adjust myListOfInputParameters when it is being re-started.
Which option is best depends to a large degree on the amount of data returned by myFunction. If this is a large amount, there is a significant overhead associated with transferring it back to the parent. In that case option 1 is probably best.
Since writing to disk is relatively slow, calculations wil probably go faster with option 2. And it is easier for the parent program to track progress.
Note that you can also use imap_unordered with option 1.
I need to parse around 1000 URLs. So far I have a function that returns a pandas dataframe after parsing the URL. How should I best structure the program so I can add all the dataframes together? I'm also unsure how to return arguments into 'futures'. In the below example, how can I eventually merge all temp dataframes into a single dataframe (i.e. finalDF=finalDF.append(temp)
import concurrent.futures
def Parser(ptf):
temp=pd.DataFrame()
URL="http://"+str(URL)
#..some complex operations, including a requests.get(URL) which returns eventually a temp: a pandas dataframe
return temp #returns a pandas dataframe
def conc_caller(ptf):
temp=Parser(ptf)
#this won't work because finalDF is not defined, unclear how to handle this
finalDF= finalDF.append(temp)
return df
booklist=['a','b','c']
finalDF=pd.DataFrame()
executor = concurrent.futures.ProcessPoolExecutor(3)
futures = [executor.submit(conc_caller, item) for item in booklist]
concurrent.futures.wait(futures)
Another problem is that I get the error message:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
Any suggestion how to fix the code are appreciated.
You have to protect your launch code with if __name__ == '__main__': to prevent creating processes forever.
Just before concurrent.futures.wait(futures)
I'm doing some file parsing that is a CPU bound task. No matter how many files I throw at the process it uses no more than about 50MB of RAM.
The task is parrallelisable, and I've set it up to use concurrent futures below to parse each file as a separate process:
from concurrent import futures
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
for this_file in files_list:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
The problem is that when I run this using futures, RAM usage rockets and before long I've run out and Python has crashed. This is probably in large part because the results from parse_function are several MB in size. Once the results have been through post_processing, the application has no further need of them. As you can see, I'm trying del jobs[job] to clear items out of jobs, but this has made no difference, memory usage remains unchanged, and seems to increase at the same rate.
I've also confirmed it's not because it's waiting on the post_process function by only using a single process, plus throwing in a time.sleep(1).
There's nothing in the futures docs about memory management, and while a brief search indicates it has come up before in real-world applications of futures (Clear memory in python loop and http://grokbase.com/t/python/python-list/1458ss5etz/real-world-use-of-concurrent-futures) - the answers don't translate to my use-case (they're all concerned with timeouts and the like).
So, how do you use Concurrent futures without running out of RAM?
(Python 3.5)
I'll take a shot (Might be a wrong guess...)
You might need to submit your work bit by bit since on each submit you're making a copy of parser_variables which may end up chewing your RAM.
Here is working code with "<----" on the interesting parts
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
files_left = len(files_list) #<----
files_iter = iter(files_list) #<------
while files_left:
for this_file in files_iter:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
if len(jobs) > MAX_JOBS_IN_QUEUE:
break #limit the job submission for now job
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
files_left -= 1 #one down - many to go... <---
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
break; #give a chance to add more jobs <-----
Try adding del to your code like this:
for job in futures.as_completed(jobs):
del jobs[job] # or `val = jobs.pop(job)`
# del job # or `job._result = None`
Looking at the concurrent.futures.as_completed() function, I learned it is enough to ensure there is no longer any reference to the future. If you dispense this reference as soon as you've got the result, you'll minimise memory usage.
I use a generator expression for storing my Future instances because everything I care about is already returned by the future in its result (basically, the status of the dispatched work). Other implementations use a dict for example like in your case, because you don't return the input filename as part of the thread workers result.
Using a generator expression means once the result is yielded, there is no longer any reference to the Future. Internally, as_completed() has already taken care of removing its own reference, after it yielded the completed Future to you.
futures = (executor.submit(thread_worker, work) for work in workload)
for future in concurrent.futures.as_completed(futures):
output = future.result()
... # on next loop iteration, garbage will be collected for the result data, too
Edit: Simplified from using a set and removing entries, to simply using a generator expression.
Same problem for me.
In my case I need to start millions of threads. For python2, I would write a thread pool myself using a dict. But in python3 I encounted the following error when I del finished threads dynamically:
RuntimeError: dictionary changed size during iteration
So I have to use concurrent.futures, at first I coded like this:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces:
pool.submit(handle_resource, *args)
But soon memory exhausted, because memory will be released only after all threads finished. I need to del finished threads before to many thread started. So I read the docs here: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
Find that Executor.shutdown(wait=True) might be what I need.
And this is my final solution:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
i = 0
while i < len(all_resouces):
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces[i:i+1000]:
pool.submit(handle_resource, *args)
i += 1000
You can avoid having to call this method explicitly if you use the with statement, which will shutdown the Executor (waiting as if Executor.shutdown() were called with wait set to True)
I am bit new in python,
My current code download the csv file and import it in cassandra but as a single thread. is there a way to create 5 or 10 threads to split the csv file(rows) and read it in parallel and insert the rows in Cassandra one row per thread? , i am trying to create a equity trading Database to store all tick database thus looking for ways to improve the performance of code and methods. please just ignore me if the question sounds bit silly.
conn = requests.get(url, stream=True)
if conn.status_code == 200:
zfile = zipfile.ZipFile(io.BytesIO(conn.content))
zfile.extractall()
with open(csv_file) as csv_d:
csv_content = csv.reader(csv_d)
for row in csv_content:
symbol = row[0]
stype = row[1]
openp = row[2]
highp = row[3]
lowp = row[4]
closep = row[5]
vol = row[8]
dtime = row[10]
cassa.main('load', symbol, dtime, stype, openp, highp, lowp, closep, vol)
csv_d.close()
os.remove(csv_file)
logging.info("csv file processed succesfully")
Thanks & Regards
If you happen to use the DataStax Python driver this will give you an async API besides the sync API. Using the async API you can try out a series of different approaches:
batched futures: start a number of async queries in parallel in wait for them to complete; repeat
queued futures: add futures to a queue; each time you add a new future to the queue, wait for the oldest one to complete
You can find a couple more ideas on how to approach this in this doc.
The way I would do this in java (and I think python would be similar) is to use a worker thread pool. You would read the csv file in a single thread as you are doing, but then in the for loop you would dispatch each row to a thread in the thread pool.
The worker threads would do a synchronous insert of their single row and return.
The size of the thread pool controls how many inserts you would have running in parallel. Up to a point, the bigger the worker pool, the faster the import of the whole file will happen (limited by the maximum throughput of the cluster).
Another way is to use a single thread and use the asynchronous mode to do the inserts. In java it's called executeAsync and this sends the CQL statement to Cassandra and returns immediately without blocking so that you get the same effect of lots of inserts running in parallel.
You could also look into using the "COPY ... FROM 'file.csv';" CQL command.