how to download multiple file simultaneously and join them in python? - python

I have some split files on a remote server.
I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python.
My pseudo is like this:
url1 = something
url2 = something
url3 = something
data1 = download(url1)
data2 = download(url2)
data3 = download(url3)
wait for all download to complete
join all data and save
Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done.
I have tried by creating a class. But again I can't figure out how to wait till all complete.
I am more interested in Threading and Queue feature and I can import them in my platform.
I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure

There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so:
Multiple threads
Or multiple processes, especially if the "things" take a lot of CPU power
Or coroutines or greenlets, especially if there are thousands of "things"
Or pools of one of the above
Event loops (either coded manually)
Or hybrid greenlet/event loop systems like gevent.
If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that.
And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect.
I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.)
import concurrent.futures
urls = ['http://example.com/foo',
'http://example.com/bar']
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
result = b''.join(executor.map(download, urls))
with open('output_file', 'wb') as f:
f.write(result)
Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time.
For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request):
def download(url):
with urllib.request.urlopen(url) as f:
return f.read()
If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib.
However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way.
If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand.
But here's the simplest joinable threadpool:
class ThreadPool(object):
def __init__(self, max_workers):
self.queue = queue.Queue()
self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)]
def start(self):
for worker in self.workers:
worker.start()
def stop(self):
for _ in range(self.workers):
self.queue.put(None)
for worker in self.workers:
worker.join()
def submit(self, job):
self.queue.put(job)
def _worker(self):
while True:
job = self.queue.get()
if job is None:
break
job()
Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor:
urls = ['http://example.com/foo',
'http://example.com/bar']
results = [list() for _ in urls]
results_lock = threading.Lock()
def download(url, i):
with urllib.request.urlopen(url) as f:
result = f.read()
with results_lock:
results[i] = url
pool = ThreadPool(max_workers=8)
pool.start()
for i, url in enumerate(urls):
pool.submit(functools.partial(download, url, i))
pool.stop()
result = b''.join(results)
with open('output_file', 'wb') as f:
f.write(result)

You can use an async framwork like twisted.
Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound

Related

Threads is not executing in parallel python with ThreadPoolExecutor

I'm new in python threading and I'm experimenting this:
When I run something in threads (whenever I print outputs), it never seems to be running in parallel. Also, my functions take the same time that before using the library concurrent.futures (ThreadPoolExecutor).
I have to calculate the gains of some attributes over a dataset (I cannot use libraries). Since I have about 1024 attributes and the function was taking about a minute to execute (and I have to use it in a for iteration) I dicided to split the array of attributes into 10 (just as an example) and run the separete function gain(attribute) separetly for each sub array. So I did the following (avoiding some extra unnecessary code):
def calculate_gains(self):
splited_attributes = np.array_split(self.attributes, 10)
result = {}
for atts in splited_attributes:
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(self.calculate_gains_helper, atts)
return_value = future.result()
self.gains = {**self.gains, **return_value}
Here's the calculate_gains_helper:
def calculate_gains_helper(self, attributes):
inter_result = {}
for attribute in attributes:
inter_result[attribute] = self.gain(attribute)
return inter_result
Am I doing something wrong? I read some other older posts but I couldn't get any info.
Thanks a lot for any help!
Python threads do not run in parallel (at least in CPython implementation) because of the GIL. Use processes and ProcessPoolExecutor to really have parallelism
with concurrent.futures.ProcessPoolExecutor() as executor:
...
You submit and then wait for each work item serially so all the threads do is slow everything down. I can't guarantee this will speed things up much because you are still dealing with the python GIL that keeps python level stuff from working in parallel, but here goes.
I've created a thread pool and pushed everything possible into the worker, including the slicing of self.attributes.
def calculate_gains(self):
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
result_list = executor.map(self.calculate_gains_helper,
((i, i+10) for i in range(0, len(self.attributes), 10)))
for return_value in result_list:
self.gains = {**self.gains, **return_value}
def calculate_gains_helper(self, start_end):
start, end = start_end
inter_result = {}
for attribute in self.attributes[start:end]:
inter_result[attribute] = self.gain(attribute)
return inter_result

Mutliprocessing Queue vs. Pool

I'm having the hardest time trying to figure out the difference in usage between multiprocessing.Pool and multiprocessing.Queue.
To help, this is bit of code is a barebones example of what I'm trying to do.
def update():
def _hold(url):
soup = BeautifulSoup(url)
return soup
def _queue(url):
soup = BeautifulSoup(url)
li = [l for l in soup.find('li')]
return True if li else False
url = 'www.ur_url_here.org'
_hold(url)
_queue(url)
I'm trying to run _hold() and _queue() at the same time. I'm not trying to have them communicate with each other so there is no need for a Pipe. update() is called every 5 seconds.
I can't really rap my head around the difference between creating a pool of workers, or creating a queue of functions. Can anyone assist me?
The real _hold() and _queue() functions are much more elaborate than the example so concurrent execution actually is necessary, I just thought this example would suffice for asking the question.
The Pool and the Queue belong to two different levels of abstraction.
The Pool of Workers is a concurrent design paradigm which aims to abstract a lot of logic you would otherwise need to implement yourself when using processes and queues.
The multiprocessing.Pool actually uses a Queue internally for operating.
If your problem is simple enough, you can easily rely on a Pool. In more complex cases, you might need to deal with processes and queues yourself.
For your specific example, the following code should do the trick.
def hold(url):
...
return soup
def queue(url):
...
return bool(li)
def update(url):
with multiprocessing.Pool(2) as pool:
hold_job = pool.apply_async(hold, args=[url])
queue_job = pool.apply_async(queue, args=[url])
# block until hold_job is done
soup = hold_job.get()
# block until queue_job is done
li = queue_job.get()
I'd also recommend you to take a look at the concurrent.futures module. As the name suggest, that is the future proof implementation for the Pool of Workers paradigm in Python.
You can easily re-write the example above with that library as what really changes is just the API names.

How can I optimize processes with a Pool or Queue in large batch processing?

I'm trying to execute a function on every line of a CSV file as fast as possible. My code works, but I know it could be faster if I make better use of the multiprocessing library.
processes = []
def execute_task(task_details):
#work is done here, may take 1 second, may take 10
#send output to another function
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row in r:
p = Process(target=execute_task, args=(row,))
processes.append(p)
p.start()
for p in processes:
p.join()
I'm thinking I should put the tasks into a Queue and process them with a Pool but all the examples make it seem like Queue doesn't work the way I assume, and that I can't map a Pool to an ever expanding Queue.
I've done something similar using a Pool of workers.
from multiprocessing import Pool, cpu_count
def initializer(arg1, arg2):
# Do something to initialize (if necessary)
def process_csv_data(data):
# Do something with the data
pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))
with open("csv_data_file.csv", "rb") as f:
csv_obj = csv.reader(f)
for row in csv_obj:
pool.apply_async(process_csv_data, (row,))
However, as pvg commented under your question, you might want to consider how to batch your data. Going row by row may not the the right level of granularity.
You might also want to profile/test to figure out the bottle-neck. For example, if disk access is limiting you, you might not benefit from parallelizing.
mulprocessing.Queue is a means to exchanging objects among the processes, so it's not something you'd put a task into.
For me it looks like you are actually trying to speed up
def check(row):
# do the checking
return (row,result_of_check)
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row,result in map(check,r):
print(row,result)
which can be done with
#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound
def check(row):
# do the checking
return (row,result_of_check)
if __name__=="__main__": #in case you are using processes on windows
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
with Pool() as p: # before python 3.3 you should do close() and join() explicitly
for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
print(row,result)
Creating processes takes some time (especially on windows) so in most cases using threads via multiprocessing.dummy is faster (and also multiprocessing is not totally trivial - see Guidelines).

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

Python multi-threading file processing

I have few files that resides on a server, im trying to implement a multi threading process to improve the performance, I read a tutorial but have few questions implementing it,
Here are the files,
filelistread = ['h:\\file1.txt', \
'h:\\file2.txt', \
'h:\\file3.txt', \
'h:\\file4.txt']
filelistwrte = ['h:\\file1-out.txt','h:\\file2-out.txt','h:\\file3-out.txt','h:\\file4-out.txt']
def workermethod(inpfile, outfile):
f1 = open(inpfile,'r')
f2 = open(outfile,'w')
x = f1.readlines()
for each in x:
f2.write(each)
f1.close()
f2.close()
How do I implement using the thread class and queue?
I started with the below class but not sure how to pass the inpfile and outputfile to the run method..Any inputs are appreciated
class ThreadUrl(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
item = self.queue.get()
You're mixing up two different solutions.
If you want to create a dedicated worker thread for each file, you don't need a queue for anything. If you want to create a threadpool and a queue of files, you don't want to pass inpfile and outfile to the run method; you want to put them in each job on the queue.
How do you choose between the two? Well, the first is obviously simpler, but if you have, say, 1000 files to copy, you'll end up creating 1000 threads, which is more threads than you ever want to create, and far more threads than the number of parallel copies the OS will be able to handle. A thread pool lets you create, say, 8 threads, and put 1000 jobs on a queue, and they'll be distributed to the threads as appropriate, so 8 jobs are running at a time.
Let's start with solution 1, a dedicated worker thread for each file.
First, if you aren't married to subclassing Thread, there's really no reason to do so here. You can pass a target function and an args tuple to the default constructor, and then the run method will just do target(*args), exactly as you want. So:
t = threading.Thread(target=workermethod, args=(inpfile, outfile))
That's all you need. When each thread runs, it will call workermethod(inpfile, outfile) and then exit.
However, if you do want to subclass Thread for some reason, you can. You can pass the inpfile and outfile in at construction time, and your run method would just be that workermethod modified to use self.inpfile and self.outfile instead of taking parameters. Like this:
class ThreadUrl(threading.Thread):
def __init__(self, inpfile, outfile):
threading.Thread.__init__(self)
self.inpfile, self.outfile = inpfile, outfile
def run(self):
f1 = open(self.inpfile,'r')
f2 = open(self.outfile,'w')
x = f1.readlines()
for each in x:
f2.write(each)
f1.close()
f2.close()
Either way, I'd suggest using with statements instead of explicit open and close, and getting rid of the readlines (which unnecessarily reads the entire file into memory), unless you need to deal with really old versions of Python:
def run(self):
with open(self.inpfile,'r') as f1, open(self.outfile,'w') as f2:
for line in f1:
f2.write(line)
Now, on to solution 2: a threadpool and a queue.
Again, you don't need a subclass here; the differences between the two ways of doing things are the same as in solution 1. But sticking with the subclass design you've started, you want something like this:
class ThreadUrl(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
inpfile, outfile = self.queue.get()
workermethod(inpfile, outfile)
Then you start your threads by passing a single queue to all of them:
q = queue.Queue
threadpool = [ThreadUrl(q) for i in range(poolsize)]
And submit jobs like this:
q.put((inpfile, outfile))
If you're going to be doing serious work with threadpools, you may want to look into using a robust, flexible, simple, and optimized implementation instead of coding something up yourself. For example, you might want to be able to cancel jobs, shutdown the queue nicely, join the whole pool instead of joining threads one by one, do batching or smart load balancing, etc.
If you're using Python 3, you should look at the standard-library ThreadPoolExecutor. If you're stuck with Python 2, or can't figure out Futures, you might want to look at the ThreadPool class hidden inside the multiprocessing module. Both of these have the advantage that switching from multithreading to multiprocessing (if, say, it turns out that you have some CPU-bound work that needs to be parallelized along with your IO) is trivial. You can also search PyPI and you'll find multiple other good implementations.
As a side note, you don't want to call the queue queue, because that will shadow the module name. Also, it's a bit confusing to have something called workermethod that's actually a free function rather than a method.
Finally, if all you're doing is copying the files, you probably don't want to read in text mode, or go line by line. In fact, you probably don't want to implement it yourself at all; just use the appropriate copy function from shutil. You can do that with any of the above methods very easily. For example, instead of this:
t = threading.Thread(target=workermethod, args=(inpfile, outfile))
do this:
t = threading.Thread(target=shutil.copyfile, args=(inpfile, outfile))
In fact, it looks like your whole program can be replaced by:
threads = [threading.Thread(target=shutil.copyfile, args=(inpfile, outfile))
for (inpfile, outfile) in zip(filelistread, filelistwrte)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

Categories

Resources