I am working on an assignment in a class that I now realize may be a little out of my reach (this is first sememster I have done any programming)
The stipulation is that I use paralell programming with mpi.
I have to input a csv file of up to a terabyte, of tick data (every micro second) that may be locally out of sort. run a process on the data to identify noise, and output a cleaned data file.
I have written a serial program using Pandas that takes the data determines significant outliers and writes them to a dataset labeled noise, then create the final data set by doing original minus noise based on the index (time)
I have no idea on where to start for parellizing the program. I understand that because my computations are all local, I should import from csv in paralell and run the process to identify noise.
I believe the best way to do this (and i may be completely wrong) is to scatter run the computation and gather using a hdf5. But i do not know how to implement this.
I do not want someone to write an entire code, but maybe a specific example of importing in paralell from csv and regathering the data, or a better approach to the problem.
If you can boil down your program to a function to run against a list of rows, then yes a simple multiprocessing approach would be easy and effective. For instance:
from multiprocessing import Pool
def clean_tickData(filename):
<your code>
pool = Pool()
pool.map(clean_tickData, cvs_row)
pool.close()
pool.join()
map from Pool runs in parallel. One can control how many parallel processes, but the default, set with empty Pool() call, starts as many processes as you have CPU cores. So, if you reduce your clean-up work to a function that can be run over the various rows in your cvs, using pool.map would be a easy and fast implementation.
Related
I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.
In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.
A quick question about parallel processing in Python. Lets say I have a big shared data structure and want to apply many functions on it in parallel. These functions are read only on the data structure but perform mutation in a result object:
def compute_heavy_task(self):
big_shared_object = self.big_shared_object
result_refs = self.result_refs
for ref in result_refs:
some_expensive_task(ref, big_shared_object)
How do I do these in parallel, say 5 at a time, or 10 at a time. How how about number of processors at a time?
You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.
But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.
One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.
If your data structure is really simple, like a giant array of ints, this is pretty simple:
class MyClass(object):
def __init__(self, giant_iterator_of_ints):
self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
def compute_heavy_task(self):
lock = multiprocessing.Lock()
def subtask(my_range):
return some_expensive_task(self.big_shared_object, lock, my_range)
pool = multiprocessing.pool.Pool(5)
my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
results = pool.map_async(subtask, my_ranges)
pool.close()
pool.join()
Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.
Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)
If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.
I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.
I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.
Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.
Despite the warnings and confused feelings I got from the ton of questions that have been asked on the subject, especially on StackOverflow, I paralellized a naive version of an embarassingly parallel problem (basically read-image-do-stuff-return for a list of many images), returned the resulting NumPy array for each computation and updated a global NumPy array via the callback parameter, and immediately got a x5 speedup on a 8-core machine.
Now, I probably didn't get x8 because of the lock required by each callback call, but what I got is encouraging.
I'm trying to find out if this can be improved upon, or if this is a good result. Questions :
I suppose the returned NumPy arrays got pickled?
Were the underlying NumPy buffers copied or just passed by reference?
How can I find out what the bottleneck is? Any particularly useful technique?
Can I improve on that or is such an improvement pretty common in such cases?
I've had great success sharing large NumPy arrays (by reference, of course) between multiple processes using sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem. Basically it suppresses pickling that normally happens when passing around NumPy arrays. All you have to do is, instead of:
import numpy as np
foo = np.empty(1000000)
do this:
import sharedmem
foo = sharedmem.empty(1000000)
and off you go passing foo from one process to another, like:
q = multiprocessing.Queue()
...
q.put(foo)
Note however, that this module has a known possibility of a memory leak upon ungraceful program exit, described to some extent here: http://grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs-pickled-copies.
Hope this helps. I use the module to speed up live image processing on multi-core machines (my project is https://github.com/vmlaker/sherlock.)
Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.
How I did it
It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :
First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.
I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.
(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)
Profiling
Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :
Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
$ /usr/bin/time -v python test.py
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.
I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
$ sudo strace -c -p <python-process-id>
I have a loop of intensive calculations, I want them to be
accelerated using the multicore processor as they are independent:
all performed in parallel. What the easiest way to do that in
python?
Let’s imagine that those calculations have to be summed at the end. How to easily add them to a list or a float variable?
Thanks for all your pedagogic answers and using python libraries ;o)
From my experience, multi-threading is probably not going to be a viable option for speeding things up (due to the Global Interpreter Lock).
A good alternative is the multiprocessing module. This may or may not work well, depending on how much data you end up having to pass around from one process to another.
Another good alternative would be to consider using numpy for your computations (if you aren't already). If you can vectorize your code, you should be able to achieve significant speedups even on a single core. Depending on what exactly you're doing and on your build of numpy, it might even be able to transparently distribute the computations across multiple cores.
edit Here is a complete example of using the multiprocessing module to perform a simple computation. It uses four processes to compute the squares of the numbers from zero to nine.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
inputs = range(10)
result = pool.map(f, inputs)
print result
This is meant as a simple illustration. Given the trivial nature of f(), this parallel version will almost certainly be slower than computing the same thing serially.
Multicore processing is a bit difficult to do in CPython (thanks to the GIL ). However, their is the multiprocessing module which allows to use subprocesses (not threads) to split you work on multiple cores.
The module is relatively straight forward to use as long as your code can really be split into multiple parts and doesn't depend on shared objects. The linked documentation should be a good starting point.
this time i'm facing a "design" problem. Using Python, I have a implement a mathematical algorithm which uses 5 parameters. To find the best combination of these 5 parameters, i used 5-layer nested loop to enumerate all possible combinations in a given range. The time it takes to finish appeared to be beyond my expectation. So I think it's the time to use multithreading...
The task in the core of nested loops are calculation and saving. In current code, result from every calculation is appended to a list and the list will be written to a file at the end of program.
since I don't have too much experience of multithreading in any language, not to mention Python, I would like to ask for some hints on what should the structure be for this problem. Namely, how should the calculations be assigned to the threads dynamically and how should the threads save results and later combine all results into one file. I hope the number of threads can be adjustable.
Any illustration with code will be very helpful.
thank you very much for your time, I appreciate it.
#
update of 2nd Day:
thanks for all helpful answers, now I know that it is multiprocessing instead of multithreading. I always confuse with these two concepts because I think if it is multithreaded then the OS will automatically use multiple processor to run it when available.
I will find time to have some hands-on with multiprocessing tonight.
You can try using jug, a library I wrote for very similar problems. Your code would then look something like
from jug import TaskGenerator
evaluate = TaskGenerator(evaluate)
for p0 in [1,2,3]:
for p1 in xrange(10):
for p2 in xrange(10,20):
for p3 in [True, False]:
for p4 in xrange(100):
results.append(evaluate(p0,p1,p2,p3,p4))
Now you could run as many processes as you'd like (even across a network if you have access to a computer cluster).
Multithreading in Python won't win you anything in this kind of problem, since Python doesn't execute threads in parallel (it uses them for I/O concurrency, mostly).
You want multiprocessing instead, or a friendly wrapper for it such as joblib:
from joblib import Parallel, delayed
# -1 == use all available processors
results = Parallel(n_jobs=-1)(delayed(evaluate)(x) for x in enum_combinations())
print best_of(results)
Where enum_combinations would enumerate all combinations of your five parameters; you can likely implement it by putting a yield at the bottom of your nested loop.
joblib distributes the combinations over multiple worker processes, taking care of some load balancing.
Assuming this is a calculation-heavy problem (and thus CPU-bound), multi-threading won't help you much in Python due to the GIL.
What you can, however, do is split the calculation across multiple processes to take advantage of extra CPU cores. The easiest way to do this is with the multiprocessing library.
There are a number of examples for how to use multiprocessing on the docs page for it.