"Global array" parallel programming on distributed memory clusters with python - python

I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."
For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)
For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.
The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.
Does something similar exist for python? My wishlist would be:
actively maintained
drop in replacement for numpy
alternatively similar usage to numexpr
high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
support for data-locality in distributed memory clusters
support for multi-core machines in the cluster
This is probably a bit like believing in the tooth-fairy but one never knows...
I have found so far:
There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.
If you know of any package or have used any of the two above, please describe your experiences with them.

You should take a look at Blaze, although it may not be far enough along in development to suit your needs at the moment. From the linked page:
Blaze is an expressive, compact set of foundational abstractions for
composing computations over large amounts of semi-structured data, of
arbitrary formats and distributed across arbitrary networks.

Related

Beyond Numpy/Scipy/Pytorch: (Dense?) linear algrebra libraries for python programmers

For research purposes, I often find myself using the very good dense linear algebra packages available in the Python ecosystem. Mostly numpy, scipy and pytorch, which are (if I understand correctly) heavily based on BLAS/Lapack.
Of course, these go a long way, being quite extensive and having in general quite good performance (in terms of both robustness and speed of execution).
However, I often have quite specific needs that are not currently covered by these libraries. For instance, I recently found myself starting to code structure-preserving linear algebra for symplectic and (skew) Hamilton matrices in Cython (which I find to be a good compromise between speed of execution and ease of integration with the bulk of the python code). This process most often consists in rewriting algorithms of the 70-80s based on dated and sometimes painful to decypher research papers, which I would not mind not having to go through.
I'm not the only person doing this on the stack exchange family either! Some links below of people who have questions doing exactly this:
Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations
It seems to me (from reading the literature from this community, not from first hand acquaintance with the members / experience in the field) that a large part of these algorithms are tested using Matlab, which is prohibitively expensive for me to get my hands on, and which would probably not work great with the rest of my codebase.
Hence my question: where can I find open source exemples of implementations of "research level" dense algebra algorithms that might easily be used in python or copied?

Make python use more resources in calculations

Hi I'm trying to do some matrix calculations using python. The problem is there seems to be a limit of how much CPU will the process consume (about 13% of my Core i7).
Is there a way I can make it use more resources?
As people in the comments pointed out, you're running only on one of your 8 (4 physical, 4 virtual) cores.
If you aren't doing this for a programming exercise but instead you're really interested in numerical programming or data analysis in python, you might want to take a close look at numpy. That package provides fast array/vector/matrix types and operations them, and (supposedly) can do multi threaded dot products (see multithreaded blas in python/numpy ).

Object Tracking: MATLAB vs. Python Numpy

I will soon be starting a final year Engineering project, consisting of the real-time tracking of objects moving on a 2D-surface. The objects will be registered by my algorithm using feature extraction.
I am trying to do some research to decide whether I should use MATLAB or use Python Numpy (Numerical Python). Some of the factors I am taking into account:
1.) Experience
I have reasonable experience in both, but perhaps more experience in image processing using Numpy. However, I have always found MATLAB to be very intuitive and easy to pick up.
2.) Real-Time abilities
It is very important that my choice be able to support the real-time acquisition of video data from an external camera. I found this link for MATLAB showing how to do it. I am sure that the same would be possible for Python, perhaps using the OpenCV library?
3.) Performance
I have heard, although never used, that MATLAB can easily split independent calculations across multiple cores. I should think that this would be very useful, and I am not sure whether the same is equally simple for Numpy?
4.) Price
I know that there is a cost associated with MATLAB, but I will be working at a university and thus will have access to full MATLAB without any cost to myself, so price is not a factor.
I would greatly appreciate any input from anyone who has done something similar, and what your experience was.
Thanks!
Python (with NumPy, SciPy and MatPlotLib) is the new Matlab. So I strongly recommend Python over Matlab.
I made the change over a year ago and I am very happy with the results.
Here it is a short pro/con list for Python and Matlab
Python pros:
Object Oriented
Easy to write large and "real" programs
Open Source (so it's completely free to use)
Fast (most of the heavy computation algorithms have a python wrapper to connect with C libraries e.g. NumPy, SciPy, SciKits, libSVM, libLINEAR)
Comfortable environment, highly configurable (iPython, python module for VIM, ...)
Fast growing community of Python users. Tons of documentation and people willing to help
Python cons:
Could be a pain to install (especially some modules in OS X)
Plot manipulation is not as nice/easy as in Matlab, especially 3D plots or animations
It's still a script language, so only use it for (fast) prototyping
Python is not designed for multicore programming
Matlab pros:
Very easy to install
Powerful Toolboxes (e.g. SignalProcessing, Systems Biology)
Unified documentation, and personalized support as long as you buy the licence
Easy to have plot animations and interactive graphics (that I find really useful for running experiments)
Matlab cons:
Not free (and expensive)
Based on Java + X11, which looks extremely ugly (ok, I accept I'm completely biased here)
Difficult to write large and extensible programs
A lot of Matlab users are switching to Python :)
I would recommend python.
I switched from MATLAB -> python about 1/2 way through my phd, and do not regret it. At the most simplistic, python is a much nicer language, has real objects, etc.
If you expect to be doing any parts of your code in c/c++ I would definitely recommend python. The mex interface works, but if your build gets complicated/big it starts to be a pain and I never sorted out how to effectively debug it. I also had great difficulty with mex+allocating large blocks interacting with matlab's memory management (my inability to fix that issue is what drove me to switch).
As a side note/self promotion, I have Crocker-Grier in c++ (with swig wrappers) and pure python.
If you're experienced with both languages it's not really a decision criterion.
Matlab has problems coping with real time settings especially since most computer vision algorithms are very costly. This is the advantage of using a tried and tested library such as OpenCV where many of the algorithms you'll be using are efficiently implemented. Matlab offers the possibility of compiling code into Mex-files but that is a lot of work.
Matlab has parallel for loops parfor which makes multicore processing easy (or at least easier). But the question is if that will suffice to get real-time speeds.
No comment.
The main advantage of Matlab is that you'll obtain a running program very quickly due to its good documentation. But I found that code reusability is bad with Matlab unless you put a heavy emphasis on it.
I think the final decision has to be if you have to/can run your algorithm real-time which I doubt in Matlab, but that depends on what methods you're planning to use.
Others have made a lot of great comments (I've opined on this topic before in another answer https://stackoverflow.com/a/5065585/392949) , but I just wanted to point out that Python has a number of really excellent tools for parallel computing/splitting up work across multiple cores. Here's a short and by no means comprehensive list:
IPython Parallel toolkit: http://ipython.org/ipython-doc/dev/parallel/index.html
mpi4py: https://code.google.com/p/mpi4py
The multiprocessing module in the standard library: http://docs.python.org/library/multiprocessing.html
pyzmq: http://zeromq.github.com/pyzmq/ (what the IPython parallel toolkit is based on)
parallel python (pp): http://www.parallelpython.com/
Cython's wrapping of openmp: http://docs.cython.org/src/userguide/parallelism.html
You will also probably find cython to be much to be a vastly superior tool compared to what Matlab has to offer if you ever need to interface external C-libraries or write C-extensions, and it has excellent numpy support built right in.
There is a list with a bunch of other options here:
http://wiki.python.org/moin/ParallelProcessing

What scalability issues are associated with NetworkX?

I'm interested in network analysis on large networks with millions of nodes and tens of millions of edges. I want to be able to do things like parse networks from many formats, find connected components, detect communities, and run centrality measures like PageRank.
I am attracted to NetworkX because it has a nice api, good documentation, and has been under active development for years. Plus because it is in python, it should be quick to develop with.
In a recent presentation (the slides are available on github here), it was claimed that:
Unlike many other tools, NX is designed to handle data on a scale
relevant to modern problems...Most of the core algorithms in NX rely on extremely fast legacy code.
The presentation also states that the base algorithms of NetworkX are implemented in C/Fortran.
However, looking at the source code, it looks like NetworkX is mostly written in python. I am not too familiar with the source code, but I am aware of a couple of examples where NetworkX uses numpy to do heavy lifting (which in turn uses C/Fortran to do linear algebra). For example, the file networkx/networkx/algorithms/centrality/eigenvector.py uses numpy to calculate eigenvectors.
Does anyone know if this strategy of calling an optimized library like numpy is really prevalent throughout NetworkX, or if just a few algorithms do it? Also can anyone describe other scalability issues associated with NetworkX?
Reply from NetworkX Lead Programmer
I posed this question on the NetworkX mailing list, and Aric Hagberg replied:
The data structures used in NetworkX are appropriate for scaling to
large problems (e.g. the data structure is an adjacency list). The
algorithms have various scaling properties but some of the ones you
mention are usable (e.g. PageRank, connected components, are linear
complexity in the number of edges).
At this point NetworkX is pure Python code. The adjacency structure
is encoded with Python dictionaries which provides great flexibility
at the expense of memory and computational speed. Large graphs will
take a lot of memory and you will eventually run out.
NetworkX does use NumPy and SciPy for algorithms that are primarily
based on linear algebra. In that case the graph is represented
(copied) as an adjacency matrix using either NumPy matrices or SciPy
sparse matrices. Those algorithms can benefit from the legacy C and
FORTRAN code that is used under the hood in NumPy and SciPY.
This is an old question, but I think it is worth mentioning that graph-tool has a very similar functionality to NetworkX, but it is implemented in C++ with templates (using the Boost Graph Library), and hence is much faster (up to two orders of magnitude) and uses much less memory.
Disclaimer: I'm the author of graph-tool.
Your big issue will be memory. Python simply cannot handle tens of millions of objects without jumping through hoops in your class implementation. The memory overhead of many objects is too high, you hit 2GB, and 32-bit code won't work. There are ways around it - using slots, arrays, or NumPy. It should be OK because networkx was written for performance, but if there are a few things that don't work, I will check your memory usage.
As for scaling, algorithms are basically the only thing that matters with graphs. Graph algorithms tend to have really ugly scaling if they are done wrong, and they are just as likely to be done right in Python as any other language.
The fact that networkX is mostly written in python does not mean that it is not scalable, nor claims perfection. There is always a trade-off. If you throw more money on your "machines", you'll have as much scalability as you want plus the benefits of using a pythonic graph library.
If not, there are other solutions, ( here and here ), which may consume less memory ( benchmark and see, I think igraph is fully C backed so it will ), but you may miss the pythonic feel of NX.

Concepts and tools required to scale up algorithms

I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I'm familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)... or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started - thanks in advance!
While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited -- you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).
Relational databases may be an exception to "technologies pale" -- they can really clamp you down when you're trying to scale up a few orders of magnitude. Is that your situation -- are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you -- in the latter, you're at the breaking point, or well past it, and must start thinking in terms of key/value stores.
Back to algorithms -- "data analysis" cover a wide range... most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can't possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most "data mining" applications fall into the second category -- you don't need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.
This is particularly crucial if you ever need to do "real-near-time" data analysis -- near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don't know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.
The work I've been doing over the last few years has had a few requirements for "near-real-time" and many more requirements for off-line, "batch" analysis -- and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don't understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes' worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv's, or whatever).
Python, R, C++, whatever, only come into play after you've mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to "statistically survive" the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework...?), and the supporting design and storage-technology choices.
The main thing for scaling up to large data is to avoid situations where you're reading huge datasets into memory at once. In pythonic terms this generally means using iterators to consume the dataset in manageable pieces.

Categories

Resources