Course on HPC with Python?

Course on HPC with Python? - python

I am developing a simulation and data-processing pipeline in Python. Currently I am still "making it work", but in a few months I will have to "make it fast". I am not very good at HPC in general and particularly not with Python.
What online (or Europe-based) courses are available that contain at least the following topics:
parallel computation in Python, and
interfacing Python and C?
Opinion based part (sorry about this):
The only course I managed to find is Python Academy in Leipzig (http://www.python-academy.com/courses/python_hpc.html). Has anybody tried Python Academy? I don't find any independent reviews, but there is a significant cost, so I would not want to go in blind.

None of the items mentioned in the course description qualifies for HPC; which currently refers to GPU utilization for massive parallelisation.
High-performance computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and
quickly. The term applies especially to systems that function above a
teraflop or 10^12 floating-point operations per second.
The course contents will enable you to program current generation of Raspberry PI which has a quad core processor and dedicated graphics processor. But I would not call that HPC (High Performance Computing)
If you are interested in High Performance Computing, get a NVIDIA graphics card and try pycuda or try pyopencl which is a more open standard for hybrid computing.
You can find good Videos on Youtube explaining the both.

Related

Efficient way to speeding up graph theory and complex network algorithms on CPU/GPU using Python?

P.S.: I've mentioned possible solutions to my problem but have many confusions with them, please provide me suggestions on them. Also if this question is not good for this site, please point me to the correct site and I'll move the question there. Thanks in advance.
I need to perform some repetitive graph theory and complex network algorithms to analyze approx 2000 undirected simple graphs with no self-loops for some research work. Each graph has approx 40,000 nodes and approx 600,000 edges (essentially making them sparse graphs).
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G) for 500 such graphs and the code is running for 3 days and have reached only halfway. This makes me fearful that my full analysis will take a huge and unexpected time.
Before elaborating on my problem and the probable solutions I've thought of, let me mention my computer's configuration as it may help you in suggesting the best approach. I am running Windows 10 on an Intel i7-9700K processor with 32GB RAM and one Zotac GeForce GTX 1050 Ti OC Edition ZT-P10510B-10L 4GB PCI Express Graphics Card.
Explaining my possible solutions and my confusions regarding them:
A) Using GPU with Adjacency Matrix as Graph Data Structure: I can put an adjacency matrix on GPU and perform my analysis by manually coding them with PyCuda or Numba using loops only as recursion cannot be handled by GPU. The nearest I was able to search is this on stackoverflow but it has no good solution.
My Expectations: I hope to speedup algorithms such as All Pair Shortest Path, All Possible Paths between two nodes, Average Clustering, Average Shortest Path Length, and Small World Properties, etc. If it gives a significant speedup per graph, my results can be achieved very fast.
My Confusions:
Could these graph algorithms can be efficiently coded in GPU?
Which will be better to use? PyCuda or Numba?
Is there any other way to store Graphs on GPU that could be more efficient as my graphs are sparse graphs.
I am an average Python Programmer with no experience of GPU programming, so I will have to understand and learn GPU programming with PyCuda/ Numba. Which one is easier to learn?
B) Parallelizing Programs on CPU Itself: I can use Joblib or any other library to parallelly run the program on my CPU itself. I can arrange 2-3 more computers on which I can run small independent portions of programs or can run 500 graphs per computer.
My Expectations: I hope to speedup algorithms by parallelizing and dividing tasks among computers. If the GPU solution does not work, I may still have some hope by this method.
My Confusions:
Which other libraries are available as good alternatives for Joblib?
Should I allot all CPU cores (8 cores in i7) for my programs or use fewer cores?
C) Apart from my probable solutions do you have any other suggestions for me? If a better and faster solution is available in any other language except C/C++, you can also suggest them as well, as I am already considering C++ as a fallback plan if nothing works.
Work In Progress Updates
In different suggestions from comments on this question and discussion in my community, these are the points I've suggested to explore.
GraphBLAS
boost.graph + extensions with python-wrappers
graph-tool
Spark/ Dask
PyCuda/ Numba
Linear Algerbra methods using Pytorch
I tried to run 100 graphs on my CPU (using n_job=-1) using Joblib, the CPU was continuously hitting a temperature of 100°C. The processor tripped after running for 3 hours. - As a solution, I am using 75% of available cores on multiple computers (so if available cores are 8, I am using 6 cores) and the program is running fine. the speedup is also good.

This is a broad but interesting question. Let me try to answer it.
2000 undirected simple graphs [...] Each graph has approx 40,000 nodes and approx 600,000 edges
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G)
NetworkX uses plain Python implementations and is not optimized for performance. It's great for prototyping but if you encounter performance issues, it's best to look to rewrite your code using another library.
Other than NetworkX, the two most popular graph processing libraries are igraph and SNAP. Both are written in C and have Python APIs so you get both good single-threaded performance and ease of use. Their parallelism is very limited but this is not a problem in your use case as you have many graphs, rendering your problem embarrassingly parallel. Therefore, as you remarked in the updated question, you can run 6-8 jobs in parallel using e.g. Joblib or even xargs. If you need parallel processing, look into graph-tool, which also has a Python API.
Regarding your NetworkX algorithms, I'd expect the average_shortest_path_length to be reasonably well-optimized in all libraries. The average_clustering algorithm is tricky as it relies on node-wise triangle counting and a naive implementation takes O(|E|^2) time while an optimized implementation will do it in O(|E|^1.5). Your graphs are large enough so that the difference between these two costs is running the algorithm on a graph in a few seconds vs. running the algorithm for hours.
The "all-pairs shortest paths" (APSP) problem is very time-consuming, with most libraries using the Floyd–Warshall algorithm that has a runtime of O(|V|^3). I'm unsure what output you're looking for with the "All Possible Paths between two nodes" algorithm – enumerating all paths leads to an exponential amount of results and is unfeasible at this scale.
I would not start using the GPU for this task: an Intel i7-9700K should be up for this job. GPU-based graph processing libraries are challenging to set up and currently do not provide that significant of a speedup – the gains by using a GPU instead of a CPU are nowhere near as significant for graph processing as for machine learning algorithms. The only problem where you might be able to get a big speedup is APSP but it depends on which algorithms your chosen library uses.
If you are interested in GPU-based libraries, there are promising directions on the topic such as Gunrock, GraphBLAST, and a work-in-progress SuiteSparse:GraphBLAS extension that supports CUDA. However, my estimate is that you should be able to run most of your algorithms (barring APSP) in a few hours using a single computer and its CPU.

Make python use more resources in calculations

Hi I'm trying to do some matrix calculations using python. The problem is there seems to be a limit of how much CPU will the process consume (about 13% of my Core i7).
Is there a way I can make it use more resources?

As people in the comments pointed out, you're running only on one of your 8 (4 physical, 4 virtual) cores.
If you aren't doing this for a programming exercise but instead you're really interested in numerical programming or data analysis in python, you might want to take a close look at numpy. That package provides fast array/vector/matrix types and operations them, and (supposedly) can do multi threaded dot products (see multithreaded blas in python/numpy ).

Python Context Free Grammar and PCFG generation benchmarks?

I know there are various functions to use to general CFGs and PCFGs in Python; however they all seem to differ in speed.
E.g.: NLTK, PyParsing.
Are there any recent benchmarks comparing various attributes related to speed and memory usage?

If you're looking at official benchmarking of NLP tools like other computing software, you may be deeply discouraged. Sadly the NLP researchers are trying to push the accuracy system above making it realtime. (It's surely nicer to say that I score X% more than the state-of-art rather than I save Y hrs / days training my model in research).
Often they have like 1 sentence in their research paper that says how long it takes to train their system e.g. In average, the sampling program run on the Wikipedia dump consumed 20G memory, and each round took about one week on a single AMD Dual-Core 1000MHZ processor. from www.aclweb.org/anthology/P10-1116.pdf
Anyway, since you want some benchmarks, so here's some of the homework that you can do with some googling =) www.aclweb.org/anthology/I11-1100‎. But once again you realized that they are benchmarking accuracy not speed =)

"Global array" parallel programming on distributed memory clusters with python

I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."
For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)
For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.
The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.
Does something similar exist for python? My wishlist would be:
actively maintained
drop in replacement for numpy
alternatively similar usage to numexpr
high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
support for data-locality in distributed memory clusters
support for multi-core machines in the cluster
This is probably a bit like believing in the tooth-fairy but one never knows...
I have found so far:
There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.
If you know of any package or have used any of the two above, please describe your experiences with them.

You should take a look at Blaze, although it may not be far enough along in development to suit your needs at the moment. From the linked page:
Blaze is an expressive, compact set of foundational abstractions for
composing computations over large amounts of semi-structured data, of
arbitrary formats and distributed across arbitrary networks.

Object Tracking: MATLAB vs. Python Numpy

I will soon be starting a final year Engineering project, consisting of the real-time tracking of objects moving on a 2D-surface. The objects will be registered by my algorithm using feature extraction.
I am trying to do some research to decide whether I should use MATLAB or use Python Numpy (Numerical Python). Some of the factors I am taking into account:
1.) Experience
I have reasonable experience in both, but perhaps more experience in image processing using Numpy. However, I have always found MATLAB to be very intuitive and easy to pick up.
2.) Real-Time abilities
It is very important that my choice be able to support the real-time acquisition of video data from an external camera. I found this link for MATLAB showing how to do it. I am sure that the same would be possible for Python, perhaps using the OpenCV library?
3.) Performance
I have heard, although never used, that MATLAB can easily split independent calculations across multiple cores. I should think that this would be very useful, and I am not sure whether the same is equally simple for Numpy?
4.) Price
I know that there is a cost associated with MATLAB, but I will be working at a university and thus will have access to full MATLAB without any cost to myself, so price is not a factor.
I would greatly appreciate any input from anyone who has done something similar, and what your experience was.
Thanks!

Python (with NumPy, SciPy and MatPlotLib) is the new Matlab. So I strongly recommend Python over Matlab.
I made the change over a year ago and I am very happy with the results.
Here it is a short pro/con list for Python and Matlab
Python pros:
Object Oriented
Easy to write large and "real" programs
Open Source (so it's completely free to use)
Fast (most of the heavy computation algorithms have a python wrapper to connect with C libraries e.g. NumPy, SciPy, SciKits, libSVM, libLINEAR)
Comfortable environment, highly configurable (iPython, python module for VIM, ...)
Fast growing community of Python users. Tons of documentation and people willing to help
Python cons:
Could be a pain to install (especially some modules in OS X)
Plot manipulation is not as nice/easy as in Matlab, especially 3D plots or animations
It's still a script language, so only use it for (fast) prototyping
Python is not designed for multicore programming
Matlab pros:
Very easy to install
Powerful Toolboxes (e.g. SignalProcessing, Systems Biology)
Unified documentation, and personalized support as long as you buy the licence
Easy to have plot animations and interactive graphics (that I find really useful for running experiments)
Matlab cons:
Not free (and expensive)
Based on Java + X11, which looks extremely ugly (ok, I accept I'm completely biased here)
Difficult to write large and extensible programs
A lot of Matlab users are switching to Python :)

I would recommend python.
I switched from MATLAB -> python about 1/2 way through my phd, and do not regret it. At the most simplistic, python is a much nicer language, has real objects, etc.
If you expect to be doing any parts of your code in c/c++ I would definitely recommend python. The mex interface works, but if your build gets complicated/big it starts to be a pain and I never sorted out how to effectively debug it. I also had great difficulty with mex+allocating large blocks interacting with matlab's memory management (my inability to fix that issue is what drove me to switch).
As a side note/self promotion, I have Crocker-Grier in c++ (with swig wrappers) and pure python.

If you're experienced with both languages it's not really a decision criterion.
Matlab has problems coping with real time settings especially since most computer vision algorithms are very costly. This is the advantage of using a tried and tested library such as OpenCV where many of the algorithms you'll be using are efficiently implemented. Matlab offers the possibility of compiling code into Mex-files but that is a lot of work.
Matlab has parallel for loops parfor which makes multicore processing easy (or at least easier). But the question is if that will suffice to get real-time speeds.
No comment.
The main advantage of Matlab is that you'll obtain a running program very quickly due to its good documentation. But I found that code reusability is bad with Matlab unless you put a heavy emphasis on it.
I think the final decision has to be if you have to/can run your algorithm real-time which I doubt in Matlab, but that depends on what methods you're planning to use.

Others have made a lot of great comments (I've opined on this topic before in another answer https://stackoverflow.com/a/5065585/392949) , but I just wanted to point out that Python has a number of really excellent tools for parallel computing/splitting up work across multiple cores. Here's a short and by no means comprehensive list:
IPython Parallel toolkit: http://ipython.org/ipython-doc/dev/parallel/index.html
mpi4py: https://code.google.com/p/mpi4py
The multiprocessing module in the standard library: http://docs.python.org/library/multiprocessing.html
pyzmq: http://zeromq.github.com/pyzmq/ (what the IPython parallel toolkit is based on)
parallel python (pp): http://www.parallelpython.com/
Cython's wrapping of openmp: http://docs.cython.org/src/userguide/parallelism.html
You will also probably find cython to be much to be a vastly superior tool compared to what Matlab has to offer if you ever need to interface external C-libraries or write C-extensions, and it has excellent numpy support built right in.
There is a list with a bunch of other options here:
http://wiki.python.org/moin/ParallelProcessing

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.