Hi I'm trying to do some matrix calculations using python. The problem is there seems to be a limit of how much CPU will the process consume (about 13% of my Core i7).
Is there a way I can make it use more resources?
As people in the comments pointed out, you're running only on one of your 8 (4 physical, 4 virtual) cores.
If you aren't doing this for a programming exercise but instead you're really interested in numerical programming or data analysis in python, you might want to take a close look at numpy. That package provides fast array/vector/matrix types and operations them, and (supposedly) can do multi threaded dot products (see multithreaded blas in python/numpy ).
Related
Overview
I am working on re-writing and optimizing a set of MATLAB scripts for a computationally intensive scientific computing application into a set of Python scripts with the kernels written in C (run using ctypes). The Python wrapping is necessary for ease of end-user application and configuration. My current hardware is a 12-core Ryzen 7 3700X with 64 GB RAM, but it is also intended to be suitable for running on much larger and lower-clocked clusters.
Input/Output
The section of code this question concerns is highly parallelizable. The input is going to be something like 100-200 sets (serially ordered in working memory) of a few million uniformly organized floats (you can imagine them as 100-200 fairly high-resolution B/W images, all with the same proportions and similar structure). Each such "set" can be processed independently and uninterrupted, for the bulk of the process. There are many computationally (and possibly memory) intensive calculations performed on these - some of it suitable for implementation using BLAS but also more complex upsampling, interpolation, filtering, back-projection and projection, and so forth. the MATLAB implementation I am using as a basis, it is implemented through a Parfor loop calling on a few subroutines and using some MEX functions written in C. The output of each iteration is going to be, again, a few million floats. If I recall correctly (running a realistic trial of the scripts is very messy at the moment - another thing I'm tasked with fixing - so I can't easily check), the computations can be intensive enough that each iteration of the loop can be expected to take a few minutes.
The Conundrum
My most likely course of action will be to turn this entire section of the code into a big "kernel" written in C. I have subfunctions of it written in mixed C/Python already, and those already have way too much Python overhead compared to the time the actual computations need - so I want to replace all of that, and the remainder of all this easily parallelized code, with C. Thus, I have two methods I can use to parallelize the code:
I have Python create subprocesses, each of which triggers serial C code separately with its section of the data.
I have Python start a single C process to which I hand all the data, having the C process use OpenMP to create subprocesses to parallelize the code.
I'm familiar with both Python and C multiprocessing, but I have never tried to run multiprocessing C scripts through Python. My question is then, which of these is preferable from a performance standpoint, and are there any aspects I should be considering which I haven't considered here?
Thank you in advance!
I have a 32 cores and 64 threads CPU for executing a scientific computation task. How many processes should I create?
To be noted that my program is computationally intensive involved lots of matrix computations based on Numpy. Now, I use the Python default process pool to execute this task. It will create 64 processes. Will it perform better or worse than 32 processes?
I'm not really sure that Python is suited for multi-threading computational intensive scenarios, due to the Global Interpreter Lock (GIL). Basically, you should use multi-threading in Python only for IO-bound tasks. I'm not sure if Numpy applies since the heavy part if I recall correctly is written in C++.
If you're looking for alternatives you could use the Apache Spark framework to distribute the work across multiple machines. I think that even if you run your code in local mode (i.e. on your machine) with 8/16 workers you could get some performance boost.
EDIT: I'm sorry, I just read on the GIL page that I linked that it doesn't apply for Numpy. I still think that this is not really the best tool you can use, since effective multi-threading programming is quite hard to get right and there are some other nuances that you can read in the link.
It's impossible to give you an answer as it will depend on your exact problem and code but potentially also of your hardware.
Basically the process for multi-processing is to split the work in X parts then distribute it to each process, let each process work and then merge each result.
Now you need to know if you can effectively split the work in 64 parts while keeping each part around the same time of work (if one process take 90% of the time and you can't split it it's useless to have more than 2 processes as you will always wait for the first one).
If you can do it and it's not taking too long to split and merge the work/results (remember that it's a supplementary work to do so it will take extra time) then it can be interesting to use more process.
It is also possible that you can speed-up your code by using less process if you pass too much time on splitting/merging the work/results (sometime the speed-up obtained by using more process can be negative).
Also you have to remember that in some architecture the memory cache can be shared among cores so it can badly affect the performances of multiprocessing.
I have an application that has 3 main functionalities which are running sequentially at the moment:
1) Loading data to memory and perform preprocesssing on it.
2) Perform some computations on the data using GPU with theano.
3) Monitor the state of the computations on GPU and print them to the screen.
These 3 functionalities are embarrassingly parallelizable by using multi-threading. But in python I perform all these three functionalities sequentially. Partly because in the past I had some bad luck with Python multi-threading and GIL issues.
Here in this case, I don't necessarily need to utilize the full-capabilities of multiple-cpu's at hand. All I want to do is, to load the data and preprocess them while the computations at the GPU are performed and monitor the state of the computations at the same time. Currently most time-consuming computations are performed at 2), so I'm kind of time-bounded with operations at 2). Now my questions are:
*Can python parallelize these 3 operations without creating new bottlenecks, e.g.: due to GIL issues.
*Should I use multiprocessing instead of multithreading?
In a nutshell how should parallelize these three operations if I should in Python.
It is been some time since last time I wrote multi-threaded code for CPU(especially for python), any guidance will be appreciated.
Edit: Typos.
The GIL is a bit of a nuisance sometimes...
A lot of it is going to revolve around how you can use the GPU. Does the API your using allow you to set it running then go off and do something else, occasionally polling to see if the GPU has finished? Or maybe it can raise an event, call a callback or something like that?
I'm sensing from your question that the answer is no... In which case I suspect your only choice (given that you're using Python) is multi processing. If the answer is yes then you can start off the GPU then get on with some preprocessing and plotting in the meantime and then check to see if the GPU has finished.
I don't know much about Python or how it does multiprocessing, but I suspect that it involves serialisation and copying of data being sent between processes. If the quantity of data you're processing is large (I suggest getting worried at the 100's of megabytes mark. Though that's just a hunch) then you may wish to consider how much time is lost in serialising and copy that data. If you don't like the answers to that analysis then your probably out of luck so far as using Python is concerned.
You say that the most time consuming part is the GPU processing? Presumably the other two parts are reasonably lengthy otherwise there would be little point trying to parallelise them. For example if the GPU was 95% of the runtime then saving 5% by parallelising the rest hardly seems worth it.
I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."
For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)
For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.
The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.
Does something similar exist for python? My wishlist would be:
actively maintained
drop in replacement for numpy
alternatively similar usage to numexpr
high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
support for data-locality in distributed memory clusters
support for multi-core machines in the cluster
This is probably a bit like believing in the tooth-fairy but one never knows...
I have found so far:
There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.
If you know of any package or have used any of the two above, please describe your experiences with them.
You should take a look at Blaze, although it may not be far enough along in development to suit your needs at the moment. From the linked page:
Blaze is an expressive, compact set of foundational abstractions for
composing computations over large amounts of semi-structured data, of
arbitrary formats and distributed across arbitrary networks.
I will soon be starting a final year Engineering project, consisting of the real-time tracking of objects moving on a 2D-surface. The objects will be registered by my algorithm using feature extraction.
I am trying to do some research to decide whether I should use MATLAB or use Python Numpy (Numerical Python). Some of the factors I am taking into account:
1.) Experience
I have reasonable experience in both, but perhaps more experience in image processing using Numpy. However, I have always found MATLAB to be very intuitive and easy to pick up.
2.) Real-Time abilities
It is very important that my choice be able to support the real-time acquisition of video data from an external camera. I found this link for MATLAB showing how to do it. I am sure that the same would be possible for Python, perhaps using the OpenCV library?
3.) Performance
I have heard, although never used, that MATLAB can easily split independent calculations across multiple cores. I should think that this would be very useful, and I am not sure whether the same is equally simple for Numpy?
4.) Price
I know that there is a cost associated with MATLAB, but I will be working at a university and thus will have access to full MATLAB without any cost to myself, so price is not a factor.
I would greatly appreciate any input from anyone who has done something similar, and what your experience was.
Thanks!
Python (with NumPy, SciPy and MatPlotLib) is the new Matlab. So I strongly recommend Python over Matlab.
I made the change over a year ago and I am very happy with the results.
Here it is a short pro/con list for Python and Matlab
Python pros:
Object Oriented
Easy to write large and "real" programs
Open Source (so it's completely free to use)
Fast (most of the heavy computation algorithms have a python wrapper to connect with C libraries e.g. NumPy, SciPy, SciKits, libSVM, libLINEAR)
Comfortable environment, highly configurable (iPython, python module for VIM, ...)
Fast growing community of Python users. Tons of documentation and people willing to help
Python cons:
Could be a pain to install (especially some modules in OS X)
Plot manipulation is not as nice/easy as in Matlab, especially 3D plots or animations
It's still a script language, so only use it for (fast) prototyping
Python is not designed for multicore programming
Matlab pros:
Very easy to install
Powerful Toolboxes (e.g. SignalProcessing, Systems Biology)
Unified documentation, and personalized support as long as you buy the licence
Easy to have plot animations and interactive graphics (that I find really useful for running experiments)
Matlab cons:
Not free (and expensive)
Based on Java + X11, which looks extremely ugly (ok, I accept I'm completely biased here)
Difficult to write large and extensible programs
A lot of Matlab users are switching to Python :)
I would recommend python.
I switched from MATLAB -> python about 1/2 way through my phd, and do not regret it. At the most simplistic, python is a much nicer language, has real objects, etc.
If you expect to be doing any parts of your code in c/c++ I would definitely recommend python. The mex interface works, but if your build gets complicated/big it starts to be a pain and I never sorted out how to effectively debug it. I also had great difficulty with mex+allocating large blocks interacting with matlab's memory management (my inability to fix that issue is what drove me to switch).
As a side note/self promotion, I have Crocker-Grier in c++ (with swig wrappers) and pure python.
If you're experienced with both languages it's not really a decision criterion.
Matlab has problems coping with real time settings especially since most computer vision algorithms are very costly. This is the advantage of using a tried and tested library such as OpenCV where many of the algorithms you'll be using are efficiently implemented. Matlab offers the possibility of compiling code into Mex-files but that is a lot of work.
Matlab has parallel for loops parfor which makes multicore processing easy (or at least easier). But the question is if that will suffice to get real-time speeds.
No comment.
The main advantage of Matlab is that you'll obtain a running program very quickly due to its good documentation. But I found that code reusability is bad with Matlab unless you put a heavy emphasis on it.
I think the final decision has to be if you have to/can run your algorithm real-time which I doubt in Matlab, but that depends on what methods you're planning to use.
Others have made a lot of great comments (I've opined on this topic before in another answer https://stackoverflow.com/a/5065585/392949) , but I just wanted to point out that Python has a number of really excellent tools for parallel computing/splitting up work across multiple cores. Here's a short and by no means comprehensive list:
IPython Parallel toolkit: http://ipython.org/ipython-doc/dev/parallel/index.html
mpi4py: https://code.google.com/p/mpi4py
The multiprocessing module in the standard library: http://docs.python.org/library/multiprocessing.html
pyzmq: http://zeromq.github.com/pyzmq/ (what the IPython parallel toolkit is based on)
parallel python (pp): http://www.parallelpython.com/
Cython's wrapping of openmp: http://docs.cython.org/src/userguide/parallelism.html
You will also probably find cython to be much to be a vastly superior tool compared to what Matlab has to offer if you ever need to interface external C-libraries or write C-extensions, and it has excellent numpy support built right in.
There is a list with a bunch of other options here:
http://wiki.python.org/moin/ParallelProcessing