High performance computing projects using Python [closed]

High performance computing projects using Python [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
For a paper I want to argue why I have used Python for the implementation of my algorithm. Besides the typical arguments that it is fast -using suitable libraries- and it is easy to implement the algorithm with it, I thought maybe there are some big HPC projects that are using it.
Does anyone know a famous project that uses Python for large parallel calculations, maybe with a paper which I can cite?

To be honest, as great a language as python is, it wouldn't be a suitable environment for scientific computing and in particular high performance computing, if those libraries weren't available. So you can see python as one pieces of a larger puzzle - much as MATLAB can be.
The two key reasons to use python for scientific or high-performance computing can then be said to be because of the convenient interfaces to software packages written in other languages, or because you need fast turn around on a project. Commonly, both issues arise at the time.
The classic example of this is the paper "Feeding a Large-scale Physics Application to Python", by David M. Beazley which combines performance intensive C++ with python using SWIG
If you're looking for something very current, there is a new paper, "A New Modelling System for Seasonal Streamflow Forecasting Service of the Bureau of Meteorology, Australia", by Daehyok Shin et al., that due to be presented at MODSIM2011. I saw the first author speak at the Melbourne Python Users Group about how ipython was used being used as a mechanism for bridging high performance fortran models and HDF5 data in such a way that even non-programmers could make effective contributions to a larger scientific program.

Check out the Python success stories page on Python.org.

Blender is written in Python which is quite impressive for what it can do. If you're not impressed by testing it, you should watch some of the shorts people have made using it. Not as impressive, Ubuntu Software Center and BitTorrent are written in Python. Battlefield 2 uses a good chunk of Python

Related

Which technology should I use to develop this fisheye perspective game? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm going to develop a 3D game, that a player walks in a maze with a 3D first-person perspective, collects things and escapes a monster. The game itself is very simple, but as it is not for entertainment, but for a biological experiment, so it has some specific features:
We will project the graphics to a spherical screen with 3 projectors, so the graphics should be in a fisheye trasformation, and be easily further transformable (to deal with the merging between projectors).
There should be a functionality to record data, like the path of the player, and the time points when the monster appears etc. All the events should be recordable.
The game program could interact with external devices via USB. For example, whenever the player press a certain key, the program will tell an Arduino board to do something.
As my invesigation, I found three candidates of tool chain to develop such a game:
Develop a MOD on Quake3 engine + Fisheye Quake. The problem I think would be that the Quake3 runs with a virtual machine, so that is it possible to implement the feature 2 and 3 above?
Panda3D + FisheyeLens API
PyOpenGL. This is the most flexible way, but with the greatest workload I think.
I'm quite familiar with C/C++/Python, but this is my first time to develop a 3D game. My question is which tool chain is fittest for this project (or any other good options) ? What problem would I encounter?

As there's no answer yet, let me give my own thoughs here. I have no experience on 3D development and don't know if these technology would work. For some side reasons, I prefer Panda3D.
Please note that I'm still open to accept other answers.
Why Panda3D?
It is well-documented. This is always an import factor to choose technology.
It's in Python, which is a plus for me.
It's general-purposed (than Quake3). Panda3D can also be used on robotics simulator etc. which would be useful for me.
Why not Quake3?
It runs on virtual machine (QVM). I'm afraid that it's hard to record data or access external device.
At first I thought of Quake3 because it's at least a very classical code to read, but currently I realized that Quake3 can not do anything else except FPS-like game. If it worth reading, just read it, but not have to work with it.
It's not that flexible to transform the rendering of graphics. It's a game engine for purpose, not a general graphics processing library.
Why not OpenGL?
At this moment I think the Panda3D is bottom enough. If it's proved to be not as flexible as I need, I would turn to OpenGL then.

Tips on writing optimised project in Python with C/C++ extensions? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a project written in IDL (interactive data language) that is used to produce a near real time model of the ionosphere by assimilating a heap of different data inputs. IDL is not a great language for this to be written in, but it is so mainly because of legacy code. The project is written in OO style despite the relatively limited object environment in IDL.
The scope of the next generation of this project is much larger and will require much more computing grunt. IDL has limited support from multi-threading and no support for parallel running on distributed memory systems. The original plan was to write the next generation code in C++ using MPI to parallelize, however I have recently started learning Python and am very impressed with the ease of use and ability to rapidly develop and maintain code. I am now considering writing the high level parts of this project in Python and using C extensions when/if required to improve the optimisation of the core number crunching parts.
Since I'm new to Python, it won't be immediately obvious to me where Python is likely to be slow compared to a C version (and I'll also probably do things sub-optimally in Python until I learn its idiosyncrasies). This means I'll thinking of basically planning out the whole project as if it was to be done all in Python, write the code, profile and optimise repeatedly until I can't make any more improvements and then look to replace the slowest parts with C extensions.
Is this a good approach? Does anyone have any tips for developing this kind of project? I'll be looking to utilise as many existing well optimised libraries as possible (such as scaLAPACK) which may also reduce the need to roll my own C based extensions for the number crunching.

Python is especially slow when you do a lot of looping, especially nested loops
for i in x:
for j in y:
....
When it comes computationally intensive problems, 99% of the problems can be solved by doing vectorized calculations with numpy instead of looping, e.g.:
x = np.arange(1000) #numbers from 0 to 999
y = np.arange(1000, 2000) #numbers from 1000 to 1999
# slow:
for i in range(len(x)):
y[i] += x[i]
# fast:
y += x
For many scientific problems there are binary libraries, written in FORTRAN or C(++), that are available in Python. This makes life really easy.
If you come to a point where this is not possible, I'd stick to Cython to easily implement the core parts in C, without writing C.

Where to begin with Distributed Computing / Parallel Processing? (Python / C) [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I'm currently investigating topics for graduate studies in Computer Science and I've found a relatively large area of interest, Distributed Computing, that I'd like to get more information on. There are a handful of other questions [1,2,3] on StackOverflow that address similar matters, but not necessarily the question I'm going to ask, specifically related to the languages I'm looking for.
I've searched the web and found plenty of papers, articles, and even courses, such as this course from Rutgers, describing the theory and mechanics behind Distributed Computing. Unfortunately, most of these papers and courses I've found are fairly limited on describing the actual concepts of Distributed Computing in code. I'm looking for websites that can give me an introduction to the programming parts of Distributed Computing. (Preferably in C or Python.)
As a side note, I'd like to mention that this may even be more specifically towards how Parallel Computing fits into the field of Distributed Computing. (I haven't taken a course in either yet!)

Disclamer: I am a developer of SCOOP.
It really depends on your personality. If you prefer getting theoretical information before moving forth, you should read some books or get along with the technologies first. A list of books covering a good part of the subject would be:
Parallel Programming for multicore and cluster systems by Thomas Rauber, and Gudula Rünger (Springer-Verlag).
Principles of Parallel Programming by Calvin Lin and Lawrence Snyder (Addison-Wesley)
Patterns for Parallel Programming by Timothy G. Mattson and al. (Addison-Wesley)
Data-based technologies you may want to get acquainted with would be the MPI standard (for multi-computers) and OpenMP (for single-computer), as well as the pretty good multiprocessing module which is builtin in Python.
If you prefer getting your hands dirty first, you should begin with task-based frameworks which provides a simple and user-friendly usage. Both of these were an utmost focus while creating SCOOP. You can try it with pip -U scoop. On Windows, you may wish to install PyZMQ first using their executable installers. You can check the provided examples and play with the various parameters to understand what causes performance degradation or increase with ease. I encourage you to compare it to its alternatives such as Celery for similar work or Gevent for a coroutine framework. If you feel adventurous, don't be shy to test the builtin coroutines functionnalities of Python and plug them with various networking stacks.
Using a task-based framework will ease you the burden of theoretical analysis such as load balancing implementation details, serialization and so on which is non-trivial and can take a long time to debug and get working. It provides all the desired level of understanding of distributed systems. Bonus with open source software: Check the code to understand under-the-hood mechanical details.

I have had good experiences using the built in packages for python on a single machine. My friend has had great success using ipython on a machine with 128 cores.
Now there are different kinds of distributed computing like on clusters, clouds, or any machine on the internet like folding#home (including PS3s!) Don't forget about GPUs as well!
Some Python Links:
Various Python libraries
Ipython
Python and Parallel Computing presentation

Best language for Molecular Dynamics Simulator, to be run in production. (Python+Numpy?) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I need to build a heavy duty molecular dynamics simulator. I am wondering if python+numpy is a good choice. This will be used in production, so I wanted to start with a good language. I am wondering if I should rather start with a functional language like eg.scala. Do we have enough library support for scientific computation in scala? Or any other language/paradigm combination you think is good - and why. If you had actually built something in the past and are talking from experience, please mention it as it will help me with collecting data points.
thanks much!

The high performing MD implementations tend to be decidedly imperative (as opposed to functional) with big arrays of data trumping object-oriented design. I've worked with LAMMPS, and while it has its warts, it does get the job done. A perhaps more appealing option is HOOMD, which has been optimized from the beginning for Nvidia GPUs with CUDA. HOOMD doesn't have all the features of LAMMPS, but the interface seems a bit nicer (it's scriptable from Python) and it's very high performance.
I've actually implemented my own MD code a couple times (Java and Scala) using a high level object oriented design, and have found disappointing performance compared to the popular MD implementations that are heavily tuned and use C++/CUDA. These days, it seems few scientists write their own MD implementations, but it is useful to be able to modify existing ones.

I believe that most highly performant MD codes are written in native languages like Fortran, C or C++. Modern GPU programming techniques are also finding favour more recently.
A language like Python would allow for much more rapid development that native code. The flip side of that is that the performance is typically worse than for compiled native code.
A question for you. Why are you writing your own MD code? There are many many libraries out there. Can't you find one to suit your needs?

Why would you do this? There are many good, freely available, molecular dynamics packages out there you could use: LAMMPS, Gromacs, NAMD, HALMD all come immediately to mind (along with less freely available ones like CHARMM, AMBER, etc.) Modifying any of these to suit your purpose is going to be vastly easier than writing your own, and any of these packages, with thousands of users and dozens of contributors, are going to be better than whatever you'd write yourself.
Python+numpy is going to be fine for prototyping, but it's going to be vastly slower (yes, even with numpy linked against fast libraries) than C/C++/Fortran, which is what all the others use. Unless you're using GPU, in which case all the hard work is done in kernels written in C/C++ anyway.

Another alternative if you want to use Python is to take a look at OpenMM:
https://simtk.org/home/openmm
It's a Molecular Dynamics API that has many of the basic elements that you need (integrators, thermostats, barostats, etc) and supports running on the CPU via OpenCL and GPU via CUDA and OpenCL. It has a python wrapper that I've used before and basically mimics the underlying c-api calls. It's been incorporated into Gromacs, and MDLab, so you have some examples of how to integrate it if you're really dead set on building something from (semi) scratch
However as others have said, I highly recommend taking a look at NAMD, Gromacs, HOOMD, LAMMPS, DL_POLY, etc to see if it fits your needs before you embark on re-inventing the wheel.

Acquiring basic skills working with visualizing/analyzing large data sets [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a way to learn to be comfortable with large data sets. I'm a university student, so everything I do is of "nice" size and complexity. Working on a research project with a professor this semester, and I've had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.
I wrote most of my data wrangling in Python, visualized using GNUPlot.
Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more "basic" visualization system than relying on gnuplot. Cairo or something, I suppose.
Looking for something that takes me from data mining, to processing, to visualization.
EDIT: I'm more looking for something that will teach me the "big ideas". I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?

I'd say the most basic skill is a good grounding in math and statistics. This can help
you assess and pick from the variety of techniques for filtering data, and
reducing its volume and dimensionality while keeping its integrity. The last
thing you'd want to do is make something pretty that shows patterns or
relationships which aren't really there.
Specialized math
To tackle some types of problems you'll need to learn some math to understand how particular algorithms work and what effect they'll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural
language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet
(and Stack Exchange sites) should you need help.
For an introductory overview of data mining techniques, Witten's Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it's not too expensive -- as you read more into the field you'll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you're using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.
Tools
For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a
good graphics library you have experience with, like PIL or
Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.
When you want to create moving, interactive visualizations, tools like the
Java-based Processing library make this easy. There
are even ways of writing Processing sketches in
Python via Jython, in case you don't want to write Java.
There are many more tools out there, should you need them, like OpenCV (computer vision,
machine learning), Orange (data mining,
analysis, viz), and NLTK (natural language, text
analysis).
Presentation principles and techniques
Books by folks in the field like Edward
Tufte and references like
Information
Graphics
can help you get a good overview of the ways of creating visualizations and
presenting them effectively.
Resources to find Viz examples
Websites like Flowing Data, Infosthetics, Visual Complexity and Information is
Beautiful show recent, interesting
visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I'm sure you'll find a lot of useful sites and inspiring examples.
(This was originally going to be a comment, but grew too long)

Check out Information is beautiful. It is not a technical book but it might give you a couple of ideas for visualising data.
And maybe have a look at the first 3 chapters of Principles of Data Mining, it goes through some concepts of visualizing data in data mining context, I found some parts of it useful during university.
Hope this helps

If you are looking for visualization rather than data mining and analysis, The Visual Display of Quantitative Information by Edward Tufte is considered one of the best books in the field.

I like the book Data Analysis with Open Source Tools by Janert. It is a pretty broad survey of data analysis methods, focusing on how to understand the system that produced the data, rather than on sophisticated statistical methods. One caveat: while the mathematics used isn't especially advanced, I do think you will need to be comfortable with mathematical arguments to gain much from the book.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.