Where to begin with Distributed Computing / Parallel Processing? (Python / C) [closed]

Where to begin with Distributed Computing / Parallel Processing? (Python / C) [closed] - python

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I'm currently investigating topics for graduate studies in Computer Science and I've found a relatively large area of interest, Distributed Computing, that I'd like to get more information on. There are a handful of other questions [1,2,3] on StackOverflow that address similar matters, but not necessarily the question I'm going to ask, specifically related to the languages I'm looking for.
I've searched the web and found plenty of papers, articles, and even courses, such as this course from Rutgers, describing the theory and mechanics behind Distributed Computing. Unfortunately, most of these papers and courses I've found are fairly limited on describing the actual concepts of Distributed Computing in code. I'm looking for websites that can give me an introduction to the programming parts of Distributed Computing. (Preferably in C or Python.)
As a side note, I'd like to mention that this may even be more specifically towards how Parallel Computing fits into the field of Distributed Computing. (I haven't taken a course in either yet!)

Disclamer: I am a developer of SCOOP.
It really depends on your personality. If you prefer getting theoretical information before moving forth, you should read some books or get along with the technologies first. A list of books covering a good part of the subject would be:
Parallel Programming for multicore and cluster systems by Thomas Rauber, and Gudula Rünger (Springer-Verlag).
Principles of Parallel Programming by Calvin Lin and Lawrence Snyder (Addison-Wesley)
Patterns for Parallel Programming by Timothy G. Mattson and al. (Addison-Wesley)
Data-based technologies you may want to get acquainted with would be the MPI standard (for multi-computers) and OpenMP (for single-computer), as well as the pretty good multiprocessing module which is builtin in Python.
If you prefer getting your hands dirty first, you should begin with task-based frameworks which provides a simple and user-friendly usage. Both of these were an utmost focus while creating SCOOP. You can try it with pip -U scoop. On Windows, you may wish to install PyZMQ first using their executable installers. You can check the provided examples and play with the various parameters to understand what causes performance degradation or increase with ease. I encourage you to compare it to its alternatives such as Celery for similar work or Gevent for a coroutine framework. If you feel adventurous, don't be shy to test the builtin coroutines functionnalities of Python and plug them with various networking stacks.
Using a task-based framework will ease you the burden of theoretical analysis such as load balancing implementation details, serialization and so on which is non-trivial and can take a long time to debug and get working. It provides all the desired level of understanding of distributed systems. Bonus with open source software: Check the code to understand under-the-hood mechanical details.

I have had good experiences using the built in packages for python on a single machine. My friend has had great success using ipython on a machine with 128 cores.
Now there are different kinds of distributed computing like on clusters, clouds, or any machine on the internet like folding#home (including PS3s!) Don't forget about GPUs as well!
Some Python Links:
Various Python libraries
Ipython
Python and Parallel Computing presentation

Related

Best language for battery modelling? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm interested to learn if there's a general consensus around using one language or environment for building physics-based computational models of batteries?
The modelling typically involves mathematically representing electrochemical, mechanical and thermal phenomena, solving partial differential equations and outputting plots of different variables in two and three dimensions.
So far, I've seen various academic research groups using MATLAB, but from other questions here, I can see that Fortran and Python have been suggested for relatively generic physics modelling. (See here: https://goo.gl/3ACddi)
I have a preference for a free (as in beer & speech) environment, wherever possible, but I recognise that some proprietary environments may have built-in toolboxes that are useful. Additionally, I would like the environment to allow the code to be easily parallelized so that it can run across many cores.

This is a broad question, but I'll share what I've experienced so far. Maybe it's of some use. Keep in mind that this is all my personal option.
MATLAB: It's widely used in academic environments. One reason is that Mathworks is following a smart business strategy where educational licenses are very cheap compared to the retail prize, thus many students and professors get used to MATLAB, even if there might be something better for them out there.
MATLAB has the advantage of being very easy to code. It will often take you a short time to get the first prototype of your code running. This comes at the expense of performance (compared to C/C++ and Python, which are often a bit faster than MATLAB). One of the downsides is that Matlab was not meant to compete with C/C++ and the like. You don't even have namespaces in matlab. Writing frameworks in matlab is therefore a whole lot more tiresome (if not impossible) and inefficient than writing one in C/C++. For instance if you create a function in your workspace called max which does absolutely nothing, you have no way to call Matlab's built in max function as long as yours is in the workspace.
C++: I'm studying engineering and here C++ is the favourite choice when it comes to physical simulations. Compared to other languages it's really fast. And since the programmer is responsible for memory management, he or she can get the last 10% bit of performance by writing efficient and case specific code for handling memory. There's also a ton of Open Source libraries out there, for example Eigen which is a library for Matrix and Vector calculation.
C: Some people (hello Linus) are convinced, that C++ is not a good language and prefer the plain C since it is a bit faster and the library "bloat" (in C++ coming from STD, Boost and the likes) is smaller. More arguments against C++ are that it seduces the programmer into creating classes for every little thing and use Polymorphism out of laziness. Both things can have a negative impact on performance, but if it makes it worth refusing to work with C++ at all is up to you to decide. As a sidenote: The complete Linux Kernel is written in C, not C++ and many tools like GIT are also written in plain C.
Python: Another language suitable for rapid prototyping since you don't need to compile a lot and the syntax is optimized to be easy and intuitive to use. Debuggers are not necessary since you can simply use the Interpreter to check out different variables and their values, much like in matlab. But contrary to Matlab Python also allows you to create objects with methods and everything like C++. (I know that Matlab recently added classes, but I refuse to say it's equivalent to C++/Python). Python is also widely used for academic purposes. There are open source libraries for Machine Learning, Artificial Intelligence and everything. There are also libraries which allow you to use Fractions without approximations. I.e. 1/6 is stored as two numbers, numerator and denominator, and not as a double. In the open source community people are putting a great effort into copying many features Matlab has over to Python, which is why you'll find many open source enthusiasts using it.
You see, some languages are good for rapid prototyping, meaning for scenarios where you want to get a proof of concept. MATLAB is useful since you don't have to compile anything and you can quickly visualize results. Python is also worth noting for rapid prototyping. But once you need to deploy the code on actual hardware or want to sell a finished product with user interface and everything, you'd probably go with something like C/C++ or Python, but not Matlab.

Machine learning development environment [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I use python for doing prototypes in machine learning but have often been frustrated with the slow interpreter. Is there a language which is good for prototype (enough libraries like sklearn, numpy, scipy) but at the same time is fast and is a powerful language.
What I am looking for is something that I can prototype in and deploy in production as well. What do people commonly use ?

As far as I know, Python is as good as it gets if you want a real language with lots of libraries.
MATLAB is probably the most popular commercial solution for prototyping. It has numerous built-ins and is easy to handle. In terms of performance, MATLAB is currently king in prototyping, second only to compiled languages for production (C, Fortran, C++, ...). It's not a proper language, though, so I guess this isn't what you are looking for.

Python is pretty much as good as it gets for the sort of prototyping you describe. However, I have to ask, if you're frustrated with its speed as a numeric language: how are you writing your code? The way to do this in Python is with Numpy, which is a package for numerical computing where the underlying operations on arrays (matrices) are performed using compiled C code. It does mean learning how to express your computations as matrix operations however, so if you're not used to linear algebra/matrix manipulation then it might require a bit of getting used to. It's basically a Matlab-like environment.
My experience: if you're writing your python code using a lot of loops, element-wise operations, etc. it is slow and ugly. Once you learn the equivalent Numpy/Scipy way, the speed gains are phenomenal (and what you write is much closer to the mathematical expression too).

You can use R within Python RPy. This way you can use R functionalities within a python program for further usage.
Depending on what you want to do, you can also have a look on OpenCV Python, for lower level machine learning tools (SVM ...)

High performance computing projects using Python [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
For a paper I want to argue why I have used Python for the implementation of my algorithm. Besides the typical arguments that it is fast -using suitable libraries- and it is easy to implement the algorithm with it, I thought maybe there are some big HPC projects that are using it.
Does anyone know a famous project that uses Python for large parallel calculations, maybe with a paper which I can cite?

To be honest, as great a language as python is, it wouldn't be a suitable environment for scientific computing and in particular high performance computing, if those libraries weren't available. So you can see python as one pieces of a larger puzzle - much as MATLAB can be.
The two key reasons to use python for scientific or high-performance computing can then be said to be because of the convenient interfaces to software packages written in other languages, or because you need fast turn around on a project. Commonly, both issues arise at the time.
The classic example of this is the paper "Feeding a Large-scale Physics Application to Python", by David M. Beazley which combines performance intensive C++ with python using SWIG
If you're looking for something very current, there is a new paper, "A New Modelling System for Seasonal Streamflow Forecasting Service of the Bureau of Meteorology, Australia", by Daehyok Shin et al., that due to be presented at MODSIM2011. I saw the first author speak at the Melbourne Python Users Group about how ipython was used being used as a mechanism for bridging high performance fortran models and HDF5 data in such a way that even non-programmers could make effective contributions to a larger scientific program.

Check out the Python success stories page on Python.org.

Blender is written in Python which is quite impressive for what it can do. If you're not impressed by testing it, you should watch some of the shorts people have made using it. Not as impressive, Ubuntu Software Center and BitTorrent are written in Python. Battlefield 2 uses a good chunk of Python

Best language for Molecular Dynamics Simulator, to be run in production. (Python+Numpy?) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I need to build a heavy duty molecular dynamics simulator. I am wondering if python+numpy is a good choice. This will be used in production, so I wanted to start with a good language. I am wondering if I should rather start with a functional language like eg.scala. Do we have enough library support for scientific computation in scala? Or any other language/paradigm combination you think is good - and why. If you had actually built something in the past and are talking from experience, please mention it as it will help me with collecting data points.
thanks much!

The high performing MD implementations tend to be decidedly imperative (as opposed to functional) with big arrays of data trumping object-oriented design. I've worked with LAMMPS, and while it has its warts, it does get the job done. A perhaps more appealing option is HOOMD, which has been optimized from the beginning for Nvidia GPUs with CUDA. HOOMD doesn't have all the features of LAMMPS, but the interface seems a bit nicer (it's scriptable from Python) and it's very high performance.
I've actually implemented my own MD code a couple times (Java and Scala) using a high level object oriented design, and have found disappointing performance compared to the popular MD implementations that are heavily tuned and use C++/CUDA. These days, it seems few scientists write their own MD implementations, but it is useful to be able to modify existing ones.

I believe that most highly performant MD codes are written in native languages like Fortran, C or C++. Modern GPU programming techniques are also finding favour more recently.
A language like Python would allow for much more rapid development that native code. The flip side of that is that the performance is typically worse than for compiled native code.
A question for you. Why are you writing your own MD code? There are many many libraries out there. Can't you find one to suit your needs?

Why would you do this? There are many good, freely available, molecular dynamics packages out there you could use: LAMMPS, Gromacs, NAMD, HALMD all come immediately to mind (along with less freely available ones like CHARMM, AMBER, etc.) Modifying any of these to suit your purpose is going to be vastly easier than writing your own, and any of these packages, with thousands of users and dozens of contributors, are going to be better than whatever you'd write yourself.
Python+numpy is going to be fine for prototyping, but it's going to be vastly slower (yes, even with numpy linked against fast libraries) than C/C++/Fortran, which is what all the others use. Unless you're using GPU, in which case all the hard work is done in kernels written in C/C++ anyway.

Another alternative if you want to use Python is to take a look at OpenMM:
https://simtk.org/home/openmm
It's a Molecular Dynamics API that has many of the basic elements that you need (integrators, thermostats, barostats, etc) and supports running on the CPU via OpenCL and GPU via CUDA and OpenCL. It has a python wrapper that I've used before and basically mimics the underlying c-api calls. It's been incorporated into Gromacs, and MDLab, so you have some examples of how to integrate it if you're really dead set on building something from (semi) scratch
However as others have said, I highly recommend taking a look at NAMD, Gromacs, HOOMD, LAMMPS, DL_POLY, etc to see if it fits your needs before you embark on re-inventing the wheel.

Acquiring basic skills working with visualizing/analyzing large data sets [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a way to learn to be comfortable with large data sets. I'm a university student, so everything I do is of "nice" size and complexity. Working on a research project with a professor this semester, and I've had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.
I wrote most of my data wrangling in Python, visualized using GNUPlot.
Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more "basic" visualization system than relying on gnuplot. Cairo or something, I suppose.
Looking for something that takes me from data mining, to processing, to visualization.
EDIT: I'm more looking for something that will teach me the "big ideas". I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?

I'd say the most basic skill is a good grounding in math and statistics. This can help
you assess and pick from the variety of techniques for filtering data, and
reducing its volume and dimensionality while keeping its integrity. The last
thing you'd want to do is make something pretty that shows patterns or
relationships which aren't really there.
Specialized math
To tackle some types of problems you'll need to learn some math to understand how particular algorithms work and what effect they'll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural
language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet
(and Stack Exchange sites) should you need help.
For an introductory overview of data mining techniques, Witten's Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it's not too expensive -- as you read more into the field you'll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you're using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.
Tools
For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a
good graphics library you have experience with, like PIL or
Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.
When you want to create moving, interactive visualizations, tools like the
Java-based Processing library make this easy. There
are even ways of writing Processing sketches in
Python via Jython, in case you don't want to write Java.
There are many more tools out there, should you need them, like OpenCV (computer vision,
machine learning), Orange (data mining,
analysis, viz), and NLTK (natural language, text
analysis).
Presentation principles and techniques
Books by folks in the field like Edward
Tufte and references like
Information
Graphics
can help you get a good overview of the ways of creating visualizations and
presenting them effectively.
Resources to find Viz examples
Websites like Flowing Data, Infosthetics, Visual Complexity and Information is
Beautiful show recent, interesting
visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I'm sure you'll find a lot of useful sites and inspiring examples.
(This was originally going to be a comment, but grew too long)

Check out Information is beautiful. It is not a technical book but it might give you a couple of ideas for visualising data.
And maybe have a look at the first 3 chapters of Principles of Data Mining, it goes through some concepts of visualizing data in data mining context, I found some parts of it useful during university.
Hope this helps

If you are looking for visualization rather than data mining and analysis, The Visual Display of Quantitative Information by Edward Tufte is considered one of the best books in the field.

I like the book Data Analysis with Open Source Tools by Janert. It is a pretty broad survey of data analysis methods, focusing on how to understand the system that produced the data, rather than on sophisticated statistical methods. One caveat: while the mathematics used isn't especially advanced, I do think you will need to be comfortable with mathematical arguments to gain much from the book.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.