Concepts and tools required to scale up algorithms

Concepts and tools required to scale up algorithms - python

I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I'm familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)... or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started - thanks in advance!

While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited -- you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).
Relational databases may be an exception to "technologies pale" -- they can really clamp you down when you're trying to scale up a few orders of magnitude. Is that your situation -- are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you -- in the latter, you're at the breaking point, or well past it, and must start thinking in terms of key/value stores.
Back to algorithms -- "data analysis" cover a wide range... most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can't possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most "data mining" applications fall into the second category -- you don't need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.
This is particularly crucial if you ever need to do "real-near-time" data analysis -- near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don't know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.
The work I've been doing over the last few years has had a few requirements for "near-real-time" and many more requirements for off-line, "batch" analysis -- and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don't understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes' worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv's, or whatever).
Python, R, C++, whatever, only come into play after you've mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to "statistically survive" the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework...?), and the supporting design and storage-technology choices.

The main thing for scaling up to large data is to avoid situations where you're reading huge datasets into memory at once. In pythonic terms this generally means using iterators to consume the dataset in manageable pieces.

Related

Beyond Numpy/Scipy/Pytorch: (Dense?) linear algrebra libraries for python programmers

For research purposes, I often find myself using the very good dense linear algebra packages available in the Python ecosystem. Mostly numpy, scipy and pytorch, which are (if I understand correctly) heavily based on BLAS/Lapack.
Of course, these go a long way, being quite extensive and having in general quite good performance (in terms of both robustness and speed of execution).
However, I often have quite specific needs that are not currently covered by these libraries. For instance, I recently found myself starting to code structure-preserving linear algebra for symplectic and (skew) Hamilton matrices in Cython (which I find to be a good compromise between speed of execution and ease of integration with the bulk of the python code). This process most often consists in rewriting algorithms of the 70-80s based on dated and sometimes painful to decypher research papers, which I would not mind not having to go through.
I'm not the only person doing this on the stack exchange family either! Some links below of people who have questions doing exactly this:
Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations
It seems to me (from reading the literature from this community, not from first hand acquaintance with the members / experience in the field) that a large part of these algorithms are tested using Matlab, which is prohibitively expensive for me to get my hands on, and which would probably not work great with the rest of my codebase.
Hence my question: where can I find open source exemples of implementations of "research level" dense algebra algorithms that might easily be used in python or copied?

Determining "noise" in bandwidth data

I'm have bandwidth data which identifies protocol usage by tonnage and hour. Based on the protocols, you can tell when something is just connect vs actually being used (1000 bits compared to million or billions of bits) in that hour for that specific protocol. The problem is When looking at each protocol, they are all heavily right skewed. Where 80% of the records are the just connected or what I'm calling "noise.
The task I have is to separate out this noise and focus on only when the protocol is actually being used. My classmates are all just doing this manually and removing at a low threshold. I was hoping there was a way to automate this and using statistics instead of just picking a threshold that "looks good." We have something like 30 different protocols each with a different amount of bits which would represent "noise" i.e. a download prototypical might have 1000 bits where a messaging app might have 75 bits when they are connected but not in full use. Similarly they will have different means and gaps between i.e. download mean is 215,000,000 and messaging is 5,000,000. There isn't any set pattern between them.
Also this "noise" has many connections but only accounts for 1-3% of the total bandwidth being used, this is why we are tasked with identify actual usage vs passive usage.
I don't want any actual code, as I'd like to practice with the implementation and solution building myself. But the logic, process, or name of a statistical method would be very helpful.

Do you have labeled examples, and do you have other data besides the bandwidth? One way to do this would be to train some kind of ML classifier if you have a decent amount of data where you know it's either in use or not in use. If you have enough data you also might be able to do this unsupervised. For a start a simple Naive Bayes classifier works well for binary solutions. As you may be away, NB was the original bases for spam detection (is it spam or not). So your case of is it noise or not should also work, but you will get more robust results if you have other data in addition to the bandwidth to train on. Also, I am wondering if there isn't a way to improve the title of your post so that it communicates your question more quickly.

What is the optimal topic-modelling workflow with MALLET?

Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers

Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.

"Global array" parallel programming on distributed memory clusters with python

I am looking for a python library which extends the functionality of numpy to operations on a distributed memory cluster: i.e. "a parallel programming model in which the programmer views an array as a single global array rather than multiple, independent arrays located on different processors."
For Matlab MIT's Lincoln Lab has created pMatlab which allows to do matrix algebra on a cluster without worrying too much about the details of the parallel programming aspect. (Origin of above quote.)
For disk-based storage, pyTables exist for python. Though it does not optimise how calculations are distributed in a cluster but rather how calculations are "distributed" with respect to large data on a disk. - Which is reasonably similar but still missing a crucial aspect.
The aim is not to squeeze the last bit of performance from a cluster but to do scientific calculations (semi-interactively) that are too large for single machines.
Does something similar exist for python? My wishlist would be:
actively maintained
drop in replacement for numpy
alternatively similar usage to numexpr
high abstraction of the parallel programming part: i.e. no need for the user to explicitly use MPI
support for data-locality in distributed memory clusters
support for multi-core machines in the cluster
This is probably a bit like believing in the tooth-fairy but one never knows...
I have found so far:
There (exists/used to exist) a python interface for Global Array by the Pacific Northwest National Laboratory. See the links under the topic "High Performance Parallel Computing in Python using NumPy and the Global Arrays Toolkit". (Especially "GA_SciPy2011_Tutorial.pdf".) However this seems to have disappeared again.
DistNumPy: described more in detail in this paper. However the projects appears to have been abandoned.
If you know of any package or have used any of the two above, please describe your experiences with them.

You should take a look at Blaze, although it may not be far enough along in development to suit your needs at the moment. From the linked page:
Blaze is an expressive, compact set of foundational abstractions for
composing computations over large amounts of semi-structured data, of
arbitrary formats and distributed across arbitrary networks.

What is MATLAB good for? Why is it so used by universities? When is it better than Python? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've been recently asked to learn some MATLAB basics for a class.
What does make it so cool for researchers and people that works in university?
I saw it's cool to work with matrices and plotting things... (things that can be done easily in Python using some libraries).
Writing a function or parsing a file is just painful. I'm still at the start, what am I missing?
In the "real" world, what should I think to use it for? When should it can do better than Python? For better I mean: easy way to write something performing.
UPDATE 1: One of the things I'd like to know the most is "Am I missing something?" :D
UPDATE 2: Thank you for your answers. My question is not about buy or not to buy MATLAB. The university has the possibility to give me a copy of an old version of MATLAB (MATLAB 5 I guess) for free, without breaking the license. I'm interested in its capabilities and if it deserves a deeper study (I won't need anything more than basic MATLAB in oder to pass the exam :P ) it will really be better than Python for a specific kind of task in the real world.

Adam is only partially right. Many, if not most, mathematicians will never touch it. If there is a computer tool used at all, it's going to be something like Mathematica or Maple. Engineering departments, on the other hand, often rely on it and there are definitely useful things for some applied mathematicians. It's also used heavily in industry in some areas.
Something you have to realize about MATLAB is that it started off as a wrapper on Fortran libraries for linear algebra. For a long time, it had an attitude that "all the world is an array of doubles (floats)". As a language, it has grown very organically, and there are some flaws that are very much baked in, if you look at it just as a programming language.
However, if you look at it as an environment for doing certain types of research in, it has some real strengths. It's about as good as it gets for doing floating point linear algebra. The notation is simple and powerful, the implementation fast and trusted. It is very good at generating plots and other interactive tasks. There are a large number of `toolboxes' with good code for particular tasks, that are affordable. There is a large community of users that share numerical codes (Python + NumPy has nothing in the same league, at least yet)
Python, warts and all, is a much better programming language (as are many others). However, it's a decade or so behind in terms of the tools.
The key point is that the majority of people who use MATLAB are not programmers really, and don't want to be.
It's a lousy choice for a general programming language; it's quirky, slow for many tasks (you need to vectorize things to get efficient codes), and not easy to integrate with the outside world. On the other hand, for the things it is good at, it is very very good. Very few things compare. There's a company with reasonable support and who knows how many man-years put into it. This can matter in industry.
Strictly looking at your Python vs. MATLAB comparison, they are mostly different tools for different jobs. In the areas where they do overlap a bit, it's hard to say what the better route to go is (depends a lot on what you're trying to do). But mostly Python isn't all that good at MATLAB's core strengths, and vice versa.

Most of answers do not get the point.
There is ONE reason matlab is so good and so widely used:
EXTREMELY FAST CODING
I am a computer vision phD student and have been using matlab for 4 years, before my phD I was using different languages including C++, java, php, python... Most of the computer vision researchers are using exclusively matlab.
1) Researchers need fast prototyping
In research environment, we have (hopefully) often new ideas, and we want to test them really quick to see if it's worth keeping on in that direction. And most often only a tiny sub-part of what we code will be useful.
Matlab is often slower at execution time, but we don't care much. Because we don't know in advance what method is going to be successful, we have to try many things, so our bottle neck is programming time, because our code will most often run a few times to get the results to publish, and that's all.
So let's see how matlab can help.
2) Everything I need is already there
Matlab has really a lot of functions that I need, so that I don't have to reinvent them all the time:
change the index of a matrix to 2d coordinate: ind2sub extract all patches of an image: im2col; compute a histogram of an image: hist(Im(:)); find the unique elements in a list unique(list); add a vector to all vectors of a matrix bsxfun(#plus,M,V); convolution on n-dimensional arrays convn(A); calculate the computation time of a sub part of the code: tic; %%code; toc; graphical interface to crop an image: imcrop(im);
The list could be very long...
And they are very easy to find by using the help.
The closest to that is python...But It's just a pain in python, I have to go to google each time to look for the name of the function I need, and then I need to add packages, and the packages are not compatible one with another, the format of the matrix change, the convolution function only handle doubles but does not make an error when I give it char, just give a wrong output... no
3) IDE
An example: I launch a script. It produces an error because of a matrix. I can still execute code with the command line. I visualize it doing: imagesc(matrix). I see that the last line of the matrix is weird. I fix the bug. All variables are still set. I select the remaining of the code, press F9 to execute the selection, and everything goes on. Debuging becomes fast, thanks to that.
Matlab underlines some of my errors before execution. So I can quickly see the problems. It proposes some way to make my code faster.
There is an awesome profiler included in the IDE. KCahcegrind is such a pain to use compared to that.
python's IDEs are awefull. python without ipython is not usable. I never manage to debug, using ipython.
+autocompletion, help for function arguments,...
4) Concise code
To normalize all the columns of a matrix ( which I need all the time), I do:
bsxfun(#times,A,1./sqrt(sum(A.^2)))
To remove from a matrix all colums with small sum:
A(:,sum(A)<e)=[]
To do the computation on the GPU:
gpuX = gpuarray(X);
%%% code normally and everything is done on GPU
To paralize my code:
parfor n=1:100
%%% code normally and everything is multi-threaded
What language can beat that?
And of course, I rarely need to make loops, everything is included in functions, which make the code way easier to read, and no headache with indices. So I can focus, on what I want to program, not how to program it.
5) Plotting tools
Matlab is famous for its plotting tools. They are very helpful.
Python's plotting tools have much less features. But there is one thing super annoying. You can plot figures only once per script??? if I have along script I cannot display stuffs at each step ---> useless.
6) Documentation
Everything is very quick to access, everything is crystal clear, function names are well chosen.
With python, I always need to google stuff, look in forums or stackoverflow.... complete time hog.
PS: Finally, what I hate with matlab: its price

I've been using matlab for many years in my research. It's great for linear algebra and has a large set of well-written toolboxes. The most recent versions are starting to push it into being closer to a general-purpose language (better optimizers, a much better object model, richer scoping rules, etc.).
This past summer, I had a job where I used Python + numpy instead of Matlab. I enjoyed the change of pace. It's a "real" language (and all that entails), and it has some great numeric features like broadcasting arrays. I also really like the ipython environment.
Here are some things that I prefer about Matlab:
consistency: MathWorks has spent a lot of effort making the toolboxes look and work like each other. They haven't done a perfect job, but it's one of the best I've seen for a codebase that's decades old.
documentation: I find it very frustrating to figure out some things in numpy and/or python because the documentation quality is spotty: some things are documented very well, some not at all. It's often most frustrating when I see things that appear to mimic Matlab, but don't quite work the same. Being able to grab the source is invaluable (to be fair, most of the Matlab toolboxes ship with source too)
compactness: for what I do, Matlab's syntax is often more compact (but not always)
momentum: I have too much Matlab code to change now
If I didn't have such a large existing codebase, I'd seriously consider switching to Python + numpy.

Hold everything. When's the last time you programed your calculator to play tetris? Did you actually think you could write anything you want in those 128k of RAM? Likely not. MATLAB is not for programming unless you're dealing with huge matrices. It's the graphing calculator you whip out when you've got Megabytes to Gigabytes of data to crunch and/or plot. Learn just basic stuff, but also don't kill yourself trying to make Python be a graphing calculator.
You'll quickly get a feel for when you want to crunch, plot or explore in MATLAB and when you want to have all that Python offers. Lots of engineers turn to pre and post processing in Python or Perl. Occasionally even just calling out to MATLAB for the hard bits.
They are such completely different tools that you should learn their basic strengths first without trying to replace one with the other. Granted for saving money I'd either use Octave or skimp on ease and learn to work with sparse matrices in Perl or Python.

MATLAB is great for doing array manipulation, doing specialized math functions, and for creating nice plots quick.
I'd probably only use it for large programs if I could use a lot of array/matrix manipulation.
You don't have to worry about the IDE as much as in more formal packages, so it's easier for students without a lot of programming experience to pick up.

MATLAB is a popular and widely adapted piece of a
sophisticated software package. It'd be a mistake to think
it's merely a math software since it has a wide range of
"toolboxes". I recently used Matplotlib to plot some data
from a database and it did the job without needing all the
bells and whistles of MATLAB. However, it may not be proper
to compare Python and MATLAB in every situation. As with
everything else the decision depends on what you need to do.
I used MATLAB in undergrad for control systems design and
simulation and also for image processing in grad school. For
these fields MATLAB makes the most sense because of the
powerful control and image processing toolboxes. As everyone
mentioned, array operations, which are used in every MATLAB
script you'd need to write, are very easy with MATLAB.
Another nice thing about MATLAB is that it's very easy and
fast to do prototyping and trying out ideas using the built
in toolbox functions. For instance, it takes no effort to
import an image and compute it's histogram or do some simple
processing on it. One disadvantage of MATLAB could be it's
speed because of its interpreted nature. However, if one
really needs speed than he can choose to implement the
tested logic in C/C++, etc.
For further comparison with Python, I can say that MATLAB
provides a full package for you to do your work without the
need of looking around for external libraries and
implementing extra functions.
One last point about MATLAB which I see is not mentioned in
the answers here is that it has a very powerful visual
modeling/simulation environment called Simulink. It's
easier to design and simulate larger systems with Simulink.
Finally, again, it all depends on the problem you need to
solve. If your problem domain can make use of one of
MATLAB's toolboxes and you have access to MATLAB then you
can be sure that you'll have the right tool to solve it.

MATLAB, as mentioned by others, is great at matrix manipulation, and was originally built as an extension of the well-known BLAS and LAPACK libraries used for linear algebra. It interfaces well with other languages like Java, and is well favored by engineering and scientific companies for its well developed and documented libraries. From what I know of Python and NumPy, while they share many of the fundamental capabilities of MATLAB, they don't have the full breadth and depth of capabilities with their libraries.
Personally, I use MATLAB because that's what I learned in my internship, that's what I used in grad school, and that's what I used in my first job. I don't have anything against Python (or any other language). It's just what I'm used too.
Also, there is another free version in addition to scilab mentioned by #Jim C from gnu called Octave.

Personally, I tend to think of Matlab as an interactive matrix calculator and plotting tool with a few scripting capabilities, rather than as a full-fledged programming language like Python or C. The reason for its success is that matrix stuff and plotting work out of the box, and you can do a few very specific things in it with virtually no actual programming knowledge. The language is, as you point out, extremely frustrating to use for more general-purpose tasks, such as even the simplest string processing. Its syntax is quirky, and it wasn't created with the abstractions necessary for projects of more than 100 lines or so in mind.
I think the reason why people try to use Matlab as a serious programming language is that most engineers (there are exceptions; my degree is in biomedical engineering and I like programming) are horrible programmers and hate to program. They're taught Matlab in college mostly for the matrix math, and they learn some rudimentary programming as part of learning Matlab, and just assume that Matlab is good enough. I can't think of anyone I know who knows any language besides Matlab, but still uses Matlab for anything other than a few pure number crunching applications.

The most likely reason that it's used so much in universities is that the mathematics faculty are used to it, understand it, and know how to incorporate it into their curriculum.

Between matplotlib+pylab and NumPy I don't think there's much actual difference between Matlab and python other than cultural inertia as suggested by #Adam Bellaire.

I believe you have a very good point and it's one that has been raised in the company where I work. The company is limited in it's ability to apply matlab because of the licensing costs involved. One developer proved that Python was a very suitable replacement but it fell on ignorant ears because to the owners of those ears...
No-one in the company knew Python although many of us wanted to use it.
MatLab has a name, a company, and task force behind it to solve any problems.
There were some (but not a lot) of legacy MatLab projects that would need to be re-written.
If it's worth £10,000 (??) it's gotta be worth it!!
I'm with you here. Python is a very good replacement for MatLab.
I should point out that I've been told the company uses maybe 5% to 10% of MatLabs capabilities and that is the basis for my agreement with the original poster

MATLAB is a fantastic tool for
prototyping
engineering simulation and
fast visualization of data
You can really play with, visualize and test your ideas on a data set very effectively. It should not be regarded as an alternative to other software languages used for product development. I highly recommend it for the above tasks, though it is expensive - free alternatives like Octave and Python are catching up.

Seems to be pure inertia. Where it is in use, everyone is too busy to learn IDL or numpy in sufficient detail to switch, and don't want to rewrite good working programs. Luckily that's not strictly true, but true enough in enough places that Matlab will be around a long time. Like Fortran (in active use where i work!)

The main reason it is useful in industry is the plug-ins built on top of the core functionality. Almost all active Matlab development for the last few years has focused on these.
Unfortunately, you won't have much opportunity to use these in an academic environment.

I know this question is old, and therefore may no longer be
watched, but I felt it was necessary to comment. As an
aerospace engineer at Georgia Tech, I can say, with no
qualms, that MATLAB is awesome. You can have it quickly
interface with your Excel spreadsheets to pull in data about
how high and fast rockets are flying, how the wind affects
those same rockets, and how different engines matter. Beyond
rocketry, similar concepts come into play for cars, trucks,
aircraft, spacecraft, and even athletics. You can pull in
large amounts of data, manipulate all of it, and make sure
your results are as they should be. In the event something is
off, you can add a line break where an error occurs to debug
your program without having to recompile every time you want
to run your program. Is it slower than some other programs?
Well, technically. I'm sure if you want to do the number
crunching it's great for on an NVIDIA graphics processor, it
would probably be faster, but it requires a lot more effort
with harder debugging.
As a general programming language, MATLAB is weak. It's not
meant to work against Python, Java, ActionScript, C/C++ or
any other general purpose language. It's meant for the
engineering and mathematics niche the name implies, and it
does so fantastically.

MATLAB WAS a wrapper around commonly available libraries.
And in many cases it still is. When you get to larger
datasets, it has many additional optimizations, including
examining and special casing common problems (reducing to
sparse matrices where useful, for example), and handling
edge cases. Often, you can submit a problem in a standard
form to a general function, and it will determine the best
underlying algorithm to use based on your data. For small
N, all algorithms are fast, but MATLAB makes determining the
optimal algorithm a non-issue.
This is written by someone who hates MATLAB, and has tried
to replace it due to integration issues. From your
question, you mention getting MATLAB 5 and using it for a
course. At that level, you might want to look at
Octave, an open source implementation with the same
syntax. I'm guessing it is up to MATLAB 5 levels by now (I
only play around with it). That should allow you to "pass
your exam". For bare MATLAB functionality it seems to be
close. It is lacking in the toolbox support (which, again,
mostly serves to reformulate the function calls to forms
familiar to engineers in the field and selects the right
underlying algorithm to use).

One reason MATLAB is popular with universities is the same reason a lot of things are popular with universities: there's a lot of professors familiar with it, and it's fairly robust.
I've spoken to a lot of folks who are especially interested in MATLAB's nascent ability to tap into the GPU instead of working serially. Having used Python in grad school, I kind of wish I had the licks to work with MATLAB in that case. It sure would make vector space calculations a breeze.

It's been some time since I've used Matlab, but from memory it does provide (albeit with extra plugins) the ability to generate source to allow you to realise your algorithm on a DSP.
Since python is a general purpose programming language there is no reason why you couldn't do everything in python that you can do in matlab. However, matlab does provide a number of other tools - eg. a very broad array of dsp features, a broad array of S and Z domain features.
All of these could be hand coded in python (since it's a general purpose language), but if all you're after is the results perhaps spending the money on Matlab is the cheaper option?
These features have also been tuned for performance. eg. The documentation for Numpy specifies that their Fourier transform is optimised for power of 2 point data sets. As I understand Matlab has been written to use the most efficient Fourier transform to suit the size of the data set, not just power of 2.
edit: Oh, and in Matlab you can produce some sensational looking plots very easily, which is important when you're presenting your data. Again, certainly not impossible using other tools.

I think you answered your own question when you noted that Matlab is "cool to work with matrixes and plotting things". Any application that requires a lot of matrix maths and visualisation will probably be easiest to do in Matlab.
That said, Matlab's syntax feels awkward and shows the language's age. In contrast, Python is a much nicer general purpose programming language and, with the right libraries can do much of what Matlab does. However, Matlab is always going to have a more concise syntax than Python for vector and matrix manipulation.
If much of your programming involves these sorts of manipulations, such as in signal processing and some statistical techniques, then Matlab will be a better choice.

First Mover Advantage. Matlab has been around since the late 1970s. Python came along more recently, and the libraries that make it suitable for Matlab type tasks came along even more recently. People are used to Matlab, so they use it.

Matlab is good at doing number crunching. Also Matrix and matrix manipulation. It has many helpful built in libraries(depends on the what version) I think it is easier to use than python if you are going to be calculating equations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.