Related
How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?
Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.
I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
http://codingjunkie.net/micro-benchmarking-with-caliper/
http://vertexlabs.co.uk/blog/caliper
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH
Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.
jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.
Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.
Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.
If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
for(i=1..n)
alg1();
for(i=1..n)
alg2();
for(i=1..n)
alg2();
for(i=1..n)
alg1();
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.
There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).
It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.
To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");
http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.
P.S.: I've mentioned possible solutions to my problem but have many confusions with them, please provide me suggestions on them. Also if this question is not good for this site, please point me to the correct site and I'll move the question there. Thanks in advance.
I need to perform some repetitive graph theory and complex network algorithms to analyze approx 2000 undirected simple graphs with no self-loops for some research work. Each graph has approx 40,000 nodes and approx 600,000 edges (essentially making them sparse graphs).
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G) for 500 such graphs and the code is running for 3 days and have reached only halfway. This makes me fearful that my full analysis will take a huge and unexpected time.
Before elaborating on my problem and the probable solutions I've thought of, let me mention my computer's configuration as it may help you in suggesting the best approach. I am running Windows 10 on an Intel i7-9700K processor with 32GB RAM and one Zotac GeForce GTX 1050 Ti OC Edition ZT-P10510B-10L 4GB PCI Express Graphics Card.
Explaining my possible solutions and my confusions regarding them:
A) Using GPU with Adjacency Matrix as Graph Data Structure: I can put an adjacency matrix on GPU and perform my analysis by manually coding them with PyCuda or Numba using loops only as recursion cannot be handled by GPU. The nearest I was able to search is this on stackoverflow but it has no good solution.
My Expectations: I hope to speedup algorithms such as All Pair Shortest Path, All Possible Paths between two nodes, Average Clustering, Average Shortest Path Length, and Small World Properties, etc. If it gives a significant speedup per graph, my results can be achieved very fast.
My Confusions:
Could these graph algorithms can be efficiently coded in GPU?
Which will be better to use? PyCuda or Numba?
Is there any other way to store Graphs on GPU that could be more efficient as my graphs are sparse graphs.
I am an average Python Programmer with no experience of GPU programming, so I will have to understand and learn GPU programming with PyCuda/ Numba. Which one is easier to learn?
B) Parallelizing Programs on CPU Itself: I can use Joblib or any other library to parallelly run the program on my CPU itself. I can arrange 2-3 more computers on which I can run small independent portions of programs or can run 500 graphs per computer.
My Expectations: I hope to speedup algorithms by parallelizing and dividing tasks among computers. If the GPU solution does not work, I may still have some hope by this method.
My Confusions:
Which other libraries are available as good alternatives for Joblib?
Should I allot all CPU cores (8 cores in i7) for my programs or use fewer cores?
C) Apart from my probable solutions do you have any other suggestions for me? If a better and faster solution is available in any other language except C/C++, you can also suggest them as well, as I am already considering C++ as a fallback plan if nothing works.
Work In Progress Updates
In different suggestions from comments on this question and discussion in my community, these are the points I've suggested to explore.
GraphBLAS
boost.graph + extensions with python-wrappers
graph-tool
Spark/ Dask
PyCuda/ Numba
Linear Algerbra methods using Pytorch
I tried to run 100 graphs on my CPU (using n_job=-1) using Joblib, the CPU was continuously hitting a temperature of 100°C. The processor tripped after running for 3 hours. - As a solution, I am using 75% of available cores on multiple computers (so if available cores are 8, I am using 6 cores) and the program is running fine. the speedup is also good.
This is a broad but interesting question. Let me try to answer it.
2000 undirected simple graphs [...] Each graph has approx 40,000 nodes and approx 600,000 edges
Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G)
NetworkX uses plain Python implementations and is not optimized for performance. It's great for prototyping but if you encounter performance issues, it's best to look to rewrite your code using another library.
Other than NetworkX, the two most popular graph processing libraries are igraph and SNAP. Both are written in C and have Python APIs so you get both good single-threaded performance and ease of use. Their parallelism is very limited but this is not a problem in your use case as you have many graphs, rendering your problem embarrassingly parallel. Therefore, as you remarked in the updated question, you can run 6-8 jobs in parallel using e.g. Joblib or even xargs. If you need parallel processing, look into graph-tool, which also has a Python API.
Regarding your NetworkX algorithms, I'd expect the average_shortest_path_length to be reasonably well-optimized in all libraries. The average_clustering algorithm is tricky as it relies on node-wise triangle counting and a naive implementation takes O(|E|^2) time while an optimized implementation will do it in O(|E|^1.5). Your graphs are large enough so that the difference between these two costs is running the algorithm on a graph in a few seconds vs. running the algorithm for hours.
The "all-pairs shortest paths" (APSP) problem is very time-consuming, with most libraries using the Floyd–Warshall algorithm that has a runtime of O(|V|^3). I'm unsure what output you're looking for with the "All Possible Paths between two nodes" algorithm – enumerating all paths leads to an exponential amount of results and is unfeasible at this scale.
I would not start using the GPU for this task: an Intel i7-9700K should be up for this job. GPU-based graph processing libraries are challenging to set up and currently do not provide that significant of a speedup – the gains by using a GPU instead of a CPU are nowhere near as significant for graph processing as for machine learning algorithms. The only problem where you might be able to get a big speedup is APSP but it depends on which algorithms your chosen library uses.
If you are interested in GPU-based libraries, there are promising directions on the topic such as Gunrock, GraphBLAST, and a work-in-progress SuiteSparse:GraphBLAS extension that supports CUDA. However, my estimate is that you should be able to run most of your algorithms (barring APSP) in a few hours using a single computer and its CPU.
I have wrote a script and running it on different machines. Script looks like below
def f(n):
x = None
while n:
x = simple_math(n)
n -= 1
return x
start = now()
f(BIGNUM)
print now() - start
At the end of the script it print how much time does it take to finish. Is this good enough to compare different machine for practical CPU speed for simple Python scripts?
By simple I mean it does not use multiprocessing module or any other technique to take advantage of multi-core machines.
This question is not about
making python programs run faster
multiprocessing module
GIL, I/O efficiency etc.
non cPython programs
Just that I want make sure if my approach to understand CPU performance among machines is fairly correct.
What's wrong with all of the countless existing benchmarks? The more sophisticated ones are propably a bit more robust. The major problems of your naive approach I - and I'm not an expert on this topic, mind you - can spot are:
Modern CPUs are highly complex and employ very clever optimizations. The speed of a purely CPU-bound can vary widely depending on how often the cache can help, how often the program causes pipeline stalls, how often branch prediction is correct, and propably many many more (these were just off the top of my head). Although many of these shouldn't make a difference when you use the same build of the same executable running the same script doing the same pure calculations, they can matter - to a degree none of us can predict - once you change any of these paramteres (e.g. using a different build because of a different OS or architecture).
Multi-threading OSs will never let a program occupy the CPU exclusively. There will always be some other program running at the same time stealing time, and you can't really know how much of the x seconds were spent running your program and how many were spent on other programs. At the very least, you should run a program many times and take the minimum time as the time it takes with relatively little inference from other programs. And even then, you need to have about the same system load in both benchmarks to make the numbers somewhat meaningful.
At least CPython won't multi-thread, so you only get the speed of one core.
But since your requirements seem to be "very rough estimate of CPU speed only, in full awareness that these numbers can't be used for anything except putting CPU speed into orders of magnitude, must be taken with a grain of salt even then and don't tell anything about the actual performance of any real applications", it might be okay - just don't consider it anywhere close to accurate. Still, why not use a hardened benchmark suite that already put some effort into mitigating (not removing - nobody can do that) these problems?
Also note that the timeit stdlib module is both easier to use than manually wielding the stopwatch and tries (not too hard, but it's a start) to fix the second point by the method I mentioned.
You can get a rough idea by using these type of methods. But that will not be exact measurement. The execution time of the script will depend on many other things other than CPU speed, like OS and interpreter version used, current system load, memory speed etc. etc. My suggestion is not to depend on this.
EDIT: Just a note. When it comes to performance, many people think only about CPU speed, but actually performance can be hampered by almost everything on the system. For example you have a high speed CPU but low RAM (both in size and speed), then you will get no performance boost up for the CPU.
In essence: No.
Benchmarking is a very difficult problem which usually is not worth solving yourself. It all depends on why you care. Your method will surely give a very rough estimate on if System A is better than System B, but really only when the outcome is vastly different.
What you're trying to do is determine how Real World Application X will perform on different computers. Very rarely is a real world application approximated by a loop of simple math. Even when it is (scientific computing mostly) you're better off measuring times on the actual program.
Real world applications are usually non-linear, and difficult to measure and simulate. Its really one of those problems which has been solved by someone else much better than you could reasonably solve yourself.
If you want a very rough estimate of performance, sure do it your way. Just don't put too much faith in the results because they will be far from what you might call "scientific"
If I understand your intention correctly (and you could clarify it a bit - what is it exactly that you are trying to measure or estimate, processor speed, code speed, something else and for what purpose, but if I understand you then) why not check how is it done in timeit
Working on different projects I have the choice of selecting different programming languages, as long as the task is done.
I was wondering what the real difference is, in terms of performance, between writing a program in Python, versus doing it in C.
The tasks to be done are pretty varied, e.g. sorting textfiles, disk access, network access, textfile parsing.
Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
And in your experience, given the power of current CPU's (i7), is it really a noticeable difference (Consider that its a program that doesnt bring the system to its knees).
Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.
C will absolutely crush Python in almost any performance category, but C is far more difficult to write and maintain and high performance isn't always worth the trade off of increased time and difficulty in development.
You say you're doing things like text file processing, but what you omit is how much text file processing you're doing. If you're processing 10 million files an hour, you might benefit from writing it in C. But if you're processing 100 files an hour, why not use python? Do you really need to be able to process a text file in 10ms vs 50ms? If you're planning for the future, ask yourself, "Is this something I can just throw more hardware at later?"
Writing solid code in C is hard. Be sure you can justify that investment in effort.
In general IO bound work will depend more on the algorithm then the language. In this case I would go with Python because it will have first class strings and lots of easy to use libraries for manipulating files, etc.
Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?
Yes.
The noticeable differences are these
There's much less Python code.
The Python code is much easier to read.
Python supports really nice unit testing, so the Python code tends to be higher quality.
You can write the Python code more quickly, since there are fewer quirky language features. No preprocessor, for example, really saves a lot of hacking around. Super-experience C programmers hardly notice it. But all that #include sandwich stuff and making the .h files correct is remarkably time-consuming.
Python can be easier to package and deploy, since you don't need a big fancy make script to do a build.
The first rule of computer performance questions: Your mileage will vary. If small performance differences are important to you, the only way you will get valid information is to test with your configuration, your data, and your benchmark. "Small" here is, say, a factor of two or so.
The second rule of computer performance questions: For most applications, performance doesn't matter -- the easiest way to write the app gives adequate performance, even when the problem scales. If that is the case (and it is usually the case) don't worry about performance.
That said:
C compiles down to machine executable and thus has the potential to execute as at least as fast as any other language
Python is generally interpreted and thus may take more CPU than a compiled language
Very few applications are "CPU bound." I/O (to disk, display, or memory) is not greatly affected by compiled vs interpreted considerations and frequently is a major part of computer time spent on an application
Python works at a higher level of abstraction than C, so your development and debugging time may be shorter
My advice: Develop in the language you find the easiest with which to work. Get your program working, then check for adequate performance. If, as usual, performance is adequate, you're done. If not, profile your specific app to find out what is taking longer than expected or tolerable. See if and how you can fix that part of the app, and repeat as necessary.
Yes, sometimes you might need to abandon work and start over to get the performance you need. But having a working (albeit slow) version of the app will be a big help in making progress. When you do reach and conquer that performance goal you'll be answering performance questions in SO rather than asking them.
If your text files that you are sorting and parsing are large, use C. If they aren't, it doesn't matter. You can write poor code in any language though. I have seen simple code in C for calculating areas of triangles run 10x slower than other C code, because of poor memory management, use of structures, pointers, etc.
Your I/O algorithm should be independent of your compute algorithm. If this is the case, then using C for the compute algorithm can be much faster.
(Assumption - The question implies that the author is familiar with C but not Python, therefore I will base my answer with that in mind.)
I was wondering what the real
difference is, in terms of
performance, between writing a program
in Python, versus doing it in C.
C will almost certainly be faster unless it is implemented poorly, but the real questions are:
What are the development implications
(development time, maintenance, etc.)
for either implementation?
Is the performance benefit significant?
Learning Python can take some time, but there are Python modules that can greatly speed development time. For example, the csv module in Python makes reading and writing csv easy. Also, Python strings, arrays, maps, and other objects make it more flexible than plain C and more elegant, in my opinion, than the equivalent C++. Some things like network access may be much quicker to develop in Python as well.
However, it may take time to learn how to program Python well enough to accomplish your task. Since you are concerned with performance, I suggest trying a simple task, such as sorting a text file, in both C and Python. That will give you a better baseline on both languages in terms of performance, development time, and possibly maintenance.
It really depends a lot on what your doing and if the algorithm in question is available in Python via a natively compiled library. If it is, then I believe you'll be looking at performance numbers close enough that Python is most likely your answer -- assuming it's your preferred language. If you must implement the algorithm yourself, depending on the amount of logic required and the size of your data set, C/C++ may be the better option. It's hard to provide a less nebulous answer without more information.
To get an idea of the raw difference in speed, check out the Computer Languages Benchmark Game.
Then you have to decide whether that difference matters to you.
Personally, I ended up deciding that it did, but most of the time instead of using C, I ended up using other higher-level languages. Personally I mostly use Scala, but Haskell and C# and Java each have their advantages also.
Across all programs, it isn't really possible to say whether things will be quicker or slower on average in Python or C.
For the programs that I've implemented in both languages, using similar algorithms, I've seen no improvement (and sometimes a performance degradation) for string- and IO-heavy code, when reimplementing python code in C. The execution time is dominated by allocation and manipulation of strings (which functionality python implements very efficiently) and waiting for IO operations (which incurs the same overhead in either language), so the extra overhead of python makes very little difference.
But for programs that do even simple operations on image files, say (images being large enough for processing time to be noticeable compared to IO), C is enormously quicker. For this sort of task the bulk of the time running the python code is spent doing Python Stuff, and this dwarfs the time spent on the underlying operations (multiply, add, compare, etc.). When reimplemented as C, the bureaucracy goes away, the computer spends its time doing real honest work, and for that reason the thing runs much quicker.
It's not uncommon for the python code to run in (say) 5 seconds where the C code runs in (say) 0.05. So that's a 100x increase -- but in absolute terms, this is not so big a deal. It takes so much less longer to write python code than it does to write C code that your program would have to be run some huge number of times to turn a time profit. I often reimplement in C, for various reasons, but if you don't have this requirement then it's probably not worth bothering. You won't get that part of your life back, and next year computers will be quicker.
Actually you can solve most of your tasks efficiently with python.
You just should know which tools to use. For text processing there is brilliant package from Egenix guys - http://www.egenix.com/products/python/mxBase/mxTextTools/. I was able to create very efficient parsers with it in python, since all the heavy lifting is done by native code.
Same approach goes for any other problem - if you have performance problems, get a C/C++ library with Python interface which implements whatever bottleneck you got efficiently.
C is definitely faster than Python because Python is written in C.
C is middle level language and hence faster but there not much a great difference between C & Python regarding executable time it takes.
but it is really very easy to write code in Python than C and it take much shorter time to write code and learn Python than C.
Because its easy to write its easy to test also.
You will find C is much slower. Your developers will have to keep track of memory allocation, and use libraries (such as glib) to handle simple things such as dictionaries, or lists, which python has built-in.
Moreover, when an error occurs, your C program will typically just crash, which means you'll need to get the error to happen in a debugger. Python would give you a stack trace (typically).
Your code will be bigger, which means it will contain more bugs. So not only will it take longer to write, it will take longer to debug, and will ship with more bugs. This means that customers will notice the bugs more often.
So your developers will spend longer fixing old bugs and thus new features will get done more slowly.
In the mean-time, your competitors will be using a sensible programming language and their products will be increasing in features and usability, rapidly yours will look bad. Your customers will leave and you'll go out of business.
The excess time to write the code in C compared to Python will be exponentially greater than the difference between C and Python execution speed.
I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I'm familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)... or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started - thanks in advance!
While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited -- you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).
Relational databases may be an exception to "technologies pale" -- they can really clamp you down when you're trying to scale up a few orders of magnitude. Is that your situation -- are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you -- in the latter, you're at the breaking point, or well past it, and must start thinking in terms of key/value stores.
Back to algorithms -- "data analysis" cover a wide range... most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can't possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most "data mining" applications fall into the second category -- you don't need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.
This is particularly crucial if you ever need to do "real-near-time" data analysis -- near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don't know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.
The work I've been doing over the last few years has had a few requirements for "near-real-time" and many more requirements for off-line, "batch" analysis -- and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don't understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes' worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv's, or whatever).
Python, R, C++, whatever, only come into play after you've mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to "statistically survive" the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework...?), and the supporting design and storage-technology choices.
The main thing for scaling up to large data is to avoid situations where you're reading huge datasets into memory at once. In pythonic terms this generally means using iterators to consume the dataset in manageable pieces.