Populate a matrix in parallel

Populate a matrix in parallel - python

I need routinely populate matrices A[i,j] by evaluation of a function between pairs of vectors, as computation of every i,j-pair is independent from each other I want to parallelize this
A = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
A[i,j] = function(X[i], X[j])
How this computation could be elegantly parallelized via joblib or other widely used library?

Q : "How this computation could be elegantly parallelized via joblib or other widely used library?"
If using joblib, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).
So this trick does not easily pay for its own add-on costs, unless the function() computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps() memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).
One might like more reads on this here and here and here.
Is There Any Better Way Forwards?
We all have had for sure heard about numpy and smart numpy-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy smart data-I/O vectorised processing.
So in most cases, if you try to redesign the function( scalarA, scalarB ) returnins a single scalarResult to be stored into an externally 2D-looped A[i,j] into an in-place modifying function( vectorX_data, matrixA_results ) and let the inner code thereof do both the i,j-looping over the actual matrixA_results.shape[0] and do the actual computing, the results may get astonishingly faster, if the numpy-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns] L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns] access-costs for using the numpy.memmap()-file-based data proxy ).
If one has never visited the domain of numpy-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru #Divakar - all respect to them!

Related

Transpose large numpy matrix on disk

I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.

While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.

How to properly implement python multiprocessing for expensive image/video tasks?

I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:
from itertools import repeat
from multiprocessing import Pool
def expensive_function(list_of_values, some_param, another_param):
do_some_python_pillow_tasks()
do_some_ffmpeg_tasks()
if __name__ == '__main__':
values = [
['a', 'b', 'c'],
['x', 'y', 'z'],
# ...
# there can be MANY items in this list, let's say 1000
]
pool = Pool(processes=len(values))
pool.starmap(
expensive_function,
zip(values, repeat('yada yada yada'), repeat('hello world')),
)
pool.close()
None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.
Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?
Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).
This may be a stupid question, but can you utilize the GPU to help speed up processing?
No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).

Q : " ... am I using multiprocessing to the best of it's ability ?"
A :Well, that actually does not matter here at all.Congratulations!You happened to enjoy a such seldom use-case, where the so called embarrasingly parallel process-orchestration may save most of otherwise present problems.
Incidentally, nothing new, this very same, exactly the same use-case, reasoning was successfully used by Peter Jackson's VFX-team for his "Lord of The Rings" frame-by-frame video-rendering & video-postprocessing & final LASER-deposition of each frame on color-film band computer power-plant setup in New Zealand. Except his factory was full of Silicon Graphics' workstations ( no Python reported to have been there ), yet the workflow orchestration principle was the same ...
Python multithreading is irrelevant here, as it keeps all threads still stand in a queue and wait one after another for its turn in acquiring the one-&-only-one central Python GIL-lock, so using it is rather an anti-pattern if you wish to gain processing speed here.
Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for ~1k promoted above ), as it
firstspends (in further context negligible [TIME]- and [SPACE]-domains costs on each spawning of a new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory management-service of the O/S starts to, concurrently to running your "useful" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away + CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )next ( and repeated per each of the 1.000 cases ... )you will have to pay ( CPU-wise + MEM-I/O-wise + O/S-IPC-wise )another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-SERialiser( at CPU + MEM-I/O add-on costs ) + DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) + DATA-DESerialise( again at CPU + MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )
What can be done instead?
a)split the list ( values ) of independent values, as was posted above, in say 4 parts ( quad-core, 2 hardware-threads each, CPU ), andb)let the embarrasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched fully independent, on respective quarter of the list( values )
There will be zero add-on cost for doing so,there will be zero add-on SER/DES penalty for 1000+ tasks' data distribution and results' recollection, andthere will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of them - so no magic but sufficient CPU-cooling can save us here anyway )
One may also test, whether PIL.Image processing could get faster, if using OpenCV with numpy.ndarray() smart vectorised processing tricks, yet these are another Level-of-Detail of boosting performance, once we prevent those gigantic overheads costs reminded above.
Except for using a magic wand, there is no other magic possible on Python-interpreter here

A Python LDLT factorization for sparse Hermitian matrices, is there one?

I have some large (N-by-N for N=10,000 to N=36,000,000) sparse, complex, Hermitian matrices, typically nonsingular, for which I have a spectral slicing problem. Specifically I need to know the exact number of positive eigenvalues.
I require a sparse LDLT decomposition -- is there one? Ideally, it will be a multifrontal algorithm and so well parallelized, and have the option of computing only D and not the upper triangular or permutation matrices.
I currently use ldl() in Matlab. This only works for real matrices so I need to create a larger real matrix. Also, it always computes L as well as D. I need a better algorithm to fit with 64GB RAM. I am hoping Python will be more customizable. (If so, I will learn Python.) I should add: I can get 64GB RAM per node, and can get 6 nodes. Even with a single machine with 64GB RAM, I would like to stop wasting RAM storing L only to delete it.
Perhaps someone wrote a Python front-end for MUMPS (MUltifrontal Massively Parallel Solver)?
I would have use for a non-parallel Python version of LDLT, as a lot of my research involves many matrices of moderate size.

I need a better algorithm to fit with 64GB RAM. I am hoping Python will be more customizable. (If so, I will learn Python.)
If this were ever possible:
|>>> ( 2. * 8 ) * 10E3 ** 2 / 1E12 # a 64GB RAM can store matrices ~1.6 GB
0.0016 # the [10k,10k] of complex64
| # or [20k,20k] of real64
|>>> ( 2. * 8 ) * 63E3 ** 2 / 1E12 # a 64GB RAM can store matrices in-RAM
0.063504 # max [63k,63k] of type complex
| # but
|>>> ( 2. * 8 ) * 36E6 ** 2 / 1E12 # the [36M,36M] one has
20736.0 # ~ 21PB of data
+0:00:09.876642 # Houston, we have a problem
#---------------------------------------------#--------and [2M7,2M7] has taken
~ month
on HPC cluster
Research needs are clear, yet there is no such language ( be it Matlab, Python, assembler, Julia or LISP ) that can ever store 21 petabytes of data into a space of just 64 gigabytes of physical RAM storage, to make a complex matrix ( of such a given scale ) eigenvalues computations possible and as fast as possible. By this I also mean, that "off-loading" of data from in-RAM computation into any form of out-of-RAM storage is so prohibitively expensive ( about ~ +1E2 ~ +1E5 ~ orders of magnitude slower ) that any such computational process will yield ages for just first "reading through" 21 PB of elements' data.
If your research has funding or sponsoring for using rather very specific computing devices infrastructure, there might be some tricks to process such heaps of numbers, yet do not expect to receive 21 PB of worms ( data ) put into a 64 GB large ( well, rather small ) can "for free".
You may enjoy Python due to many other reasons and/or motivations, but not due to any cheaper while faster HPC-grade parallel-computing, nor making easily process 21PB of data inside a 64GB device, nor for any kind of annihilating the principal and immense [TIME]-domain add-on costs of sparse-matrix manipulations visible but during their use in computations. Having made some xTB sparse-matrix processing feasible to yield results in less than 1E2 [min] instead of 2E3, I dare say I know how immensely hard it is to increase both the [PSPACE]-data scaling and to shorten the often [EXPTIME] processing durations at the same time ... the truly Hell-corner of the computational-complexity ... where the sparse-matrix representation will often create even more headaches ( again, both in [SPACE] and more, well, worse in [TIME] as new types of penalties appear ) than helping to at least enjoy some, potential [PSPACE]-savings
Given the scope of parameters, I may safely bet for sure, even the algorithm part will not help and even the promises of Quantum-Computing devices will remain for our lifetime expectancy unable to augment such vast parameter-space into the QC annealer-based unstructured quantum-driven minimiser ( processing ) for any reasonably long ( short duration ) sequence of parameter-blocks translation into a ( limited physical size ) qubit-field problem augmentation-process, as is currently in use by the QC community, thanks to LLNL et al research innovations.
Sorry, there no such magic seems to be anywhere near.
Using the available python frontends for MUMPS does not change the HPC-problem of the game, but if you wish to use it, yes, there are several of these available.
The efficient HPC-grade number-crunching at scale is still the root-cause of the problems with the product of [ processing-time ] x [ (whatever) data-representation's efficient storage and retrieval ].
Hope you will get and enjoy the right mix of a comfort ( pythonic users are keen to stay in ) and yet the HPC-grade performance ( of the backend of whatever type ) you wish to have.

How do I know if my Embarassingly Parallel Task is Suitable for GPU?

Are we saying that a task that requires fairly light computation per row on a huge number of rows is fundamentally unsuited to a GPU?
I have some data processing to do on a table where the rows are independent. So it is embarrasingly parallel. I have a GPU so....match made in heaven? It is something quite similar to this example which calculates moving average for each entry per row (rows are independent.)
import numpy as np
from numba import guvectorize
#guvectorize(['void(float64[:], intp[:], float64[:])'], '(n),()->(n)')
def move_mean(a, window_arr, out):
window_width = window_arr[0]
asum = 0.0
count = 0
for i in range(window_width):
asum += a[i]
count += 1
out[i] = asum / count
for i in range(window_width, len(a)):
asum += a[i] - a[i - window_width]
out[i] = asum / count
arr = np.arange(2000000, dtype=np.float64).reshape(200000, 10)
print(arr)
print(move_mean(arr, 3))
Like this example, my processing for each row is not heavily mathematical. Rather it is looping across the row and doing some sums, assignments and other bits and pieces with some conditional logic thrown in.
I have tried using guVectorize in Numba library to assign this to a Nvidia GPU. It works fine but I'm not getting a speedup.
Is this type of task suited to a GPU in principle? i.e. if I go deeper into Numba and start tweaking the threads, blocks and memory management or the algorithm implementation should I , in theory , get a speed up. Or, is this kind of problem fundamentally just unsuited to the architecture.
The answers below seem to suggest it is unsuited but I am not quite convinced yet.
numba - guvectorize barely faster than jit
And numba guvectorize target='parallel' slower than target='cpu'

Your task is obviously memory-bound, but it doesn't mean that you cannot profit from GPU, however it is probably less straight forward than for a CPU-bound task.
Let's look at common configuration and do some math:
CPU-RAM memory bandwidth of ca. 24GB/s
CPU-GPU transfer bandwidth of ca. 8GB/s
GPU-RAM memory bandwidth of ca. 180GB/s
Let's assume we need to transfer 24 GB of data to complete the task, so we will have the following optimal times (whether and how to achieve these times is another question!):
scenario: only CPU time = 24GB/24GB/s = 1 second.
scenario: Data must be transferred from CPU to GPU (24GB/8GB/s = 3 seconds) and processed there (24GB/180GB/s = 0.13 second) leads to 3.1 seconds.
scenario: Data is already on the device, so only 24GB/180GB/s = 0.13 seconds are needed.
As one can see, there is a potential for a speed-up but only in the 3. scenario - when your data is already on the GPU-device.
However, achieving the maximal bandwidth is a quite challenging enterprise.
For example, when processing the matrix row-wise, on CPU, you would like your data to be in the row-major-order (C-order) in order to get the most out of the L1-cache: while reading a double you actually get 8 doubles loaded into the cache and you don't want them to be evicted from the cache, before you could process the remaining 7.
On GPU, on the other hand, you want the memory accesses to be coalesced, e.g. thread 0 should access address 0, thread 1 - address 1 and so on. For this to work, the data must be in column-major-order (Fortran-order).
There is another thing to be considered: the way you test the performance. Your test array is only about 2MB large and thus small enough for the L3 cache. The bandwidth of the L3 cache depends on the number of cores used for the calculation, but will be at least around 100GB/s - not much slower than GPU and probably much faster when parallelized on CPU.
You need a bigger dataset to not get fooled by cache behavior.
A somewhat off-topic remark: your algorithm is not very robust from the numerical point of view.
If the window width were 3, as in your example, but there were about 10**4 elements in a row. So for the last element, the value is result of about 10**4 additions and subtractions, each of which adds a rounding error to the value - compared to only three three additions if done "naively" it is quite a difference.
Of cause, it might not be of significance (for 10 elements in a row as in your example), but also might bite you one day...

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.