Provided that we know that all the file will be loaded in memory and we can afford it,
what are the drawbacks (if any) or limitations (if any) of loading an entire file (possibly a binary file) in a python variable. If this is technically possible, should this be avoided, and why ?
Regarding file size concerns, to what maximum size this solution should be limited ?. And why ?
The actual loading code could be the one proposed in this stackoverflow entry.
Sample code is:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('/bin/kill')
... code manipulating 'content' ...
[EDIT]
Code manipulation that comes to mind (but is maybe not applicable) is standard list/strings operators (square brackets, '+' signs) or some string operators ('len', 'in' operator, 'count', 'endswith'/'startswith', 'split', 'translation' ...).
Yes, you can
The only drawback is memory usage, and possible also speed if the file is big.
File size should be limited to how much space you have in memory.
In general, there are better ways to do it, but for one-off scripts where you know memory is not an issue, sure.
While you've gotten good responses, it seems nobody has answered this part of your question (as often happens when you ask many questions in a question;-)...:
Regarding file size concerns, to what
maximum size this solution should be
limited ?. And why ?
The most important thing is, how much physical RAM can this specific Python process actually use (what's known as a "working set"), without unduly penalizing other aspects of the overall system's performance. If you exceed physical RAM for your "working set", you'll be paginating and swapping in and out to disk, and your performance can rapidly degrade (up to a state known as "thrashing" were basically all available cycles are going to the tasks of getting pages in and out, and negligible amounts of actual work can actually get done).
Out of that total, a reasonably modest amount (say a few MB at most, in general) are probably going to be taken up by executable code (Python's own executable files, DLLs or .so's) and bytecode and general support datastructures that are actively needed in memory; on a typical modern machine that's not doing other important or urgent tasks, you can almost ignore this overhead compared to the gigabytes of RAM that you have available overall (though the situation might be different on embedded systems, etc).
All the rest is available for your data -- which includes this file you're reading into memory, as well as any other significant data structures. "Modifications" of the file's data can typically take (transiently) twice as much memory as the file's contents' size (if you're holding it in a string) -- more, of course, if you're keeping a copy of the old data as well as making new modified copies/versions.
So for "read-only" use on a typical modern 32-bit machine with, say, 2GB of RAM overall, reading into memory (say) 1.5 GB should be no problem; but it will have to be substantially less than 1 GB if you're doing "modifications" (and even less if you have other significant data structures in memory!). Of course, on a dedicated server with a 64-bit build of Python, a 64-bit OS, and 16 GB of RAM, the practical limits before very different -- roughly in proportion to the vastly different amount of available RAM in fact.
For example, the King James' Bible text as downloadable here (unzipped) is about 4.4 MB; so, in a machine with 2 GB of RAM, you could keep about 400 slightly modified copies of it in memory (if nothing else is requesting memory), but, in a machine with 16 (available and addressable) GB of RAM, you could keep well over 3000 such copies.
with open(filename) as f:
This only works on Python 2.x on Unix. It won't do what you expect on Python 3.x or on Windows, as these both draw a strong distinction between text and binary files. It's better to specify that the file is binary, like this:
with open(filename, 'rb') as f:
This will turn off the OS's CR/LF conversion on Windows, and will force Python 3.x to return a byte array rather than Unicode characters.
As for the rest of your question, I agree with Lennart Regebro's (unedited) answer.
The sole issue you can run into is memory consumption: Strings in Python are immutable. So when you need to change a byte, you need to copy the old string:
new = old[0:pos] + newByte + old[pos+1:]
This needs up to three times the memory of old.
Instead of a string, you can use an array. These offer much better performance if you need to modify the contents and you can create them easily from a string.
You can also use Python's v3 feature:
>>> ''.join(open('htdocs/config.php', 'r').readlines())
"This is the first line of the file.\nSecond line of the file"
Read more here http://docs.python.org/py3k/tutorial/inputoutput.html
Yes you can -provided the file is small enough-.
It is even very pythonic to further convert the return from read() to any container/iterable type as with say, string.split(), along with associated functional programming features to continue treating the file "at once".
Related
I am trying to run a Python (2.7) script with PyPy but I have encountered the following error:
TypeError: sys.getsizeof() is not implemented on PyPy.
A memory profiler using this function is most likely to give results
inconsistent with reality on PyPy. It would be possible to have
sys.getsizeof() return a number (with enough work), but that may or
may not represent how much memory the object uses. It doesn't even
make really sense to ask how much *one* object uses, in isolation
with the rest of the system. For example, instances have maps,
which are often shared across many instances; in this case the maps
would probably be ignored by an implementation of sys.getsizeof(),
but their overhead is important in some cases if they are many
instances with unique maps. Conversely, equal strings may share
their internal string data even if they are different objects---or
empty containers may share parts of their internals as long as they
are empty. Even stranger, some lists create objects as you read
them; if you try to estimate the size in memory of range(10**6) as
the sum of all items' size, that operation will by itself create one
million integer objects that never existed in the first place.
Now, I really need to check the size of one nested dict during the execution of the program, is there any alternative to sys.getsizeof() I can use in PyPy? If not, how would I check for the size of a nested object in PyPy?
Alternatively you can gauge the memory usage of your process using
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
As your program is executing, getrusage will give total memory consumption of the process in number of bytes or kilobytes. Using this information you can estimate the size of your data structures, and if you begin to use say 50% of your machine's total memory, then you can do something to handle it.
I'm trying to optimize a binary reader for the Stata filetype, and the current implementation is lazily evaluated for each record in the file. The reader loses speed very quickly as the size of the file increases.
When I asked the person who initially wrote it why he used a generator, he said it to be memory-careful. What advice I've been given is read and process larger chunks of the file at a time, and I would like to know how to tell what the largest chunk I can read without going into virtual memory is.
A few side notes
why is reading and processing large chunks faster than doing so with small chunks. Does the cost of overhead being called many times add up that quickly?
I'm interested in seeing if I can get even greater speed gains by trying my hand at Cython. Does anyone know of any modules with binary file readers I could take a look at (other than the scipy.stats matlab file reader)?
I would like to know how to tell what the largest chunk I can read without going into virtual memory is.
I'm not sure what you mean by "without going into virtual memory", but this is highly dependent on details such as the file format, the storage medium and the filesystem/OS. It's best determined empirically. If you can, implement a parameter chunk_size (or n_records, or whatever) that determines how many records to read at a time.
why is reading and processing large chunks faster than doing so with small chunks
Depends on the code that's doing the reading. It might be due to system call overhead, or because Python code has to executed in between reads.
Does anyone know of any modules with binary file readers I could take a look
I co-wrote a loader for the LibSVM/SVMlight file format, a simple text format for sparse matrices, in Cython. It's distributed as part of scikit-learn.
Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?
take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.
A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.
Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.
Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.
You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.
Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.
I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)
Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.
From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)
Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.
Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.
I wrote a program that calls a function with the following prototype:
def Process(n):
# the function uses data that is stored as binary files on the hard drive and
# -- based on the value of 'n' -- scans it using functions from numpy & cython.
# the function creates new binary files and saves the results of the scan in them.
#
# I optimized the running time of the function as much as I could using numpy &
# cython, and at present it takes about 4hrs to complete one function run on
# a typical winXP desktop (three years old machine, 2GB memory etc).
My goal is to run this function exactly 10,000 times (for 10,000 different values of 'n') in the fastest & most economical way. following these runs, I will have 10,000 different binary files with the results of all the individual scans. note that every function 'run' is independent (meaning, there is no dependency whatsoever between the individual runs).
So the question is this. having only one PC at home, it is obvious that it will take me around 4.5 years (10,000 runs x 4hrs per run = 40,000 hrs ~= 4.5 years) to complete all runs at home. yet, I would like to have all the runs completed within a week or two.
I know the solution would involve accessing many computing resources at once. what is the best (fastest / most affordable, as my budget is limited) way to do so? must I buy a strong server (how much would it cost?) or can I have this run online? in such a case, is my propritary code gets exposed, by doing so?
in case it helps, every instance of 'Process()' only needs about 500MB of memory. thanks.
Check out PiCloud: http://www.picloud.com/
import cloud
cloud.call(function)
Maybe it's an easy solution.
Does Process access the data on the binary files directly or do you cache it in memory? Reducing the usage of I/O operations should help.
Also, isn't it possible to break Process into separate functions running in parallel? How is the data dependency inside the function?
Finally, you could give some cloud computing service like Amazon EC2 a try (don't forget to read this for tools), but it won't be cheap (EC2 starts at $0.085 per hour) - an alternative would be going to an university with a computer cluster (they are pretty common nowadays, but it will be easier if you know someone there).
Well, from your description, it sounds like things are IO bound... In which case parallelism (at least on one IO device) isn't going to help much.
Edit: I just realized that you were referring more to full cloud computing, rather than running multiple processes on one machine... My advice below still holds, though.... PyTables is quite nice for out-of-core calculations!
You mentioned that you're using numpy's mmap to access the data. Therefore, your execution time is likely to depend heavily on how your data is structured on the disc.
Memmapping can actually be quite slow in any situation where the physical hardware has to spend most of its time seeking (e.g. reading a slice along a plane of constant Z in a C-ordered 3D array). One way of mitigating this is to change the way your data is ordered to reduce the number of seeks required to access the parts you are most likely to need.
Another option that may help is compressing the data. If your process is extremely IO bound, you can actually get significant speedups by compressing the data on disk (and sometimes even in memory) and decompressing it on-the-fly before doing your calculation.
The good news is that there's a very flexible, numpy-oriented library that's already been put together to help you with both of these. Have a look at pytables.
I would be very surprised if tables.Expr doesn't significantly (~ 1 order of magnitude) outperform your out-of-core calculation using a memmapped array. See here for a nice, (though canned) example. From that example:
This is a follow-up questions on a previous one.
Consider this code, which is less toyish than the one in the previous question (but still much simpler than my real one)
import sys
data=[]
for line in open(sys.argv[1]):
data.append(line[-1])
print data[-1]
Now, I was expecting a longer run time (my benchmark file is 65150224 lines long), possibly much longer. This was not the case, it runs in ~ 2 minutes on the same hw as before!
Is it data.append() very lightweight? I don't believe so, thus I wrote this fake code to test it:
data=[]
counter=0
string="a\n"
for counter in xrange(65150224):
data.append(string[-1])
print data[-1]
This runs in 1.5 to 3 minutes (there is strong variability among runs)
Why don't I get 3.5 to 5 minutes in the former program? Obviously data.append() is happening in parallel with the IO.
This is good news!
But how does it work? Is it a documented feature? Is there any requirement on my code that I should follow to make it works as much as possible (besides load-balancing IO and memory/CPU activities)? Or is it just plain buffering/caching in action?
Again, I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you think it's worth doing.
Obviously data.append() is happening in parallel with the IO.
I'm afraid not. It is possible to parallelize IO and computation in Python, but it doesn't happen magically.
One thing you could do is use posix_fadvise(2) to give the OS a hint that you plan to read the file sequentially (POSIX_FADV_SEQUENTIAL).
In some rough tests doing "wc -l" on a 600 meg file (an ISO) the performance increased by about 20%. Each test was done immediately after clearing the disk cache.
For a Python interface to fadvise see python-fadvise.
How big are the lines in your file? If they're not very long (anything under about 1K probably qualifies) then you're likely seeing performance gains because of input buffering.
Why do you think list.append() would be a slower operation? It is extremely fast, considering the internal pointer arrays used by lists to hold references to the objects in them are allocated in increasingly large blocks, so that every append does not actually re-allocate the array, and most can simply increment the length counter and set a pointer and incref.
I don't see any evidence that "data.append() is happening in parallel with the IO." Like Benji, I don't think this is automatic in the way you think. You showed that doing data.append(line[-1]) takes about the same amount of time as lc = lc + 1 (essentially no time at all, compared to the IO and line splitting). It's not really surprising that data.append(line[-1]) is very fast. One would expect the whole line to be in a fast cache, and as noted append prepares buffers ahead of time and only rarely has to reallocate. Moreover, line[-1] will always be '\n', except possibly for the last line of the file (no idea if Python optimizes for this).
The only part I'm a little surprised about is that the xrange is so variable. I would expect it to always be faster, since there's no IO, and you're not actually using the counter.
If your run times are varying by that amount for the second example, I'd suspect your method of timing or outside influences (other processes / system load) to be skewing the times to the point where they don't give any reliable information.