memory management in python using memory_profiler

memory management in python using memory_profiler - python

While reading an article on memeory management in python came across few doubts:
import copy
import memory_profiler
#profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
$:python -m memory_profiler memory_profiler_demo.py
Filename: memory_profiler_demo.py
Line # Mem usage Increment Line Contents
================================================
4 30.074 MiB 30.074 MiB #profile
5 def function():
6 61.441 MiB 31.367 MiB x = list(range(1000000)) # allocate a big list
7 111.664 MiB 50.223 MiB y = copy.deepcopy(x)#doubt 1
8 103.707 MiB -7.957 MiB del x #doubt 2
9 103.707 MiB 0.000 MiB return
so i have the doubts on line 7 why it took more size to copy the list and second doubt on line 8 why it only frees 7 MiB.

First, let's start with why line 8 only frees 7MiB.
Once you allocate a bunch of memory, Python and your OS and/or malloc library both guess that you're likely to allocate a bunch of memory again. On modern platforms, it's a lot faster to reuse that memory in-process than to release it and reallocate it from scratch, while it costs very little to keep extra unused pages of memory in your process's space, so it's usually the right tradeoff. (But of course usually != always, and the blog you linked seems to be in large part about how to work out that you're building an application where it's not the right tradeoff and what to do about it.)
A default build of CPython on Linux virtually never releases any memory. On other POSIX (including Mac) it almost never releases any memory. On Windows, it does release memory more often—but there are still constraints. Basically, if a single allocation from Windows has any piece in use (or even in the middle of a freelist chain), that allocation can't be returned to Windows. So, if you're fragmenting memory (which you usually are), that memory can't be freed. The blog post you linked to explains this to some extent, and there are much better resources than an SO answer to explain further.
If you really do need to allocate a lot of memory for a short time, release it, and never use it again, without holding onto all those pages, there's a common Unix idiom for that—you fork, then do the short-term allocation in the child and exit after passing back the small results in some way. (In Python, that usually means using multiprocessing.Process instead of os.fork directly.)
Now, why does your deepcopy take more memory than the initial construction?
I tested your code on my Mac laptop with python.org builds of 2.7, 3.5, and 3.6. What I found was that the list construction takes around 38MiB (similar to what you're seeing), while the copy takes 42MiB on 2.7, 31MiB on 3.5, and 7MB on 3.6.
Slightly oversimplified, here's the 2.7 behavior: The functions in copy just call the type's constructor on an iterable of the elements (for copy) or recursive copies of them (for deepcopy). For list, this means creating a list with a small starting capacity and then expanding it as it appends. That means you're not just creating a 1M-length array, you're also creating and throwing away arrays of 500K, 250K, etc. all the way down. The sum of all those lengths is equivalent to a 2M-length array. Of course you don't really need the sum of all of them—only the most recent array and the new one are ever live at the same time—but there's no guarantee the old arrays will be freed in a useful way that lets them get reused. (That might explain why I'm seeing about 1.5x the original construction while you're seeing about 2x, but I'd need a lot more investigation to bet anything on that part…)
In 3.5, I believe the biggest difference is that a number of improvements over the 5 years since 2.7 mean that most of those expansions now get done by realloc, even if there is free memory in the pool that could be used instead. That changes a tradeoff that favored 32-bit over 64-bit on modern platforms into one that works the other way round—in 64-bit linux/Mac/Windows: there are often going to be free pages that can be tossed onto the end of an existing large alloc without remapping its address, so most of those reallocs mean no waste.
In 3.6, the huge change is probably #26167. Oversimplifying again, the list type knows how to copy itself by allocating all in one go, and the copy methods now take advantage of that for list and a few other builtin types. Sometimes there's no reallocation at all, and even when there is, it's usually with the special-purpose LIST_APPEND code (which can be used when you can assume nobody outside the current function has access to the list yet) instead of the general-purpose list.append code (which can't).

Related

Understanding memory growth of a Python process (VmRSS vs gc.get_objects)

Without going into algorithmic details, lets just say that my code sequentially processes a list of inputs:
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
for i in inputs:
process_input(i)
For simplicity, lets consider process_input to be a state-less black-box.
I know that this site is full of questions about finding memory leaks in Python code, but this is not what this question is about. Instead, I'm trying to understand the memory consumption of my code over time and whether it might suffer from leaking memory.
In particular, I'm trying to understand a discrepancy of two distinct indicators of memory usage:
The number of allocated objects (reported by gc.get_objects) and
the actually used amount of physical memory (read from VmRSS on a Linux system).
To study these two indicators, I expanded the original code from above as follows:
import time, gc
def get_current_memory_usage():
with open('/proc/self/status') as f:
memusage = f.read().split('VmRSS:')[1].split('\n')[0][:-3]
return int(memusage.strip()) / (1024 ** 2)
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
gc.collect()
last_object_count = len(gc.get_objects())
for i in inputs:
print(f'\nProcessing input {i}...')
process_input(i)
gc.collect()
time.sleep(1)
memory_usage = get_current_memory_usage()
object_count = len(gc.get_objects())
print(f'Memory usage: {memory_usage:.2f} GiB')
print(f'Object count: {object_count - last_object_count:+}')
last_object_count = object_count
Note that process_input is state-less, i.e. the order of the inputs does not matter. Thus, we would expect both indicators to be about the same before running process_input and afterwards, right? Indeed, this is what I observe for the number of allocated objects. However, the consumption of memory grows steadily:
Now my core question: Do these observations indicate a memory leak? To my understanding, memory leaking in Python would be indicated by a growth of allocated objects, which we do not observe here. On the other hand, why does the memory consumption grow steadily?
For further investigation, I also ran a second test. For this test, I repeatedly invoked process_input(i) using a fixed input i (five times each) and recorded the memory consumption in between of the iterations:
For i=12, the memory consumption remained constant at 10.91 GiB.
For i=14, the memory consumption remained constant at 7.00 GiB.
I think, these observations make the presence of a memory leak even more unlikely, right? But then, what could be a possible explanation for why the memory consumption is not falling in between of the iterations, given that process_input is state-less?
The system has 32 GiB RAM in total and is running Ubuntu 20.04. Python version is 3.6.10. The process_input function uses several third-party libraries.

In general RSS is not a particularly good indicator because it is "resident" set size and even a rather piggish process, in terms of committed memory, can have a modest RSS as memory can be swapped out. You can look at /proc/self/smaps and add up the size of the writable regions to get a much better benchmark.
On the other hand, if there is actually growth, and you want to understand why, you need to look at the actual dynamically allocated memory. What I'd suggest for this is using https://github.com/vmware/chap
To do this, just make that 1 second sleep a bit longer, put a print just before the call to sleep, and use gcore from another session to gather a live core during a few of those sleeps.
So lets say you have cores gathered from when the input was 14 and when it was 21. Look at each of the cores using chap, for example, with the following commands:
count used
That will give you a good view of allocations that have been requested but not released. If the numbers are much larger for the later core, you probably have some kind of growth issue. If those numbers do differ by quite a lot, use
summarize used
If you have growth, it is possible that there is a leak (as opposed to some container simply expanding). To check this, you can try commands like
count leaked
show leaked
From there you should probably look at the documentation, depending on what you find.
OTOH if used allocations are not the issue, maybe try the following, to see memory for allocations that have been released but are part of larger regions of memory that cannot be given back to the operating system because parts of those regions are still in use:
count free
summarize free
If neither "used" allocations or "free" allocations are the issue, you might try:
summarize writable
That is a very high level view of all writable memory. For example, you can see things like stack usage...

Finding memory leak in python by tracemalloc module

I have a python script which uses an opensource pytorch model and this code has a memory leak. I am running this with memory_profiler mprof run --include-children python my_sctipt.py and get the following image:
I am trying to search for the reason of the leak by the system python module tracemalloc:
tracemalloc.start(25)
while True:
...
snap = tracemalloc.take_snapshot()
domain_filter = tracemalloc.DomainFilter(True, 0)
snap = snap.filter_traces([domain_filter])
stats = snap.statistics('lineno', True)
for stat in stats[:10]:
print(stat)
If looking only at tracemalloc output, I will not be able to identify the problem. I assume that the problem is in the C extension but, I would like to make sure it is true.
I tried to change the domain by DomainFilter, but I have output only in 0 domain.
Also, I don't understand the meaning of the parameter which tracemalloc.start(frameno) has got, frameno is a number of the most recent frames, but nothing happens when I change it.
What can I do next to find the problematic place in the code which causes the memory leak?
Looking forward to your answer.

Given that your guess is that the problem is in the C extension, but that you want to make sure this is true, I would suggest that you do so using a tool that is less python-specific like https://github.com/vmware/chap or at least if you are able to run your program on Linux.
What you will need to do is run your script (uninstrumented) and at some point gather a live core (for example, using "gcore pid-of-your-running-program").
Once you have that core, open that core in chap ("chap your-core-file-path") and try the following command from the chap prompt:
summarize writable
The output will be something like this, but your numbers will likely vary considerably:
chap> summarize writable
5 ranges take 0x2021000 bytes for use: stack
6 ranges take 0x180000 bytes for use: python arena
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
8 ranges take 0x80000 bytes for use: used by module
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
4 ranges take 0x30000 bytes for use: unknown
29 writable ranges use 0x23e7000 (37,646,336) bytes.
The lines in the summary are given in decreasing order of byte usage, so you can follow that order. So looking at the top one first we see that the use is "stack":
5 ranges take 0x2021000 bytes for use: stack
This particular core was for a very simple python program that starts 4 extra threads and has all 5 threads sleep. The reason large stack allocations can happen rather easily with a multi-threaded python program is that python uses pthreads to create additional threads and pthreads uses the ulimit value for stack size as a default. If your program has a similarly large value, you can change the stack size in one of several ways, including running "ulimit -s" in the parent process to change the default stack size. To see what values actually make sense you can use the following command from the chap prompt:
chap> describe stacks
Thread 1 uses stack block [0x7fffe22bc000, 7fffe22dd000)
current sp: 0x7fffe22daa00
Peak stack usage was 0x7798 bytes out of 0x21000 total.
Thread 2 uses stack block [0x7f51ec07c000, 7f51ec87c000)
current sp: 0x7f51ec87a750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 3 uses stack block [0x7f51e7800000, 7f51e8000000)
current sp: 0x7f51e7ffe750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 4 uses stack block [0x7f51e6fff000, 7f51e77ff000)
current sp: 0x7f51e77fd750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 5 uses stack block [0x7f51e67fe000, 7f51e6ffe000)
current sp: 0x7f51e6ffc750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
5 stacks use 0x2021000 (33,689,600) bytes.
So what you see above is that 4 of the stacks are 8MiB in size but could easily be well under 64KiB.
Your program may not have any issues with stack size, but if so, you can fix them as described above.
Continuing with checking for causes of growth, look at the next line from the summary:
6 ranges take 0x180000 bytes for use: python arena
So python arenas use the next most memory. These are used strictly for python-specific allocations. So if this value is large in your case it disproves your theory about C allocations being the culprit, but there is more you can do later to figure out how those python allocations are being used.
Looking at the remaining lines of the summary, we see a few with "libc" as part of the "use" description:
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
Note that libc is responsible for all that memory, but you can't know that the memory is used for non-python code because for allocations beyond a certain size threshold (well under 4K) python grabs memory via malloc rather than grabbing memory from one of the python arenas.
So lets assume that you have resolved any issues you might have had with stack usage and you have mainly "python arenas" or "libc malloc" related usages. The next thing you want to understand is whether that memory is mostly "used" (meaning allocated but never freed) or "free" (meaning "freed but not given back to the operating system). You can do that as shown:
chap> count used
15731 allocations use 0x239388 (2,331,528) bytes.
chap> count free
1563 allocations use 0xb84c8 (754,888) bytes.
So in the above case, used allocations dominate and what one should do is to try to understand those used allocations. The case where free allocations dominate is much more complex and is discussed a bit in the user guide but would take too much time to cover here.
So lets assume for now that used allocations are the main cause of growth in your case. We can find out why we have so many used allocations.
The first thing we might want to know is whether any allocations were actually "leaked" in the sense that they are no longer reachable. This excludes the case where the growth is due to container-based growth.
One does this as follows:
chap> summarize leaked
0 allocations use 0x0 (0) bytes.
So for this particular core, as is pretty common for python cores, nothing was leaked. Your number may be non-zero. If it is non-zero but still much lower than the totals associated with memory used for "python" or "libc" reported above, you might just make a note of the leaks but continue to look for the real cause of growth. The user guide has some information about investigating leaks but it is a bit sparse. If the leak count is actually big enough to explain your growth issue, you should investigate that next but if not, read on.
Now that you are assuming container-based growth the following commands are useful:
chap> redirect on
chap> summarize used
Wrote results to scratch/core.python_5_threads.summarize_used
chap> summarize used /sortby bytes
Wrote results to scratch/core.python_5_threads.summarize_used::sortby:bytes
The above created two text files, one which has a summary ordered in terms of object counts and another which has a summary in terms of the total bytes used directly by those objects.
At present chap has only very limited support for python (it finds those python objects, in addition to any allocated by libc malloc but for python objects the summary only breaks out limited categories for python objects in terms of patterns (for example, %SimplePythonObject matches things like "int", "str", ... that don't hold other python objects and %ContainerPythonObject matches things like tuple, list, dict, ... that do hold references to other python objects). With that said, it should be pretty easy to tell from the summary whether the growth in used allocations is primarily due to objects allocated by python or objects allocated by native code.
So in this case, given that you specifically are trying to find out whether the growth is due to native code or not, look in the summary for counts like the following, all of which are python-related:
Pattern %SimplePythonObject has 7798 instances taking 0x9e9e8(649,704) bytes.
Pattern %ContainerPythonObject has 7244 instances taking 0xc51a8(807,336) bytes.
Pattern %PyDictKeysObject has 213 instances taking 0xb6730(747,312) bytes.
So in the core I have been using for an example, definitely python allocations dominate.
You will also see a line for the following, which is for allocations that chap does not yet recognize. You can't make assumptions about whether these are python-related or not.
Unrecognized allocations have 474 instances taking 0x1e9b8(125,368) bytes.
This will hopefully answer your basic question of what you can do next. At least at that point you will understand whether the growth is likely due to C code or python code and depending on what you find, the chap user guide should help you go further from there.

Jupyter Notebook Memory Management

I am currently working on a jupyter notebook in kaggle. After performing the desired transformations on my numpy array, I pickled it so that it can be stored on disk. The reason I did that is so that I can free up the memory being consumed by the large array.
The memory consumed after pickling the array was about 8.7 gb.
I decided to run this code snippet provided by #jan-glx here , to find out what variables were consuming my memory:
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
After performing this step I noticed that the size of my array was 3.3 gb, and the size of all the other variables summed together was about 0.1 gb.
I decided to delete the array and see if that would fix the problem, by performing the following:
del my_array
gc.collect()
After doing this, the memory consumption decreased from 8.7 gb to 5.4 gb. Which in theory makes sense, but still didn't explain what the rest of the memory was being consumed by.
I decided to continue anyways and reset all my variables to see whether this would free up the memory or not with:
%reset
As expected it freed up the memory of the variables that were printed out in the function above, and I was still left with 5.3 gb of memory in use.
One thing to note is that I noticed a memory spike when pickling the file itself, so a summary of the process would be something like this:
performed operations on array -> memory consumption increased from about 1.9 gb to 5.6 gb
pickled file -> memory consumption increased from 5.6 gb to about 8.7 gb
Memory spikes suddenly while file is being pickled to 15.2 gb then drops back to 8.7 gb.
deleted array -> memory consumption decreased from 8.7 gb to 5.4 gb
performed reset -> memory consumption decreased from 5.4 gb to 5.3 gb
Please note that the above is loosely based of monitoring the memory on kaggle and may be inaccurate.
I have also checked this question but it was not helpful for my case.
Would this be considered a memory leak? If so, what do I do in this case?
EDIT 1:
After some further digging, I noticed that there are others facing this problem. This problem stems from the pickling process, and that pickling creates a copy in memory but, for some reason, does not release it. Is there a way to release the memory after the pickling process is complete.
EDIT 2:
When deleting the pickled file from disk, using:
!rm my_array
It ended up freeing the disk space and freeing up space on memory as well. I don't know whether the above tidbit would be of use or not, but I decided to include it anyways as every bit of info might help.

There is one basic drawback that you should be aware of: The CPython interpreter actually can actually barely free memory and return it to the OS. For most workloads, you can assume that memory is not freed during the lifetime of the interpreter's process. However, the interpreter can re-use the memory internally. So looking at the memory consumption of the CPython process from the operating system's perspective really does not help at all. A rather common work-around is to run memory intensive jobs in a sub-process / worker process (via multiprocessing for instance) and "only" return the result to the main process. Once the worker dies, the memory is actually freed.
Second, using sys.getsizeof on ndarrays can be impressively misleading. Use the ndarray.nbytes property instead and be aware that this may also be misleading when dealing with views.
Besides, I am not entirely sure why you "pickle" numpy arrays. There are better tools for this job. Just to name two: h5py (a classic, based on HDF5) and zarr. Both libraries allow you to work with ndarray-like objects directly on disk (and compression) - essentially eliminating the pickling step. Besides, zarr also allows you to create compressed ndarray-compatible data structures in memory. Must ufuncs from numpy, scipy & friends will happily accept them as input parameters.

Is freeing handled differently for small/large numpy arrays?

I am trying to debug a memory problem with my large Python application. Most of the memory is in numpy arrays managed by Python classes, so Heapy etc. are useless, since they do not account for the memory in the numpy arrays. So I tried to manually track the memory usage using the MacOSX (10.7.5) Activity Monitor (or top if you will). I noticed the following weird behavior. On a normal python interpreter shell (2.7.3):
import numpy as np # 1.7.1
# Activity Monitor: 12.8 MB
a = np.zeros((1000, 1000, 17)) # a "large" array
# 142.5 MB
del a
# 12.8 MB (so far so good, the array got freed)
a = np.zeros((1000, 1000, 16)) # a "small" array
# 134.9 MB
del a
# 134.9 MB (the system didn't get back the memory)
import gc
gc.collect()
# 134.9 MB
No matter what I do, the memory footprint of the Python session will never go below 134.9 MB again. So my question is:
Why are the resources of arrays larger than 1000x1000x17x8 bytes (found empirically on my system) properly given back to the system, while the memory of smaller arrays appears to be stuck with the Python interpreter forever?
This does appear to ratchet up, since in my real-world applications, I end up with over 2 GB of memory I can never get back from the Python interpreter. Is this intended behavior that Python reserves more and more memory depending on usage history? If yes, then Activity Monitor is just as useless as Heapy for my case. Is there anything out there that is not useless?

Reading from Numpy's policy for releasing memory it seems like numpy does not have any special handling of memory allocation/deallocation. It simply calls free() when the reference count goes to zero. In fact it's pretty easy to replicate the issue with any built-in python object. The problem lies at the OS level.
Nathaniel Smith has written an explanation of what is happening in one of his replies in the linked thread:
In general, processes can request memory from the OS, but they cannot
give it back. At the C level, if you call free(), then what actually
happens is that the memory management library in your process makes a
note for itself that that memory is not used, and may return it from a
future malloc(), but from the OS's point of view it is still
"allocated". (And python uses another similar system on top for
malloc()/free(), but this doesn't really change anything.) So the OS
memory usage you see is generally a "high water mark", the maximum
amount of memory that your process ever needed.
The exception is that for large single allocations (e.g. if you create
a multi-megabyte array), a different mechanism is used. Such large
memory allocations can be released back to the OS. So it might
specifically be the non-numpy parts of your program that are producing
the issues you see.
So, it seems like there is no general solution to the problem .Allocating many small objects will lead to a "high memory usage" as profiled by the tools, even thou it will be reused when needed, while allocating big objects wont show big memory usage after deallocation because memory is reclaimed by the OS.
You can verify this allocating built-in python objects:
In [1]: a = [[0] * 100 for _ in range(1000000)]
In [2]: del a
After this code I can see that memory is not reclaimed, while doing:
In [1]: a = [[0] * 10000 for _ in range(10000)]
In [2]: del a
the memory is reclaimed.
To avoid memory problems you should either allocate big arrays and work with them(maybe use views to "simulate" small arrays?), or try to avoid having many small arrays at the same time. If you have some loop that creates small objects you might explicitly deallocate objects not needed at every iteration instead of doing this only at the end.
I believe Python Memory Management gives good insights on how memory is managed in python. Note that, on top of the "OS problem", python adds another layer to manage memory arenas, which can contribute to high memory usage with small objects.

Can we use a python variable to hold an entire file?

Provided that we know that all the file will be loaded in memory and we can afford it,
what are the drawbacks (if any) or limitations (if any) of loading an entire file (possibly a binary file) in a python variable. If this is technically possible, should this be avoided, and why ?
Regarding file size concerns, to what maximum size this solution should be limited ?. And why ?
The actual loading code could be the one proposed in this stackoverflow entry.
Sample code is:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('/bin/kill')
... code manipulating 'content' ...
[EDIT]
Code manipulation that comes to mind (but is maybe not applicable) is standard list/strings operators (square brackets, '+' signs) or some string operators ('len', 'in' operator, 'count', 'endswith'/'startswith', 'split', 'translation' ...).

Yes, you can
The only drawback is memory usage, and possible also speed if the file is big.
File size should be limited to how much space you have in memory.
In general, there are better ways to do it, but for one-off scripts where you know memory is not an issue, sure.

While you've gotten good responses, it seems nobody has answered this part of your question (as often happens when you ask many questions in a question;-)...:
Regarding file size concerns, to what
maximum size this solution should be
limited ?. And why ?
The most important thing is, how much physical RAM can this specific Python process actually use (what's known as a "working set"), without unduly penalizing other aspects of the overall system's performance. If you exceed physical RAM for your "working set", you'll be paginating and swapping in and out to disk, and your performance can rapidly degrade (up to a state known as "thrashing" were basically all available cycles are going to the tasks of getting pages in and out, and negligible amounts of actual work can actually get done).
Out of that total, a reasonably modest amount (say a few MB at most, in general) are probably going to be taken up by executable code (Python's own executable files, DLLs or .so's) and bytecode and general support datastructures that are actively needed in memory; on a typical modern machine that's not doing other important or urgent tasks, you can almost ignore this overhead compared to the gigabytes of RAM that you have available overall (though the situation might be different on embedded systems, etc).
All the rest is available for your data -- which includes this file you're reading into memory, as well as any other significant data structures. "Modifications" of the file's data can typically take (transiently) twice as much memory as the file's contents' size (if you're holding it in a string) -- more, of course, if you're keeping a copy of the old data as well as making new modified copies/versions.
So for "read-only" use on a typical modern 32-bit machine with, say, 2GB of RAM overall, reading into memory (say) 1.5 GB should be no problem; but it will have to be substantially less than 1 GB if you're doing "modifications" (and even less if you have other significant data structures in memory!). Of course, on a dedicated server with a 64-bit build of Python, a 64-bit OS, and 16 GB of RAM, the practical limits before very different -- roughly in proportion to the vastly different amount of available RAM in fact.
For example, the King James' Bible text as downloadable here (unzipped) is about 4.4 MB; so, in a machine with 2 GB of RAM, you could keep about 400 slightly modified copies of it in memory (if nothing else is requesting memory), but, in a machine with 16 (available and addressable) GB of RAM, you could keep well over 3000 such copies.

with open(filename) as f:
This only works on Python 2.x on Unix. It won't do what you expect on Python 3.x or on Windows, as these both draw a strong distinction between text and binary files. It's better to specify that the file is binary, like this:
with open(filename, 'rb') as f:
This will turn off the OS's CR/LF conversion on Windows, and will force Python 3.x to return a byte array rather than Unicode characters.
As for the rest of your question, I agree with Lennart Regebro's (unedited) answer.

The sole issue you can run into is memory consumption: Strings in Python are immutable. So when you need to change a byte, you need to copy the old string:
new = old[0:pos] + newByte + old[pos+1:]
This needs up to three times the memory of old.
Instead of a string, you can use an array. These offer much better performance if you need to modify the contents and you can create them easily from a string.

You can also use Python's v3 feature:
>>> ''.join(open('htdocs/config.php', 'r').readlines())
"This is the first line of the file.\nSecond line of the file"
Read more here http://docs.python.org/py3k/tutorial/inputoutput.html

Yes you can -provided the file is small enough-.
It is even very pythonic to further convert the return from read() to any container/iterable type as with say, string.split(), along with associated functional programming features to continue treating the file "at once".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.