Can Pickle handle files larger than the RAM installed on my machine?

Can Pickle handle files larger than the RAM installed on my machine? - python

I'm using pickle for saving on disk my NLP classifier built with the TextBlob library.
I'm using pickle after a lot of searches related to this question. At the moment I'm working locally and I have no problem loading the pickle file (which is 1.5Gb) with my i7 and 16gb RAM machine. But the idea is that my program, in the future, has to run on my server which only has 512Mb RAM installed.
Can pickle handle such a large file or will I face memory issues?
On my server I've got Python 3.5 installed and it is a Linux server (not sure which distribution).
I'm asking because at the moment I can't access my server, so I can't just try and find out what happens, but at the same time I'm doubtful if I can keep this approach or I have to find other solutions.

Unfortunately this is difficult to accurately answer without testing it on your machine.
Here are some initial thoughts:
There is no inherent size limit that the Pickle module enforces, but you're pushing the boundaries of its intended use. It's not designed for individual large objects. However, you since you're using Python 3.5, you will be able to take advantage of PEP 3154 which adds better support for large objects. You should specify pickle.HIGHEST_PROTOCOL when you dump your data.
You will likely have a large performance hit because you're trying to deal with an object that is 3x the size of your memory. Your system will probably start swapping, and possibly even thrashing. RAM is so cheap these days, bumping it up to at least 2GB should help significantly.
To handle the swapping, make sure you have enough swap space available (a large swap partition if you're on Linux, or enough space for the swap file on your primary partition on Windows).
As pal sch's comment shows, Pickle is not very friendly to RAM consumption during the pickling process, so you may have to deal with Python trying to get even more memory from the OS than the 1.5GB we may expect for your object.
Given these considerations, I don't expect it to work out very well for you. I'd strongly suggest upgrading the RAM on your target machine to make this work.

I don't see how you could load an object into RAM that exceeds the RAM. i.e. bytes(num_bytes_greater_than_ram) will always raise an MemoryError.

Related

Python: MemoryError (scripts runs sometimes)

I have a script which sometimes runs successfully, providing the desired output, but when rerun moments later it provides the following error:
numpy.core._exceptions.MemoryError: Unable to allocate 70.8 MiB for an array with shape (4643100, 2) and data type float64
I realise this question has been answered several times (like here), but so far none of the solutions have worked for me. I was wondering if anyone has any idea how it's possible that sometimes the script runs fine and then moments later it provides an error?
I have lowered my computer's RAM usage, have increased the virtual memory, rebooted my laptop, none of which seemed to help (Windows 10, RAM 8.0GB, python 3.9.2 32 bit).
PS: Unfortunately not possible to share the script/create dummy.

Python is a garbage collected language. Garbage collection is non-deterministic. This means that peak memory usage may be different each time a program is run. So the first time you run the program, its peak memory usage is less than the available memory. But the next time you run the program, its peak memory usage is sufficient to consume all available memory. This assumes that the available memory on the host system is constant, which is an incorrect assumption. So the fluctuation in available memory, i.e. the memory not in use by the other running processes, is another reason that the program may raise a MemoryError one time, but terminate without error another time.
Sidenote: Increase virtual memory as a last resort. It isn't memory, it's disk that is used like memory, and it is much slower than memory.

Using more memory than available

I have written a program that expands a database of prime numbers. This program is written in python and runs on windows 10 (x64) with 8GB RAM.
The program stores all primes it has found in a list of integers for further calculations and uses approximately 6-7GB of RAM while running. During some runs however, this figure has dropped to below 100MB. The memory usage then stays low for the duration of the run, though increasing as expected as more numbers are added to the prime array. Note that not all runs result in a memory drop.
Memory usage measured with task manager
These, seemingly random, drops has led me the following theories:
There's a bug in my code, making it drop critical data and messing up the results (most likely but not supported by the results)
Python just happens to optimize my code extremely well after a while.
Python or Windows is compensating for my over-usage of the RAM by cleaning out portions of my prime-number array that aren't used that much. (eventually resulting in incorrect calculations)
Python or Windows is compensating for my over-usage of the RAM by allocating disk space instead of ram.
Questions
What could be the reason(s) for this memory drop?
How does python handle programs that use more than available RAM?
How does Windows handle programs that use more than available RAM?

1, 2, and 3 are incorrect theories.
4 is correct. Windows (not Python) is moving some of your process memory to swap space. This is almost totally transparent to your application - you don't need to do anything special to respond to or handle this situation. The only thing you will notice is your application may get slower as information is written to and read from disk. But it all happens transparently. See https://en.wikipedia.org/wiki/Virtual_memory for more information.

Have you heard of paging? Windows dumps some ram (that hasn't been used in a while) to your hard drive to keep your computer from running out or ram and ultimately crashing.
Only Windows deals with memory management. Although, if you use Windows 10, it will also compress your memory, somewhat like a zip file.

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?

Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.

Huge memory usage of Python's json module?

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.

I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

Why does running SQLite (through python) cause memory to "unofficially" fill up?

I'm dealing with some big (tens of millions of records, around 10gb) database files using SQLite. I'm doint this python's standard interface.
When I try to insert millions of records into the database, or create indices on some of the columns, my computer slowly runs out of memory. If I look at the normal system monitor, it looks like the majority of the system memory is free. However, when I use top, it looks like I have almost no system memory free. If I sort the processes by their memory consuption, then non of them uses more than a couple percent of my memory (including the python process that is running sqlite).
Where is all the memory going? Why do top and Ubuntu's system monitor disagree about how much system memory I have? Why does top tell me that I have very little memory free, and at the same time not show which process(es) is (are) using all the memory?
I'm running Ubuntu 11.04, sqlite3, python 2.7.

10 to 1 says you are confused by linux's filesystem buffer/cache
see
ofstream leaking memory
https://superuser.com/questions/295900/linux-sort-all-data-in-memory/295902#295902
Test it by doing (as root)
echo 3 > /proc/sys/vm/drop_caches

The memory may be not assigned to a process, but it can be e.g. a file on tmpfs filesystem (/dev/shm, /tmp sometimes). You should show us the output of top or free (please note those tools do not show a single 'memory usage' value) to let us tell something more about the memory usage.
In case of inserting records to a database it may be a temporary image created for the current transaction, before it is committed to the real database. Splitting the insertion into many separate transactions (if applicable) may help.
I am just guessing, not enough data.
P.S. It seems I mis-read the original question (I assumed the computer slows down) and there is no problem. sehe's answer is probably better.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can Pickle handle files larger than the RAM installed on my machine? - python

I don't see how you could load an object into RAM that exceeds the RAM. i.e. bytes(num_bytes_greater_than_ram) will always raise an MemoryError.

Related

Python: MemoryError (scripts runs sometimes)

Using more memory than available

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

Huge memory usage of Python's json module?

Why does running SQLite (through python) cause memory to "unofficially" fill up?

Categories

Resources