I think that I'm having a memory leak when loading an .yml file with the library PyYAML.
I've followed the next steps:
import yaml
d = yaml.load(open(filename, 'r'))
The memory used by the process (I've gotten it with top or htop) has grown from 60K to 160M while the size of the file is lower than 1M.
Then, I've done the next command:
sys.getsizeof(d)
And it has returned a value lower than 400K.
I've also tried to use the garbage collector with gc.collect(), but nothing has happened.
As you can see, it seems that there's a memory leak, but I don't know what is producing it, neither I know how to free this amount of memory.
Any idea?
Your approach doesn't show a memory leak, it just shows that PyYAML uses a lot of memory while processing a moderately sized YAML file.
If you would do:
import yaml
X = 10
for x in range(X):
d = yaml.safe_load(open(filename, 'r'))
And the memory size used by the program would change depending on what you set X to, then there is reason to assume there is a memory leak.
In tests that I ran this is not the case. It is just that the default Loader and SafeLoader take about 330x the filesize in memory (based on an arbitrary 1Mb size simple, i.e. no tags, YAML file) and the CLoader about 145x that filesize.
Loading the YAML data multiple times doesn't increase that, so load() gives back the memory it uses, which means there is no memory leak.
That is not to say that it looks like an enormous amount of overhead.
(I am using safe_load() as PyYAML's documentation indicate that load() is not safe on uncontrolled input files).
Related
Yes, this was asked seven years ago, but the 'answers' were not helpful in my opinion. So much open data uses JSON, so I'm asking this again to see if any better techniques are available. I'm loading a 28 MB JSON file (with 7,000 lines) and the memory used for json.loads is over 300 MB.
This statement is run repeatedly:
data_2_item = json.loads(data_1_item)
and is eating up the memory for the duration of the program. I've tried various other statements such as pd.read_json(in_file_name, lines=True)
with the same results. I've tried simplejson and rapidjson alternative packages also.
As a commenter observed, json.loads is NOT the culprit. data_2_item can be very large -- sometimes 45K. Since it is appended to a list over 7,000 times, the list becomes huge (300 MB) and that memory is NEVER released. So to me, the answer: no solution with existing packages/loaders. The overall goal is to load a large JSON file into a Pandas Dataframe without using 300 MB (or more) of memory for intermediate processing. And that memory does not shrink. See also https://github.com/pandas-dev/pandas/issues/17048
If after loading your content you will be using only part of it then consider using ijson to load the JSON content in a streaming fashion, with low memory consumption, and only constructing the data you need to handle rather than the whole object.
Each time I load a npy file, the memory usage increases. The following mini-example illustrates this problem.
import numpy as np
X = np.random.randn(10000,10000)
np.save('tmp.npy',X)
Now, if the following line is executed several times, then each time the memory usage will increase
y=np.load('tmp.npy')
I found the exact similar problem with npz file here, yet the solution is to use a function that is applicable to npy file.
Any idea?
The premise is flawed: memory usage does temporarily increase when loading the file, and may increase again the second time, and perhaps even the third, but eventually the garbage collector will run and the memory will be freed.
If you don't want to wait for nondeterministic time to reclaim the memory, you can explicitly force the garbage collector to run whenever need to reclaim memory:
import gc
gc.collect()
You can also explicitly delete the array after loading it, if you no longer need the data:
del y
But if you do neither of these things, and simply load the same data over and over forever, memory usage will not grow forever--at some point the garbage collector will run and the memory usage will shrink. This happens automatically, and you usually do not need to worry about it.
So I'm using Google Cloud data Lab and I use the %%storage read command to read in a large file (2,000,000 rows) Into the text variable and then I have to process it into a pandas dataframe using BytesIO eg df_new=pd.read_csv(BytesIO(text))
So now I don't need the text Variable or its contents around, (all further processing is done on df_new, how can I delete it (text) and free up memory (I sure don't need two copies of a 2 million record dataset hanging around...)
Use del followed by forced garbage collection.
import gc
# Remove text variable
del text
# Force gc collection - this not actually necessary, but may be useful.
gc.collect()
Note that you may not see process size decreasing and memory returning to OS, depending on memory allocator used (depends on OS, core libraries used and python compilation options).
When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.
I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
I have to copy and do some simple processing on file. I can not read whole file to the memory because it is to big. I come up with piece of code which looks like this:
buffer = inFile.read(buffer_size)
while len(buffer) > 0:
outFile.write(buffer)
simpleCalculations(buffer)
buffer = inFile.read(buffer_size)
simpleCalculations procedure is irrelevant in this context but I am worried about subsequent memory allocations of buffer list. On some hardware configuration memory usage gets very high and that apparently kills the machine. I would like to reuse buffer. Is this posible in python 2.6?
I don't think there's any easy way around this. The file.read() method just returns a new string each time you call it. On the other hand, you don't really need to worry about running out of memory -- once you assign buffer to the newly-read string, the previously-read string no longer has any references to it, so its memory gets freed automatically (see here for more details).
Python being a strictly reference-counted environment, your buffer will be deallocated as soon as you no longer have any references to it.
If you're worried about physical RAM but have spare address space, you could mmap your file rather than reading it in a bit at a time.