I am reading in a 15Gb .csv file using the read_csv() pandas function including the iterator/chunk functionality because I need a subset of the file of about 20%.
I am doing this in PyCharm where I set the max heap size to 18Gb (although I have 16Gb RAM) and the minimum allocated memory to half of the max heap size 9Gb. Throughout this process Pycharm indicates I am using around 100Mb to 200Mb of RAM, while the Windows Task Manager indicates I am using approximately 2.5Gb of RAM which includes both the PyCharm and Python processes. I have about 45% left of my memory in the task manager.
As far as I can see there is nothing that indicates that I am running out of memory. Still while reading in this data I get a Memory error which tells me:
MemoryError: Unable to allocate array with shape (4, 8193780) and data type float64
Is there someone that can clarify this for me? I would suspect that maybe the final dataframe is larger than my RAM can handle? That would be:
( 4 * 8193780 * 8 (float64) ) / (1024**3) < 1Gb
So the above also does not seem to be the problem, or am I missing something here?
I think you are using 15 Gb of memory just to read your file, since i guess the read_csv() function access the whole file even if you specified the chunk/iterator to use 20% percent of your file, excluding that you are runninig windows and pycharm which needs at least 1 Gb of memory, so adding all the things up then you are out of memory i guess.
But those are someways to face your problem.
Verify the dtype of your array, and try to find the best one for your purpose. For example you are using float64, consider whether float32 or even float16 might be appropriate.
Consider if your computation can be done on a subset of the data. This is called subsampling. Maybe using subsampling you get a good enough model (this may be the case for a clustering algorithm like Kmean).
You may search for out-of-core solutions. This may either be rethinking your algorithm (can you split the work), or trying a solution that does it transparently.
Related
When reading in areas of a VRT with the python gdal library, RAM usage keeps increasing up to about 50% of available memory. This is fine on a normal computer but becomes a problem for when running on a computing cluster with huge amounts of RAM available.
Is there a way to limit how much RAM gdal uses?
Edit:
I am reading in blocks of 256x256pixels at a time with vrt.ReadAsArray(...) which are immediately used and not needed afterwards anymore. However, judging by the memory consumption, gdal is keeping read tiles in memory in case they are needed again until the available memory is about 50% filled. Only then does it start deleting unused tiles from RAM. No matter what hardware I run the program on, memory consumption will keep increasing over time until it reaches the 50% mark.
I would like to limit this to something like 32Gb RAM.
I have found a CHACHE_MAX config option of gdal. However, upon checking the amount of used cache with gdal.GetCacheUsed() it is apparently always 0. So while the option sounded promising, this does not seem to provide a solution.
I finally did some tests and found a solution in case anyone else comes across this problem.
Although gdal.GetCacheUsed() always returned 0, changing the CACHE_MAX config option solved the problem for me. This can be set in python like this:
from osgeo import gdal
gdal.SetCacheMax(134217728) # 134Mb
While I couldn't figure out how exactly this limit applies, the cache size appears to be per band, per raster in VRT, per VRT, per process. That is, the memory usage will be higher for VRTs with many rasters and bands etc.
I would like to load as much data, as is safe, so that the current process works fine as well as other processess. I would prefer to use RAM only (not using swap) but any suggestions are welcome. Excessive data can be discarded. What is the proper way of doing this? If I just wait for MemoryException, the system become not operable (if using list).
data_storage = []
for data in read_next_data():
data_storage.append(data)
The data is finally to be loaded into numpy array.
psutil has a virtual_memory function that contains, beside others, an attribute representing the free memory:
>>> psutil.virtual_memory()
svmem(total=4170924032, available=1743937536, percent=58.2, used=2426986496, free=1743937536)
>>> psutil.virtual_memory().free
1743937536
That should be pretty accurate (but the function call is costly -slow- at least on Windows). The MemoryError doesn't take memory used by other proccesses into account so it's only raised if the memory of the array exceeds the total avaiable (free or not) RAM.
You may have to guess at which point you stop accumulating because the free memory can change (other processes also need some additional memory from time to time) and the conversion to numpy.array might temporarly double your used memory because at that time the list and the array must fit into your RAM.
However you can approach this also in different way:
Read in the first dataset: read_next_data().
Calculate the free memory at that point: psutil.virtual_memory().free
Use the shape of the first dataset and the dtype to calculate the shape of the array that fits easily into the RAM. Let's say it uses factor (i.e. 75%) of the avaiable free memory: rows= freeMemory * factor / (firstDataShape * memoryPerElement) that should give you the number of datasets that you read in at once.
Create an array of that shape: arr = np.empty((rows, *firstShape), dtype=firstDtype).
Load the next datasets but store them directly into your array arr[i] = next(read_next_data). That way you you don't keep these lists around and you avoid the doubled memory.
I am using matrix = np.array(docTermMatrix) to make DTM. But sometimes it will run into memory error problems at this line. How can I prevent this from happening?
I assume you are using 32bit python. 32bit python limits your program ram memory to 2 gb (all 32bit programs have this as a hard limit), some of this is taken up by python overhead, more of this is taken up by your program. normal python objects do not need contiguous memory and will map disparate regions of memory
numpy.arrays require contiguous memory allocation, this is much harder to allocate. aditionally np.array(a) + 1 creates a 2nd array and must allocate again a huge contiguous block (in fact most operations).
some possible solutions that come to mind
use 64 bit python ... this will give you orders of magnitude more ram to work with ... you will be unlikely to encounter a memory error with this unless you have a really really really big array (so much so that numpy is probably not the right solution)
use multiprocessing to create a new process with a new 2gb limit that just does the numpy processing stuff
use a different solution than numpy( ie a database)
I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.
Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.
The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.
When writing a large dataset to a file using parallel HDF5 via h5py and mpi4py (and quite possible also when using HDF5 and MPI directly from C), I get the following error if using the mpio driver with a single process:
OSError: Can't prepare for writing data (Can't convert from size to size_i)
It seems that the limit on the allowed dataset is 4GB, at least when the content is double arrays. Larger datasets work fine if using more processes to share the workload, or if done on a single CPU without the mpio driver.
Why is this? Are size and size_i pointer types, and can the former not hold addresses larger than what corresponds to 4GB double[]? This error probably won't be a serious problem for me in the end, because I will use more than 1 process in general, but I would like my code to work even using just a single process.
I recently faced the same issue, and digging up got me to this point:
https://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8.1/src/unpacked/src/H5FDmpio.c
Where you will see the error being raised. Simply put, the error comes when the size of the array in bytes is greater than 2GB.
Upon digging further, got me here:
https://www.hdfgroup.org/hdf5-quest.html#p2gb
Where the problem and the workarounds are described.
Please have a look.