Memory Errors in Python - python

I created a non-sparse matrix in Python with the shape 156.100.000 x 513 and therefore 8.007.930.000 entries. The entries are float32 numbers. I obviously have troubles now using the data within my 8GB RAM system.
What are common approaches to address this problem? I'm not sure how to use the data e.g. chunk-wise since all those Numpy and Scipy functions always operate on the whole dataset. Sparse matrices won't work because of the data structure and I also have troubles to understand if HDF5 is really suited towards my problem (which is use a big dataset with standard Numpy-functions without crashing my memory)

Related

How to find dot product of two very large matrices to avoid memory error?

I am trying to learn ML using Kaggle datasets. In one of the problems (using Logistic regression) inputs and parameters matrices are of size (1110001, 8) & (2122640, 8) respectively.
I am getting memory error while doing it in python. This would be same for any language I guess since it's too big. My question is how do they multiply matrices in real life ML implementations (since it would usually be this big)?
Things bugging me :
Some ppl in SO have suggested to calculate dot product in parts and then combine. But even then matrix would be still too big for RAM (9.42TB? in this case)
And If I write it to a file wouldn't it be too slow for optimization algorithms to read from file and minimize function?
Even if I do write it to file how would fmin_bfgs(or any opt. function) read from file?
Also Kaggle notebook shows only 1GB of storage available. I don't think anyone would allow TBs of storage space.
In my input matrix many rows have similar values for some columns. Can I use it my advantage to save space? (like sparse matrix for zeros in matrix)
Can anyone point me to any real life sample implementation of such cases. Thanks!
I have tried many things. I will be mentioning these here, if anyone needs them in future:
I had already cleaned up data like removing duplicates and
irrelevant records depending on given problem etc.
I have stored large matrices which hold mostly 0s as sparse matrix.
I implemented the gradient descent using mini-batch method instead of plain old Batch method (theta.T dot X).
Now everything is working fine.

load and operate on matrices bigger than RAM - python - numpy - pandas

my tasks:
load from the database matrices whose dimension is bigger than my
RAM by using (pandas.read_sql(...) - database is postresql)
operate on the numpy representation of such matrices (bigger than my RAM) using numpy
the problem: I get a memory error when even loading the data from the database.
my temporary quick and dirty solution: loop over chunks of the aforementioned data (so importing parts of the data at a time) thus allowing RAM to handle the workload. The issue at play here is speed related. time is significantly higher and before delving into Cython optimization and the like, I wanted to know whether there were some solutions (either in the forms of data structures like using the library shelving or the HDF5 format) to solve the issue

How to tell pytables the amount of RAM to be used (using it for paging)?

I am storing several numpy arrays in a pytables file. Each individual array (size ~1MB - 100MB) fits into RAM but not all (N ~10 - 1000) arrays together fit.
In the application I operate repeatedly on these arrays, also changing their shapes etc. So I want to use pytables to swap currently unneeded arrays to disk and reload them when needed (paging). The swapping is supposed to work on a "Least Recently Used" basis.
How can I tell pytables how much RAM it can use?
I tried playing around with parameters.NODE_CACHE_SLOTS but it had basically no effect. In a test script, I am storing ~200 random arrays of shape (~1000,~1000) in a table. No matter what I chose for some of the parameters like NODE_CACHE_SLOTS, the used RAM stayed the same -- about 80MB, while several GB would be available.
Especially in cases, where all nodes fit into the RAM, programs using pytables would not require any disk I/O and would, hence, of course be much faster. In general, one wants to exploit the available RAM.
[Of course it is also interesting if you know a better option than pytables for such paging purposes.]

What are the options to to compute statistical models on out of memory data sets in python?

I'm referring to sub hadoop size data, but bigger than ram.
Must these be coded by hand?
I'd try pytables, it's based on HDF5 and numpy, so you can use the same good statistical packages in Python, which are mostly based on numpy in some manner, while not having to put everything in memory
http://www.pytables.org/moin/MainFeatures
* Unlimited datasets size
Allows working with tables and/or arrays with a very large number of rows (up to 2**63), i.e. that don't fit in memory.

Python - Best data structure for incredibly large matrix

I need to create about 2 million vectors w/ 1000 slots in each (each slot merely contains an integer).
What would be the best data structure for working with this amount of data? It could be that I'm over-estimating the amount of processing/memory involved.
I need to iterate over a collection of files (about 34.5GB in total) and update the vectors each time one of the the 2-million items (each corresponding to a vector) is encountered on a line.
I could easily write code for this, but I know it wouldn't be optimal enough to handle the volume of the data, which is why I'm asking you experts. :)
Best,
Georgina
You might be memory bound on your machine. Without cleaning up running programs:
a = numpy.zeros((1000000,1000),dtype=int)
wouldn't fit into memory. But in general if you could break the problem up such that you don't need the entire array in memory at once, or you can use a sparse representation, I would go with numpy (scipy for the sparse representation).
Also, you could think about storing the data in hdf5 with h5py or pytables or netcdf4 with netcdf4-python on disk and then access the portions you need.
Use a sparse matrix assuming most entries are 0.
If you need to work in RAM try the scipy.sparse matrix variants. It includes algorithms to efficiently manipulate sparse matrices.

Categories

Resources