How do I work with large, memory hungry numpy array?

How do I work with large, memory hungry numpy array? - python

I have a program which creates an array:
List1 = zeros((x, y), dtype=complex_)
Currently I am using x = 500 and y = 1000000.
I will initialize the first column of the list by some formula. Then the subsequent columns will calculate their own values based on the preceding column.
After the list is completely filled, I will then display this multidimensional array using imshow().
The size of each value (item) in the list is 24 bytes.
A sample value from the code is: 4.63829355451e-32
When I run the code with y = 10000000, it takes up too much RAM and the system stops the run. How do I solve this problem? Is there a way to save my RAM while still being able to process the list using imshow() easily? Also, how large a list can imshow() display?

There's no way to solve this problem (in any general way).
Computers (as commonly understood) have a limited amount of RAM, and they require elements to be in RAM in order to operate on them.
An complex128 array size of 10000000x500 would require around 74GiB to store. You'll need to somehow reduce the amount of data you're processing if you hope to use a regular computer to do it (as opposed to a supercomputer).
A common technique is partitioning your data and processing each partition individually (possibly on multiple computers). Depending on the problem you're trying to solve, there may be special data structures that you can use to reduce the amount of memory needed to represent the data - a good example is a sparse matrix.
It's very unusual to need this much memory - make sure to carefully consider if it's actually needed before you dwell into the extremely complex workarounds.

Related

program freezes on creating a large numpy array

I am making a Tkinter based project where array size can sometimes get as high as 10^9 or even more(although quite minimal chances of more).
Earlier I used a simple array using loops but it took a lot of time in an array of size of order 10^6 or more, so I decided to switch my approach to NumPy and in most cases, it gave far better results, but at the above-mentioned condition(size>=10^9), the program just freezes(sometimes even the computer also freezes leaving no option other than force restart), unlike the simple looping approach which gave the result even for higher sizes of the list, but however, took a whole lot of time.
I looked this but it involved terminologies like using heap memory, stack size and I know little about these things.
I am not quite used to the stack platform, so any advice would be appreciated.
Update: I am adding the chunk of code where I tried replacing normal list with numpy one. Commented lines are the ones I used earlier with simple list.
def generate(self):
# t is number of times we need to generate this list
for i in range(self.t):
self.n = randint(self.n_min, self.n_max) # constraints
# self.a = [0] * self.n
self.a = np.random.randint(low=self.a_min, high=self.a_max, size=self.n)
# for j in range(self.n):
# self.a[j] = randint(self.a_min, self.a_max)
and then insert all these values in the output screen of Tkinter GUI,
here 'n' i.e size of NumPy array can take very high values sometimes.
I am on dual boot (win+ubuntu) and the current situation is faced on ubuntu in which I have allocated 500 GB storage and my laptop's RAM is 8 GB.

You ran out of memory most likely, for a 1e9 element float64 array in numpy, that would be 8GB alone. Also if you are naively looping over that array (something like):
for item in big_numpy_arrray:
do_something(item)
That is going to take forever. Avoid doing that, and use numpy's vector operations when possible.

How to cluster big data using Python or R without memory error?

I am trying to cluster a data set with about 1,100,000 observations, each with three values.
The code is pretty simple in R:
df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.
the error I get is :
Error: cannot allocate vector of size 4439.0 Gb
Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.
I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?
Can you recommend anything in either R or python?

For trivial reasons, the function dist needs quadratic memory.
So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.
And 1 million is still pretty small, isn't it?
Using dist on big data is impossible. End of story.
For larger data sets, you'll need to
use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller
In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.

Python NUMPY HUGE Matrices multiplication

I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

pyfftw release references to arrays without destroying plan

I have a large set of large arrays that need to be fourier transformed one after another, repeatedly, and they do not all fit in memory at the same time. Typical array size is (350,250000), but is quite variable. The general procedure is
while True:
for data in data_set:
array = generate_array(data)
fft(array,farray)
do_something_with_farray()
ifft(farray,array)
do_something_with_array()
This needs to be fast, so ideally I would make plans for all the arrays beforehand, and reuse them in the loop. This is especially important because even constructing a plan with FFTW_ESTIMATE is too slow for me to do it inside the loop (10x+ times slower than just executing the plan, when constructing it as pyfftw.FFTW(array, farray, flags=['FFTW_ESTIMATE,FFTW_DESTROY_INPUT'], threads=nthread, axes=[-1])). However, each plan contains a reference to the arrays that were used when constructing it, which means that keeping all the plans in memory results in me keeping all the arrays in memory too, which I can't afford.
Is it possible to make pyfftw release the references it holds to the arrays? After all, I am planning to repoint them to fully compatible new arrays inside the loop anyway. If not, is there some other way of getting around this problem? I guess I could make plans for single rows, or for chunks of rows, but that could easily lead to slowdowns.
PS. I use FFTW_ESTIMATE rather than FFTW_MEASURE despite planning to reuse the plan many times becuse FFTW_MEASURE takes forever for these array sizes, and when I specify a time limit, performance is no better than for FFTW_ESTIMATE.
Edit: Actually, the slowness of constructing the plan only happens the first time I construct a plan of that shape (due to wisdom, I guess), so the approach of not storing the plans works after all. Still, if it is possible to store plans without the array references, that would be nice to know about.

FFTW plans are by there nature tied to a piece of memory. However, there is nothing to stop you using the same piece of memory for all your plans. So you could create a single array that is big enough for all your possible arrays and then create your FFTW objects on views into that array.
You can then execute the FFT using the FFTW.__call__() interface that allows the arrays to be updated prior to execution (with little overhead when they agree with the original array in strides and alignment).
Now, the FFTW object will have the new arrays as its internal arrays. If you want to revert back to the other memory, you can use FFTW.update_arrays().

Python - Best data structure for incredibly large matrix

I need to create about 2 million vectors w/ 1000 slots in each (each slot merely contains an integer).
What would be the best data structure for working with this amount of data? It could be that I'm over-estimating the amount of processing/memory involved.
I need to iterate over a collection of files (about 34.5GB in total) and update the vectors each time one of the the 2-million items (each corresponding to a vector) is encountered on a line.
I could easily write code for this, but I know it wouldn't be optimal enough to handle the volume of the data, which is why I'm asking you experts. :)
Best,
Georgina

You might be memory bound on your machine. Without cleaning up running programs:
a = numpy.zeros((1000000,1000),dtype=int)
wouldn't fit into memory. But in general if you could break the problem up such that you don't need the entire array in memory at once, or you can use a sparse representation, I would go with numpy (scipy for the sparse representation).
Also, you could think about storing the data in hdf5 with h5py or pytables or netcdf4 with netcdf4-python on disk and then access the portions you need.

Use a sparse matrix assuming most entries are 0.

If you need to work in RAM try the scipy.sparse matrix variants. It includes algorithms to efficiently manipulate sparse matrices.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.