theano CPU running out of memory: what is wrong?

theano CPU running out of memory: what is wrong? - python

I run a simple network with theano on the server and got out-of-memory error, but I am not sure what is the reason. I am asking because it is unlikely to be just because I am using too much memory.
Here are the reasons:
First, according to this post, only when running with GPU will result in the problems caused by no support of virtual memory, but I am running it with CPU, so it should be fine.
Second, I build a network where the first layer is a matrix 100k by 10, and the second layer is 10 by 1, so it's just about 1M numbers for the model. So far, I only tried with 1000 data points together, so even if the machine load all the data together, and initialize all the layers together, there should be at most 110M float numbers. I used float32, on a 64bit machine. According to this post, each number takes 60bytes at most. So, the whole initialization takes 6GB memory. Even if there could be a variate different resources that take up memory, I don't understand why it cannot run on a 128GB RAM server.
Can someone suggest what I should look into?
Just in case someone asks for code, here it is.

What size are your minibatches? You need to remember that the activations take space in memory too.

Related

Transpose large numpy matrix on disk

I have a rather large rectangular (>1G rows, 1K columns) Fortran-style NumPy matrix, which I want to transpose to C-style.
So far, my approach has been relatively trivial with the following Rust snippet, using MMAPed slices of the source and destination matrix, where both original_matrix and target_matrix are MMAPPed PyArray2, with Rayon handling the parallelization.
Since the target_matrix has to be modified by multiple threads, I wrap it in an UnsafeCell.
let shared_target_matrix = std::cell::UnsafeCell::new(target_matrix);
original_matrix.as_ref().par_chunks(number_of_nodes).enumerate().for_each(|(j, feature)|{
feature.iter().copied().enumerate().for_each(|(i, feature_value)| unsafe {
*(shared_target_matrix.uget_mut([i, j])) = feature_value;
});
});
This approach transposes a matrix with shape (~1G, 100), ~120GB takes ~3 hours on an HDD disk. Transposing a (~1G, 1000), ~1200GB matrix does not scale linearly to 30 hours, as one may naively expect, but explode to several weeks. As it stands, I have managed to transpose roughly 100 features in 2 days, and it keeps slowing down.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Note on sequential and parallel approaches
While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster (on a machine with 12 cores and 24 threads) than a sequential approach when transposing a matrix with shape (1G, 100). We are not sure why this is the case.
Note on using two HDDs
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.

While intuitively, this sort of operation should be likely only limited by IO and therefore not benefit from any parallelization, we have observed experimentally that the parallel approach is indeed around three times faster
This may be due to poor IO queue utilization. With an entirely sequential workload without prefetching you'll be alternating the device between working and idle. If you keep multiple operations in flight it'll be working all the time.
Check with iostat -x <interval>
But parallelism is a suboptimal way to achieve best utilization of a HDD because it'll likely cause more head-seeks than necessary.
We also experimented with using two devices, one providing the Fortran-style matrix and a second one where we write the target matrix. Both HDDs were connected through SATA cables directly to the computer motherboard. We expected at least a doubling of the performance, but they remained unchanged.
This may be due to the operating system's write cache which means it can batch writes very efficiently and you're mostly bottlenecked on reads. Again, check with iostat.
There are several aspects, such as the employed file system, the HDD fragmentation, and how MMAPed handles page loading, which my solution is currently ignoring.
Are there known, more holistic solutions that take into account these issues?
Yes, if the underlying filesystem supports it you can use FIEMAP to get the physical layout of the data on disk and then optimize your read order to follow the physical layout rather than the logical layout. You can use the filefrag CLI tool to inspect the fragmentation data manually, but there are rust bindings for that ioctl so you can use it programmatically too.
Additionally you can use madvise(MADV_WILLNEED) to inform the kernel to prefetch data in the background for the next few loop iterations. For HDDs this should be ideally done in batches worth a few megabytes at a time. And the next batch should be issued when you're half-way through the current one.
Issuing them in batches minimizes syscall overhead and starting the next one half-way through ensures there's enough time left to actually complete the IO before you reach the end of the current one.
And since you'll be manually issuing prefetches in physical instead of logical order you can also disable the default readahead heuristics (which would be getting in the way) via madvise(MADV_RANDOM)
If you have enough free diskspace you could also try a simpler approach: defragmenting the file before operating on it. But even then you should still use madvise to ensure that there always are IO requests in flight.

Python Memory Overflow

Hello my fellow programmers,
I am writing a program which reads 90GB of images into a python list. But my Hardware only has 8GB RAM, thus the program gets stuck. I was wondering if a python list can handle this problem itself by writing on the hard disk or something like that. Otherwise, how could I solve this problem without upgrading the RAM to 128GB?
EDIT: I need to have all images in one list at one time
BACKGROUND INFORMATION: I am making a neural network which colors black and white images

Is it absolutely necessary to have all of the images in the memory at the same time? You could either process the images in batches or adjust the pipeline to use one image at a time.
You can also use a swap partition to supply additional memory for your process.

According to the documentation there's an argument batch_size which should help you:
batch_size (int, optional) – how many samples per batch to load
(default: 1).

Tensorflow - Profiling using timeline - Understand what is limiting the system

I am trying to understand why each train iteration takes aprox 1.5 sec.
I used the tracing method described here.I am working on a TitanX Pascal GPU. My results look very strange, it seems that every operation is relatively fast and the system is idle most of the time between operations. How can i understand from this what is limiting the system.
It does seem however that when I drastically reduce the batch size the gaps close, as could be seen here.
Unfortunately the code is very complicated and I can't post a small version of it that has the same problem
Is there a way to understand from the profiler what is taking the space in the gaps between operations?
Thanks!
EDIT:
On CPU ony I do not see this behavior:
I am running a

Here are a few guesses, but it's hard to say without a self-contained reproduction that I can run and debug.
Is it possible you are running out of GPU memory? One signal of this is if you see log messages of the form Allocator ... ran out of memory during training. If you run out of GPU memory, then the allocator backs off and waits in the hope more becomes available. This might explain the large inter-operator gaps that go away if you reduce the batch size.
As Yaroslav suggests in a comment above, what happens if you run the model on CPU only? What does the timeline look like?
Is this a distributed training job or a single-machine job? If it's a distributed job, does a single-machine version show the same behavior?
Are you calling session.run() or eval() many times, or just once per training step? Every run() or eval() call will drain the GPU pipeline, so for efficiency you need usually need to express your computation as one big graph with only a single run() call. (I doubt this is your problem but I mention it for completeness.)

Python Process using only 1.6 GB RAM Ubuntu 32 bit in Numpy Array

I have a program for learning Artificial Neural Network and it takes a 2-d numpy array as training data. The size of the data array I want to use is around 300,000 x 400 floats. I can't use chunking here because the library I am using (DeepLearningTutorials) takes a single numpy array as training data.
The code shows MemoryError when the RAM usage is around 1.6Gb by this process(I checked it in system monitor) but I have a total RAM of 8GB. Also, the system is Ubuntu-12.04 32-bit.
I checked for the answers ofor other similar questions but somewhere it says that there is nothing like allocating memory to your python program and somewhere the answer is not clear as to how to increase the process memory.
One interesting thing is I am running the same code on a different machine and it can take a numpy array of almost 1,500,000 x 400 floats without any problem. The basic configurations are similar except that the other machine is 64-bit and this one is 32-bit.
Could someone please give some theoretical answer as to why there is so much difference in this or is this the only reason for my problem?

A 32-bit OS can only address up to aroung 4gb of ram, while a 64-bit OS can take advantage of a lot more ram (theoretically 16.8 million terabytes). Since your OS is 32-bit, your OS can only take advantage of 4gb, so your other 4gb isn't used.
The other 64-bit machine doesn't have the 4gb ram limit, so it can take advantage of all of its installed ram.
These limits come from the fact that a 32-bit machine can only store memory address (pointers) of 32-bytes, so there are 2^32 different possible memory locations that the computer can identify. Similarly, a 64-bit machine can identify 2^64 different possible memory locations, so it can address 2^64 different bytes.

Super Linear Speedup - Python - Cluster - Multiple Processes

I parallelized a program which uses fairly large matrices. The program depicts the Ising model, from statistical mechanics. On my laptop everything works fine - even the visualization shows the behaviour I expect. Now I wanted to see how it scales using many CPUs, so I used a cluster computer I have at hand. Well, I get super linear speedup. First I thought it's not a big deal since it's possible that when I use multiple processes the problem size gets smaller and thus might fit into the cache. So no time consuming coping from cache to ram and back will slow it down. However, I even get super linear speedup for one CPU. I wouldn't expect that. If the whole system (matrix) doesn't fit into the cache for the sequential version then it also shouldn't fit into it using the parallel version with only one CPU, right?
I've done a check on my laptop. Averaged over 5 runs, the parallel version using one CPU is a tiny bit slower than the sequential version. I guess this is okay since there are some statements in the parallel version which I don't have in the sequential one.
Any ideas what this could be all about? Is the super linear speedup reasonable?
Note: I'm programming in python using numpy and for the parallel version, processes and shmarray.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.