program freezes on creating a large numpy array

program freezes on creating a large numpy array - python

I am making a Tkinter based project where array size can sometimes get as high as 10^9 or even more(although quite minimal chances of more).
Earlier I used a simple array using loops but it took a lot of time in an array of size of order 10^6 or more, so I decided to switch my approach to NumPy and in most cases, it gave far better results, but at the above-mentioned condition(size>=10^9), the program just freezes(sometimes even the computer also freezes leaving no option other than force restart), unlike the simple looping approach which gave the result even for higher sizes of the list, but however, took a whole lot of time.
I looked this but it involved terminologies like using heap memory, stack size and I know little about these things.
I am not quite used to the stack platform, so any advice would be appreciated.
Update: I am adding the chunk of code where I tried replacing normal list with numpy one. Commented lines are the ones I used earlier with simple list.
def generate(self):
# t is number of times we need to generate this list
for i in range(self.t):
self.n = randint(self.n_min, self.n_max) # constraints
# self.a = [0] * self.n
self.a = np.random.randint(low=self.a_min, high=self.a_max, size=self.n)
# for j in range(self.n):
# self.a[j] = randint(self.a_min, self.a_max)
and then insert all these values in the output screen of Tkinter GUI,
here 'n' i.e size of NumPy array can take very high values sometimes.
I am on dual boot (win+ubuntu) and the current situation is faced on ubuntu in which I have allocated 500 GB storage and my laptop's RAM is 8 GB.

You ran out of memory most likely, for a 1e9 element float64 array in numpy, that would be 8GB alone. Also if you are naively looping over that array (something like):
for item in big_numpy_arrray:
do_something(item)
That is going to take forever. Avoid doing that, and use numpy's vector operations when possible.

Related

Why does naive string concatenation become quadratic above a certain length?

Building a string through repeated string concatenation is an anti-pattern, but I'm still curious why its performance switches from linear to quadratic after string length exceeds approximately 10 ** 6:
# this will take time linear in n with the optimization
# and quadratic time without the optimization
import time
start = time.perf_counter()
s = ''
for i in range(n):
s += 'a'
total_time = time.perf_counter() - start
time_per_iteration = total_time / n
For example, on my machine (Windows 10, python 3.6.1):
for 10 ** 4 < n < 10 ** 6, the time_per_iteration is almost perfectly constant at 170±10 µs
for 10 ** 6 < n, the time_per_iteration is almost perfectly linear, reaching 520 µs at n == 10 ** 7.
Linear growth in time_per_iteration is equivalent to quadratic growth in total_time.
The linear complexity results from the optimization in the more recent CPython versions (2.4+) that reuse the original storage if no references remain to the original object. But I expected the linear performance to continue indefinitely rather than switch to quadratic at some point.
My question is based made on this comment. For some odd reason running
python -m timeit -s"s=''" "for i in range(10**7):s+='a'"
takes incredibly long time (much longer than quadratic), so I never got the actual timing results from timeit. So instead, I used a simple loop as above to obtain performance numbers.
Update:
My question might as well have been titled "How can a list-like append have O(1) performance without over-allocation?". From observing constant time_per_iteration on small-size strings, I assumed the string optimization must be over-allocating. But realloc is (unexpectedly to me) quite successful at avoiding memory copy when extending small memory blocks.

In the end, the platform C allocators (like malloc()) are the ultimate source of memory. When CPython tries to reallocate string space to extend its size, it's really the system C realloc() that determines the details of what happens. If the string is "short" to begin with, chances are decent the system allocator finds unused memory adjacent to it, so extending the size is just a matter of the C library's allocator updating some pointers. But after repeating this some number of times (depending again on details of the platform C allocator), it will run out of space. At that point, realloc() will need to copy the entire string so far to a brand new larger block of free memory. That's the source of quadratic-time behavior.
Note, e.g., that growing a Python list faces the same tradeoffs. However, lists are designed to be grown, so CPython deliberately asks for more memory than is actually needed at the time. The amount of this overallocation scales up as the list grows, enough to make it rare that realloc() needs to copy the whole list-so-far. But the string optimizations do not overallocate, making cases where realloc() needs to copy far more frequent.

[XXXXXXXXXXXXXXXXXX............]
\________________/\__________/
used space reserved
space
When growing a contiguous array data structure (illustrated above) through appending to it, linear performance can be achieved if the extra space reserved while reallocating the array is proportional to the current size of the array. Obviously, for large strings this strategy is not followed, most probably with the purpose of not wasting too much memory. Instead a fixed amount of extra space is reserved during each reallocation, resulting in quadratic time complexity. To understand where the quadratic performance comes from in the latter case, imagine that no overallocation is performed at all (which is the boundary case of that strategy). Then at each iteration a reallocation (requiring linear time) must be performed, and the full runtime is quadratic.

TL;DR: Just because string concatenation is optimized under certain circumstances doesn't mean it's necessarily O(1), it's just not always O(n). What determines the performance is ultimatly your system and it could be smart (beware!). Lists that "garantuee" amortized O(1) append operations are still much faster and better at avoiding reallocations.
This is an extremly complicated problem, because it's hard to "measure quantitativly". If you read the announcement:
String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.
If you take a closer look at it then you'll note that it mentions "certain circumstances". The tricky thing is to find out what these certain cirumstances are. One is immediatly obvious:
If something else holds a reference to the original string.
Otherwise it wouldn't be safe to change s.
But another condition is:
If the system can do the reallocation in O(1) - that means without needing to copy the contents of the string to a new location.
That's were it get's tricky. Because the system is responsible for doing a reallocation. That's nothing you can control from within python. However your system is smart. That means in many cases you can actually do the reallocation without needing to copy the contents. You might want to take a look at #TimPeters answer, that explains some of it in more details.
I'll approach this problem from an experimentalists point of view.
You can easily check how many reallocations actually need a copy by checking how often the ID changes (because the id function in CPython returns the memory adress):
changes = []
s = ''
changes.append((0, id(s)))
for i in range(10000):
s += 'a'
if id(s) != changes[-1][1]:
changes.append((len(s), id(s)))
print(len(changes))
This gives a different number each run (or almost each run). It's somewhere around 500 on my computer. Even for range(10000000) it's just 5000 on my computer.
But if you think that's really good at "avoiding" copies you're wrong. If you compare it to the number of resizes a list needs (lists over-allocate intentionally so append is amortized O(1)):
import sys
changes = []
s = []
changes.append((0, sys.getsizeof(s)))
for i in range(10000000):
s.append(1)
if sys.getsizeof(s) != changes[-1][1]:
changes.append((len(s), sys.getsizeof(s)))
len(changes)
That only needs 105 reallocations (always).
I mentioned that realloc could be smart and I intentionally kept the "sizes" when the reallocs happened in a list. Many C allocators try to avoid memory fragmentation and at least on my computer the allocator does something different depending on the current size:
# changes is the one from the 10 million character run
%matplotlib notebook # requires IPython!
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
#ax.plot(sizes, num_changes, label='str')
ax.scatter(np.arange(len(changes)-1),
np.diff([i[0] for i in changes]), # plotting the difference!
s=5, c='red',
label='measured')
ax.plot(np.arange(len(changes)-1),
[8]*(len(changes)-1),
ls='dashed', c='black',
label='8 bytes')
ax.plot(np.arange(len(changes)-1),
[4096]*(len(changes)-1),
ls='dotted', c='black',
label='4096 bytes')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('x-th copy')
ax.set_ylabel('characters added before a copy is needed')
ax.legend()
plt.tight_layout()
Note that the x-axis represents the number of "copies done" not the size of the string!
That's graph was actually very interesting for me, because it shows clear patterns: For small arrays (up to 465 elements) the steps are constant. It needs to reallocate for every 8 elements added. Then it needs to actually allocate a new array for every character added and then at roughly 940 all bets are off until (roughly) one million elements. Then it seems it allocates in blocks of 4096 bytes.
My guess is that the C allocator uses different allocation schemes for differently sized objects. Small objects are allocated in blocks of 8 bytes, then for bigger-than-small-but-still-small arrays it stops to overallocate and then for medium sized arrays it probably positions them where they "may fit". Then for huge (comparativly speaking) arrays it allocates in blocks of 4096 bytes.
I guess the 8byte and 4096 bytes aren't random. 8 bytes is the size of an int64 (or float64 aka double) and I'm on a 64bit computer with python compiled for 64bits. And 4096 is the page size of my computer. I assume there are lots of "objects" that need have these sizes so it makes sense that the compiler uses these sizes because it could avoid memory fragmentation.
You probably know but just to make sure: For O(1) (amortized) append behaviour the overallocation must depend on the size. If the overallocation is constant it will be O(n**2) (the greater the overallocation the smaller the constant factor but it's still quadratic).
So on my computer the runtime behaviour will be always O(n**2) except for lengths (roughly) 1 000 to 1 000 000 - there it really seems to undefined. In my test run it was able to add many (ten-)thousand elements without ever needing a copy so it would probably "look like O(1)" when timed.
Note that this is just my system. It could look totally different on another computer or even with another compiler on my computer. Don't take these too seriously. I provided the code to do the plots yourself, so you can analyze your system yourself.
You also asked the question (in the comments) if there would be downsides if you over-allocate strings. That's really easy: Strings are immutable. So any overallocated byte is wasting ressources. There are only a few limited cases where it really does grow and these are considered implementation details. The developers probably don't throw away space to make implementation details perform better, some python developers also think that adding this optimization was a bad idea.

How do I work with large, memory hungry numpy array?

I have a program which creates an array:
List1 = zeros((x, y), dtype=complex_)
Currently I am using x = 500 and y = 1000000.
I will initialize the first column of the list by some formula. Then the subsequent columns will calculate their own values based on the preceding column.
After the list is completely filled, I will then display this multidimensional array using imshow().
The size of each value (item) in the list is 24 bytes.
A sample value from the code is: 4.63829355451e-32
When I run the code with y = 10000000, it takes up too much RAM and the system stops the run. How do I solve this problem? Is there a way to save my RAM while still being able to process the list using imshow() easily? Also, how large a list can imshow() display?

There's no way to solve this problem (in any general way).
Computers (as commonly understood) have a limited amount of RAM, and they require elements to be in RAM in order to operate on them.
An complex128 array size of 10000000x500 would require around 74GiB to store. You'll need to somehow reduce the amount of data you're processing if you hope to use a regular computer to do it (as opposed to a supercomputer).
A common technique is partitioning your data and processing each partition individually (possibly on multiple computers). Depending on the problem you're trying to solve, there may be special data structures that you can use to reduce the amount of memory needed to represent the data - a good example is a sparse matrix.
It's very unusual to need this much memory - make sure to carefully consider if it's actually needed before you dwell into the extremely complex workarounds.

How can I rewrite this Python operation so it doesn't hang my system?

Beginner here, looked for an answer, but can't find one.
I know (or rather suspect) that part of the problem with the following code is how big the list of combinations gets.
(Maybe too, the last line seems like an error, in that, if I just run 'print ...' rather than 'comb += ...' it runs quickly and quits. Would 'append' be more graceful?)
I'm not 100% sure if the system hang is due to disk I/O (swapping?), CPU use, or memory... running it under Windows seems to result in a rather large disk I/O by 'System', while under Linux, top was showing high CPU and memory use before it was killed. In both cases though, the rest of the system was unusable while this operation was going (tried it in the Python interpreter directly, as well as in PyCharm).
So two part question: 1) is there some 'safe' way to test code like this that won't affect the rest of the system negatively, and 2) for this specific example, how should I rewrite it?
Trying this code (which I do not recommend!):
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb += cwr(iterable, x)
Thanks!
EDIT: Should have specified, but it is python2.7 code here as well (guess the xrange makes it obvious it's not 3 anyways). The Windows machine that's hanging has 4 GB of RAM, but it looks like the hang is on disk I/O. The original problem I was (and still am) working on was a question at codewars.com, about how many ways to make change given a list of possible coins and an amount to make. The solution I'd come up with worked for small amounts, and not big ones. Obviously, I need to come up with a better algorithm to solve that problem... so this is non-essential code, certainly. However, I would like to know if there's something I can do to set the programming environment so that bugs in my code don't propagate and choke my system this way.
FURTHER EDIT:
I was working on the problem again tonight, and realized that I didn't need to append to a master list (as some of you hinted to me in the comments), but just work on the subset that was collected. I hadn't really given enough of the code to make that obvious, but my key problem here was the line:
comb += cwr(iterable, x)
which should have been
comb = cwr(iterable, x)

Since you are trying to compute combinations with replacement, the number of orderings that must be considered will be 4^nth power.(4 because your iterable has 4 items).
More generally speaking, the number of orderings to be computed is the number of elements that can be at any spot in the list, raised to the power of how long the list is.
You are trying to compute 4^nth power for n between 3 and 99. 4^99 power is 4.01734511064748 * 1059.
I'm afraid not even a quantum computer would be much help computing that.

This isn't a very powerful laptop (3.7 GiB,Intel® Celeron(R) CPU N2820 # 2.13GHz × 2, 64bit ubuntu) but it did it in 15s or so (but did slow noticeably, top showed 100% cpu (dual core) and 35% memory. It took about 15s to release the memory when if finished.
len(comb) was 4,421,240
I had to change your code to
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb.extend(list(cwr(iterable, x)))
ED - just re-tried as per your original and it does run OK. My mistake. It looks as though it is the memory requirement. If you really need to do this you could write it to a file.
re-ED being curious about the back-of-an-envelope complexity calculation above not squaring my experience, I tried plotting n (X axis) against the length of list returned by combinations_with_replacement() (Y axis) for iterable lengths 2,3,4,5 i. The result seems to be below n**(i-1) (Which ties in with the figure I got for 4,99 above. It's actually (i+n-1)! / n! / (i-1)! which approximates to n**(i-1)/i! for n much bigger than i)
Also, running the plot I didn't keep the full comb list in memory and this did improve computer performance quite a bit, so maybe that's a relevant point: rather than produce a giant list then work on it afterwords, do the calculations in the loop.

Python NUMPY HUGE Matrices multiplication

I need to multiply two big matrices and sort their columns.
import numpy
a= numpy.random.rand(1000000, 100)
b= numpy.random.rand(300000,100)
c= numpy.dot(b,a.T)
sorted = [argsort(j)[:10] for j in c.T]
This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.
I was wondering if this operation can be serialized and I dont have to store everything in the memory.

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]
The results are;
After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot operation in pieces;
load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.
Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.
All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.
Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.
Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.
On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).
The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.
If you were running it on your own computer, calculating the resulting matrix could be done in the following way:
create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)
This would result in a linear file write with a largish (2.4 TB) file.
This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.
Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.
Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.
However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.
Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.
Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

Python MemoryError when using long lists not occurring on Linux

I've come to work with a rather big simulation code which needs to store up to 189383040 floating point numbers. I know, this is large, but there isn't much that could be done to overcome this, like only looking at a portion of them or processing them one-by-one.
I've written a short script, which reproduces the error so I could quickly test it in different environments:
noSnapshots = 1830
noObjects = 14784
objectsDict={}
for obj in range(0, noObjects):
objectsDict[obj]=[[],[],[]]
for snapshot in range(0,noSnapshots):
objectsDict[obj][0].append([1.232143454,1.232143454,1.232143454])
objectsDict[obj][1].append([1.232143454,1.232143454,1.232143454])
objectsDict[obj][2].append(1.232143454)
It represents the structure of the actual code where some parameters (2 lists of length 3 each and 1 float) have to be stored for each of the 14784 objects at 1830 distinct locations. Obviously the numbers would be different each time for a different object, but in my code I just went for some randomly-typed number.
The thing, which I find not very surprising, is that it fails on Windows 7 Enterprise and Home Premium with a MemoryError. Even if I run the code on a machine with 16 GB of RAM it still fails, even though there's still plenty of memory left on the machine. So the first question would be: Why does it happen so? I'd like to think that the more RAM I've got the more things I can store in the memory.
I ran the same code on an Ubuntu 12.04 machine of my colleague (again with 16 GB of RAM) and it finished no-problem. So another thing which I'd like to know is: Is there anything I could do to make Windows happy with this code? I.e. give my Python process more memory on heap and stack?
Finally: Does anyone have any suggestions as to how to store plenty of data in memory in a manner similar to the one in the example code?
EDIT
After the answer I changed the code to:
import numpy
noSnapshots = 1830
noObjects = int(14784*1.3)
objectsDict={}
for obj in range(0, noObjects):
objectsDict[obj]=[[],[],[]]
objectsDict[obj][0].append(numpy.random.rand(noSnapshots,3))
objectsDict[obj][1].append(numpy.random.rand(noSnapshots,3))
objectsDict[obj][2].append(numpy.random.rand(noSnapshots,1))
and it works despite the larger amount of data, which has to be stored.

In Python, every float is an object on the heap, with its own reference count, etc. For storing this many floats, you really ought to use a dense representation of lists of floats, such as numpy's ndarray.
Also, because you are reusing the same float objects, you are not estimating the memory use correctly. You have lists of references to the same single float object. In a real case (where the floats are different) your memory use would be much higher. You really ought to use ndarray.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.