Pypy memory usage increases when nothing new is allocated

Pypy memory usage increases when nothing new is allocated - python

I'm not sure if this is a duplicate of the other PyPy memory questions, but here I'll provide a concrete example.
from __future__ import division
def mul_inv(a, m):
"""Modular multiplicative inverse, a^-1 mod m. Credit: rosettacode.org"""
m0 = m
x0, x1 = 0, 1
if m == 1: return 1
while a > 1:
assert m != 0, "a and m must be coprime"
q = a // m
a, m = m, a%m
x0, x1 = x1 - q * x0, x0
if x1 < 0: x1 += m0
return x1
M = 1000000009
L = 10**8
bin2 = [0] * L
bin2[0] = 1
for n in range(L-1):
bin2[n+1] = (bin2[n] * (4*n + 2) * mul_inv(n+1, M)) % M
if n % 10**5 == 0: print(n, bin2[n])
print(bin2[:20])
With python 3.6, the program uses 3-4 GB at most and runs to completion (Armin Rigo's list change doesn't change this significantly). With python 2.7.13 running PyPy 5.10.0, the program reaches 8 GB (how much RAM I have) quickly and freezes. Even with gc.collect() calls the program runs out of memory when n is about 3.5 * 10^7.
Where is this memory usage coming from? The only large memory usage should be initializing bin2 as a 10^8 int list. Nothing else should be increasing the memory usage, under the assumption that all the local variables in mul_inv are garbage collected.

Oops, it's a bad case of the optimization for lists of integers. The problem is that this starts as a list of ints:
bin2 = [0] * L
This is internally stored as an array of ints. It's usually much more compact, even though in this case it doesn't change anything---because on CPython it's a list containing L copies of the same object 0.
But the problem is that pretty soon, we store a long in the list. At this point, we need to turn the whole list into the generic kind that can store anything. But! The problem is that we see 100 million zeroes, and so we create 100 million 0 objects. This creates instantly 3 GB of memory pressure for nothing, in addition to 800MB for the list itself.
We can check that the problem doesn't occur if we initialize the list like this, so that it really contains 100 million times the same object:
bin2 = [0L] * L # Python 2.x
bin2[0] = 1
That said, in your example you don't need the list to contain 100 million elements in the first place. You can initialize it as:
bin2 = [1]
and use bin2.append(). This lets the program start much more quickly and without any large memory usage near the beginning.
Note that PyPy3 still uses more memory than CPython3.

AFAICT the issue here is that you're assigning longs to the array, and despite your modulo, PyPy doesn't seem to notice that the number still fits into a machine word.
I can think of two ways to fix this:
Pass the value assigned to bin2[n+1] through int().
Use array.array().
The former only affects PyPy2, and results in what appears to be a stable memory footprint of ~800MB on my Mac, whereas the latter appears to stabilise at ~1.4GB regardless of whether I run it in PyPy2 or PyPy3.
I haven't run the program fully to completion, though, so YMMV…

Related

Why is 0/1 faster than False/True for this sieve in PyPy?

Similar to why use True is slower than use 1 in Python3 but I'm using pypy3 and not using the sum function.
def sieve_num(n):
nums = [0] * n
for i in range(2, n):
if i * i >= n: break
if nums[i] == 0:
for j in range(i*i, n, i):
nums[j] = 1
return [i for i in range(2, n) if nums[i] == 0]
def sieve_bool(n):
nums = [False] * n
for i in range(2, n):
if i * i >= n: break
if nums[i] == False:
for j in range(i*i, n, i):
nums[j] = True
return [i for i in range(2, n) if nums[i] == False]
sieve_num(10**8) takes 2.55 s, but sieve_bool(10**8) takes 4.45 s, which is a noticeable difference.
My suspicion was that [0]*n is somehow smaller than [False]*n and fits into cache better, but sys.getsizeof and vmprof line profiling are unsupported for PyPy. The only info I could get is that <listcomp> for sieve_num took 116 ms (19% of total execution time) while <listcomp> for sieve_bool tool 450 ms (40% of total execution time).
Using PyPy 7.3.1 implementing Python 3.6.9 on Intel i7-7700HQ with 24 GB RAM on Ubuntu 20.04. With Python 3.8.10 sieve_bool is only slightly slower.

The reason is that PyPy uses a special implementation for "list of ints that fit in 64 bits". It has got a few other special cases, like "list of floats", "list of strings that contain only ascii", etc. The goal is primarily to save memory: a list of 64-bit integers is stored just like an array.array('l') and not a list of pointers to actual integer objects. You save memory not in the size of the list itself---which doesn't change---but in the fact that you don't need a very large number of small additional integer objects all existing at once.
There is no special case for "list of boolean", because there are only ever two boolean objects in the first place. So there would be no memory-saving benefit in using a strategy like "list of 64-bit ints" in this case. Of course, we could do better and store that list with only one bit per entry, but it is not a really common pattern in Python; we just never got around to implementing that.
So why it is slower, anyway? The reason is that in the "list of general objects" case, the JIT compiler needs to produce extra code to check the type of objects every time it reads an item from the list, and extra GC logic every time it puts an item into the list. This is not a lot of code, but in your case, I guess it doubles the length of the (extremely short) generated assembly for the inner loop doing nums[j] = 1.
Right now, both in PyPy and CPython(*), the fastest is probably to use array.array('B') instead of a list, which both avoids that PyPy-specific issue and also uses substantially less memory (always a performance win if your data structures contain 10**8 elements).
EDIT: (*) no, turns out that CPython is probably too slow for the memory bandwidth to be a limit. On my machine, PyPy is maybe 30-35% faster when using bytes. See also comments for a hack that speeds up CPython from 9x to 3x slower than PyPy, but which as usual slows down PyPy.

Python large integer performance

I wanted to code a prime number generator in python - I've only done this in C and Java. I did the following. I used an integer bitmap as an array. Performance of the algorithm should increase nlog(log(n)) but I am seeing exponential increase in cost/time as the problem size n increases. Is this something obvious I am not seeing or don't know about python as integers grow larger than practical? I am using python-3.8.3.
def countPrimes(n):
if n < 3:
return []
arr = (1 << (n-1)) - 2
for i in range(2, n):
selector = 1 << (i - 1)
if (selector & arr) == 0:
continue
# We have a prime
composite = selector
while (composite := composite << i) < arr:
arr = arr & (~composite)
primes = []
for i in range(n):
if (arr >> i) & 1 == 1:
primes.append(i+1)
return primes
Some analysis of my runtime:
A plot of y = nlog(log(n)) (red line which is steeper) and y = x (blue line which is less steep):
I'd normally not use integers with sizes exceeding uint64, because python allows unlimited size integers and I'm just testing, I used the above approach. As I said, I am trying to understand why the algorithm time increases exponentially with problem size n.

I used an integer bitmap as an array
That's extremely expensive. Python ints are immutable. Every time you want to toggle a bit, you're building a whole new gigantic int.
You also need to build other giant ints just to access single bits you're interested in - for example, composite and ~composite are huge in arr = arr & (~composite), even though you're only interested in 1 bit.
Use an actual mutable sequence type. Maybe a list, maybe a NumPy array, maybe some bitvector type off of PyPI, but don't use an int.

What does sum do?

At first, I want to test the memory usage between generator and list-comprehension.The book give me a little bench code snippet and I run it on my PC(python3.6, Windows),find something unexpected.
On the book, it said, because list-comprehension has to create a real list and allocate memory for it, itering from a list-comprehension must be slower than itering from a generator.
Ofcourse, list-comprehension use more memory than generator.
FOllowing is my code,which is not satisfy previous opinion(within sum function).
import tracemalloc
from time import time
def timeIt(func):
start = time()
func()
print('%s use time' % func.__name__, time() - start)
return func
tracemalloc.start()
numbers = range(1, 1000000)
#timeIt
def lStyle():
return sum([i for i in numbers if i % 3 == 0])
#timeIt
def gStyle():
return sum((i for i in numbers if i % 3 == 0))
lStyle()
gStyle()
shouldSize = [i for i in numbers if i % 3 == 0]
snapshotL = tracemalloc.take_snapshot()
top_stats = snapshotL.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
The output:
lStyle use time 0.4460000991821289
gStyle use time 0.6190001964569092
[ Top 10 ]
F:/py3proj/play.py:31: size=11.5 MiB, count=333250, average=36 B
F:/py3proj/play.py:33: size=448 B, count=1, average=448 B
F:/py3proj/play.py:22: size=136 B, count=1, average=136 B
F:/py3proj/play.py:17: size=136 B, count=1, average=136 B
F:/py3proj/play.py:14: size=76 B, count=2, average=38 B
F:/py3proj/play.py:8: size=34 B, count=1, average=34 B
Two points:
Generator use more time and same memory space.
The list-comprehension in sum function seems not create the total list
I think maybe the sum function did something i don't know.Who can explain this?
The book is High Perfoamance Python.chapter 5.But i did sth myself different from the book to check the validity in other context. And his code is here book_code,he didn't put the list-comprehension in sum funciton.

When it comes to time performance test, I do rely on the timeit module because it automatically executes multiple runs of the code.
On my system, timeit gives following results (I strongly reduced sizes because of the numerous runs):
>>> timeit.timeit("sum([i for i in numbers if i % 3 == 0])", "numbers = range(1, 1000)")
59.54427594248068
>>> timeit.timeit("sum((i for i in numbers if i % 3 == 0))", "numbers = range(1, 1000)")
64.36398425334801
So the generator is slower by about 8% (*). This is not really a surprize because the generator has to execute some code on the fly to get next value, while a precomputed list only increment its current pointer.
Memory evalutation is IMHO more complex, so I have used Compute Memory footprint of an object and its contents (Python recipe) from activestate
>>> numbers = range(1, 100)
>>> numbers = range(1, 100000)
>>> l = [i for i in numbers if i % 3 == 0]
>>> g = (i for i in numbers if i % 3 == 0)
>>> total_size(l)
1218708
>>> total_size(g)
88
>>> total_size(numbers)
48
My interpretation is that a list uses memory for all of its items (which is not a surprize), while a generator only need few values and some code so the memory footprint is much lesser for the generator.
I strongly think that you have used tracemalloc for something it is not intended for. It is aimed at searching possible memory leaks (large blocs of memory never deallocated) and not at controlling the memory used by individual objects.
BEWARE: I could only test for small sizes. But for very large sizes, the list could exhaust the available memory and the machine will use virtual memory from swap. In that case, the list version will become much slower. More details there

Efficient random generator for very large range (in python)

I am trying to create a generator that returns numbers in a given range that pass a particular test given by a function foo. However I would like the numbers to be tested in a random order. The following code will achieve this:
from random import shuffle
def MyGenerator(foo, num):
order = list(range(num))
shuffle(order)
for i in order:
if foo(i):
yield i
The Problem
The problem with this solution is that sometimes the range will be quite large (num might be of the order 10**8 and upwards). This function can become slow, having such a large list in memory. I have tried to avoid this problem, with the following code:
from random import randint
def MyGenerator(foo, num):
tried = set()
while len(tried) <= num - 1:
i = randint(0, num-1)
if i in tried:
continue
tried.add(i)
if foo(i):
yield i
This works well most of the time, since in most cases num will be quite large, foo will pass a reasonable number of numbers and the total number of times the __next__ method will be called will be relatively small (say, a maximum of 200 often much smaller). Therefore its reasonable likely we stumble upon a value that passes the foo test and the size of tried never gets large. (Even if it only passes 10% of the time, we wouldn't expect tried to get larger than about 2000 roughly.)
However, when num is small (close to the number of times that the __next__ method is called, or foo fails most of the time, the above solution becomes very inefficient - randomly guessing numbers until it guesses one that isn't in tried.
My attempted solution...
I was hoping to use some kind of function that maps the numbers 0,1,2,..., n onto themselves in a roughly random way. (This isn't being used for any security purposes and so doesn't matter if it isn't the most 'random' function in the world). The function here (Create a random bijective function which has same domain and range) maps signed 32-bit integers onto themselves, but I am not sure how to adapt the mapping to a smaller range. Given num I don't even need a bijection on 0,1,..num just a value of n larger than and 'close' to num (using whatever definition of close you see fit). Then I can do the following:
def mix_function_factory(num):
# something here???
def foo(index):
# something else here??
return foo
def MyGenerator(foo, num):
mix_function = mix_function_factory(num):
for i in range(num):
index = mix_function(i)
if index <= num:
if foo(index):
yield index
(so long as the bijection isn't on a set of numbers massively larger than num the number of times index <= num isn't True will be small).
My Question
Can you think of one of the following:
A potential solution for mix_function_factory or even a few other potential functions for mix_function that I could attempt to generalise for different values of num?
A better way of solving the original problem?
Many thanks in advance....

The problem is basically generating a random permutation of the integers in the range 0..n-1.
Luckily for us, these numbers have a very useful property: they all have a distinct value modulo n. If we can apply some mathemical operations to these numbers while taking care to keep each number distinct modulo n, it's easy to generate a permutation that appears random. And the best part is that we don't need any memory to keep track of numbers we've already generated, because each number is calculated with a simple formula.
Examples of operations we can perform on every number x in the range include:
Addition: We can add any integer c to x.
Multiplication: We can multiply x with any number m that shares no prime factors with n.
Applying just these two operations on the range 0..n-1 already gives quite satisfactory results:
>>> n = 7
>>> c = 1
>>> m = 3
>>> [((x+c) * m) % n for x in range(n)]
[3, 6, 2, 5, 1, 4, 0]
Looks random, doesn't it?
If we generate c and m from a random number, it'll actually be random, too. But keep in mind that there is no guarantee that this algorithm will generate all possible permutations, or that each permutation has the same probability of being generated.
Implementation
The difficult part about the implementation is really just generating a suitable random m. I used the prime factorization code from this answer to do so.
import random
# credit for prime factorization code goes
# to https://stackoverflow.com/a/17000452/1222951
def prime_factors(n):
gaps = [1,2,2,4,2,4,2,4,6,2,6]
length, cycle = 11, 3
f, fs, next_ = 2, [], 0
while f * f <= n:
while n % f == 0:
fs.append(f)
n /= f
f += gaps[next_]
next_ += 1
if next_ == length:
next_ = cycle
if n > 1: fs.append(n)
return fs
def generate_c_and_m(n, seed=None):
# we need to know n's prime factors to find a suitable multiplier m
p_factors = set(prime_factors(n))
def is_valid_multiplier(m):
# m must not share any prime factors with n
factors = prime_factors(m)
return not p_factors.intersection(factors)
# if no seed was given, generate random values for c and m
if seed is None:
c = random.randint(n)
m = random.randint(1, 2*n)
else:
c = seed
m = seed
# make sure m is valid
while not is_valid_multiplier(m):
m += 1
return c, m
Now that we can generate suitable values for c and m, creating the permutation is trivial:
def random_range(n, seed=None):
c, m = generate_c_and_m(n, seed)
for x in range(n):
yield ((x + c) * m) % n
And your generator function can be implemented as
def MyGenerator(foo, num):
for x in random_range(num):
if foo(x):
yield x

That may be a case where the best algorithm depends on the value of num, so why not using 2 selectable algorithms wrapped in one generator ?
you could mix your shuffle and set solutions with a threshold on the value of num. That's basically assembling your 2 first solutions in one generator:
from random import shuffle,randint
def MyGenerator(foo, num):
if num < 100000 # has to be adjusted by experiments
order = list(range(num))
shuffle(order)
for i in order:
if foo(i):
yield i
else: # big values, few collisions with random generator
tried = set()
while len(tried) < num:
i = randint(0, num-1)
if i in tried:
continue
tried.add(i)
if foo(i):
yield i
The randint solution (for big values of num) works well because there aren't so many repeats in the random generator.

Getting the best performance in Python is much trickier than in lower-level languages. For example, in C, you can often save a little bit in hot inner loops by replacing a multiplication by a shift. The overhead of python bytecode-orientation erases this. Of course, this changes again when you consider which variant of "python" you're targetting (pypy? numpy? cython?)- you really have to write your code based on which one you're using.
But even more important is arranging operations to avoid serialized dependencies, since all CPUs are superscalar these days. Of course, real compilers know about this, but it still matters when choosing an algorithm.
One of the easiest ways to gain a little bit over existing answers would be by by generating numbers in chunks using numpy.arange() and applying the ((x + c) * m) % n to the numpy ndarray directly. Every python-level loop that can be avoided helps.
If the function can be applied directly to numpy ndarrays, that might even better. Of course, a sufficiently-small function in python will be dominated by function-call overhead anyway.
The best fast random-number-generator today is PCG. I wrote a pure-python port here but concentrated on flexibility and ease-of-understanding rather than speed.
Xoroshiro128+ is second-best-quality and faster, but less informative to study.
Python's (and many others') default choice of Mersenne Twister is among the worst.
(there's also something called splitmix64 which I don't know enough about to place - some people say it's better than xoroshiro128+, but it has a period problem - of course, you might want that here)
Both default-PCG and xoroshiro128+ use a 2N-bit state to generate N-bit numbers. This is generally desirable, but means numbers will be repeated. PCG has alternate modes that avoid this, however.
Of course, much of this depends on whether num is (close to) a power of 2. In theory, PCG variants can be created for any bit width, but currently only various word sizes are implemented since you'd need explicit masking. I'm not sure exactly how to generate the parameters for new bit sizes (perhaps it's in the paper?), but they can be tested simply by doing a period/2 jump and verifying that the value is different.
Of course, if you're only making 200 calls to the RNG, you probably don't actually need to avoid duplicates on the math side.
Alternatively, you could use an LFSR, which does exist for every bit size (though note that it never generates the all-zeros value (or equivalently, the all-ones value)). LFSRs are serial and (AFAIK) not jumpable, and thus can't be easily split across multiple tasks. Edit: I figured out that this is untrue, simply represent the advance step as a matrix, and exponentiate it to jump.
Note that LFSRs do have the same obvious biases as simply generating numbers in sequential order based on a random start point - for example, if rng_outputs[a:b] all fail your foo function, then rng_outputs[b] will be much more likely as a first output regardless of starting point. PCG's "stream" parameter avoids this by not generating numbers in the same order.
Edit2: I have completed what I thought was a "brief project" implementing LFSRs in python, including jumping, fully tested.

Python while loop speed

I would like it if someone could please explain to me why the following code has so much additional overhead. At 100k iterations, the speed is the same for either case (2.2 sec). When increasing to 1E6 iterations case "B" never finishes, while case "A" takes only 29 seconds.
Case "A"
while n is not 1:
foo
Case "B"
while n > 1:
foo
Complete code if of any help
def coll(n):
count = 0
# while n is not 1:
while n > 1:
count += 1
if not n % 2:
n /= 2
else:
n = 3*n + 1
return count
for x in range(1,100000):
count = coll(x)

First of all, you should use n > 1 or n != 1, not n is not 1. The fact that the latter works is an implementation detail, and obviously it's not working for you.
The reason it's not working is because there are values of x in your code that cause the Collatz sequence to go over the value of sys.maxint, which turns n into a long. Then, even when it ends up going back down to 1, it's actually 1L; a long, not an int.
Try using while n is not 1 and repr(n) != '1L':, and it'll work as you expect. But don't do that; just use n > 1 or n != 1.

Generally in Python, is is extremely fast to check, as it uses referential equality, so all that needs to be checked is that two objects have the same memory location. Note that the only reason your code works is that most implementations of Python generally maintain a pool of the smaller integers, so that every 1 always refers to the same 1, for example, where there may be multiple objects representing 1000. In those cases, n is 1000 would fail, where n == 1000 would work. Your code is relying on this integer pool, which is risky.
> involves a function call, which is fairly slow in Python: n > 1 translates to n.__gt__(1).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.