When will __length_hint__ be inaccurate? - python

I'm aware that, technically, you cannot know the length of a Python iterator without actually iterating through it.
The __length_hint__ method i.e. it.__length_hint__() returns an estimate of len(list(it)). There's even a wrapper around this method in the operator module, which says that the method "may over- or under-estimate by an arbitrary amount."
For finite iterators, what are the cases where __length_hint__ will be inaccurate? If this can't be known, why not?
I don't see any reference to this in PEP 424.
>>> obja = iter(range(98345984))
>>> obja.__length_hint__()
98345984
>>> import numpy as np
>>> objb = iter(np.arange(817483))
>>> objb.__length_hint__()
817483
I know it's not a great idea to rely on an implementation detail. But this is a detail that is already explicitly used in a top-level function of the operator module. Would there be, for instance, specific data structures that would not give possible inaccuracies?

Basically, anything that is iterating over something that is generated dynamically, rather than iterating over a completed sequence.
Consider a simple iterator that flips a coin, with a head worth 1 point and a tail worth 2 points. It continues to flip the coin until you reach 4 points.
def coinflip():
s = 0
while s < 4:
x = random.choice([1,2])
s += x
yield ("H" if x == 1 else "T")
How long will the sequence be? It could be as short as 2: TT. It could be as long as 4: either HHHH or HHHT. However, in the majority of cases it will be 3: HHT, HTH, HTT, THT or THH. In this case, 3 would be the "safest" guess, but that could be higher or lower.

Related

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?
q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.
If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

Python For Loop Using Math Operators

Ok, I'm in the process of learning Python, and had a quick question about for loops. I was wondering if you could use math operators in them, like JavaScript. For example, could I do:
for i = 0, i < 5, i++:
#code here
Now, I'm quite aware that Python doesn't support i++, and I think it doesn't support the commas either. So if I can do it that way, could you provide a sample.
Thanks
You would use a range loop:
for i in range(5):
#code here
If you want to increment in a loop you would use a while loop:
i = 0
while i < 5:
i += 1
To decrement you would use i -= 1.
Just as a loop is introduced by for, does not imply the same behaviour for different languages.
Python's for loop iterates over objects. Something like the C-for loop does not exist.
The C for loop (for ( <init> ; <cond> ; <update> ) <statement>, however, is actually identical to the C code:
<init>;
while ( <cond> ) {
<statement>
<update>
}
So, with the additional information that Python does have a while loop which behaves like the C-while loop, you should now be able to implement something like the C for loop in Python. I'll leave that as an exercise:-)
Note: as generating an evenly spaced sequence of integer values is a common case, Python provides the range() (Python 3) or xrange() (Python 2) function. This does create a RangeObject which (basically) yields the next value for a sequence given by start, stop and step arguments.
Quick answer
You may use:
for i in range(5):
# code here
or
i = 0
while i < 5:
i = i + 1 # or i += 1
Boring/pedantic answer
When I was learning Python I disliked the syntax; why should a simple for loop require a second keyword, range? The answer, I believe, is due to the fundamental role of the list in Python's prescriptive syntax. Repeated annoyances by range made me think about how the data were described (or not) before the loop, which in turn led me to think more Pythonically about the design of the data.
Let's say you want to populate a list with the first five perfect squares. You could:
squares = []
for i in range(5):
squares.append(i**2)
Alternatively, you could use comprehension:
initial_values = range(5) # we've declared the initial values
squares = [i**2 for i in initial_values]
Or more compactly:
squares = [i**2 for i in range(5)]
I routinely encounter problems where there's no Pythonic way to write the code, and I end up writing C-like Python (as in the Quick answer above). But just as often I find there's a more elegant and readable way to do things, and usually this indicates some imperfections in the antecedent data design.

Is Haskell's laziness an elegant alternative to Python's generators?

In a programming exercise, it was first asked to program the factorial function and then calculate the sum: 1! + 2! + 3! +... n! in O(n) multiplications (so we can't use the factorial directly). I am not searching the solution to this specific (trivial) problem, I'm trying to explore Haskell abilities and this problem is a toy I would like to play with.
I thought Python's generators could be a nice solution to this problem. For example :
from itertools import islice
def ifact():
i , f = 1, 1
yield 1
while True:
f *= i
i += 1
yield f
def sum_fact(n):
return sum(islice(ifact(),5))
Then I've tried to figure out if there was something in Haskell having a similar behavior than this generator and I thought that laziness do all the staff without any additional concept.
For example, we could replace my Python ifact with
fact = scan1 (*) [1..]
And then solve the exercise with the following :
sum n = foldl1 (+) (take n fact)
I wonder if this solution is really "equivalent" to Python's one regarding time complexity and memory usage. I would say that Haskell's solution never store all the list fact since their elements are used only once.
Am I right or totally wrong ?
EDIT :
I should have check more precisely:
Prelude> foldl1 (+) (take 4 fact)
33
Prelude> :sprint fact
fact = 1 : 2 : 6 : 24 : _
So (my implementation of) Haskell store the result, even if it's no longer used.
Indeed, lazy lists can be used this way. There are some subtle differences though:
Lists are data structures. So you can keep them after evaluating them, which can be both good and bad (you can avoid recomputation of values and to recursive tricks as #ChrisDrost described, at the cost of keeping memory unreleased).
Lists are pure. In generators you can have computations with side effects, you can't do that with lists (which is often desirable).
Since Haskell is a lazy language, laziness is everywhere and if you just convert a program from an imperative language to Haskell, the memory requirements can change considerably (as #RomanL describes in his answer).
But Haskell offers more advanced tools to accomplish the generator/consumer pattern. Currently there are three libraries that focus on this problem: pipes, conduit and iteratees. My favorite is conduit, it's easy to use and the complexity of its types is kept low.
They have several advantages, in particular that you can create complex pipelines and you can base them on a chosen monad, which allows you to say what side effects are allowed in a pipeline.
Using conduit, your example could be expressed as follows:
import Data.Functor.Identity
import Data.Conduit
import qualified Data.Conduit.List as C
ifactC :: (Num a, Monad m) => Producer m a
ifactC = loop 1 1
where
loop r n = let r' = r * n
in yield r' >> loop r' (n + 1)
sumC :: (Num a, Monad m) => Consumer a m a
sumC = C.fold (+) 0
main :: IO ()
main = (print . runIdentity) (ifactC $= C.isolate 5 $$ sumC)
-- alternatively running the pipeline in IO monad directly:
-- main = (ifactC $= C.isolate 5 $$ sumC) >>= print
Here we create a Producer (a conduit that consumes no input) that yields factorials indefinitely. Then we compose it with isolate, which ensures that no more than a given number of values are propagated through it, and then we compose it with a Consumer that just sums values and returns the result.
Your examples are not equivalent in memory usage. It is easy to see if you replace * with a + (so that the numbers don't get big too quickly) and then run both examples on a big n such as 10^7. Your Haskell version will consume a lot of memory and python will keep it low.
Python generator will not generate a list of values then sum it up. Instead, the sum function will get values one-by-one from the generator and accumulate them. Thus, the memory usage will remain constant.
Haskell will evaluate functions lazily, but in order to calculate say foldl1 (+) (take n fact) it will have to evaluate the complete expression. For large n this will unfold into a huge expression the same way as (foldl (+) 0 [0..n]) does. For more details on evaluation and reduction have a look here: https://www.haskell.org/haskellwiki/Foldr_Foldl_Foldl%27.
You can fix your sum n by using foldl1' instead of foldl1 as described on the link above. As #user2407038 explained in his comment, you'd also need to keep fact local. The following works in GHC with a constant memory use:
let notfact = scanl1 (+) [1..]
let n = 20000000
let res = foldl' (+) 0 (take n notfact)
Note that in case of the actual factorial in place of notfact memory considerations are less of a concern. The numbers will get big quickly, arbitrary-precision arithmetic will slow things down so you won't be able get to big values of n in order to actually see the difference.
Basically, yes: Haskell's lazy-lists are a lot like Python's generators, if those generators were effortlessly cloneable, cacheable, and composeable. Instead of raising StopIteration you return [] from your recursive function, which can thread state into the generator.
They do some cooler stuff due to self-recursion. For example, your factorial generator is more idiomatically generated like:
facts = 1 : zipWith (*) facts [1..]
or the Fibonaccis as:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
In general any iterative loop can be converted to a recursive algorithm by promoting the loop-state to arguments of a function and then calling it recursively to get the next loop cycle. Generators are just like that, but we prepend some elements each iteration of the recursive function, `go ____ = (stuff) : go ____.
The perfect equivalent is therefore:
ifact :: [Integer]
ifact = go 1 1
where go f i = f : go (f * i) (i + 1)
sum_fact n = sum (take n ifact)
In terms of what's fastest, the absolute fastest in Haskell will probably be the "for loop":
sum_fact n = go 1 1 1
where go acc fact i
| i <= n = go (acc + fact) (fact * i) (i + 1)
| otherwise = acc
The fact that this is "tail-recursive" (a call of go does not pipe any sub-calls to go to another function like (+) or (*)) means that the compiler can package it into a really tight loop, and that's why I'm comparing it with "for loops" even though that's not really a native idea to Haskell.
The above sum_fact n = sum (take n ifact) is a little slower than this but faster than sum (take n facts) where facts is defined with zipWith. The speed differences are not very large and I think mostly just come down to memory allocations that don't get used again.

Why am I exceeding max recursion depth?

Let me stop you right there, I already know you can adjust the maximum allowed depth.
But I would think this function, designed to calculate the nth Fibonacci number, would not exceed it, owing to the attempted memoization.
What am I missing here?
def fib(x, cache={1:0,2:1}):
if x is not 1 and x is not 2 and x not in cache: cache[x] = fib(x-1) + fib(x-2)
return cache[x]
The problem here is the one that tdelaney pointed out in a comment.
You are filling the cache in backward, from x down to 2.
That is sufficient to ensure that you only perform a linear number of recursive calls. The first call to fib(4000) only makes 3998 recursive calls.
But 3998 > sys.getcursionlimit(), so that doesn't help.
Your code works, just set the recursion limit (default is 1000):
>>> def fib(x, cache={1:0,2:1}):
... if x is not 1 and x is not 2 and x not in cache: cache[x] = fib(x-1) + f
ib(x-2)
... return cache[x]
...
>>> from sys import setrecursionlimit
>>> setrecursionlimit(4001)
>>> fib(4000)
24665411055943750739295700920408683043621329657331084855778701271654158540392715
48090034103786310930146677221724629877922534738171673991711165681180811514457211
13771400656054018493704811431159158792987298892998378107544456316501964164304630
21568595514449785504918067352892206292173283858530346012173429628868997174476215
95754737778371797011268738657294932351901755682732067943003555687894170965511472
22394287423465133129791428666544293424932758353804445807459873383767095726534051
03186366562265469193320676382408395686924657068094675464095820220760924728356005
27753139995364477320639625889904027436038223654786222515006804845418392308019640
53848249082837958012652040193422565794818023898141209364892225521425081077545093
40549694342959926058170589410813569880167004050051440392247460055993434072332526
101572422443738016276258104875526626L
>>>
The reason is, if you imagine a large tree, your root node is 4000, which connects to 3999 and 3998. you go all the way down one branch of the tree until you hit a base case. Then you come back up building the cache from the bottom. So the tree is over 1000 deep which is why you hit the limit.
To add to the discussion question comments, wanted to summarize:
You're adding to the cache after the recursive step -- thus your cache isn't doing much.
You're also referring to the same cache value in all the calls. Not sure if that's what you want, but that's the behavior.
This style of recursion isn't idiomatic Python. However, what is idiomatic Python is to use something like a memoization decorator. For an example, look here: https://wiki.python.org/moin/PythonDecoratorLibrary#Memoize (With your exact example)
Maybe this helps to visualise, what is going wrong:
def fib(x, cache={0:..., 1:0, 2:1}):
if x not in cache: cache[x] = fib(x-1) + fib(x-2)
return cache[x]
for n in range(4000): fib(n)
print(fib(4000))
Works perfectly as you explicitely build the cache bottom up. (It is a good thing that default arguments are not evaluated at runtime.)
Btw: your initial dictionary is wrong. fib (1) is 1 and not 0. I kept this numbering offset in my approach, though.
The trick to making memoization work well for a problem like this is to start at the first value you don't yet know and work up towards the value you need to return. This means avoiding top-down recursion. It's easy to iteratively compute Fibonacci values. Here's a really compact version with a memo list:
def fib(n, memo=[0,1]):
while len(memo) < n+1:
memo.append(memo[-2]+memo[-1])
return memo[n]
Here's a quick demo run (which goes very fast):
>>> for i in range(90, 101):
print(fib(i))
2880067194370816120
4660046610375530309
7540113804746346429
12200160415121876738
19740274219868223167
31940434634990099905
51680708854858323072
83621143489848422977
135301852344706746049
218922995834555169026
354224848179261915075
>>> fib(4000)
39909473435004422792081248094960912600792570982820257852628876326523051818641373433549136769424132442293969306537520118273879628025443235370362250955435654171592897966790864814458223141914272590897468472180370639695334449662650312874735560926298246249404168309064214351044459077749425236777660809226095151852052781352975449482565838369809183771787439660825140502824343131911711296392457138867486593923544177893735428602238212249156564631452507658603400012003685322984838488962351492632577755354452904049241294565662519417235020049873873878602731379207893212335423484873469083054556329894167262818692599815209582517277965059068235543139459375028276851221435815957374273143824422909416395375178739268544368126894240979135322176080374780998010657710775625856041594078495411724236560242597759185543824798332467919613598667003025993715274875

Counting collisions in a Python dictionary

my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?
Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>
This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.

Categories

Resources