Python range for multiple iteration - python

I was told that if you'd like to iterate over the list multiple times, it's probably better to use range. This is because xrange has to generate an integer object every time you access an index, whereas range is a static list and the integers are already "there" to use.
So,i deduce that the list created by the range method remains in memory enough to be re-iterated. When will that list be destroyed ?

In Python 2, range(n) returns a list of numbers from 0 up to n-1.
If you code
for i in range(1000000):
then your program creates a list of a million integers, and throws it away when the for statement ends. That list has to be constructed in memory, and that can be expensive, if your code does it often.
If you code
for i in xrange(1000000):
then your program doesn't create a list of a million integers: instead you get a sort of iterator. But it will still be destroyed at the end of the for statement. The xrange() function dates from Python 1, before the language had iterators as first-class constructs, but the idea is much the same. It is not a data structure but a chunk of code that returns the next number when you ask for it, instead of generating them all beforehand.
So your deduction "that the list created by the range method remains in memory enough to be re-iterated" is incorrect. (Though I think you meant xrange not range. Either way, it's wrong.)
If you want a range()-generated list to persist in memory, do this:
myrange = range(1000000):
Then you can do
for i in myrange:
as many times as you like and you will pay the overhead of creating a list of a million integers only once.
If your range is of the order of dozens or hundreds, not millions, and you need to ask this question, then you should not be fretting about efficiency.

Related

Random item from iterator?

I have the following code
number_list = (i for i in range(5))
permutations = (num for num in itertools.product(number_list, repeat=9))
This is generating an iterator called permutations which will hold all permutations of 9 characters within the number_list if I'm not mistaken. This can get pretty large for a big number_list.
I can iterate through permutations with next(permutations) but the problem is that it's sequencial. I would like to be able to draw a random item from any part of the iterator. If it was a list, I could simply do random.choice() but for a big number_list I don't have near enough memory nor time for that.
I could also just use next() and store a list of X amount of items and them randomize them but that won't work either because it can get so incredibly big that the outputs would be so similar it wouldn't really be "random".
I was wondering, if it isn't possible to draw a random item from the iterator, is there an algorithm which allows me to create an iterator which will output a random set with next() but that when it ends it will have gone through the entire permutations witout repeating?
The final idea would be having an iterator that would spit a random permutation of n characters out of a list of i elements, being able to get both n and i to arbitrarily large numbers without memory constraints, making sure that when the whole iterator ends up finishing (doesn't matter when, even if it finished after years in theory), all possible permutations would be exhausted without repetitions.
Firstly, your code does not generate permutations, but draws with replacement. Secondly, iterators (as the name suggests) are meant to ITERATE through some collection, and not to jump to random places in it (of course, you can write your own __next__ function which does whatever you want - whether you want to call the resulting object an iterator is a philosophical question). Thirdly, producing random samples with replacement is a much studied and implemented question. See for example: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

3 questions about generators and iterators in Python

Everyone says you lose the benefit of generators if you put the result into a list.
But you need a list, or a sequence, to even have a generator to begin with, right? So, for example, if I need to go through the files in a directory, don't I have to make them into a list first, like with os.listdir()? If so, then how is that more efficient? (I am always working with strings and files, so I really hate that all the examples use range and integers, but I digress)
Taking it a step further, the mere presence of the yield keyword is supposed to make a generator. So if I do:
for x in os.listdir():
yield x
Is a list still being created? Or is os.listdir() itself now also magically a generator? Is it possible that, os.listdir() not having been called yet, that there really isn't a list here yet?
Finally, we are told that iterators need iter() and next() methods. But doesn’t that also mean they need an index? If not, what is next() operating on? How does it know what is next without an index? Before 3.6, dict keys had no order, so how did that iteration work?
No.
See, there's no list here:
def one():
while True:
yield 1
Index and next() are two independent tools to perform an iteration. Again, if you have an object such that its iterator's next() always returns 1, you don't need any indices.
In deeper detail...
See, technically, you can always associate a list and an index with any generator or iterator: simply write down all its returned values — you'll get at most countable set of values a₀, a₁, ... But those are merely a mathematical formalism quite unnecessarily having anything in common with how a real generator works. For instance, you have a generator that always yields one. You can count how many ones have you got from it so far, and call that an index. You can write down all that ones, comma-separated, and call that a list. Do those two objects correctly describe your elapsed generator's output? Apparently so. Are they in a least bit important for the generator itself? Not really.
Of course, a real generator will probably have a state (you can call it an index—provided you don't necessarily call something an index if it is only a non-negative integral scalar; you can write down all its states, provided it works deterministically, number them and call current state's number index—yes, approximately that). They will always have a source of their states and returned values. So, indices and lists can be regarded as abstractions that describe object's behaviour. But quite unnecessary they are concrete implementation details that are really used.
Consider unbuffered file reader. It retrieves a single byte from the disk and immediately yields it. There's no a real list in memory, only the file contents on the disk (there may even be no, if our file reader is connected to a net socket instead of a real disk drive, and the Oracle of Delphi is at connection's other end). You can call file position index—until you read the stdin, which is only forward-traversable and thus indexing it makes no real physical sense—same goes for network connections via unreliable protocol, BTW.
Something like this.
1) This is wrong; it is just the easiest example to explain a generator from a list. If you think of the 8 queens-problem and you return each position as soon as the program finds it, I can't recognize a result list anywhere. Note, that often iterators are alternately offered even by python standard library (islice() vs. slice(), and an easy example not representable by a list is itertools.cycle().
In consequence 2 and 3 are also wrong.

Is there a form of 'for' that evaluates the "expression list" every time?

lt = 1000 #list primes to ...
remaining = list(range(2, lt + 1)) #remaining primes
for c in remaining: #current "prime" being tested
for t in remaining[0: remaining.index(c)]: #test divisor
if c % t == 0 and c != t:
if c in remaining:
remaining.remove(c)
If you don't need context:
How can I either re-run the same target-list value, or use something other than for that reads the expression list every iteration?
If you need context:
I am currently creating a program that lists primes from 2 to a given value (lt). I have a list 'remaining' that starts as all integers from 2 to the given value. One at a time, it tests a value on the list 'c' and tests for divisibility one by one by all smaller numbers on the list 't'. If 'c' is divisible by 't', it removes it from the list. By the end of the program, in theory, only primes remain but I have run into the problem that because I am removing items from the list, and for only reads remaining once, for is skipping values in remaining and thus leaving composites in the list.
What you're trying to do is almost never the right answer (and it's definitely not the right answer here, for reasons I'll get to later), which is why Python doesn't give you a way to do it automatically. In fact, it's illegal for delete from or insert into a list while you're iterating over it, even if CPython and other Python implementations usually don't check for that error.
But there is a way you can simulate what you want, with a little verbosity:
for i in range(remaining.index(c)):
if i >= remaining.index(c): break
t = remaining[i]
Now we're not iterating over remaining, we're iterating over its indices. So, if we remove values, we'll be iterating over the indices of the modified list. (Of course we're not really relying on the range there, since the if…break tests the same thing; if you prefer for i in itertools.count():, that will work too.)
And, depending on what you want to do, you can expand it in different ways, such as:
end = remaining.index(c)
for i in range(end):
if i >= end: break
t = remaining[i]
# possibly subtract from end within the loop
# so we don't have to recalculate remaining.index(c)
… and so on.
However, as I mentioned at the top, this is really not what you want to be doing. If you look at your code, it's not only looping over all the primes less than c, it's calling a bunch of functions inside that loop that also loop over either all the primes less than c or your entire list (that's how index, remove, and in work for lists), meaning you're turning linear work into quadratic work.
The simplest way around this is to stop trying to mutate the original list to remove composite numbers, and instead build a set of primes as you go along. You can search, add, and remove from a set in constant time. And you can just iterate your list in the obvious way because you're no longer mutating it.
Finally, this isn't actually implementing a proper prime sieve, but a much less efficient algorithm that for some reason everyone has been teaching as a Scheme example for decades and more recently translating into other languages. See The Genuine Sieve of Eratosthenes for details, or this project for sample code in Python and Ruby that shows how to implement a proper sieve and a bit of commentary on performance tradeoffs.
(In the following, I ignore the XY problem of finding primes using a "mutable for".)
It's not entirely trivial to design an iteration over a sequence with well-defined (and efficient) behavior when the sequence is modified. In your case, where the sequence is merely being depleted, one reasonable thing to do is to use a list but "delete" elements by replacing them with a special value. (This makes it easy to preserve the current iteration position and avoids the cost of shifting the subsequent elements.)
To make it efficient to skip the deleted elements (both for the outer iteration and any inner iterations like in your example), the special value should be (or contain) a count of any following deleted elements. Note that there is a special case of deleting the current element, where for maximum efficiency you must move the cursor while you still know how far to move.

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Why l.insert(0, i) is slower than l.append(i) in python?

I tested two different ways to reverse a list in python.
import timeit
value = [i for i in range(100)]
def rev1():
v = []
for i in value:
v.append(i)
v.reverse()
def rev2():
v = []
for i in value:
v.insert(0, i)
print timeit.timeit(rev1)
print timeit.timeit(rev2)
Interestingly, the 2nd method that inserts the value to the first element is pretty much slower than the first one.
20.4851300716
73.5116429329
Why is this? In terms of operation, inserting an element to the head doesn't seem that expensive.
insert is an O(n) operation as it requires all elements at or after the insert position to be shifted up by one. append, on the other hand, is generally O(1) (and O(n) in the worst case, when more space must be allocated). This explains the substantial time difference.
The time complexities of these methods are thoroughly documented here.
I quote:
Internally, a list is represented as an array; the largest costs come from growing beyond the current allocation size (because everything must move), or from inserting or deleting somewhere near the beginning (because everything after that must move).
Now, going back to your code, we can see that rev1() is an O(n) implementation whereas rev2() is in fact O(n2), so it makes sense that rev2() will be much slower.
In Python, lists are implemented as arrays. If you append one element to an array, the reserved space for an array is simply expanded. If you prepend an element, all elements are shifted by 1 and that is very expensive.
you can confirm this by reading about python lists online. Python implements a list as an array, where the size of the array is actually typically larger than the size of your current list. The unused elements are at the end of the array and represent new elements that could be added to the END of the list, not the beginning. Python uses a classical amortized cost approach so that on average, appending to the end of the list takes O(1) time if you do a bunch of appends, although occasionally a single append will cause the array to become full so a new larger array needs to be created, and all the data copied to the new array. On the other hand, if you always insert at the front of the list, then in the underlying array all elements need to be moved over one index to make room for the new element at the beginning of the array. So, to summarize, if you create a list by doing N insertions, then the total running time will be O(N) if you always append new items to the end of the list, and it will be O(N^2) if you always append to the front of the list.

Categories

Resources