What is the inverse function of itertools.izip in python? - python

I saw this, and this questions and I'd like to have the same effect, only efficiently done with itertool.izip.
From itertool.izip's documentation:
Like zip() except that it returns an iterator instead of a list
I need an iterator because I can't fit all values to memory so instead I'm using a generator and iterating over the values.
More specifically, I have a generator that generates a three values tuple, and instead of iterating it I'd like to feed three lists of values to three functions, each list represents a single position in the tuple.
Out of those three-tuple-values, only one is has big items (memory consumption wise) in it (lets call it data) while the other two contain only values that require only little amount of memory to hold, so iterating over the data value's "list of values" first should work for me by consuming the data values one by one, and caching the small ones.
I can't think of a smart way to generate one "list of values" at a time, because I might decide to remove instances of a three-value-tuple occasionally, depending on the big value of the tuple.
Using the widely suggested zip solution, similar to:
>>> zip(*[('a', 1), ('b', 2), ('c', 3), ('d', 4)])
[('a', 'b', 'c', 'd'), (1, 2, 3, 4)]
Results in the "unpacking argument list" part (*[...]) of this to trigger a full iteration over the entire iterator and (I assume) cache all results in memory, which is as I said, an issue for me.
I can build a mask list (True/False for small values to keep), but I'm looking for a cleaner more pythonic way. If all else fails, I'll do that.

What's wrong with a traditional loop?
>>> def gen():
... yield 'first', 0, 1
... yield 'second', 2, 3
... yield 'third', 4, 5
...
>>> numbers = []
>>> for data, num1, num2 in gen():
... print data
... numbers.append((num1, num2))
...
first
second
third
>>> numbers
[(0, 1), (2, 3), (4, 5)]

Related

Recovering a linked list

I have a linked list that has been stored out of order in an array and the information about the original order is preserved by storing with each element the index of the next element.
For example,
[c;3][b;0][a;1][d;4]
Here [c;3] means that c is followed by d (stored at 3); [b;0] means that b is followed by c (stored at 0), and so on. The out of bound index 4 in [d;4] means that d is the last element.
I am looking for an algorithm to extract the original linked list order, abcd in the example, from such an array.
The last element always comes last (already at the correct place) and the algorithm may use this fact.
To clarify the question, let me reformulate it in terms of Python data structures.
I have a list of 2-tuples where the second element in each tuple is an integer that defines the traversal order through the list. The value of the integer is the index of the next tuple to be traversed. For example, given a list
[('c', 3), ('b', 0), ('a', 1), ('d', 4)]
the traversal order is
[2, 1, 0, 3]
or ('a', 1) -> ('b', 0) -> ('c', 3) -> ('d', 4). How can I write a function that given a list of 2-tuples described above would find the traversal order.
Here is a possible Python solution:
def order(x):
nexts = [n for _, n in x]
prevs = [-1] * (len(a) + 1)
for i, n in enumerate(nexts):
prevs[n] = i
trav = []
i = prevs[-1]
for _ in x:
trav.append(i)
i = prevs[i]
return trav[::-1]
Given the example data
>>> a = [('c', 3), ('b', 0), ('a', 1), ('d', 4)]
this function produces the expected result
>>> order(a)
[2, 1, 0, 3]
>>> [a[i] for i in order(a)]
[('a', 1), ('b', 0), ('c', 3), ('d', 4)]
Is there a better solution?
The head is the element that has nothing pointing to it. You can walk through once and keep track of what has nothing pointing to it, then walk through again to print them out in linear time.
What about sorting the array itself?
C is followed by [3], so swap c with [2] (element before [3]), and continue the same for each. But you'll have to keep track of original index and swapped/latest index for each element.
Find the index of the head of the linked list in O(n) time and space by looking for whatever element isn't pointed at.
Alternatively, sort by index pointed at in ascending order such that NULL is greater than any index. This can definitely be done in O(n log n) time and O(1) space; you can also use a linear sorting algorithm since the indices are bounded fairly well. There may be in-place variants of linear sorts; something to check.
So you have this [c;3][b;0][a;1][d;NULL] and you want to rearrange to [a][b][c][d].
It's given that the last item in the array contains the tail of the list. So you could build the list backward using an O(n^2) algorithm that's similar to selection sort.
The general idea, in pseudo code, is:
new_list = new array[a.length];
int pass = a.length-1;
new_list[pass] = a[a.length-1];
while (pass > 0)
{
for (j = 0; j < a.length-1; j++)
{
if (a[j].index == pass)
{
new_list[pass] = a[j];
break;
}
}
--pass;
}
That should work, although it's not very efficient. But it'd be fine for small lists if you don't call it very often.
There's a faster way that uses a dictionary and is just slightly more difficult to code. The idea is that you store the item's position as the key, and its predecessor as the value. If you don't have the item's predecessor yet, you store -1. So for this example, you look at the first item, [c;3]. You don't know its predecessor, so you store {0,-1} in the dictionary. Then look in the dictionary to see if you already have an entry for '3'. You don't, so you skip forward to the item at index 3 in the array, which is [d,NULL]. Its predecessor is 0, so you add {3,0} to the dictionary.
At this point, you don't have a successor, so you go back to the last place you were in your sequential scan and go to the next item: [b;0]. You don't know its predecessor, so you store {1,-1} in the dictionary. This item's successor is 0, which you already have in the dictionary. So you update the entry to {0,1}, and proceed with your forward scan. You don't know the predecessor of [a;1], so you add {2,-1} to your dictionary. You already have an entry for 1 in your dictionary, so you update it to {1,2}. You move forward to the last entry, see that 4 is already in your dictionary, and you're done.
Your dictionary contains:
{0,1},{3,0}{1,2}{2,-1}
Since you know that 3 is the end of the list, you start there in the dictionary and build the ordered sequence backwards by following the predecessor links. You know you're done when you reach the entry that has a predecessor of -1.
Worst case, this makes 3*n passes over each item: twice in the scan of the array, and once when following links in the dictionary.
You have to first sort the letters and then find the numbers that go with them:
def order(tuples, index_of_value):
aim = []
# Loop through sorted letters
l = [i[index_of_value] for i in tuples] # Find the nth of each tuple
for i in sorted(l):
#Find the number that goes with it and append
correct = [j for j in tuples if j[0] == i]
first_num = correct[0][0]
second_num = correct[0][1]
aim.append(tuple((first_num, second_num))) # Turn back into tuple and append
return aim
a = [('c', 3), ('b', 0), ('a', 1), ('d', 4)]
print(order(a, 0))
>>> [('a', 1), ('b', 0), ('c', 3), ('d', 4)]
In order to have a linked list, you must have a pointer to its head (beginning). To find the head in this situation, iterate over the entire array once until you know which node is not pointed to by any other node. Then all you need to do is follow the pointer at each node until you reach NULL.

Is there a point to using nested iterators?

I was reading through some older code of mine and came across this line
itertools.starmap(lambda x,y: x + (y,),
itertools.izip(itertools.repeat(some_tuple,
len(list_of_tuples)),
itertools.imap(lambda x: x[0],
list_of_tuples)))
To be clear, I have some list_of_tuples from which I want to get the first item out of each tuple (the itertools.imap), I have another tuple that I want to repeat (itertools.repeat) such that there is a copy for each tuple in list_of_tuples, and then I want to get new, longer tuples based on the items from list_of_tuples (itertools.starmap).
For example, suppose some_tuple = (1, 2, 3) and list_of_tuples = [(1, other_info), (5, other), (8, 12)]. I want something like [(1, 2, 3, 1), (1, 2, 3, 5), (1, 2, 3, 8)]. This isn't the exact IO (it uses some pretty irrelevant and complex classes) and my actual lists and tuples are very big.
Is there a point to nesting the iterators like this? It seems to me like each function from itertools would have to iterate over the iterator I gave it and store the information from it somewhere, meaning that there is no benefit to putting the other iterators inside of starmap. Am I just completely wrong? How does this work?
There is no reason to nest iterators. Using variables won't have a noticeable impact on performance/memory:
first_items = itertools.imap(lambda x: x[0], list_of_tuples)
repeated_tuple = itertools.repeat(some_tuple, len(list_of_tuples))
items = itertools.izip(repeated_tuple, first_items)
result = itertools.starmap(lambda x,y: x + (y,), items)
The iterator objects used and returned by itertools do not store all the items in memory, but simply calculate the next item when it is needed. You can read more about how they work here.
I don't believe the combobulation above is necessary in this case.
it appears to be equivalent to this generator expression:
(some_tuple + (y[0],) for y in list_of_tuples)
However occasionally itertools can have a performance advantage especially in cpython

What is an O(n) algorithm to pair two equally lengthed lists in order in place?

Suppose I have two unordered lists of equal length in Python:
a = [5, 2, 3, 1, 4]
b = ['d', 'b', 'a', 'c', 'e']
Is there an O(n), in-place algorithm to obtain the following result?
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e')]
You're looking for the zip and sorted built-in functions.
r = zip(sorted(a), sorted(b))
zip takes two iterables and pairs them together in sequence (so if the lists were unsorted, you'd get (5, 'd') as your first tuple), and any excess values appear to be truncated/ignored (since they can't be paired).
sorted, the last time I looked into the code base, uses a different sorting algorithm depending on the size of the lists you give it - it should perform at about O(n*log(n)). There isn't a practical sort out there that gives you O(n) performance, since you have to still compare an individual value with the rest of the values in some amount of time.
If you want an in-place sort, you can use the list.sort() function, which does perform an in-place sort. This changes the syntax to the following:
a.sort()
b.sort()
r = zip(a, b)
i don't think there is.
sort() is considered to take O(nlogn) and your requirement is something more than sort(though only a little bit). If there's some kind of O(n) algorithm for this, we can also use it to replace sort() which has been studied for long and is not likely to have an O(n) algorithm.
zip will give you a constant time (but not in place) pairing of elements. izip from itertools has a constant memory footprint, but you'd need to do linear time scans each time you access an element out of order, and then reset your generator.
If you can afford an O(n log(n)) in place sorting algorithm, there's a great question and answer about the default implementation of sort() here.
I think the best approach for most applications where the lists are large enough for memory and computation time to matter would be to call sort on each array, and then use the itertools.izip method to create a generator on the results. This approach has constant memory overhead, and is as good as you can get for asymptotic computation time on a generic array.
Constant time sorting can be done with radix sort , or some variation, however this is not in place and makes some assumptions about your datatypes (ie, array of ints or chars works, but floats and BigInts get messy)
Side bar: the bucket sort article on wikipedia needs some attention if anyone in this community has some free time.
Yes there is a way to get O(N) when sorting positive integers less than or equal to N.
The way to do it is to use buckets.
Here is an implementation:
def _sort(_list):
buckets=[0]*len(_list)
for i in _list:
i=int(i)
assert(0<=i<len(_list))
buckets[i]+=1
result=[]
for num,count in enumerate(buckets):
result.extend([num]*count)
return result
alp=map(ord,list("dabce"))
m=min(alp)
alp=[i-m for i in alp]
alp=_sort(alp)
alp=[i+m for i in alp]
alp=map(chr,alp)
print zip(_sort([1,3,2,0,4]),alp)
#[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]

Binning into timeslots - Is there a better way than using list comp?

I have a dataset of events (tweets to be specific) that I am trying to bin / discretize. The following code seems to work fine so far (assuming 100 bins):
HOUR = timedelta(hours=1)
start = datetime.datetime(2009,01,01)
z = [dt + x*HOUR for x in xrange(1, 100)]
But then, I came across this fateful line at python docs 'This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n)'. The zip idiom does indeed work - but I can't understand how (what is the * operator for instance?). How could I use to make my code prettier? I'm guessing this means I should make a generator / iterable for time that yields the time in graduations of an HOUR?
I will try to explain zip(*[iter(s)]*n) in terms of a simpler example:
imagine you have the list s = [1, 2, 3, 4, 5, 6]
iter(s) gives you a listiterator object that will yield the next number from s each time you ask for an element.
[iter(s)] * n gives you the list with iter(s) in it n times e.g. [iter(s)] * 2 = [<listiterator object>, <listiterator object>] - the key here is that these are 2 references to the same iterator object, not 2 distinct iterator objects.
zip takes a number of sequences and returns a list of tuples where each tuple contains the ith element from each of the sequences. e.g. zip([1,2], [3,4], [5,6]) = [(1, 3, 5), (2, 4, 6)] where (1, 3, 5) are the first elements from the parameters passed to zip and (2, 4, 6) are the second elements from the parameters passed to zip.
The * in front of *[iter(s)]*n converts the [iter(s)]*n from being a list into being multiple parameters being passed to zip. so if n is 2 we get zip(<listiterator object>, <listiterator object>)
zip will request the next element from each of its parameters but because these are both references to the same iterator this will result in (1, 2), it does the same again resulting in (3, 4) and again resulting in (5, 6) and then there are no more elements so it stops. Hence the result [(1, 2), (3, 4), (5, 6)]. This is the clustering a data series into n-length groups as mentioned.
The expression from the docs looks like this:
zip(*[iter(s)]*n)
This is equivalent to:
it = iter(s)
zip(*[it, it, ..., it]) # n times
The [...]*n repeats the list n times, and this results in a list that contains nreferences to the same iterator.
This is again equal to:
it = iter(s)
zip(it, it, ..., it) # turning a list into positional parameters
The * before the list turns the list elements into positional parameters of the function call.
Now, when zip is called, it starts from left to right to call the iterators to obtain elements that should be grouped together. Since all parameters refer to the same iterator, this yields the first n elements of the initial sequence. Then that process continues for the second group in the resulting list, and so on.
The result is the same as if you had constructed the list like this (evaluated from left to right):
it = iter(s)
[(it.next(), it.next(), ..., it.next()), (it.next(), it.next(), ..., it.next()), ...]

A list vs. tuple situation in Python

Is there a situation where the use of a list leads to an error, and you must use a tuple instead?
I know something about the properties of both tuples and lists, but not enough to find out the answer to this question. If the question would be the other way around, it would be that lists can be adjusted but tuples don't.
You can use tuples as dictionary keys, because they are immutable, but you can't use lists. Eg:
d = {(1, 2): 'a', (3, 8, 1): 'b'} # Valid.
d = {[1, 2]: 'a', [3, 8, 1]: 'b'} # Error.
Because of their immutable nature, tuples (unlike lists) are hashable. This is what allows tuples to be keys in dictionaries and also members of sets. Strictly speaking it is their hashability, not their immutability that counts.
So in addition to the dictionary key answer already given, a couple of other things that will work for tuples but not lists are:
>>> hash((1, 2))
3713081631934410656
>>> set([(1, 2), (2, 3, 4), (1, 2)])
set([(1, 2), (2, 3, 4)])
In string formatting tuples are mandatory:
"You have %s new %s" % ('5', 'mails') # must be a tuple, not a list!
Using a list in that example produces the error "not enough arguments for format string", because a list is considered as one argument. Weird but true.

Categories

Resources