Selecting random elements in a list conditional on attribute - python

class Agent:
def __init__(self, state):
self.state = state
#initialize values
state_0_agents = 10
state_1_agents = 10
numberofselections = 2 #number of agents who can choose to transition to the higher plane
#list of agents
agents = [Agent(0) for i in range(state_0_agents)]
agents.extend(Agent(1) for i in range(state_1_agents))
random.choice(agents)
I want to randomly select a couple of agents from this Agents list whose state I will end up changing to 1. Unfortunately the random.choice function selects among all the elements. However I want to randomly select only among those whose state is 0.
I would prefer if this could occur without creating a new list.

I see 3 options here:
Create a list anyway, you can do so with a list comprehension:
random.choice([a for a in agents if a.state == 0])
Put the random.choice() call in a loop, keep trying until you get one that matches the criteria:
while True:
agent = random.choice(agents)
if agent.state == 0:
break
Index your agents list, then pick from that index; these are really just lists still:
agent_states_index = {}
for index, agent in enumerate(agents):
agent_states_index.setdefault(agent.state, []).append(index)
agent_index = random.choice(agent_states_index[0])
agent = agents[agent_index]

There are four algorithms I know of for this.
The first is detailed in this answer. Iterate through the array, then if you come across an element that satisfies a condition, check to see if a random integer is less than (1/(however many elements you've passed that satisfy the condition)).
The second is to iterate through your array, adding to a new array elements that fulfill the condition, then randomly pick one out of that list.
Both of these algorithms run in O(n) time, where n is the size of the array. They are guaranteed to find an element if it is there and satisfies the condition.
There are another two algorithms that are much faster. They both run in O(1) time but have some major weaknesses.
The first is to keep picking indexes randomly until you hit on one that satisfies the condition. This has a potentially infinite time complexity but is O(1) in practice. (If there are very few elements that satisfy the condition and the array is very large, something like 1 in 10000 elements, this becomes slower.) It is also not guaranteed to find an element if it is not there; if there is no element that satisfies the condition, you either have an infinite loop or have to write the algorithm to make a finite number of guesses and you might miss an element even if it is there.
The second is to pick a random index, then keep incrementing it until you find an index that satisfies the condition. It is guaranteed to either find an acceptable index or look through all of the indexes without entering into an infinite loop. It has the downside of not being completely random. Obviously, if you increment the index by 1 every time, it will be really, really nonrandom (if there are clumps of acceptable indexes in the array). However, if you choose the increment randomly from one of a handful of numbers that are coprime to the number of elements of the array, then it's still not fair and random, but is fairly fair and random, and guaranteed to succeed.
Again, these last 2 algorithms are very fast but are either not guaranteed to work or not guaranteed to be completely random. I don't know of an algorithm that is both fast, guaranteed to work, and completely fair and random.

Use numpy.where:
import numpy as np
class Agent:
def __init__(self, state):
self.state = state
#initialize values
state_0_agents = 10
state_1_agents = 10
#list of agents
agents = [0]*state_0_agents
agents += [1]*state_1_agents
selected_agent_idx = random.choice(np.where(np.array(agents) == 0))

You can also use the nonzero function in numpy as it returns a list of indices where an iterable is not zero. Then you can combine it with the choice function to change the value of a random in a element index of that list:
import numpy as np
index_agent0 = np.nonzero(agents==0)[0]
agents[np.random.choice(index_agent0)] = 1

Related

enumerate in dictionary loop take long time how to improv the speed

I am using python-3.x and I would like to speed my code where in every loop, I am creating new values and I checked if they exist or not in the dictionary by using the (check if) then I will keep the index where it is found if it exists in the dictionary. I am using the enumerate but it takes a long time and it very clear way. is there any way to speed my code by using another way or in my case the enumerate is the only way I need to work with? I am not sure in my case using numpy will be better.
Here is my code:
# import numpy
import numpy as np
# my first array
my_array_1 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_1 = np.array(my_array_1)
# here I want to find the unique values from my_array_1
indx = np.unique(my_array_1, return_index=True, return_counts= True,axis=0)
#then saved the result to dictionary
dic_t= {"my_array_uniq":indx[0], # unique values in my_array_1
"counts":indx[2]} # how many times this unique element appear on my_array_1
# here I want to create random array 100 times
for i in range (100):
print (i)
# my 2nd array
my_array_2 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_2 = np.array(my_array_2)
# I would like to check if the values in my_array_2 exists or not in the dictionary (my_array_uniq":indx[0])
# if it exists then I want to hold the index number of that value in the dictionary and
# add 1 to the dic_t["counts"], which mean this value appear agin and cunt how many.
# if not exists, then add this value to the dic (my_array_uniq":indx[0])
# also add 1 to the dic_t["counts"]
for i, a in enumerate(my_array_2):
ix = [k for k,j in enumerate(dic_t["my_array_uniq"]) if (a == j).all()]
if ix:
print (50*"*", i, "Yes", "at", ix[0])
dic_t["counts"][ix[0]] +=1
else:
# print (50*"*", i, "No")
dic_t["counts"] = np.hstack((dic_t["counts"],1))
dic_t["my_array_uniq"] = np.vstack((dic_t["my_array_uniq"], my_array_2[i]))
explanation:
1- I will create an initial array.
2- then I want to find the unique values, index and count from an initial array by using (np.unique).
3- saved the result to the dictionary (dic_t)
4- Then I want to start the loop by creating random values 100 times.
5- I would like to check if this random values in my_array_2 exist or not in the dictionary (my_array_uniq":indx[0])
6- if one of them exists then I want to hold the index number of that value in the dictionary.
7 - add 1 to the dic_t["counts"], which mean this value appears again and count how many.
8- if not exists, then add this value to the dic as new unique value (my_array_uniq":indx[0])
9 - also add 1 to the dic_t["counts"]
So from what I can see you are
Creating 256 random numbers from a linear distribution of numbers between -1000 and 1000
Generating 100 triplets from those (it could be fewer than 100 due to unique but with overwhelming probability it will be exactly 100)
Then doing pretty much the same thing 100 times and each time checking for each of the triplets in the new list whether they exist in the old list.
You're then trying to get a count of how often each element occurs.
I'm wondering why you're trying to do this, because it doesn't make much sense to me, but I'll give a few pointers:
There's no reason to make a dictionary dic_t if you're only going to hold to objects in it, just use two variables my_array_uniq and counts
You're dealing with triplets of floating point numbers. In the given range, that should give you about 10^48 different possible triplets (I may be wrong on the exact number but it's an absurdly large number either way). The way you're generating them does reduce the total phase-space a fair bit, but nowhere near enough. The probability of finding identical ones is very very low.
If you have a set of objects (in this case number triplets) and you want to determine whether you have seen a given one before, you want to use sets. Sets can only contain immutable objects, so you want to turn your triplets into tuples. Determining whether a given triplet is already contained in your set is then an O(1) operation.
For counting the number of occurences of sth, collections.Counter is the natural datastructure to use.

Best way to hash ordered permutation of [1,9]

I'm trying to implement a method to keep the visited states of the 8 puzzle from generating again.
My initial approach was to save each visited pattern in a list and do a linear check each time the algorithm wants to generate a child.
Now I want to do this in O(1) time through list access. Each pattern in 8 puzzle is an ordered permutation of numbers between 1 to 9 (9 being the blank block), for example 125346987 is:
1 2 5
3 4 6
_ 8 7
The number of all of the possible permutation of this kind is around 363,000 (9!). what is the best way to hash these numbers to indexes of a list of that size?
You can map a permutation of N items to its index in the list of all permutations of N items (ordered lexicographically).
Here's some code that does this, and a demonstration that it produces indexes 0 to 23 once each for all permutations of a 4-letter sequence.
import itertools
def fact(n):
r = 1
for i in xrange(n):
r *= i + 1
return r
def I(perm):
if len(perm) == 1:
return 0
return sum(p < perm[0] for p in perm) * fact(len(perm) - 1) + I(perm[1:])
for p in itertools.permutations('abcd'):
print p, I(p)
The best way to understand the code is to prove its correctness. For an array of length n, there's (n-1)! permutations with the smallest element of the array appearing first, (n-1)! permutations with the second smallest element appearing first, and so on.
So, to find the index of a given permutation, see count how many items are smaller than the first thing in the permutation and multiply that by (n-1)!. Then recursively add the index of the remainder of the permutation, considered as a permutation of (n-1) elements. The base case is when you have a permutation of length 1. Obviously there's only one such permutation, so its index is 0.
A worked example: [1324].
[1324]: 1 appears first, and that's the smallest element in the array, so that gives 0 * (3!)
Removing 1 gives us [324]. The first element is 3. There's one element that's smaller, so that gives us 1 * (2!).
Removing 3 gives us [24]. The first element is 2. That's the smallest element remaining, so that gives us 0 * (1!).
Removing 2 gives us [4]. There's only one element, so we use the base case and get 0.
Adding up, we get 0*3! + 1*2! + 0*1! + 0 = 1*2! = 2. So [1324] is at index 2 in the sorted list of 4 permutations. That's correct, because at index 0 is [1234], index 1 is [1243], and the lexicographically next permutation is our [1324].
I believe you're asking for a function to map permutations to array indices. This dictionary maps all permutations of numbers 1-9 to values from 0 to 9!-1.
import itertools
index = itertools.count(0)
permutations = itertools.permutations(range(1, 10))
hashes = {h:next(index) for h in permutations}
For example, hashes[(1,2,5,3,4,6,9,8,7)] gives a value of 1445.
If you need them in strings instead of tuples, use:
permutations = [''.join(x) for x in itertools.permutations('123456789')]
or as integers:
permutations = [int(''.join(x)) for x in itertools.permutations('123456789')]
It looks like you are only interested in whether or not you have already visited the permutation.
You should use a set. It grants the O(1) look-up you are interested in.
A space as well lookup efficient structure for this problem is a trie type structure, as it will use common space for lexicographical matches in any
permutation.
i.e. the space used for "123" in 1234, and in 1235 will be the same.
Lets assume 0 as replacement for '_' in your example for simplicity.
Storing
Your trie will be a tree of booleans, the root node will be an empty node, and then each node will contain 9 children with a boolean flag set to false, the 9 children specify digits 0 to 8 and _ .
You can create the trie on the go, as you encounter a permutation, and store the encountered digits as boolean in the trie by setting the bool as true.
Lookup
The trie is traversed from root to children based on digits of the permutation, and if the nodes have been marked as true, that means the permutation has occured before. The complexity of lookup is just 9 node hops.
Here is how the trie would look for a 4 digit example :
Python trie
This trie can be easily stored in a list of booleans, say myList.
Where myList[0] is the root, as explained in the concept here :
https://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The final trie in a list would be around 9+9^2+9^3....9^8 bits i.e. less than 10 MB for all lookups.
Use
I've developed a heuristic function for this specific case. It is not a perfect hashing, as the mapping is not between [0,9!-1] but between [1,767359], but it is O(1).
Let's assume we already have a file / reserved memory / whatever with 767359 bits set to 0 (e.g., mem = [False] * 767359). Let a 8puzzle pattern be mapped to a python string (e.g., '125346987'). Then, the hash function is determined by:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
I.e., in order to check if a 8puzzle pattern = 125346987 has already been used, we need to:
pattern = '125346987'
pos = getPosition(pattern)
used = mem[pos-1] #mem starts in 0, getPosition in 1.
With a perfect hashing we would have needed 9! bits to store the booleans. In this case we need 2x more (767359/9! = 2.11), but recall that it is not even 1Mb (barely 100KB).
Note that the function is easily invertible.
Check
I could prove you mathematically why this works and why there won't be any collision, but since this is a programming forum let's just run it for every possible permutation and check that all the hash values (positions) are indeed different:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
#CHECKING PURPOSES
def addperm(x,l):
return [ l[0:i] + [x] + l[i:] for i in range(len(l)+1) ]
def perm(l):
if len(l) == 0:
return [[]]
return [x for y in perm(l[1:]) for x in addperm(l[0],y) ]
#We generate all the permutations
all_perms = perm([ i for i in range(1,10)])
print "Number of all possible perms.: "+str(len(all_perms)) #indeed 9! = 362880
#We execute our hash function over all the perms and store the output.
all_positions = [];
for permutation in all_perms:
perm_string = ''.join(map(str,permutation))
all_positions.append(getPosition(perm_string))
#We wan't to check if there has been any collision, i.e., if there
#is one position that is repeated at least twice.
print "Number of different hashes: "+str(len(set(all_positions)))
#also 9!, so the hash works properly.
How does it work?
The idea behind this has to do with a tree: at the beginning it has 9 branches going to 9 nodes, each corresponding to a digit. From each of these nodes we have 8 branches going to 8 nodes, each corresponding to a digit except its parent, then 7, and so on.
We first store the first digit of our input string in a separate variable and pop it out from our 'node' list, because we have already taken the branch corresponding to the first digit.
Then we have 8 branches, we choose the one corresponding with our second digit. Note that, since there are 8 branches, we need 3 bits to store the index of our chosen branch and the maximum value it can take is 111 for the 8th branch (we map branch 1-8 to binary 000-111). Once we have chosen and store the branch index, we pop that value out, so that the next node list doesn't include again this digit.
We proceed in the same way for branches 7, 6 and 5. Note that when we have 7 branches we still need 3 bits, though the maximum value will be 110. When we have 5 branches, the index will be at most binary 100.
Then we get to 4 branches and we notice that this can be stored just with 2 bits, same for 3 branches. For 2 branches we will just need 1bit, and for the last branch we don't need any bit: there will be just one branch pointing to the last digit, which will be the remaining from our 1-9 original list.
So, what we have so far: the first digit stored in a separated variable and a list of 7 indexes representing branches. The first 4 indexes can be represented with 3bits, the following 2 indexes can be represented with 2bits and the last index with 1bit.
The idea is to concatenate all this indexes in their bit form to create a larger number. Since we have 17bits, this number will be at most 2^17=131072. Now we just add the first digit we had stored to the end of that number (at most this digit will be 9) and we have that the biggest number we can create is 1310729.
But we can do better: recall that when we had 5 branches we needed 3 bits, though the maximum value was binary 100. What if we arrange our bits so that those with more 0s come first? If so, in the worst case scenario our final bit number will be the concatenation of:
100 10 101 110 111 11 1
Which in decimal is 76735. Then we proceed as before (adding the 9 at the end) and we get that our biggest possible generated number is 767359, which is the ammount of bits we need and corresponds to input string 987654321, while the lowest possible number is 1 which corresponds to input string 123456789.
Just to finish: one might wonder why have we stored the first digit in a separate variable and added it at the end. The reason is that if we had kept it then the number of branches at the beginning would have been 9, so for storing the first index (1-9) we would have needed 4 bits (0000 to 1000). which would have make our mapping much less efficient, as in that case the biggest possible number (and therefore the amount of memory needed) would have been
1000 100 10 101 110 111 11 1
which is 1125311 in decimal (1.13Mb vs 768Kb). It is quite interesting to see that the ratio 1.13M/0.768K = 1.47 has something to do with the ratio of the four bits compared to just adding a decimal value (2^4/10 = 1.6) which makes a lot of sense (the difference is due to the fact that with the first approach we are not fully using the 4 bits).
First. There is nothing faster than a list of booleans. There's a total of 9! == 362880 possible permutations for your task, which is a reasonably small amount of data to store in memory:
visited_states = [False] * math.factorial(9)
Alternatively, you can use array of bytes which is slightly slower (not by much though) and has a much lower memory footprint (by a power of magnitude at least). However any memory savings from using an array will probably be of little value considering the next step.
Second. You need to convert your specific permutation to it's index. There are algorithms which do this, one of the best StackOverflow questions on this topic is probably this one:
Finding the index of a given permutation
You have fixed permutation size n == 9, so whatever complexity an algorithm has, it will be equivalent to O(1) in your situation.
However to produce even faster results, you can pre-populate a mapping dictionary which will give you an O(1) lookup:
all_permutations = map(lambda p: ''.join(p), itertools.permutations('123456789'))
permutation_index = dict((perm, index) for index, perm in enumerate(all_permutations))
This dictionary will consume about 50 Mb of memory, which is... not that much actually. Especially since you only need to create it once.
After all this is done, checking your specific combination is done with:
visited = visited_states[permutation_index['168249357']]
Marking it to visited is done in the same manner:
visited_states[permutation_index['168249357']] = True
Note that using any of permutation index algorithms will be much slower than mapping dictionary. Most of those algorithms are of O(n2) complexity and in your case it results 81 times worse performance even discounting the extra python code itself. So unless you have heavy memory constraints, using mapping dictionary is probably the best solution speed-wise.
Addendum. As has been pointed out by Palec, visited_states list is actually not needed at all - it's perfectly possible to store True/False values directly in the permutation_index dictionary, which saves some memory and an extra list lookup.
Notice if you type hash(125346987) it returns 125346987. That is for a reason, because there is no point in hashing an integer to anything other than an integer.
What you should do, is when you find a pattern add it to a dictionary rather than a list. This will provide the fast lookup you need rather than traversing the list like you are doing now.
So say you find the pattern 125346987 you can do:
foundPatterns = {}
#some code to find the pattern
foundPatterns[1] = 125346987
#more code
#test if there?
125346987 in foundPatterns.values()
True
If you must always have O(1), then seems like a bit array would do the job. You'd only need to store 363,000 elements, which seems doable. Though note that in practice it's not always faster. Simplest implementation looks like:
Create data structure
visited_bitset = [False for _ in xrange(373000)]
Test current state and add if not visited yet
if !visited[current_state]:
visited_bitset[current_state] = True
Paul's answer might work.
Elisha's answer is perfectly valid hash function that would guarantee that no collision happen in the hash function. The 9! would be a pure minimum for a guaranteed no collision hash function, but (unless someone corrects me, Paul probably has) I don't believe there exists a function to map each board to a value in the domain [0, 9!], let alone a hash function that is nothing more that O(1).
If you have a 1GB of memory to support a Boolean array of 864197532 (aka 987654321-12346789) indices. You guarantee (computationally) the O(1) requirement.
Practically (meaning when you run in a real system) speaking this isn't going to be cache friendly but on paper this solution will definitely work. Even if an perfect function did exist, doubt it too would be cache friendly either.
Using prebuilts like set or hashmap (sorry I haven't programmed Python in a while, so don't remember the datatype) must have an amortized 0(1). But using one of these with a suboptimal hash function like n % RANDOM_PRIME_NUM_GREATER_THAN_100000 might give the best solution.

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?
To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.
Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.
Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List
If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).
Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).

Retrieve List Position Of Highest Value?

There is a list with float values, which can differ or not. How can I find the randomly chosen list-index of one of the highest values in this list?
If the context is interesting to you:
I try to write a solver for the pen&paper game Battleship. I attempt to calculate probabilities for a hit on each of the fields and then want the the solver to shoot at one of the most likely spots, which means retrieving the index of the highest likelyhood in my likelyhood-list and then tell the game engine this index as my choice. Already the first move shows that it can happen, that there are a lot of fields with the same likelyhood. In this case it makes sense to choose one of them at random (and not just take always the first or anything like that).
Find the maximum using How to find all positions of the maximum value in a list?. Then pick a random from the list using random.choice.
>>> m = max(a)
>>> max_pos = [i for i, j in enumerate(a) if j == m]
>>> random.choice(max_pos)

"sorted 1-d iterator" based on "2-d iterator" (Cartesian product of iterators)

I'm looking for a clean way to do this in Python:
Let's say I have two iterators "iter1" and "iter2": perhaps a prime number generator, and itertools.count(). I know a priori that both are are infinite and monotonically increasing. Now I want to take some simple operation of two args "op" (perhaps operator.add or operator.mul), and calculate every element of the first iterator with every element of the next, using said operation, then yield them one at a time, sorted. Obviously, this is an infinite sequence itself. (As mentioned in comment by #RyanThompson: this would be called the Cartesian Product of these sequences... or, more exactly, the 1d-sort of that product.)
What is the best way to:
wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output.
Allowable simplifying assumptions:
If it helps, we can assume op(a,b) >= a and op(a,b) >= b.
If it helps, we can assume op(a,b) > op(a,c) for all b > c.
Also allowable:
Also acceptable would be an iterator that yields values in "generally increasing" order... by which I mean the iterable could occasionally give me a number less than a previous one, but it would somehow make "side information" available (as via a method on the object) that would say "I'm not guarenteeing the next value I give you will be greater than the one I just gave you, but I AM SURE that all future values will at least be greater than N.".... and "N" itself is monotonically increasing.
The only way I can think to do this is a sort of "diagonalization" process, where I keep an increasing number of partially processed iterables around, and "look ahead" for the minimum of all the possible next() values, and yield that. But this weird agglomeration of a heapq and a bunch of deques just seems outlandish, even before I start to code it.
Please: do not base your answer on the fact that my examples mentioned primes or count().... I have several uses for this very concept that are NOT related to primes and count().
UPDATE: OMG! What a great discussion! And some great answers with really thorough explanations. Thanks so much. StackOverflow rocks; you guys rock.
I'm going to delve in to each answer more thoroughly soon, and give the sample code a kick in the tires. From what I've read so far, my original suspicions are confirmed that there is no "simple Python idiom" to do this. Rather, by one way or another, I can't escape keeping all yielded values of iter1 and iter2 around indefinitely.
FWIW: here's an official "test case" if you want to try your solutions.
import operator
def powers_of_ten():
n = 0
while True:
yield 10**n
n += 1
def series_of_nines():
yield 1
n = 1
while True:
yield int("9"*n)
n += 1
op = operator.mul
iter1 = powers_of_ten()
iter2 = series_of_nines()
# given (iter1, iter2, op), create an iterator that yields:
# [1, 9, 10, 90, 99, 100, 900, 990, 999, 1000, 9000, 9900, 9990, 9999, 10000, ...]
import heapq
import itertools
import operator
def increasing(fn, left, right):
"""
Given two never decreasing iterators produce another iterator
resulting from passing the value from left and right to fn.
This iterator should also be never decreasing.
"""
# Imagine an infinite 2D-grid.
# Each column corresponds to an entry from right
# Each row corresponds to an entry from left
# Each cell correspond to apply fn to those two values
# If the number of columns were finite, then we could easily solve
# this problem by keeping track of our current position in each column
# in each iteration, we'd take the smallest value report it, and then
# move down in that column. This works because the values must increase
# as we move down the column. That means the current set of values
# under consideration must include the lowest value not yet reported
# To extend this to infinite columns, at any point we always track a finite
# number of columns. The last column current tracked is always in the top row
# if it moves down from the top row, we add a new column which starts at the top row
# because the values are increasing as we move to the right, we know that
# this last column is always lower then any columns that come after it
# Due to infinities, we need to keep track of all
# items we've ever seen. So we put them in this list
# The list contains the first part of the incoming iterators that
# we have explored
left_items = [next(left)]
right_items = [next(right)]
# we use a heap data structure, it allows us to efficiently
# find the lowest of all value under consideration
heap = []
def add_value(left_index, right_index):
"""
Add the value result from combining the indexed attributes
from the two iterators. Assumes that the values have already
been copied into the lists
"""
value = fn( left_items[left_index], right_items[right_index] )
# the value on the heap has the index and value.
# since the value is first, low values will be "first" on the heap
heapq.heappush( heap, (value, left_index, right_index) )
# we know that every other value must be larger then
# this one.
add_value(0,0)
# I assume the incoming iterators are infinite
while True:
# fetch the lowest of all values under consideration
value, left_index, right_index = heapq.heappop(heap)
# produce it
yield value
# add moving down the column
if left_index + 1 == len(left_items):
left_items.append(next(left))
add_value(left_index+1, right_index)
# if this was the first row in this column, add another column
if left_index == 0:
right_items.append( next(right) )
add_value(0, right_index+1)
def fib():
a = 1
b = 1
while True:
yield a
a,b = b,a+b
r = increasing(operator.add, fib(), itertools.count() )
for x in range(100):
print next(r)
Define the sequences as:
a1 <= a2 <= a3 ...
b1 <= b2 <= b3 ...
Let a1b1 mean op(a1,b1) for short.
Based on your allowable assumptions (very important) you know the following:
max(a1, b1) <= a1b1 <= a1b2 <= a1b3 ...
<=
max(a2, b1) <= a2b1 <= a2b2 <= a2b3 ...
<=
max(a3, b1) <= a3b1 <= a3b2 <= a3b3 ...
. .
. .
. .
You'll have to do something like:
Generate a1b1. You know that if you continue increasing the b variables, you will only get higher values. The question now is: is there a smaller value by increasing the a variables? Your lower bound is min(a1, b1), so you will have to increase the a values until min(ax,b1) >= a1b1. Once you hit that point, you can find the smallest value from anb1 where 1 <= n <= x and yield that safely.
You then will have multiple horizontal chains that you'll have to keep track of. Every time you have a value that goes past min(ax,b1), you'll have to increase x (adding more chains) until min(ax,b1) is larger than it before safely emitting it.
Just a starting point... I don't have time to code it at the moment.
EDIT: Oh heh that's exactly what you already had. Well, without more info, this is all you can do, as I'm pretty sure that mathematically, that's what is necessary.
EDIT2: As for your 'acceptable' solution: you can just yield a1bn in increasing order of n, returning min(a1,b1) as N =P. You'll need to be more specific. You speak as if you have a heuristic of what you generally want to see, the general way you want to progress through both iterables, but without telling us what it is I don't know how one could do better.
UPDATE: Winston's is good but makes an assumption that the poster didn't mention: that op(a,c) > op(b,c) if b>a. However, we do know that op(a,b)>=a and op(a,b)>=b.
Here is my solution which takes that second assumption but not the one Winston took. Props to him for the code structure, though:
def increasing(fn, left, right):
left_items = [next(left)]
right_items = [next(right)]
#columns are (column value, right index)
columns = [(fn(left_items[0],right_items[0]),0)]
while True:
#find the current smallest value
min_col_index = min(xrange(len(columns)), key=lambda i:columns[i][0])
#generate columns until it's impossible to get a smaller value
while right_items[0] <= columns[min_col_index][0] and \
left_items[-1] <= columns[min_col_index][0]:
next_left = next(left)
left_items.append(next_left)
columns.append((fn(next_left, right_items[0]),0))
if columns[-1][0] < columns[min_col_index][0]:
min_col_index = len(columns)-1
#yield the smallest value
yield columns[min_col_index][0]
#move down that column
val, right_index = columns[min_col_index]
#make sure that right value is generated:
while right_index+1 >= len(right_items):
right_items.append(next(right))
columns[min_col_index] = (fn(left_items[min_col_index],right_items[right_index+1]),
right_index+1)
#repeat
For a (pathological) input that demonstrates the difference, consider:
def pathological_one():
cur = 0
while True:
yield cur
cur += 100
def pathological_two():
cur = 0
while True:
yield cur
cur += 100
lookup = [
[1, 666, 500],
[666, 666, 666],
[666, 666, 666],
[666, 666, 666]]
def pathological_op(a, b):
if a >= 300 or b >= 400: return 1005
return lookup[b/100][a/100]
r = increasing(pathological_op, pathological_one(), pathological_two())
for x in range(15):
print next(r)
Winston's answer gives:
>>>
1
666
666
666
666
500
666
666
666
666
666
666
1005
1005
1005
While mine gives:
>>>
1
500
666
666
666
666
666
666
666
666
666
666
1005
1005
1005
Let me start with an example of how I would solve this intuitively.
Because reading code inline is a little tedious, I'll introduce some notation:
Notation
i1 will represent iter1. i10 will represent the first element of iter1. Same for iter2.
※ will represent the op operator
Intuitive solution
By using simplifying assumption 2, we know that i10 ※ i20 is the smallest element that will ever be yielded from your final iterator. The next element would the smaller of i10 ※ i21 and i11 ※ i20.
Assuming i10 ※ i21 is smaller, you would yield that element. Next, you would yield the smaller of i11 ※ i20, i11 ※ i20, and i11 ※ i21.
Expression as traversal of a DAG
What you have here is a graph traversal problem. First, think of the problem as a tree. The root of the tree is i10 ※ i20. This node, and each node below it, has two children. The two children of i1x ※ i2y are the following: One child is i1x+1 ※ i2y, and the other child is i1x ※ i2y+1. Based on your second assumption, we know that i1x ※ i2y is less than both of its children.
(In fact, as Ryan mentions in a comment, this is a directed acyclic graph, or DAG. Some "parents" share "children" with other "parent" nodes.)
Now, we need to keep a frontier - a collection of nodes that could be next to be returned. After returning a node, we add both its children to the frontier. To select the next node to visit (and return from your new iterator), we compare the values of all the nodes in the frontier. We take the node with the smallest value and we return it. Then, we again add both of its child nodes to the frontier. If the child is already in the frontier (added as the child of some other parent), just ignore it.
Storing the frontier
Because you are primarily interested in the value of the nodes, it makes sense to store these nodes indexed by value. As such, it may be in your interest to use a dict. Keys in this dict should be the values of nodes. Values in this dict should be lists containing individual nodes. Because the only identifying information in a node is the pair of operands, you can store individual nodes as a two-tuple of operands.
In practice, after a few iterations, your frontier may look like the following:
>>> frontier
{1: [(2, 3), (2, 4)], 2: [(3, 5), (5, 4)], 3: [(1, 6)], 4: [(6, 3)]}
Other implementation notes
Because iterators don't support random access, you'll need to hang on to values that are produced by your first two iterators until they are no longer needed. You'll know that a value is still needed if it is referenced by any value in your frontier. You'll know that a value is no longer needed once all nodes in the frontier reference values later/greater than one you've stored. For example, i120 is no longer needed when nodes in your frontier reference only i121, i125, i133, ...
As mentioned by Ryan, each value from each iterator will be used an infinite number of times. Thus, every value produced will need to be saved.
Not practical
Unfortunately, in order to assure that elements are returned only in increasing order, the frontier will grow without bound. Your memoized values will probably also take a significant amount of space also grow without bound. This may be something you can address by making your problem less general, but this should be a good starting point.
So you basically want to take two monotonically increasing sequences, and then (lazily) compute the multiplication (or addition, or another operation) table between them, which is a 2-D array. Then you want to put the elements of that 2-D array in sorted order and iterate through them.
In general, this is impossible. However, if your sequences and operation are such that you can make certain guarantees about the rows and columns of the table, then you can make some progress. For example, let's assume that your sequences are monitonically-increasing sequences of positive integers only, and that the operation is multiplication (as in your example). In this case, we know that every row and column of the array is a monotonically-increasing sequence. In this case, you do not need to compute the entire array, but rather only parts of it. Specifically, you must keep track of the following:
How many rows you have ever used
The number of elements you have taken from each row that you have used
Every element from either input sequence that you have ever used, plus one more from each
To compute the next element in your iterator, you must do the following:
For each row that you have ever used, compute the "next" value in that row. For example, if you have used 5 values from row 1, then compute the 6th value (i=1, j=6) by taking the 1st value from the first sequence and the 6th value from the second sequence (both of which you have cached) and applying the operation (multiplication) to them. Also, compute the first value in the first unused row.
Take the minimum of all the values you computed. Yield this value as the next element in your iterator
Increment the counter for the row from which you sampled the element in the previous step. If you took the element from a new, unused row, you must increment the count of the number of rows you have used, and you must create a new counter for that row initialized to 1. If necessary, you must also compute more values of one or both input sequences.
This process is kind of complex, and in particular notice that to compute N values, you must in the worst case save an amount of state proportional to the square root of N. (Edit: sqrt(N) is actually the best case.) This is in stark contrast to a typical generator, which only requires constant space to iterate through its elements regardless of length.
In summary, you can do this under certain assumptions, and you can provide a generator-like interface to it, but it cannot be done in a "streaming" fashion, because you need to save a lot of state in order to iterate through the elements in the correct order.
Use generators, which are just iterators written as functions that yield results. In this case you can write generators for iter1 and iter2 and another generator to wrap them and yield their results (or do calculations with them, or the history of their results) as you go.
From my reading of the question you want something like this, which will calculate every element of the first iterator with every element of the next, using said operation, you also state you want some way to wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output. I propose generators offer a simple solution to such problem.
import itertools
def prime_gen():
D, q = {}, 2
while True:
if q not in D:
yield q
D[q * q] = [q]
else:
for p in D[q]:
D.setdefault(p + q, []).append(p)
del D[q]
q += 1
def infinite_gen(op, iter1, iter2):
while True:
yield op(iter1.next(), iter2.next())
>>> gen = infinite_gen(operator.mul, prime_gen(), itertools.count())
>>> gen.next()
<<< 0
>>> gen.next()
<<< 3
>>> gen.next()
<<< 10
Generators offer a lot of flexibility, so it should be fairly easy to write iter1 and iter2 as generators that return values you want in the order you want. You could also consider using coroutines, which let you send values into a generator.
Discussion in other answers observes that there is potentially infinite storage required no matter what the algorithm, since every a[n] must remain available for each new b[n]. If you remove the restriction that the input be two iterators and instead only require that they be sequences (indexable or merely something that can be regenerated repeatedly) then I believe all of your state suddenly collapses to one number: The last value you returned. Knowing the last result value you can search the output space looking for the next one. (If you want to emit duplicates properly then you may need to also track the number of times the result has been returned)
With a pair of sequences you have a simple recurrence relation:
result(n) = f(seq1, seq1, result(n-1))
where f(seq1, seq1, p) searches for the minimum value in the output space q such that q > p. In practical terms you'd probably make the sequences memoized functions and choose your search algorithm to avoid thrashing the pool of memoized items.

Categories

Resources