Function takes a long time - python

im currently working on trying to get the the number of unique paths from node 1 .. N of maximum length for a weighted directed acyclic graph, i have worked out getting the max length but i am stuck on getting the NUMBER of paths of that given max length...
Data is inputted like this:
91 120 # Number of nodes, number of edges
1 2 34
1 3 15
2 4 10
....
As Node 1-> Node 2 with a weight of 34,
I input my data using a diction so my dict looks like:
_distance = {}
_distance = {1: [(2, 34), (3, 15)], 2: [(4, 10)], 3: [(4, 17)], 4: [(5, 36), (6, 22)], 5: [(7, 8)],...ect
I have worked out how to achieve the longest length of the paths using this:
first i make a list of vertices
class Vertice:
def __init__(self,name,weight=0,visted=False):
self._n = name
self._w = weight
self._visited = visted
self.pathTo
for i in range(numberOfNodes): # List of vertices (0-n-1)
_V = Vertice(i)
_nodes.append(_V)
next i iterate through my dictionary setting each node to the maximum weight it can be
for vert, neighbors in _distance.iteritems():
_vert = _nodes[vert-1] # Current vertice array starts at 0, so n-1
for x,y in neighbors: # neighbores,y = weight of neighbors
_v = _nodes[x-1] # Node #1 will be will be array[0]
if _v._visited == True:
if _v._w > _vert._w+y:
_v._w = _v._w
else:
_v._w = y + _vert._w
else:
_v._w = y + _vert._w
_v._visited = True
with this done, the last node will have a weight of the maximum so i can just call
max = _nodes[-1]._w
to get the max weight. This seems to perform fast and has no trouble finding the max length path even when performed on the bigger data set, i then take my max value and run it into this function:
# Start from first node in dictionary, distances is our dict{}
# Target is the last node in the list of nodes, or the total number of nodes.
numLongestPaths(currentLocation=1,target=_numNodes,distances=_distance,maxlength=max)
def numLongestPaths(currentLocation,maxlength, target, sum=0, distances={}):
_count = 0
if currentLocation == target:
if sum == maxlength:
_count += 1
else:
for vert, weight in distances[currentLocation]:
newSum = sum + weight
currentLocation = vert
_count += numLongestPaths(currentLocation,maxlength,target,newSum,distances)
return _count
I simply check once we have hit the end node if our current sum is the max, if it is, add one to our count, if not pass.
This works instantly for the inputs such as 8 nodes and longest path is 20, finding 3 paths, and for inputs such as 100 nodes, longest length of 149 and only 1 unique path of that length, but when i try to do a data set with 91 nodes such as longest path 1338 and number of unique paths are 32, the function takes extremely LONG, it works but is very slow.
Can someone give me some tips on what is wrong with my function to cause it to take so long finding the # of paths length X from 1..N? i'm assuming its getting an exponential run time but i'm unsure how to fix it
Thank you for your help!
EDIT: Okay i was overthinking this and going about this the wrong way, i restructured my approach and my code is now as follows:
# BEGIN SEARCH.
for vert, neighbors in _distance.iteritems():
_vert = _nodes[vert-1] # Current vertice array starts at 0, so n-1
for x,y in neighbors: # neighbores
_v = _nodes[x-1] # Node #1 will be will be array[0]
if _v._visited == True:
if _v._w > _vert._w+y:
_v._w = _v._w
elif _v._w == _vert._w+y:
_v.pathsTo += _vert.pathsTo
else:
_v.pathsTo = _vert.pathsTo
_v._w = y + _vert._w
else:
_v._w = y + _vert._w
_v.pathsTo = max(_vert.pathsTo, _v.pathsTo + 1)
_v._visited = True
i added a pathsTo variable to my Vertice class, and that will hold the number of unique paths of MAX length

Your numLongestPaths is slow because you're recursively trying every possible path, and there can be exponentially many of those. Find a way to avoid computing numLongestPaths for any node more than once.
Also, your original _w computation is broken, because when it computes a node's _w value, it does nothing to ensure the other _w values it's relying on have themselves been computed. You will need to avoid using uninitialized values; a topological sort may be useful, although it sounds like the vertex labels may have already been assigned in topological sort order.

In addition to #user2357112's answer, here are two additional recommendations
Language
If you what this code to be as efficient as possible, I recommend using C. Python is a great scripting language, but really slow compared to compiled alternatives
Data-structure
Nodes are named in an ordered fashion, you can thus optimize a lot your code by using a list instead of a dictionary. i.e.
_distance = [[] for i in range(_length)]

Related

Fastest way to sample most numbers with minimum difference larger than a value from a Python list

Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.
from itertools import combinations
import random
from time import time
mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]
t0 = time()
n = len(lst)
sample = []
found = False
while not found:
# get all subsets with size n
subsets = list(combinations(lst, n))
# shuffle to ensure randomness
random.shuffle(subsets)
for subset in subsets:
# sort the subset numbers
ss = sorted(subset)
# calculate the differences between every two adjacent numbers
diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
if min(diffs) > mindiff:
sample = set(subset)
found = True
break
# check subsets with size -1
n -= 1
print(sample)
print(time()-t0)
Output:
{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524
However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?
My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.
Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.
Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).
Keep doing this until you reach the maximum k - you now know the size of the optimal subset.
Uniformly sample a path of length k from s to t using P to weight your choices.
This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.
We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.
Use the nodes you sampled in 3 as your random subset.
An implementation of the above in pure Python:
import random
def sample_mindist_subset(xs, mindist):
# Construct directed graph G.
n = len(xs)
s = n; t = n + 1 # Two virtual nodes, source and sink.
neighbors = {
i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
for i in range(n)}
neighbors[s] = [t] + list(range(n))
neighbors[t] = []
# Compute number of paths P[x][k] from x to t of length k.
P = [[0 for _ in range(n+2)] for _ in range(n+2)]
P[t][0] = 1
for k in range(1, n+2):
for x in range(n+2):
P[x][k] = sum(P[y][k-1] for y in neighbors[x])
# Sample maximum length path uniformly at random.
maxk = max(k for k in range(n+2) if P[s][k] > 0)
path = [s]
while path[-1] != t:
candidates = neighbors[path[-1]]
weights = [P[cn][maxk-len(path)] for cn in candidates]
path.append(random.choices(candidates, weights)[0])
return [xs[i] for i in path[1:-1]]
Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.
I probably don't fully understand the question, because right now the solution is quite trivial. EDIT: yes, I misunderstood after all, the OP does not just want an optimal solution, but wishes to randomly sample from the set of optimal solutions. This answer is not incorrect but it also is an answer to a different question than what OP is interested in.
Simply sort the numbers and greedily construct the subset:
def mindist_subset(xs, mindist):
result = []
for x in sorted(xs):
if not result or x - result[-1] > mindist:
result.append(x)
return result
Sketch of proof of correctness.
Suppose we have a solution S given input array A that is of optimal size. If it does not contain min(A) note that we could remove min(S) from S and add min(A) since this would only increase the distance between min(S) and the second smallest number in S. Conclusion: we can without loss of generality assume that min(A) is part of an optimal solution.
Now we can apply this argument recursively. We add min(A) to a solution and remove all elements too close to min(A), giving remaining elements A'. Then we're left with a subproblem where exactly the same argument applies, we can choose min(A') as our next element of the solution, etc.

Find the indices of the two largest and two smallest values in a matrix in python

I am attempting to find the indices of the two smallest and the two largest values in python:
I have
import sklearn
euclidean_matrix=sklearn.metrics.pairwise_distances(H10.T,metric='euclidean')
max_index =np.where(euclidean_matrix==np.max(euclidean_matrix[np.nonzero(euclidean_matrix)]))
min_index=np.where(euclidean_matrix==np.min(euclidean_matrix[np.nonzero(euclidean_matrix)]))
min_index
max_index
I get the following output
(array([158, 272]), array([272, 158]))
(array([ 31, 150]), array([150, 31]))
the above code only returns the indices of the absolute smallest and the absolute largest values of the matrix, I would like to find the indices of the next smallest value and the indices of the next largest value. How can I do this? Ideally I would like to return the indices of the 2 largest values of the matrix and the indices of the two smallest values of the matrix. How can I do this?
I can think of a couple ways of doing this. Some of these depend on how much data you need to search through.
A couple of caveats: You will have to decide what to do when there are 1, 2, 3 elements only Or if all the same value, do you want min, max, etc to be identical? What if there are multiple items in max or min or min2, max2? which should be selected?
run min then remove that element run min on the rest. run max then remove that element and run on the rest (note that this is on the original and not the one with min removed). This is the least efficient method since it requires searching 4 times and copying twice. (Actually 8 times because we find the min/max then find the index.) Something like the in the pseudo code.
PSEUDO CODE:
max_index = np.where(euclidean_matrix==np.max(euclidean_matrix[np.nonzero(euclidean_matrix)]))
tmp_euclidean_matrix = euclidean_matrix #make sure this is a deepcopy
tmp_euclidean_matrix.remove(max_index) #syntax might not be right?
max_index2 = np.where(tmp_euclidean_matrix==np.max(tmp_euclidean_matrix[np.nonzero(tmp_euclidean_matrix)]))
min_index = np.where(euclidean_matrix==np.min(euclidean_matrix[np.nonzero(euclidean_matrix)]))
tmp_euclidean_matrix = euclidean_matrix #make sure this is a deepcopy
tmp_euclidean_matrix.remove(min_index) #syntax might not be right?
min_index2 = np.where(tmp_euclidean_matrix==np.min(tmp_euclidean_matrix[np.nonzero(tmp_euclidean_matrix)]))
Sort the data (if you need it sorted anyway this is a good option) then just grab two smallest and largest. This isn't great unless you needed it sorted anyway because of many copies and comparisons to sort.
PSEUDO CODE:
euclidean_matrix.sort()
min_index = 0
min_index2 = 1
max_index = len(euclidean_matrix) - 1
max_index2 = max_index - 1
Best option would be to roll your own search function to run on the data, this would be most efficient because you would go through the data only once to collect them.
This is just a simple iterative approach, other algorithms may be more efficient. You will want to validate this works though.
PSEUDO CODE:
def minmax2(array):
""" returns (minimum, second minimum, second maximum, maximum)
"""
if len(array) == 0:
raise Exception('Empty List')
elif len(array) == 1:
#special case only 1 element need at least 2 to have different
minimum = 0
minimum2 = 0
maximum2 = 0
maximum = 0
else:
minimum = 0
minimum2 = 1
maximum2 = 1
maximum = 0
for i in range(1, len(array)):
if array[i] <= array[minimum]:
# a new minimum (or tie) will shift the other minimum
minimum2 = minimum
minimum = i
elif array[i] < array[minimum2]:
minimum2 = i
elif array[i] >= array[maximum]:
# a new maximum (or tie) will shift the second maximum
maximum2 = maximum
maximum = i
elif array[i] > array[maximum2]:
maximum2 = i
return (minimum, minimum2, maximum2, maximum)
edit: Added pseudo code

Shuffling a list with maximum distance travelled [duplicate]

I have tried to ask this question before, but have never been able to word it correctly. I hope I have it right this time:
I have a list of unique elements. I want to shuffle this list to produce a new list. However, I would like to constrain the shuffle, such that each element's new position is at most d away from its original position in the list.
So for example:
L = [1,2,3,4]
d = 2
answer = magicFunction(L, d)
Now, one possible outcome could be:
>>> print(answer)
[3,1,2,4]
Notice that 3 has moved two indices, 1 and 2 have moved one index, and 4 has not moved at all. Thus, this is a valid shuffle, per my previous definition. The following snippet of code can be used to validate this:
old = {e:i for i,e in enumerate(L)}
new = {e:i for i,e in enumerate(answer)}
valid = all(abs(i-new[e])<=d for e,i in old.items())
Now, I could easily just generate all possible permutations of L, filter for the valid ones, and pick one at random. But that doesn't seem very elegant. Does anyone have any other ideas about how to accomplish this?
This is going to be long and dry.
I have a solution that produces a uniform distribution. It requires O(len(L) * d**d) time and space for precomputation, then performs shuffles in O(len(L)*d) time1. If a uniform distribution is not required, the precomputation is unnecessary, and the shuffle time can be reduced to O(len(L)) due to faster random choices; I have not implemented the non-uniform distribution. Both steps of this algorithm are substantially faster than brute force, but they're still not as good as I'd like them to be. Also, while the concept should work, I have not tested my implementation as thoroughly as I'd like.
Suppose we iterate over L from the front, choosing a position for each element as we come to it. Define the lag as the distance between the next element to place and the first unfilled position. Every time we place an element, the lag grows by at most one, since the index of the next element is now one higher, but the index of the first unfilled position cannot become lower.
Whenever the lag is d, we are forced to place the next element in the first unfilled position, even though there may be other empty spots within a distance of d. If we do so, the lag cannot grow beyond d, we will always have a spot to put each element, and we will generate a valid shuffle of the list. Thus, we have a general idea of how to generate shuffles; however, if we make our choices uniformly at random, the overall distribution will not be uniform. For example, with len(L) == 3 and d == 1, there are 3 possible shuffles (one for each position of the middle element), but if we choose the position of the first element uniformly, one shuffle becomes twice as likely as either of the others.
If we want a uniform distribution over valid shuffles, we need to make a weighted random choice for the position of each element, where the weight of a position is based on the number of possible shuffles if we choose that position. Done naively, this would require us to generate all possible shuffles to count them, which would take O(d**len(L)) time. However, the number of possible shuffles remaining after any step of the algorithm depends only on which spots we've filled, not what order they were filled in. For any pattern of filled or unfilled spots, the number of possible shuffles is the sum of the number of possible shuffles for each possible placement of the next element. At any step, there are at most d possible positions to place the next element, and there are O(d**d) possible patterns of unfilled spots (since any spot further than d behind the current element must be full, and any spot d or further ahead must be empty). We can use this to generate a Markov chain of size O(len(L) * d**d), taking O(len(L) * d**d) time to do so, and then use this Markov chain to perform shuffles in O(len(L)*d) time.
Example code (currently not quite O(len(L)*d) due to inefficient Markov chain representation):
import random
# states are (k, filled_spots) tuples, where k is the index of the next
# element to place, and filled_spots is a tuple of booleans
# of length 2*d, representing whether each index from k-d to
# k+d-1 has an element in it. We pretend indices outside the array are
# full, for ease of representation.
def _successors(n, d, state):
'''Yield all legal next filled_spots and the move that takes you there.
Doesn't handle k=n.'''
k, filled_spots = state
next_k = k+1
# If k+d is a valid index, this represents the empty spot there.
possible_next_spot = (False,) if k + d < n else (True,)
if not filled_spots[0]:
# Must use that position.
yield k-d, filled_spots[1:] + possible_next_spot
else:
# Can fill any empty spot within a distance d.
shifted_filled_spots = list(filled_spots[1:] + possible_next_spot)
for i, filled in enumerate(shifted_filled_spots):
if not filled:
successor_state = shifted_filled_spots[:]
successor_state[i] = True
yield next_k-d+i, tuple(successor_state)
# next_k instead of k in that index computation, because
# i is indexing relative to shifted_filled_spots instead
# of filled_spots
def _markov_chain(n, d):
'''Precompute a table of weights for generating shuffles.
_markov_chain(n, d) produces a table that can be fed to
_distance_limited_shuffle to permute lists of length n in such a way that
no list element moves a distance of more than d from its initial spot,
and all permutations satisfying this condition are equally likely.
This is expensive.
'''
if d >= n - 1:
# We don't need the table, and generating a table for d >= n
# complicates the indexing a bit. It's too complicated already.
return None
table = {}
termination_state = (n, (d*2 * (True,)))
table[termination_state] = 1
def possible_shuffles(state):
try:
return table[state]
except KeyError:
k, _ = state
count = table[state] = sum(
possible_shuffles((k+1, next_filled_spots))
for (_, next_filled_spots) in _successors(n, d, state)
)
return count
initial_state = (0, (d*(True,) + d*(False,)))
possible_shuffles(initial_state)
return table
def _distance_limited_shuffle(l, d, table):
# Generate an index into the set of all permutations, then use the
# markov chain to efficiently find which permutation we picked.
n = len(l)
if d >= n - 1:
random.shuffle(l)
return
permutation = [None]*n
state = (0, (d*(True,) + d*(False,)))
permutations_to_skip = random.randrange(table[state])
for i, item in enumerate(l):
for placement_index, new_filled_spots in _successors(n, d, state):
new_state = (i+1, new_filled_spots)
if table[new_state] <= permutations_to_skip:
permutations_to_skip -= table[new_state]
else:
state = new_state
permutation[placement_index] = item
break
return permutation
class Shuffler(object):
def __init__(self, n, d):
self.n = n
self.d = d
self.table = _markov_chain(n, d)
def shuffled(self, l):
if len(l) != self.n:
raise ValueError('Wrong input size')
return _distance_limited_shuffle(l, self.d, self.table)
__call__ = shuffled
1We could use a tree-based weighted random choice algorithm to improve the shuffle time to O(len(L)*log(d)), but since the table becomes so huge for even moderately large d, this doesn't seem worthwhile. Also, the factors of d**d in the bounds are overestimates, but the actual factors are still at least exponential in d.
In short, the list that should be shuffled gets ordered by the sum of index and a random number.
import random
xs = range(20) # list that should be shuffled
d = 5 # distance
[x for i,x in sorted(enumerate(xs), key= lambda (i,x): i+(d+1)*random.random())]
Out:
[1, 4, 3, 0, 2, 6, 7, 5, 8, 9, 10, 11, 12, 14, 13, 15, 19, 16, 18, 17]
Thats basically it. But this looks a little bit overwhelming, therefore...
The algorithm in more detail
To understand this better, consider this alternative implementation of an ordinary, random shuffle:
import random
sorted(range(10), key = lambda x: random.random())
Out:
[2, 6, 5, 0, 9, 1, 3, 8, 7, 4]
In order to constrain the distance, we have to implement a alternative sort key function that depends on the index of an element. The function sort_criterion is responsible for that.
import random
def exclusive_uniform(a, b):
"returns a random value in the interval [a, b)"
return a+(b-a)*random.random()
def distance_constrained_shuffle(sequence, distance,
randmoveforward = exclusive_uniform):
def sort_criterion(enumerate_tuple):
"""
returns the index plus a random offset,
such that the result can overtake at most 'distance' elements
"""
indx, value = enumerate_tuple
return indx + randmoveforward(0, distance+1)
# get enumerated, shuffled list
enumerated_result = sorted(enumerate(sequence), key = sort_criterion)
# remove enumeration
result = [x for i, x in enumerated_result]
return result
With the argument randmoveforward you can pass a random number generator with a different probability density function (pdf) to modify the distance distribution.
The remainder is testing and evaluation of the distance distribution.
Test function
Here is an implementation of the test function. The validatefunction is actually taken from the OP, but I removed the creation of one of the dictionaries for performance reasons.
def test(num_cases = 10, distance = 3, sequence = range(1000)):
def validate(d, lst, answer):
#old = {e:i for i,e in enumerate(lst)}
new = {e:i for i,e in enumerate(answer)}
return all(abs(i-new[e])<=d for i,e in enumerate(lst))
#return all(abs(i-new[e])<=d for e,i in old.iteritems())
for _ in range(num_cases):
result = distance_constrained_shuffle(sequence, distance)
if not validate(distance, sequence, result):
print "Constraint violated. ", result
break
else:
print "No constraint violations"
test()
Out:
No constraint violations
Distance distribution
I am not sure whether there is a way to make the distance uniform distributed, but here is a function to validate the distribution.
def distance_distribution(maxdistance = 3, sequence = range(3000)):
from collections import Counter
def count_distances(lst, answer):
new = {e:i for i,e in enumerate(answer)}
return Counter(i-new[e] for i,e in enumerate(lst))
answer = distance_constrained_shuffle(sequence, maxdistance)
counter = count_distances(sequence, answer)
sequence_length = float(len(sequence))
distances = range(-maxdistance, maxdistance+1)
return distances, [counter[d]/sequence_length for d in distances]
distance_distribution()
Out:
([-3, -2, -1, 0, 1, 2, 3],
[0.01,
0.076,
0.22166666666666668,
0.379,
0.22933333333333333,
0.07766666666666666,
0.006333333333333333])
Or for a case with greater maximum distance:
distance_distribution(maxdistance=9, sequence=range(100*1000))
This is a very difficult problem, but it turns out there is a solution in the academic literature, in an influential paper by Mark Jerrum, Alistair Sinclair, and Eric Vigoda, A Polynomial-Time Approximation Algorithm for the Permanent of a Matrix with Nonnegative Entries, Journal of the ACM, Vol. 51, No. 4, July 2004, pp. 671–697. http://www.cc.gatech.edu/~vigoda/Permanent.pdf.
Here is the general idea: first write down two copies of the numbers in the array that you want to permute. Say
1 1
2 2
3 3
4 4
Now connect a node on the left to a node on the right if mapping from the number on the left to the position on the right is allowed by the restrictions in place. So if d=1 then 1 on the left connects to 1 and 2 on the right, 2 on the left connects to 1, 2, 3 on the right, 3 on the left connects to 2, 3, 4 on the right, and 4 on the left connects to 3, 4 on the right.
1 - 1
X
2 - 2
X
3 - 3
X
4 - 4
The resulting graph is bipartite. A valid permutation corresponds a perfect matching in the bipartite graph. A perfect matching, if it exists, can be found in O(VE) time (or somewhat better, for more advanced algorithms).
Now the problem becomes one of generating a uniformly distributed random perfect matching. I believe that can be done, approximately anyway. Uniformity of the distribution is the really hard part.
What does this have to do with permanents? Consider a matrix representation of our bipartite graph, where a 1 means an edge and a 0 means no edge:
1 1 0 0
1 1 1 0
0 1 1 1
0 0 1 1
The permanent of the matrix is like the determinant, except there are no negative signs in the definition. So we take exactly one element from each row and column, multiply them together, and add up over all choices of row and column. The terms of the permanent correspond to permutations; the term is 0 if any factor is 0, in other words if the permutation is not valid according to the matrix/bipartite graph representation; the term is 1 if all factors are 1, in other words if the permutation is valid according to the restrictions. In summary, the permanent of the matrix counts all permutations satisfying the restriction represented by the matrix/bipartite graph.
It turns out that unlike calculating determinants, which can be accomplished in O(n^3) time, calculating permanents is #P-complete so finding an exact answer is not feasible in general. However, if we can estimate the number of valid permutations, we can estimate the permanent. Jerrum et. al. approached the problem of counting valid permutations by generating valid permutations uniformly (within a certain error, which can be controlled); an estimate of the value of the permanent can be obtained by a fairly elaborate procedure (section 5 of the paper referenced) but we don't need that to answer the question at hand.
The running time of Jerrum's algorithm to calculate the permanent is O(n^11) (ignoring logarithmic factors). I can't immediately tell from the paper the running time of the part of the algorithm that uniformly generates bipartite matchings, but it appears to be over O(n^9). However, another paper reduces the running time for the permanent to O(n^7): http://www.cc.gatech.edu/fac/vigoda/FasterPermanent_SODA.pdf; in that paper they claim that it is now possible to get a good estimate of a permanent of a 100x100 0-1 matrix. So it should be possible to (almost) uniformly generate restricted permutations for lists of 100 elements.
There may be further improvements, but I got tired of looking.
If you want an implementation, I would start with the O(n^11) version in Jerrum's paper, and then take a look at the improvements if the original algorithm is not fast enough.
There is pseudo-code in Jerrum's paper, but I haven't tried it so I can't say how far the pseudo-code is from an actual implementation. My feeling is it isn't too far. Maybe I'll give it a try if there's interest.
I am not sure how good it is, but maybe something like:
create a list of same length than initial list L; each element of this list should be a list of indices of allowed initial indices to be moved here; for instance [[0,1,2],[0,1,2,3],[0,1,2,3],[1,2,3]] if I understand correctly your example;
take the smallest sublist (or any of the smallest sublists if several lists share the same length);
pick a random element in it with random.choice, this element is the index of the element in the initial list to be mapped to the current location (use another list for building your new list);
remove the randomly chosen element from all sublists
For instance:
L = [ "A", "B", "C", "D" ]
i = [[0,1,2],[0,1,2,3],[0,1,2,3],[1,2,3]]
# I take [0,1,2] and pick randomly 1 inside
# I remove the value '1' from all sublists and since
# the first sublist has already been handled I set it to None
# (and my result will look as [ "B", None, None, None ]
i = [None,[0,2,3],[0,2,3],[2,3]]
# I take the last sublist and pick randomly 3 inside
# result will be ["B", None, None, "D" ]
i = [None,[0,2], [0,2], None]
etc.
I haven't tried it however. Regards.
My idea is to generate permutations by moving at most d steps by generating d random permutations which move at most 1 step and chaining them together.
We can generate permutations which move at most 1 step quickly by the following recursive procedure: consider a permutation of {1,2,3,...,n}. The last item, n, can move either 0 or 1 place. If it moves 0 places, n is fixed, and we have reduced the problem to generating a permutation of {1,2,...,n-1} in which every item moves at most one place.
On the other hand, if n moves 1 place, it must occupy position n-1. Then n-1 must occupy position n (if any smaller number occupies position n, it will have moved by more than 1 place). In other words, we must have a swap of n and n-1, and after swapping we have reduced the problem to finding such a permutation of the remainder of the array {1,...,n-2}.
Such permutations can be constructed in O(n) time, clearly.
Those two choices should be selected with weighted probabilities. Since I don't know the weights (though I have a theory, see below) maybe the choice should be 50-50 ... but see below.
A more accurate estimate of the weights might be as follows: note that the number of such permutations follows a recursion that is the same as the Fibonacci sequence: f(n) = f(n-1) + f(n-2). We have f(1) = 1 and f(2) = 2 ({1,2} goes to {1,2} or {2,1}), so the numbers really are the Fibonacci numbers. So my guess for the probability of choosing n fixed vs. swapping n and n-1 would be f(n-1)/f(n) vs. f(n-2)/f(n). Since the ratio of consecutive Fibonacci numbers quickly approaches the Golden Ratio, a reasonable approximation to the probabilities is to leave n fixed 61% of the time and swap n and n-1 39% of the time.
To construct permutations where items move at most d places, we just repeat the process d times. The running time is O(nd).
Here is an outline of an algorithm.
arr = {1,2,...,n};
for (i = 0; i < d; i++) {
j = n-1;
while (j > 0) {
u = random uniform in interval (0,1)
if (u < 0.61) { // related to golden ratio phi; more decimals may help
j -= 1;
} else {
swap items at positions j and j-1 of arr // 0-based indexing
j -= 2;
}
}
}
Since each pass moves items at most 1 place from their start, d passes will move items at most d places. The only question is the uniform distribution of the permutations. It would probably be a long proof, if it's even true, so I suggest assembling empirical evidence for various n's and d's. Probably to prove the statement, we would have to switch from using the golden ratio approximation to f(n-1)/f(n-2) in place of 0.61.
There might even be some weird reason why some permutations might be missed by this procedure, but I'm pretty sure that doesn't happen. Just in case, though, it would be helpful to have a complete inventory of such permutations for some values of n and d to check the correctness of my proposed algorithm.
Update
I found an off-by-one error in my "pseudocode", and I corrected it. Then I implemented in Java to get a sense of the distribution. Code is below. The distribution is far from uniform, I think because there are many ways of getting restricted permutations with short max distances (move forward, move back vs. move back, move forward, for example) but few ways of getting long distances (move forward, move forward). I can't think of a way to fix the uniformity issue with this method.
import java.util.Random;
import java.util.Map;
import java.util.TreeMap;
class RestrictedPermutations {
private static Random rng = new Random();
public static void rPermute(Integer[] a, int d) {
for (int i = 0; i < d; i++) {
int j = a.length-1;
while (j > 0) {
double u = rng.nextDouble();
if (u < 0.61) { // related to golden ratio phi; more decimals may help
j -= 1;
} else {
int t = a[j];
a[j] = a[j-1];
a[j-1] = t;
j -= 2;
}
}
}
}
public static void main(String[] args) {
int numTests = Integer.parseInt(args[0]);
int d = 2;
Map<String,Integer> count = new TreeMap<String,Integer>();
for (int t = 0; t < numTests; t++) {
Integer[] a = {1,2,3,4,5};
rPermute(a,d);
// convert a to String for storage in Map
String s = "(";
for (int i = 0; i < a.length-1; i++) {
s += a[i] + ",";
}
s += a[a.length-1] + ")";
int c = count.containsKey(s) ? count.get(s) : 0;
count.put(s,c+1);
}
for (String k : count.keySet()) {
System.out.println(k + ": " + count.get(k));
}
}
}
Here are two sketches in Python; one swap-based, the other non-swap-based. In the first, the idea is to keep track of where the indexes have moved and test if the next swap would be valid. An additional variable is added for the number of swaps to make.
from random import randint
def swap(a,b,L):
L[a], L[b] = L[b], L[a]
def magicFunction(L,d,numSwaps):
n = len(L)
new = list(range(0,n))
for i in xrange(0,numSwaps):
x = randint(0,n-1)
y = randint(max(0,x - d),min(n - 1,x + d))
while abs(new[x] - y) > d or abs(new[y] - x) > d:
y = randint(max(0,x - d),min(n - 1,x + d))
swap(x,y,new)
swap(x,y,L)
return L
print(magicFunction([1,2,3,4],2,3)) # [2, 1, 4, 3]
print(magicFunction([1,2,3,4,5,6,7,8,9],2,4)) # [2, 3, 1, 5, 4, 6, 8, 7, 9]
Using print(collections.Counter(tuple(magicFunction([0, 1, 2], 1, 1)) for i in xrange(1000))) we find that the identity permutation comes up heavy with this code (the reason why is left as an exercise for the reader).
Alternatively, we can think about it as looking for a permutation matrix with interval restrictions, where abs(i - j) <= d where M(i,j) would equal 1. We can construct a one-off random path by picking a random j for each row from those still available. x's in the following example represent matrix cells that would invalidate the solution (northwest to southeast diagonal would represent the identity permutation), restrictions represent how many is are still available for each j. (Adapted from my previous version to choose both the next i and the next j randomly, inspired by user2357112's answer):
n = 5, d = 2
Start:
0 0 0 x x
0 0 0 0 x
0 0 0 0 0
x 0 0 0 0
x x 0 0 0
restrictions = [3,4,5,4,3] # how many i's are still available for each j
1.
0 0 1 x x # random choice
0 0 0 0 x
0 0 0 0 0
x 0 0 0 0
x x 0 0 0
restrictions = [2,3,0,4,3] # update restrictions in the neighborhood of (i ± d)
2.
0 0 1 x x
0 0 0 0 x
0 0 0 0 0
x 0 0 0 0
x x 0 1 0 # random choice
restrictions = [2,3,0,0,2] # update restrictions in the neighborhood of (i ± d)
3.
0 0 1 x x
0 0 0 0 x
0 1 0 0 0 # random choice
x 0 0 0 0
x x 0 1 0
restrictions = [1,0,0,0,2] # update restrictions in the neighborhood of (i ± d)
only one choice for j = 0 so it must be chosen
4.
0 0 1 x x
1 0 0 0 x # dictated choice
0 1 0 0 0
x 0 0 0 0
x x 0 1 0
restrictions = [0,0,0,0,2] # update restrictions in the neighborhood of (i ± d)
Solution:
0 0 1 x x
1 0 0 0 x
0 1 0 0 0
x 0 0 0 1 # dictated choice
x x 0 1 0
[2,0,1,4,3]
Python code (adapted from my previous version to choose both the next i and the next j randomly, inspired by user2357112's answer):
from random import randint,choice
import collections
def magicFunction(L,d):
n = len(L)
restrictions = [None] * n
restrict = -1
solution = [None] * n
for i in xrange(0,n):
restrictions[i] = abs(max(0,i - d) - min(n - 1,i + d)) + 1
while True:
availableIs = filter(lambda x: solution[x] == None,[i for i in xrange(n)]) if restrict == -1 else filter(lambda x: solution[x] == None,[j for j in xrange(max(0,restrict - d),min(n,restrict + d + 1))])
if not availableIs:
L = [L[i] for i in solution]
return L
i = choice(availableIs)
availableJs = filter(lambda x: restrictions[x] <> 0,[j for j in xrange(max(0,i - d),min(n,i + d + 1))])
nextJ = restrict if restrict != -1 else choice(availableJs)
restrict = -1
solution[i] = nextJ
restrictions[ nextJ ] = 0
for j in xrange(max(0,i - d),min(n,i + d + 1)):
if j == nextJ or restrictions[j] == 0:
continue
restrictions[j] = restrictions[j] - 1
if restrictions[j] == 1:
restrict = j
print(collections.Counter(tuple(magicFunction([0, 1, 2], 1)) for i in xrange(1000)))
Using print(collections.Counter(tuple(magicFunction([0, 1, 2], 1)) for i in xrange(1000))) we find that the identity permutation comes up light with this code (why is left as an exercise for the reader).
Here's an adaptation of #גלעד ברקן's code that takes only one pass through the list (in random order) and swaps only once (using a random choice of possible positions):
from random import choice, shuffle
def magicFunction(L, d):
n = len(L)
swapped = [0] * n # 0: position not swapped, 1: position was swapped
positions = list(xrange(0,n)) # list of positions: 0..n-1
shuffle(positions) # randomize positions
for x in positions:
if swapped[x]: # only swap an item once
continue
# find all possible positions to swap
possible = [i for i in xrange(max(0, x - d), min(n, x + d)) if not swapped[i]]
if not possible:
continue
y = choice(possible) # choose another possible position at random
if x != y:
L[y], L[x] = L[x], L[y] # swap with that position
swapped[x] = swapped[y] = 1 # mark both positions as swapped
return L
Here is a refinement of the above code that simply finds all possible adjacent positions and chooses one:
from random import choice
def magicFunction(L, d):
n = len(L)
positions = list(xrange(0, n)) # list of positions: 0..n-1
for x in xrange(0, n):
# find all possible positions to swap
possible = [i for i in xrange(max(0, x - d), min(n, x + d)) if abs(positions[i] - x) <= d]
if not possible:
continue
y = choice(possible) # choose another possible position at random
if x != y:
L[y], L[x] = L[x], L[y] # swap with that position
positions[x] = y
positions[y] = x
return L

Iterate over two lists, execute function and return values

I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries
according to some requirement X on the value the function returns.
The lists in questions are:
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
and the function takes 4 entries, every combination of l1 and l2.
The function is defined as
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
I have so far been able to loop over the lists simultaneously as:
for index, (eta, phi) in enumerate(zip(e_list, p_list)):
for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):
if index == index2: continue # to avoid same indices
if deltaR(eta, phi, eta2, phi2) < X:
print (index, index2) , deltaR(eta, phi, eta2, phi2)
This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc
The output of the code returns:
(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432
I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:
output = [No. matched entries]
i.e.
output = [2, 3]
2 coming from the fact that indices 0 and 1 are matched
3 coming from the fact that indices 2, 3, and 4 are all matched
A possible way I have thought of is to append to a list, all the indices used such that I return
output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]
Then, I use defaultdict to count the occurrances:
for index in output_list:
hits[index] += 1
From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?
This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.
The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.
From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph
You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.
------------------------ graph theory ----------------------
You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory
You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.
To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.
Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.
Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.
As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]
---------------- code -------------------
But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.
My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
X = 1 # you never actually said, but this works
def these_two_particles_are_going_the_same_direction(p1, p2):
return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X
class Vertex(object):
def __init__(self, eta, phi):
self.eta = eta
self.phi = phi
self.connected = []
self.visited = False
class Graph(object):
def __init__(self, e_list, p_list):
self.vertices = []
for eta, phi in zip(e_list, p_list):
self.add_node(eta, phi)
def add_node(self, eta, phi):
# add this data point at the next available index
n = len(self.vertices)
a = Vertex(eta, phi)
for i, b in enumerate(self.vertices):
if these_two_particles_are_going_the_same_direction(a,b):
b.connected.append(n)
a.connected.append(i)
self.vertices.append(a)
def reset_visited(self):
for v in self.nodes:
v.visited = False
def DFS(self, n):
#perform depth first search from node n, return count of connected vertices
count = 0
v = self.vertices[n]
if not v.visited:
v.visited = True
count += 1
for i in v.connected:
count += self.DFS(i)
return count
def connected_components(self):
self.reset_visited()
components = []
for i, v in enumerate(self.vertices):
if not v.visited:
components.append(self.DFS(i))
return components
g = Graph(e_list, p_list)
print g.connected_components()

Cycle detection in a 2-tuple python list

Given a list of edges in 2-tuple, (source, destination), is there any efficient way to determine if a cycle exists? Eg, in the example below, a cycle exists because 1 -> 3 -> 6 -> 4 -> 1. One idea is to calculate the number of occurrence of each integer in the list (again, is there any efficient way to do this?). Is there any better way? I am seeing a problem with 10,000 of 2-tuple edge information.
a = [(1,3), (4,6), (3,6), (1,4)]
I'm assuming you want to find a cycle in the undirected graph represented by your edge list and you don't want to count "trivial" cycles of size 1 or 2.
You can still use a standard depth-first search, but you need to be a bit careful about the node coloring (a simple flag to signal which nodes you have already visited is not sufficient):
from collections import defaultdict
edges = [(1,3), (4,6), (3,6), (1,4)]
adj = defaultdict(set)
for x, y in edges:
adj[x].add(y)
adj[y].add(x)
col = defaultdict(int)
def dfs(x, parent=None):
if col[x] == 1: return True
if col[x] == 2: return False
col[x] = 1
res = False
for y in adj[x]:
if y == parent: continue
if dfs(y, x): res = True
col[x] = 2
return res
for x in adj:
if dfs(x):
print "There's a cycle reachable from %d!" % x
This will detect if there is a back edge in the depth-first forest that spans at least 2 levels. This is exactly the case if there is a simple cycle of size >= 2. By storing parent pointers you can actually print the cycle as well if you found it.
For large graphs you might want to use an explicit stack instead of recursion, as illustrated on Wikipedia.

Categories

Resources