I have a set of points and need to select an optimal subset of 3 of them, where the criterion is a linear sum of some properties of the points, and some properties of pairs of the points.
In Python, this is quite easy using itertools.combinations:
all_points = combinations(points, 3)
costs = []
for i, (p1, p2, p3) in enumerate(all_points):
costs.append((p1.weight + p2.weight + p3.weight
+ pair_weight(p1, p2) + pair_weight(p1, p3) + pair_weight(p2, p3),
i))
costs.sort()
best = all_points[costs[0][1]]
The problem is that this is a brute force solution, requiring to enumerate all possible combinations of 3 points, which is O(n^3) in the number of points and therefore easily leads to a very large number of evaluations to perform. I have been trying to research whether there is a more efficient way to do this, perhaps taking advantage of the linearity of the cost function.
I have tried turning this into a networkx graph featuring node and edge weights. However, I have not yet found an algorithm in that toolkit that can calculate the "shortest triangle", particularly one that considers both edge and node weights. (Shortest path algorithms tend to only consider edge weights for example.)
There are functions to enumerate all cliques, and then I can select 3-cliques, and calculate the cost, but this is also brute force and therefore not better than doing it with combinations as above.
Are there any other algorithms I can look at?
By the way, if I do not have the edge weights, it is easy to just sort the nodes by their node-weight and choose the first three. So it is really the paired costs that add complexity to this problem. I am wondering if somehow I can just list all pairs and find the top-k of those that form triangles, or something better? At least if I could efficiently enumerate top candidates and stop the enumeration on some heuristic, it might be better than the brute force approach.
From now on, I will use n as the number of nodes and m as the number of edges. If your graph is fully connected, then m is just n choose 2. I'll also disregard node weights, because as the comments to your initial post have noted, the node weights can be absorbed into the edges they're connected to.
Your algorithm is O(n^3); it's hopefully not too hard to see why: You iterate over every possible triplet of nodes. However, it is possible to iterate over every triangle in a graph in O(m sqrt(m)):
for every node u:
for every node v adjacent to u:
if degree(u) < degree(v): continue;
for every node w adjacent to v:
if degree(v) < degree(w): continue;
if u is not connected to w: continue;
// <u,v,w> is a triangle!
The proof for this algorithm's runtime of O(m sqrt(m)) is nontrivial, so I'll direct you here: https://cs.stanford.edu/~rishig/courses/ref/l1.pdf
If your graph is fully connected, then you've gotta stick with the O(n^3), I think. There might be some early-pruning ideas you can do but they won't lead to a significant speedup, probably 2x at very best.
Related
I'm struggling to find a solution to this problem with time complexity o(m log n) + O(n).
Assume you have directed acyclic graph with n nodes and m requests, each node has at most one parent. At time = 0, the graph is empty. Requests have two types: add edge (u, v) or find root of the subgraph with vertex u. You should add edges only if it doesn't break any property of the graph (it should remain acyclic and each node should still have at most one incoming edge).
There are multiple solutions I could think of, but neither of them has the required complexity. Here I describe my best solution (complexity-wise). Create vector roots (roots[ u ] is a root of subgraph with vertex u) and vector of vectors children (children[ u ] are descendants of vertex u). After edge (u, v) is added, I update vectors the following way:
for child in children[v]:
root[child] = u
children[u].append(child)
children[v] = []
This way, checking whether adding edge breaks the property or returning root takes O(1) time. However, updating procedure has total complexity O(n^2) (there can be at most n - 1 edges in such graph and children[ u ] has size at most n - 1 for every u). The total complexity is O(m + n^2).
Could you please suggest any ideas on how to solve it? I assume that there must be a O(m log^2 n + n) solution.
This can be done by union find with path compression, but without union by rank since it's important to be able to control which node remains the root of its component. The time complexity is as desired, O(m log n + n).
I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.
One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ā R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_iā{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.
Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.
I'm a trying to calculate a kind of fuzzy Jaccard index between two sets with the following rationale: as the Jaccard index, I want to calculate the ratio between the number of items that are common to both sets and the total number of different items in both sets. The problem is that I want to use a similarity function with a threshold to determine what what counts as the "same" item being in both sets, so that items that are similar:
Aren't counted twice in the union
Are counted in the intersection.
I have a working implementation here (in python):
def fuzzy_jaccard(set1, set2, similarity, threshold):
intersection_size = union_size = len(set1 & set2)
shorter_difference, longer_difference = sorted([set2 - set1, set1 - set2], key=len)
while len(shorter_difference) > 0:
item1, item2 = max(
itertools.product(longer_difference, shorter_difference),
key=lambda (a, b): similarity(a, b)
)
longer_difference.remove(item1)
shorter_difference.remove(item2)
if similarity(item1, item2) > threshold:
union_size += 1
intersection_size += 1
else:
union_size += 2
union_size = union_size + len(longer_difference)
return intersection_size / union_size
The problem here is the this is quadratic in the size of the sets, because in itertools.product I iterate in all possible pairs of items taken one from each set(*). Now, I think I must do this because I want to match each item a from set1 with the best possible candidate b from set2 that isn't more similar to another item a' from set1.
I have a feeling that there should be a O(n) way of doing that I'm not grasping. Do you have any suggestions?
There are other issues two, like recalculating the similarity for each pair once I get the best match, but I don't care to much about them.
I doubt there's any way that would be O(n) in the general case, but you can probably do a lot better than O(n^2) at least for most cases.
Is similarity transitive? By this I mean: can you assume that distance(a, c) <= distance(a, b) + distance(b, c)? If not, this answer probably won't help. I'm treating similarities like distances.
Try clumping the data:
Pick a radius r. Based on intuition, I suggest setting r to one-third of the average of the first 5 similarities you calculate, or something.
The first point you pick in set1 becomes the centre of your first clump. Classify the points in set2 as being in the clump (similarity to the centre point <= r) or outside the clump. Also keep track of points that are within 2r of the clump centre.
You can require that clump centre points be at least a distance of 2r from each other; in that case some points may not be in any clump. I suggest making them at least r from each other. (Maybe less if you're dealing with a large number of dimensions.) You could treat every point as a clump centre but then you wouldn't save any processing time.
When you pick a new point, first compare it with the clump centre points (even though they're in the same set). Either it's in an already existing clump, or it becomes a new clump centre, (or perhaps neither if it's between r and 2r of a clump centre). If it's within r of a clump centre, then compare it with all points in the other set that are within 2r of that clump centre. You may be able to ignore points further than 2r from the clump centre. If you don't find a similar point within the clump (perhaps because the clump has no points left), then you may have to scan all the rest of the points for that case. Hopefully this would mostly happen only when there aren't many points left in the set. If this works well, then in most cases you'd find the most similar point within the clump and would know that it's the most similar point.
This idea may require some tweaking.
If there are a large number of dimenstions involved, then you might find that for a given radius r, frustratingly many points are within 2r of each other while few are within r of each other.
Here's another algorithm. The more time-consuming it is to calculate your similarity function (as compared to the time it takes to maintain sorted lists of points) the more index points you might want to have. If you know the number of dimensions, it might make sense to use that number of index points. You might reject a point as a candidate index point if it's too similar to another index point.
For each of the first point you use and any others you decide to use as index points, generate a list of all the remaining points in the other set, sorted in order of distance from the index point,
When you're comparing a point P1 to points in the other set, I think you can skip over sets for two possible reasons. Consider the most similar point P2 you've found to P1. If P2 is similar to an index point then you can skip all points which are sufficiently dissimilar from that index point. If P2 is dissimilar to an index point then you can skip over all points which are sufficiently similar to that index point. I think in some cases you can skip over some of both types of point for the same index point.
I have a slight variant on the "find k nearest neighbours" algorithm which involves rejecting those that don't satisfy a certain condition and I can't think of how to do it efficiently.
What I'm after is to find the k nearest neighbours that are in the current line of sight. Unfortunately scipy.spatial.cKDTree doesn't provide an option for searching with a filter to conditionally reject points.
The best algorithm I can come up with is to query for n nearest neighbours and if there aren't k that are in the line of sight then query it again for 2n nearest neighbours and repeat. Unfortunately this would mean recomputing the n nearest neighbours repeatedly in the worst cases. The performance hit gets worse the more times I have to repeat this query. On the other hand setting n too high is potentially wasteful if most of the points returned aren't needed.
The line of sight changes frequently so I can't recompute the cKDTree each time either. Any suggestions?
If you are looking for the neighbours in a line of sight, couldn't use an method like
cKDTree.query_ball_point(self, x, r, p, eps)
which allows you to query the KDTree for neighbours that are inside a radius of size r around the x array points.
Unless I misunderstood your question, it seems that the line of sight is known and is equivalent to this r value.
I have the following problem:
Consider a weighted direct graph.
Each node has a rating and the weighted edges represents
the "influence" of a node on its neighbors.
When a node rating change, the neighbors will see their own rating modified (positively or negatively)
How to propagate a new rating on one node?
I think this should be a standard algorithm but which one?
This is a general question but in practice I am using Python ;)
Thanks
[EDIT]
The rating is a simple float value between 0 to 1: [0.0,1.0]
There is certainly a convergence issue: I want just limit the propagation to a few iteration...
There is an easy standard way to do it as follows:
let G=(V,E) be the graph
let w:E->R be a weight function such that w(e) = weight of edge e
let A be an array such that A[v] = rating(v)
let n be the required number of iterations
for i from 1 to n (inclusive) do:
for each vertex v in V:
A'[v] = calculateNewRating(v,A,w) #use the array A for the old values and w
A <- A' #assign A with the new values which are stored in A'
return A
However, for some cases - you might have better algorithms based on the features of the graph and how the rating for each node is recalculated. For example:
Assume rating'(v) = sum(rating(u) * w(u,v)) for each (u,v) in E, and you get a variation of Page Rank, which is guaranteed to converge to the principle eigenvector if the graph is strongly connected (Perron-Forbenius theorem), so calculating the final value is simple.
Assume rating'(v) = max{ rating(u) | for each (u,v) in E}, then it is also guaranteed to converge and can be solved linearly using strongly connected components. This thread discusses this case.