Performance of a "fuzzy" Jaccard index implementation

Performance of a "fuzzy" Jaccard index implementation - python

I'm a trying to calculate a kind of fuzzy Jaccard index between two sets with the following rationale: as the Jaccard index, I want to calculate the ratio between the number of items that are common to both sets and the total number of different items in both sets. The problem is that I want to use a similarity function with a threshold to determine what what counts as the "same" item being in both sets, so that items that are similar:
Aren't counted twice in the union
Are counted in the intersection.
I have a working implementation here (in python):
def fuzzy_jaccard(set1, set2, similarity, threshold):
intersection_size = union_size = len(set1 & set2)
shorter_difference, longer_difference = sorted([set2 - set1, set1 - set2], key=len)
while len(shorter_difference) > 0:
item1, item2 = max(
itertools.product(longer_difference, shorter_difference),
key=lambda (a, b): similarity(a, b)
)
longer_difference.remove(item1)
shorter_difference.remove(item2)
if similarity(item1, item2) > threshold:
union_size += 1
intersection_size += 1
else:
union_size += 2
union_size = union_size + len(longer_difference)
return intersection_size / union_size
The problem here is the this is quadratic in the size of the sets, because in itertools.product I iterate in all possible pairs of items taken one from each set(*). Now, I think I must do this because I want to match each item a from set1 with the best possible candidate b from set2 that isn't more similar to another item a' from set1.
I have a feeling that there should be a O(n) way of doing that I'm not grasping. Do you have any suggestions?
There are other issues two, like recalculating the similarity for each pair once I get the best match, but I don't care to much about them.

I doubt there's any way that would be O(n) in the general case, but you can probably do a lot better than O(n^2) at least for most cases.
Is similarity transitive? By this I mean: can you assume that distance(a, c) <= distance(a, b) + distance(b, c)? If not, this answer probably won't help. I'm treating similarities like distances.
Try clumping the data:
Pick a radius r. Based on intuition, I suggest setting r to one-third of the average of the first 5 similarities you calculate, or something.
The first point you pick in set1 becomes the centre of your first clump. Classify the points in set2 as being in the clump (similarity to the centre point <= r) or outside the clump. Also keep track of points that are within 2r of the clump centre.
You can require that clump centre points be at least a distance of 2r from each other; in that case some points may not be in any clump. I suggest making them at least r from each other. (Maybe less if you're dealing with a large number of dimensions.) You could treat every point as a clump centre but then you wouldn't save any processing time.
When you pick a new point, first compare it with the clump centre points (even though they're in the same set). Either it's in an already existing clump, or it becomes a new clump centre, (or perhaps neither if it's between r and 2r of a clump centre). If it's within r of a clump centre, then compare it with all points in the other set that are within 2r of that clump centre. You may be able to ignore points further than 2r from the clump centre. If you don't find a similar point within the clump (perhaps because the clump has no points left), then you may have to scan all the rest of the points for that case. Hopefully this would mostly happen only when there aren't many points left in the set. If this works well, then in most cases you'd find the most similar point within the clump and would know that it's the most similar point.
This idea may require some tweaking.
If there are a large number of dimenstions involved, then you might find that for a given radius r, frustratingly many points are within 2r of each other while few are within r of each other.
Here's another algorithm. The more time-consuming it is to calculate your similarity function (as compared to the time it takes to maintain sorted lists of points) the more index points you might want to have. If you know the number of dimensions, it might make sense to use that number of index points. You might reject a point as a candidate index point if it's too similar to another index point.
For each of the first point you use and any others you decide to use as index points, generate a list of all the remaining points in the other set, sorted in order of distance from the index point,
When you're comparing a point P1 to points in the other set, I think you can skip over sets for two possible reasons. Consider the most similar point P2 you've found to P1. If P2 is similar to an index point then you can skip all points which are sufficiently dissimilar from that index point. If P2 is dissimilar to an index point then you can skip over all points which are sufficiently similar to that index point. I think in some cases you can skip over some of both types of point for the same index point.

Related

Selecting an item (from a set of items) based on distance and frequency of occurence

There exists a set of points (or items, it doesn't matter). Each point a is at a specific distance from other points in the set. The distance can be retrieved via the function retrieve_dist(a, b).
This question is about programming (in Python) an algorithm to pick a point, with replacement, from this set of points. The picked point:
i) has to be at the maximum possible distance from all already-selected points, while adhering to the requirement in (ii)
ii) the number of times an already-selected point occurs in the sample must carry weight in this calculation. I.e. more frequently-selected points should be weighed more heavily.
E.g. imagine a and b have already been selected (100 and 10 times respectively). Then when the next point is to be selected, it's distance from a matters more than its distance from b, in line with the frequency of occurrence of a in the already-selected sample.
What I can try:
This would have been easy to accomplish if weights/frequencies weren't in play. I could do:
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point in selected_points:
distances[new_point] += retrieve_dist(new_point, already_selected_point)
Then I'd sort distances.items() by the second entry in each tuple, and would get the desired item to select.
However, when frequencies of already-selected points come into play, I just can't seem to wrap my head around this problem.
Can an expert help out? Thanks in advance.

A solution to your problem would be to make selected_points a list rather than a set. In this case, each new point is compared to a and b (and all other points) as many times as they have already been found.
If each point is typically found many times, it might be possible to improve perfomance using a dict instead, with the key being the points, and the value being the number of times each point is selected. In that case I think your algorithm would be
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point, occurances in selected_points.items():
distances[new_point] += occurances * retrieve_dist(new_point, already_selected_point)

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.

One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ∈ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_i∈{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.

Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Finding range overlap between at least three ranges Python

I have a list of tuples, which I am using to mark the lower and upper bounds of ranges. For example:
[(3,10), (4,11), (2,6), (8,11), (9,11)] # 5 separate ranges.
I want to find the ranges where three or more of the input ranges overlap. For instance the tuples listed above would return:
[(4,6), (8,11)]
I tried the method provided by #WolframH in answer to this post
But I don't know what I can do to:
Give me more than one output range
Set a threshold of at least three range overlaps to qualify an output

You first have to find all combinations of ranges. Then you can transform them to sets and calculate the intersection:
import itertools
limits = [(3,10), (4,11), (2,6), (8,11), (9,11)]
ranges = [range(*lim) for lim in limits]
results = []
for comb in itertools.combinations(ranges,3):
intersection = set(comb[0]).intersection(comb[1])
intersection = intersection.intersection(comb[2])
if intersection and intersection not in results and\
not any(map(intersection.issubset, results)):
results = filter(lambda res: not intersection.issuperset(res),results)
results.append(intersection)
result_limits = [(res[0], res[-1]+1) for res in map(list,results)]
It should give you all 3-wise intersections

You can, of course, solve this by brute-force checking all the combinations if you want. If you need this algorithm to scale, though, you can do it in (pseudo) nlogn. You can technically come up with a degenerate worst-case that's O(n**2), but whatchagonnado.
Basically, you sort the ranges, then for a given range you look to its immediate left to see that the bounds overlap, and if so you then look right to mark overlapping intervals as results. Pseudocode (which is actually almost valid python, look at that):
ranges.sort()
for left_range, current_range, right_range in sliding_window(ranges, 3):
if left_range.right < current_range.left:
continue
while right_range.left < min(left_range.right, current_range.right):
results.append(overlap(left_range, right_range))
right_range = right_range.next
#Before moving on to the next node, extend the current_range's right bound
#to be the longer of (left_range.right, current_range.right)
#This makes sense if you think about it.
current_range.right = max(left_range.right, current_range.right)
merge_overlapping(results)
(you also need to merge some possibly-overlapping ranges at the end, this is another nlogn operation - though n will usually be much smaller there. I won't discuss the code for that, but it's similar in approach to the above, involving a sort-then-merge. See here for an example.)

Getting Keys Within Range/Finding Nearest Neighbor From Dictionary Keys Stored As Tuples

I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.

You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.

Filter list to remove similar, but not identical, entries

I have a long list containing several thousand names that are all unique strings, but I would like to filter them to produce a shorter list so that if there are similar names only one is retained. For example, the original list could contain:
Mickey Mouse
Mickey M Mouse
Mickey M. Mouse
The new list would contain just one of them - it doesn't really matter which at this moment in time. It's possible to get a similarity score using the code below (where a and b are the text being compared), so providing I pick an appropriate ratio it I have a way of making a include/exclude decision.
difflib.SequenceMatcher(None, a, b).ratio()
What I'm struggling to work out is how to populate the second list from the first one. I'm sure it's a trivial matter, but it baffling my newbie brain.
I'd have thought something along the lines of this would have worked, but nothing ends up being populated in the second list.
for p in ppl1:
for pp in ppl2:
if difflib.SequenceMater(None, p, pp).ratio() <=0.9:
ppl2.append(p)
In fact, even if that did populate the list, it'd still be wrong. I guess it'd need to compare the name from the first list to all the names in the second list, keep track of the highest ratio scored, and then only add it if the highest ratio was less that the cutoff criteria.
Any guidance gratefully received!

I'm going to risk never getting an accept because this may be too advanced for you, but here's the optimal solution.
What you're trying to do is a variant of agglomerative clustering. A union-find algorithm can be used to solve this efficiently. From all pairs of distinct strings a and b, which can be generated using
def pairs(l):
for i, a in enumerate(l):
for j in range(i + 1, len(l)):
yield (a, l[j])
you filter the pairs that have a similarity ratio <= .9:
similar = ((a, b) for a, b in pairs
if difflib.SequenceMatcher(None, p, pp).ratio() <= .9)
then union those in a disjoint-set forest. After that, you loop over the sets to get their representatives.

Firstly, you shouldn't modify a list while you're iterating over it.
One strategy would be to go through all pairs of names and, if a certain pair is too similar to each other, only keep one, and then iterate this until no two pairs are too similar. Of course, the result would now depend on the initial order of the list, but if your data is sufficiently clustered and your similarity score metric sufficiently nice, it should produce what you're looking for.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.