Best way to compare two large sets of strings in Python

Best way to compare two large sets of strings in Python - python

I am using Python (and have access to pandas, numpy, scipy).
I have two sets strings set A and set B. Each set A and B contains c. 2000 elements (each element being a string). The strings are around 50-100 characters long comprising up to c. 20 words (these sets may get much larger).
I wish to check if an member of set A is also a member of set B.
Now I am thinking a naive implementation can be visualised as a matrix where members in A and B are compared to one another (e.g. A1 == B1, A1 == B2, A1 == B3 and so on...) and the booleans (0, 1) from the comparison comprise the elements of the matrix.
What is the best way to implement this efficiently?
Two further elaborations:
(i) I am also thinking that for larger sets I may use a Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each string (i.e. I dont mind fasle positives so much...). Is this a good approach or are there other strategies I should consider?
(ii) I am thinking of including a Levenshtein distance match between strings (which I know can be slow) as I may need fuzzy matches - is there a way of combining this with the approach in (i) or otherwise making it more efficient?
Thanks in advance for any help!

Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.
Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:
def compare(E1, E2):
i, j = 0, 0
I, J = len(E1), len(E2)
while i < I:
if j >= J or E1[i] < E2[j]:
print(E1[i], "is not in E2")
i += 1
elif E1[i] == E2[j]:
print(E1[i], "is in E2")
i, j = i + 1, j + 1
else:
j += 1
It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).
For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.

As mentioned in the comments:
def compare(A, B):
return list(set(A).intersection(B))

This is a modified version of the function that #michaelmeyer presented here https://stackoverflow.com/a/17264117/362951 - in his answer to the question on top of the page we are on.
The modified version below works also on unsorted data, because the function now includes the sorting.
This should not be a performance or resource problem in many cases, because python sorting is very effective. And presorting also helps.
Please note that the 'output' is now in sorted order too. This will differ from the original order of the first parameter, if it was unsorted.
Otherwise the sorting won't hurt much, even if both data sets are already sorted.
But if you want to suppress the sorting, in case both data sets are known to be sorted in ascending order already, call it like this:
compare(my_data1,my_data2,data_is_sorted=True)
Otherwise:
compare(my_data1,my_data2)
and the function accepts unordered data.
This is the modified version. Only the first two lines were added and a third optional parameter:
def compare(E1, E2, data_is_sorted=False):
if not data_is_sorted:
E1=sorted(E1)
E2=sorted(E2)
i, j = 0, 0
I, J = len(E1), len(E2)
while i < I:
if j >= J or E1[i] < E2[j]:
print(E1[i], "is not in E2")
i += 1
elif E1[i] == E2[j]:
print(E1[i], "is in E2")
i, j = i + 1, j + 1
else:
j += 1

Related

Check if differences between elements already exists in a list

I'm trying to build a heuristic for the simplest feasible Golomb Ruler as possible. From 0 to n, find n numbers such that all the differences between them are different. This heuristic consists of incrementing the ruler by 1 every time. If a difference already exists on a list, jump to the next integer. So the ruler starts with [0,1] and the list of differences = [ 1 ]. Then we try to add 2 to the ruler [0,1,2], but it's not feasible, since the difference (2-1 = 1) already exists in the list of differences. Then we try to add 3 to the ruler [0,1,3] and it is feasible, and thus the list of differences becomes [1,2,3] and so on. Here's what I've come to so far:
n = 5
positions = list(range(1,n+1))
Pos = []
Dist = []
difs = []
i = 0
while (i < len(positions)):
if len(Pos)==0:
Pos.append(0)
Dist.append(0)
elif len(Pos)==1:
Pos.append(1)
Dist.append(1)
else:
postest = Pos + [i] #check feasibility to enter the ruler
difs = [a-b for a in postest for b in postest if a > b]
if any (d in difs for d in Dist)==True:
pass
else:
for d in difs:
Dist.append(d)
Pos.append(i)
i += 1
However I can't make the differences check to work. Any suggestions?

For efficiency I would tend to use a set to store the differences, because they are good for inclusion testing, and you don't care about the ordering (possibly until you actually print them out, at which point you can use sorted).
You can use a temporary set to store the differences between the number that you are testing and the numbers you currently have, and then either add it to the existing set, or else discard it if you find any matches. (Note else block on for loop, that will execute if break was not encountered.)
n = 5
i = 0
vals = []
diffs = set()
while len(vals) < n:
diffs1 = set()
for j in reversed(vals):
diff = i - j
if diff in diffs:
break
diffs1.add(diff)
else:
vals.append(i)
diffs.update(diffs1)
i += 1
print(vals, sorted(diffs))
The explicit loop over values (rather than the use of any) is to avoid unnecessarily calculating the differences between the candidate number and all the existing values, when most candidate numbers are not successful and the loop can be aborted early after finding the first match.
It would work for vals also to be a set and use add instead of append (although similarly, you would probably want to use sorted when printing it). In this case a list is used, and although it does not matter in principle in which order you iterate over it, this code is iterating in reverse order to test the smaller differences first, because the likelihood is that unusable candidates are rejected more quickly this way. Testing it with n=200, the code ran in about 0.2 seconds with reversed and about 2.1 without reversed; the effect is progressively more noticeable as n increases. With n=400, it took 1.7 versus 27 seconds with and without the reversed.

How to compute derangement (permutation) of a list with repeating elements

I have a list with repeating elements, i.e. orig = [1,1,1,2,2,3].
I want to create a derangement b = f(orig) such that for every location value in b is different from value in orig:
b[i] != orig[i], for all i
I know a solution when all element in orig are unique, but this is a harder case.
Developing a solution in python, but any language will do.

The not so-efficient solution is clearly
import itertools
set([s for s in itertools.permutations(orig) if not any([a == b for a, b in zip(s, orig)])])
A second method and first improvement is using this perm_unique:
[s for s in perm_unique(orig) if not any([a == b for a, b in zip(s, orig)])]
A third method is to use this super quick unique_permutations algorithm.
import copy
[copy.copy(s) for s in unique_permutations(orig) if not any([a == b for a, b in zip(s, orig)])]
In my notebook with %%timeit the initial method takes 841 µs, and we improve to 266 µs and then to 137 µs.
Edit
Couldn't stop considering, made a small edit of the second method. Didn't have the time to dive into the last method. For explanation, first see the original post (link above). Then I only added the check and el != elements[depth] which forces the condition of the derangement. With this we arrive at a running time of 50 µs.
from collections import Counter
def derangement_unique(elements):
list_unique = Counter(elements)
length_list = len(elements) # will become depth in the next function
placeholder = [0]*length_list # will contain the result
return derangement_unique_helper(elements, list_unique, placeholder, length_list-1)
def derangement_unique_helper(elements, list_unique, result_list, depth):
if depth < 0: # arrived at a solution
yield tuple(result_list)
else:
# consider all elements and how many times they should still occur
for el, count in list_unique.items():
# ... still required and not breaking the derangement requirement
if count > 0 and el != elements[depth]:
result_list[depth] = el # assignment element
list_unique[el] -= 1 # substract number needed
# loop for all possible continuations
for g in derangement_unique_helper(elements, list_unique, result_list, depth-1):
yield g
list_unique[el] += 1
list(derangement_unique(orig))

If your list contains significant part of duplicates, it might be hard to find derangement quickly.
In this case you can try graph approach.
Treat initial list to make a graph where every item is connected with non-equal elements (easy for sorted list).
Then build perfect matching (if number of element is even) or near-perfect matching (for odd count, here you'll need find some suitable pair and join single node to it).
Edges of matching indicate swaps to make derangement.
Python library networkx should contain needed methods.

How can I merge overlapping strings in python?

I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!

Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.

I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Determining index of maximum value in a list - Optimalization

I have written few lines of code to solve this problem, but profiler says, that it is very time-consuming. (using kernprof line-by-line profiler)
Here is the code:
comp = [1, 2, 3] #comp is list with always 3 elements, values 1, 2, 3 are just for illustration
m = max(comp)
max_where = [i for i, j in enumerate(comp) if j == m]
if 0 in max_where:
some action1
if 1 in max_where:
some action2
if 2 in max_where:
some action3
Profiler says that most time is consumed in max_where calculation. I have also tried to split this calculation into if-tree to avoid some unnecessary operations, but results were not satisfactory.
Please, am I doing it wrong or is it just python?

If it's always three elements, why not simply do:
comp = [1, 2, 3]
m = max(comp)
if comp[0] == m:
some action
if comp[1] == m:
some action
if comp[2] == m:
some action

If you're doing this many times, and if you have all the lists available at the same time, then you could make use of numpy.argmax to get the indices for all the lists.

You say that this is a time-consuming operation, but I sincerely doubt this actually affects your program. Have you actually found that this is causing some problem due to slow execution in your code? If not, there is no point optimizing.
This said, there is a small optimization I can think of - which is to use a set rather than a list comprehension for max_where. This will make your three membership tests faster.
max_where = {i for i, j in enumerate(comp) if j == m}
That said, with only three items/checks, the construction of the set may well take more time than it saves.
In general, with a list of three items, this operation is going to take negligible amounts of time. On my system, it takes half a microsecond to perform this operation.
In short: Don't bother. Unless this is a proven bottleneck in your program that needs to be sped up, your current code is fine.

Expanding upon Tobias' answer, using a for loop:
comp = [1, 2, 3]
m = max(comp)
for index in range(len(comp)):
if comp[index] == m:
# some action
Since indexing starts at 0, you do not need to do len(comp) + 1.
I prefer using indexing in a for loop instead of the actual element, because it speeds things up considerably.
Some times in a process, you may need the index of a specific element. Then, using l.index(obj) will waste time (even if only insignificant amounts --- for longer processes, this becomes tedious).
This also assumes that every process (for the comp[index]) is very similar: same process but with different variables. This wouldn't work if you have significantly different processes for each index.
However, by using for index in range(len(l)):, you already have the index and the item can easily be accessed with l[index] (along with the index, which is given by the loop).
Oddly, it seems that Tobias' implementation is faster (I thought otherwise):
comp = [1, 2, 3]
m = max(comp)
from timeit import timeit
def test1():
if comp[0] == m: return m
if comp[1] == m: return m
if comp[2] == m: return m
def test2():
for index in range(len(comp)):
if comp[index] == m: return m
print 'test1:', timeit(test1, number = 1000)
print 'test2:', timeit(test2, number = 1000)
Returns:
test1: 0.00121262329299
test2: 0.00469034990534
My implementation may be faster for longer lists (not sure, though). However, writing the code for that is tedious (for a long list using repeated if comp[n] == m).

How About this:
sample = [3,1,2]
dic = {0:func_a,1:func_b,2:func_c}
x = max(sample)
y = sample.index(x)
dic[y]
As mentioned and rightfully downvoted this does not work for multiple function calls.
However this does:
sample = [3,1,3]
dic = {0:"func_a",1:"func_b",2:"func_c"}
max_val = max(sample)
max_indices = [index for index, elem in enumerate(sample) if elem==max_val]
for key in max_indices:
dic[key]
This is quite similar to other solutions above. I know some time passed but it wasn't right how it was. :)
Cheers!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to compare two large sets of strings in Python - python

As mentioned in the comments: def compare(A, B): return list(set(A).intersection(B))

Related

Check if differences between elements already exists in a list

How to compute derangement (permutation) of a list with repeating elements

How can I merge overlapping strings in python?

longest common sub-string, Python complexity analysis

Determining index of maximum value in a list - Optimalization

Categories

Resources