Increasing speed of Fuzz Ratio - python

I am new to python and am struggling to speed up this piece of code.
I have about 1 million strings in zz1 and 250,000 strings in a3. The strings in zz1 have errors in them and I would like to match it onto the one in a3, which has the highest ratio of fuzzy match.
I am happy with the results, but it takes about ~9 seconds to process one row in zz1. Is there a way to speed this up?
import csv
import string
import time
import sys
from fuzzywuzzy import fuzz
op=[]
for i in range(1,len(zz1)):
num=zz1[i][1]
start2 = time.clock()
forecast = []
if not num in zz4 and zz5:
for j in range (1,len(a3)):
forecast.append(fuzz.ratio(num,a3[j][1]))
if (max(forecast)>60):
index=forecast.index(max(forecast))
op.append(a3[index][1])
elapsed2 = (time.clock() - start2)
print (i,elapsed2)

Since the data represents abbreviated strings rather than typos or other errors, you should be able to perform this operation by eliminating the easy choices.
Consider these two examples (I don't know how GRAND CANYON would be abbreviated, so this is just a guess to illustrate the point):
GRAND CANYON -> GR CANYON
MELBOURNE REEF -> MELBRNE REEF
There is no way GR CANYON could possibly relate to MELBOURNE REEF, so there is no need to test that. Group by left prefix of the strings to codify this test, and you should see a very high hit rate since an abbreviation will normally preserve the first letter.
So, you only compare abbreviations that start with a "G" with the full names that also start with a "G", and "M" with "M", etc.
Note also that this doesn't have to provide a perfect match; you will maintain lists of unmatched strings and remove matches as you find them, so after you have compared the grouped strings and removed the matches, your final list of unmatched items will be a very small subset of your original data that can be compared relatively quickly using your original brute-force approach.
With just the first character, you reduce the number of strings to be compared in each list by a factor of 26, which reduces the number of comparisons by a factor of 26*26, that's 676 times faster!
Since perfect accuracy isn't necessary in the initial elimination steps, you could extend this grouping to more than just the first character for the first iteration, reducing the number of characters for each additional iteration until you reach zero characters. With 2-character grouping, you reduce the number of items in each list to compare by 26*26, which reduces the total number of comparisons by 26*26*26*26, or from 250 billion down to about half a million comparisons. With 3-character grouping, from 250 billion down to about 809.
By using this iterative approach, you will perform exponentially fewer fuzz operations, without losing accuracy.

Not sure how much time this will save but this bit seems unnecessarily complicated:
for j in range (1,len(a3)):
forecast.append(fuzz.ratio(num,a3[j][1]))
if (max(forecast)>60):
index=forecast.index(max(forecast))
op.append(a3[index][1])
Every time, you're making a new list with 250,000 elements in it, then you're checking that new list for the max element, then you're finding the index of that max element to retrieve the associated value from the original list. It probably makes more sense to do something more like:
max_ratio = 0
match = ''
for a in a3:
fr = fuzz.ratio(num,a[1])
if fr > max_ratio:
max_ratio = fr
match = a[1]
if max_ratio > 60:
op.append(match)
This way you're not creating and storing a big new list every time, you're just keeping the best match you find, replacing it only when you find a better match.

Related

loop on long sequence in python with minimum time complexity in python

I have two DNA sequences (long sequences) I want to iterate onto them.
I tried traditional for loop but it takes linear time so it has big complexity so I want to find a way to loop on these two sequences and compare between characters without linear time, is there any way to do that ?
update
I want to compare between each character in two sequences
update 2 :
this is code which I tried
for i in range(0,len(seqA)):
if seqA[i]==seqB[i]:
print("similar")
else :
print("not similar")
and this is a sample of DNA sequences which I want to compare between it
seqA = "AT-AC-TCT-GG--TTTCT--CT----TCA---G-A-T--C-G--C--AT-A----AATC-T-T--T-CG-CCTT-T-T---A---C--TA-A--A---G-ATTTCCGT-GGAG-AGG-A-AC---AACTCT-G-AG-T--CT---TA--AC-CCA---ATT-----T--T-TTG-AG--CCTTGCCTT-GGCAA-GGCT--A---"
seqB = "ATCGCTTCTCGGCCTTT-TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAAT-ATCTGATACGTCC-TCTATCCGAGGACAATATATTAAATGGATTT---TTGGAGCAGGGAGA-TGGAA---TAGGAGCTTGCTCCGT-CCACTCCACGCA-TCGACCTGGTATTGCAGTACC-T-CC--AGG-AACGG-TGCACCC"
If you want this: a sequence that tells you for each position whether they are equal at that point (assuming they are equally long):
result = "".join(
1 if a == seqB[i] else 0
for (i, a) in enumerate(seqA)
)
Not sure how useful result is. What if both sequences are similar, but one of them has deleted a letter? By this measure the sequences will then be very dissimilar.
The Levenshtein edit-distance is a captures the intuition of similarity better, but is much more costly to compute.
https://en.wikipedia.org/wiki/Levenshtein_distance
You could chop up the sequences in chunks of 100 letters, and perform Levenshtein on each pair of chunks.
This library has worked well for me, working on ancient texts, not DNA (I did billions of pairs of chunks in a few hours):
https://github.com/ztane/python-Levenshtein

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?
A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes
You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

Optimizing random sampling for scaling permutations

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Find K-most longest common suffix in a set of strings

I want to find most longest common suffix in a set of strings to detect some potential important morpheme in my natural language process project.
Given frequency K>=2,find the K-most common longest suffix in a list of strings S1,S2,S3...SN
To simplify the problem, here are some examples:
Input1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
Output1:
["man","eman","woman"]
Explanation1:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
It would be appreciated if the output also tracked the frequency of each common suffix
{"man":4,"eman":2,"woman":2}
It is not guaranteed that each word will have at least one-length common suffix, see the example below.
Input2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
Output2:
["man","eman","woman","ple","apple"]
Explanation2:
"man" occur 4 times, "eman" occur 2 times,"woman" occur 2 times
"ple" occur 3 times, "apple" occur 2 times
Any efficient algorithm to solve this problem ?
I have referred to Suffix Tree and Generalised Suffix Tree algorithm but it seems not to fit this problem well.
(BTW, I am researching on a Chinese NLP project which means there are far more characters than English has)
If you want to be really efficient this paper could help. The algorithm gives you the most frequent (longest) substrings from a set of strings in O(n log n). The basic idea is the following:
Concatenate all strings with whitespace between them (these are used as seperators later) in order to from a single long string
Build a suffix array SA over the long string
For each pair of adjacent entries in the suffix array, check how many of the starting characters are equal of the two suffixes that the entries refer to (you should stop when you reach a whitespace). Put the count of equal characters in an additional array (called LCP). Each entry LCP[i] refers to the count of equal characters of the string at position SA[i-1] and SA[i] in the long string.
Go over the LCP an find successive entries with a count larger than K.
Example from the paper for3 strings
s1= bbabab
s2 = abacac
s3 = bbaaa
One idea would be to reverse the strings and then run a trie on them.
Such a simple trick in the book.
Once subsequence is reached, just a matter of reversal again.
E.g. Result from Tries would be
[nam, name, ...]
Reversal of these would give you the final answer.
[man, eman, ...]
Time complexity increased by + O(n).

Python collections Counter with large data

I have two text files both consisting of approximately 700,000 lines.
Second file consists of responses to statements in the first file for corresponding line.
I need to calculate Fisher's Exact Score for each word pair that appears on matching lines.
For example, if nth lines in the files are
how are you
and
fine thanx
then I need to calculate Fisher's score for (how,fine), (how,thanx), (are,fine), (are,thanx), (you,fine), (you,thanx).
In order to calculate Fisher's Exact Score, I used collections module's Counter to count the number of appearances of each word, and their co-appearances throughout the two files, as in
with open("finalsrc.txt") as f1, open("finaltgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(set(list(find_words(line1))))
words2 = list(set(list(find_words(line2))))
counts1.update(words1)
counts2.update(words2)
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
then I calculate the Fisher's exact score for each pair using scipy module by
from scipy import stats
def calculateFisher(s, t):
sa = counts1[s]
ta = counts2[t]
st = counts_pair[s, t]
snt = sa - st
nst = ta - st
nsnt = n - sa - ta + st
oddsratio, pvalue = stats.fisher_exact([[st, snt], [nst, nsnt]])
return pvalue
This works fast and fine for small text files,
but since my files contain 700,000 lines each, I think the Counter gets too large to retrieve the values quickly, and this becomes very very slow.
(Assuming 10 words per each sentence, the counts_pair would have (10^2)*700,000=70,000,000 entries.)
It would take tens of days to finish the computation for all word pairs in the files.
What would be the smart workaround for this?
I would greatly appreciate your help.
How exactly are you calling the calculateFisher function? Your counts_pair will not have 70 million entries: a lot of word pairs will be seen more than once, so seventy million is the sum of their counts, not the number of keys. You should be only calculating the exact test for pairs that do co-occur, and the best place to find those is in counts_pair. But that means that you can just iterate over it; and if you do, you never have to look anything up in counts_pair:
for (s, t), count in counts_pair.iteritems():
sa = counts1[s]
ta = counts2[t]
st = count
# Continue with Fisher's exact calculation
I've factored out the calculate_fisher function for clarity; I hope you get the idea. So if dictionary look-ups were what's slowing you down, this will save you a whole lot of them. If not, ... do some profiling and let us know what's really going on.
But note that simply looking up keys in a huge dictionary shouldn't slow things down too much. However, "retrieving values quickly" will be difficult if your program must to swap most of its data to disk. Do you have enough memory in your computer to hold the three counters simultaneously? Does the first loop complete in a reasonable amount of time? So find the bottleneck and you'll know more about what needs fixing.
Edit: From your comment it sounds like you are calculating Fisher's exact score over and over during a subsequent step of text processing. Why do that? Break up your program in two steps: First, calculate all word pair scores as I describe. Write each pair and score out into a file as you calculate it. When that's done, use a separate script to read them back in (now the memory contains nothing else but this one large dictionary of pairs & Fisher's exact scores), and rewrite away. You should do that anyway: If it takes you ten days just to get the scores (and you *still haven't given us any details on what's slow, and why), get started and in ten days you'll have them forever, to use whenever you wish.
I did a quick experiment, and a python process with a list of a million ((word, word), count) tuples takes just 300MB (on OS X, but the data structures should be about the same size on Windows). If you have 10 million distinct word pairs, you can expect it to take about 2.5 GB of RAM. I doubt you'll have even this many word pairs (but check!). So if you've got 4GB of RAM and you're not doing anything wrong that you haven't told us about, you should be all right. Otherwise, YMMV.
I think that your bottleneck is in how you manipulate the data structures other than the counters.
words1 = list(set(list(find_words(line1)))) creates a list from a set from a list from the result of find_words. Each of these operations requires allocating memory to hold all of your objects, and copying. Worse still, if the type returned by find_words does not include a __len__ method, the resulting list will have to grow and be recopied as it iterates.
I'm assuming that all you need is an iterable of unique words in order to update your counters, for which set will be perfectly sufficient.
for line1, line2 in itertools.izip(f1, f2):
words1 = set(find_words(line1)) # words1 now has list of unique words from line1
words2 = set(find_words(line2)) # words2 now has list of unique words from line2
counts1.update(words1) # counts1 increments words from line1 (once per word)
counts2.update(words2) # counts2 increments words from line2 (once per word)
counts_pair.update(itertools.product(words1, words2)
Note that you don't need to change the output of itertools.product that is passed to counts_pair as there are no repeated elements in words1 or words2, so the Cartesian product will not have any repeated elements.
Sounds like you need to generate the cross-products lazily - a Counter with 70 million elements will take a lot of RAM and suffer from cache misses on virtually every access.
So how about instead save a dict mapping a "file 1" word to a list of sets of corresponding "file 2" words?
Initial:
word_to_sets = collections.defaultdict(list)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_sets[w1].append(words2)
Then in your Fisher function, replace this:
st = counts_pair[s, t]
with:
st = sum(t in w2set for w2set in word_to_sets.get(s, []))
That's as lazy as I can get - the cross-products are never computed at all ;-)
EDIT Or map a "list 1" word to its own Counter:
Initial:
word_to_counter = collections.defaultdict(collections.Counter)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_counter[w1].update(words2)
In Fisher function:
st = word_to_counter[s][t]

Categories

Resources