I've got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I've created a custom indexing function that blocks pairs of records that haven't got similar 'name' attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
indici1=[]
indici2=[]
for i in range(0, 173033):
if(i%2 == 0):
print(i) #keeps track of the iteration
for j in range(i, 173033):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.5):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.
[MultiIndex([( 0, 0),
( 0, 22159),
( 0, 67902),
( 0, 67903),
( 1, 1),
( 1, 1473),
( 1, 5980),
( 1, 123347),
( 2, 2),
...
Here's the code for the similarity function:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Here's an example of the dataframe I have as input:
name
0 Amazon
1 Walmart
2 Apple
3 Amazon.com
4 Walmart Inc.
I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)
You are using .append method of list, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque which does have such quirks, just add import collections and do
indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]
If that would not help enough you would need similar function for possible improvements.
As others have pointed out, the solution to your problem requires O(N^2) running time, which means it won't scale well for very large datasets. Nonetheless, I think there's still a lot of room for improvement.
Here are some strategies you can use to speed up your code:
If your dataset contains many duplicate name values, you can use "memoization" to avoid re-computing the similar score for duplicate name pairs. Of course, caching all 172k^2 pairs would be devastatingly expensive, but if the data is pre-sorted by name, then lru_cache with 172k items should work just fine.
Looking at the difflib documentation, it appears that you have the option of quickly filtering out "obvious" mismatches. If you expect most pairs to be "easy" to eliminate from consideration, then it makes sense to first call SequenceMatcher.quick_ratio() (or even real_quick_ratio()), followed by ratio() only if necessary.
There will be some overhead in the ordinary control flow.
Calling df.loc many times in a for-loop might be a bit slow in comparison to simple iteration.
You can use itertools.combinations to avoid writing a nested for-loop yourself.
BTW, tqdm provides a convenient progress bar, which will give a better indication of true progress than the print statements in your code.
Lastly, I saw no need for the df_b parameter in your function above, so I didn't include it in the code below. Here's the full solution:
import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm
#lru_cache(173_000)
def is_similar(a, b):
matcher = SequenceMatcher(None, a, b)
if matcher.quick_ratio() <= 0.5:
return False
return matcher.ratio() > 0.5
def link_index(df):
# We initialize the index result pairs with [(0,0), (1,1,), (2,2), ...]
# because they are trivially "linked" and your problem statement
# says you want them in the results.
indici1 = df.index.tolist()
indici2 = df.index.tolist()
# Sort the names so that our lru_cache is effective,
# even though it is limited to 173k entries.
name_items = df['name'].sort_values().items()
pairs = combinations(name_items, 2)
num_pairs = math.comb(len(names), 2)
for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
if is_similar(i_name, j_name):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
links = pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
return links.sortlevel([0,1])[0]
Quick Test:
names = ['Amazon', 'Walmart', 'Apple', 'Amazon.com', 'Walmart Inc.']
df = pd.DataFrame({'name': names})
link_index(df)
Output:
(MultiIndex([(0, 0),
(0, 3),
(1, 1),
(1, 4),
(2, 2),
(3, 3),
(4, 4)],
names=['first', 'second']),
array([0, 5, 1, 6, 2, 3, 4]))
Let me know if that speeds things up on your actual data!
Let's set some realistic expectations.
Your original estimate was ~1 minute for 10 "iterations". That implies the total time would have been ~6 days:
print(math.comb(172033, 2) / (10*172033) / 60 / 24)
On the other hand, merely iterating through the full set of i,j combinations and doing absolutely nothing with them would take ~45 minutes on my machine. See for yourself:
sum(1 for _ in tqdm(combinations(np.arange(172033), 2), total=math.comb(172033, 2)))
So the real solution will take longer than that. Now you've got some bounds on what the optimal solution will require: Somewhere between ~1 hour and ~6 days. Hopefully it's closer to the former!
We asked you to reveal the similar() function but you've declined to do so.
You wrote
for i in range(0, 173033):
...
for j in range(i, 173033):
if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:
So you plan to call that mysterious similar 29_940_419_089 (30 billion) times.
I'm guessing that's going to take some little while, maybe two weeks?
We describe this as O(n^2) quadratic performance.
Here is the way out.
First, sort your dataset. It will only cost O(n log n), cheap!
Next, find something in the similar loss function that allows you to
simplify the problem, or design a new related loss function which allows that.
Spoiler alert -- we can only do comparisons against local neighbors,
not against far flung entries more than 100,000 positions away.
I will focus on the five example words.
It seems likely that your mysterious function reports
a large value for similar("Walmart", "Walmart Inc.")
and a small value when comparing those strings against "Apple".
So let's adopt a new cost function rule.
If initial character of the two strings is not identical,
the similarity is immediately reported as 0.
So now we're faced with comparing Apple against two Amazons,
and Walmart with itself.
We have partitioned the problem into "A" and "W" subsets.
The quadratic monster is starting to die.
Minor side note: Since your similarity function likely is
symmetric, it suffices to examine just half the square,
where i < j.
A practical implementation for a dataset of this size
is going to need to be more aggressive, perhaps insisting
that initial 2 or 3 letters be identical.
Perhaps you can run a hyphenation algorithm and look
at identical syllables.
Or use Metaphone
or Soundex.
The simplest possible thing you could do is define a
window W of maybe 10. When examining the sorted dataset,
we arbitrarily declare that similarity between i, j entries
shall be zero if abs(i - j) > W.
This may impact accuracy, but it's great for performance,
it lets you prune the search space before you even call the function.
We went from O(n^2) to O(n) linear.
Perfect accuracy is hardly relevant if you never wait
long enough for the code to produce an answer.
Use these ideas to code up a solution, and
let us know
how you resolved the details.
EDIT
#Pierre D. advocates the use of
locality sensitive hashing
to avoid lost revenue due to {prefix1}{entity} and {prefix2}{entity}
being far apart in a sorted list yet close together IRL and in hash space.
Metaphone is one technique for inducing LSH collisions,
but it seldom has any effect on initial letter / sort order.
Das, et al.,
describe a MinHash technique in
"Google News Personalization: Scalable Online Collaborative Filtering".
Number of syllables in the business names you're examining
will be significantly less than size of click-sets seen by Google.
Hashes give binary results of hit or miss,
rather than a smooth ranking or distance metric.
The need to promote neighbor collisions,
while steering clear of the birthday paradox,
makes it difficult to recommend tuning parameters
absent a snapshot of the production dataset.
Still, it is a technique you should keep in mind.
Asking spacy to do a
named entity recognition
pre-processing pass over your data might also be
worth your while, in the hopes of normalizing
or discarding any troublesome "noise" prefixes.
In the end this is what I came up. After sorting the dataset in alphabetical order according to the attribute 'name', I use this custom index. Every row in the dataset is compared to its neighbours in the range of 100 rows.
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
t0 = time.time()
indici1=[]
indici2=[]
for i in range(1, 173034):
if(i%500 == 0):
print(i)
if(i<100):
n = i
for j in range((i-(n-1)), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
elif(i>172932):
n = 173033 - i
for j in range((i-100), (i+n)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
else:
for j in range((i-100), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
t1 = time.time()
print(t1-t0)
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!
I'll be looking into locality sensitive hashing suggested by Pierre D.
Related
I have about 95,000,000 permutations to check.
I have 8 lists of varying length, each string identifies properties (a-k) defined in an excel sheet.
e.g
bcdgj
has properties b, c, d, g and j
I need to find just one permutation that contains at least 3 of every property and then match those properties to the data in the spreadsheet
I have made this script (my first attempt at using python)
import numpy
import itertools
for x in itertools.product(['abfhj','bcdgj','fghij','abcj','bdgk','abgi','cdei','cdgi','dgik','aghi','abgh','bfhk'],['cdei','bcdgj','abcgi','abcj','abfj','bdfj','cdgi','bhjk','bdgk','dgik'],['afhk','cdgik','cegik','bdgi','cgij','cdei','bcgi','abgh'],['fhjk','bdgij','cgij','abk','ajk','bdk','cik','cdk','cei','fgj'],['abe','abcf','afh','cdi','afj','cdg','abi','cei','cgk','ceg','cgi'],['cdgi','bcgj','bcgi','bcdg','abfh','bdhi','bdgi','bdk','fhk','bei','beg','fgi','abf','abc','egi'],['bcdgik','cegik','chik','afhj','abcj','abfj'],['ceg','bcfg','cgi','bdg','afj','cgj','fhk','cfk','dgk','bcj']):
gear = ''.join(x)
count_a = gear.count('a')
count_b = gear.count('b')
count_c = gear.count('c')
count_d = gear.count('d')
count_e = gear.count('e')
count_f = gear.count('f')
count_g = gear.count('g')
count_h = gear.count('h')
count_i = gear.count('i')
count_j = gear.count('j')
count_k = gear.count('k')
score_a = numpy.clip(count_a, 0, 3)
score_b = numpy.clip(count_b, 0, 3)
score_c = numpy.clip(count_c, 0, 3)
score_d = numpy.clip(count_d, 0, 3)
score_e = numpy.clip(count_e, 0, 3)
score_f = numpy.clip(count_f, 0, 3)
score_g = numpy.clip(count_g, 0, 3)
score_h = numpy.clip(count_h, 0, 3)
score_i = numpy.clip(count_i, 0, 3)
score_j = numpy.clip(count_j, 0, 3)
score_k = numpy.clip(count_k, 0, 3)
rating = score_a + score_b + score_c + score_d + score_e + score_f + score_g + score_h + score_i + score_j + score_k
if rating == 33:
print(x)
print(rating)
I've adjusted the rating requirement to test that it's working, it is but it's going to take a while to crunch through 95,000,000 permutations. Anyone have any advice for getting it to run faster?
I think I've already reduced the number of values in each list as much as I can, the excel sheet the data comes from has several hundred entries per list and I've managed to reduce it to 6-12 per list.
Python is not designed to write computationally-intensive pure-Python codes. It is meant to be used as glue code. The intensive part should be vectorized, that is, optimized in compiled native languages like C. This is especially true with the default implementation of CPython which is an interpreter. Any call to a function is pretty expensive (100 ns per gear.count).
Still, there are many sources of slowdown that can be avoided. First of all, strings are unicode-based ones and unicode strings are slow to compute (because of the non trivial encoding and the variable size). Your code create a new string object in every iteration of the loop. Creating new objects is expensive. The thing is Python strings are immutable so there is no way not to create new string except not using strings at all. Additionally, the same thing is true for Numpy operations: Numpy is not designed to compute very small array so it introduce a significant overhead. The string is rebuilt from scratch with ''.join(...) but only the last part needs to be modified. The same thing is true for counting characters: you can only recompute the part that is changing from one iteration to another. Moreover, there is no need for numpy.clip since the number of item cannot be negative: you can replace this with a call to score_xxx = min(count_xxx, 3). Note that this operation can be executed in parallel (using multiprocessing in pure-Python). That being said, rewriting this in C should be many order of magnitude faster if one pay attention to the aforementioned points.
If you are bound to Python, you can use just-in-time compilers like Numba to do that. However, Numba do not support well strings. This is not much a problem since we should not use it for sake of performance here anyway. The strings can be translated to ASCII-based integer arrays and the itertool generator can be replaced with basic loops.
One way to do it efficiently in Numba is to 1. split the Cartesian product in two parts and compute two quite big arrays with the counts (using basic loops), 2. then compute the Cartesian product of the two groups (using a 2 nested loops).
from itertools import groupby, product
data = (
['abfhj','bcdgj','fghij','abcj','bdgk','abgi','cdei','cdgi','dgik','aghi','abgh','bfhk'],
['cdei','bcdgj','abcgi','abcj','abfj','bdfj','cdgi','bhjk','bdgk','dgik'],
['afhk','cdgik','cegik','bdgi','cgij','cdei','bcgi','abgh'],
['fhjk','bdgij','cgij','abk','ajk','bdk','cik','cdk','cei','fgj'],
['abe','abcf','afh','cdi','afj','cdg','abi','cei','cgk','ceg','cgi'],
['cdgi','bcgj','bcgi','bcdg','abfh','bdhi','bdgi','bdk','fhk','bei','beg','fgi','abf','abc ','egi'],
['bcdgik','cegik','chik','afhj','abcj','abfj'],
['ceg','bcfg','cgi','bdg','afj','cgj','fhk','cfk','dgk','bcj'],
)
REQ_PROPS = set("abcdefghijk")
for x in product(*data):
permu = ''.join(x)
# if the permutation does not contain all letters from a-k, skip it.
if REQ_PROPS.difference(permu):
continue
prop_map = dict.fromkeys(permu)
for prop, group in groupby(sorted(permu)):
group_rating = len(tuple(group))
# dont bother searching more props of this permutation if the current
# property has a rating less than 3
if group_rating < 3:
break
prop_map[prop] = group_rating
# check if this permutation satisfies the requirement and exit if it does.
if all(v is not None for v in prop_map.values()):
print(x)
print(prop_map) # total rating of each property
break
Not sure if this is what you are looking for but the most obvious optimization would be to "short-circuit" the search by stopping as soon as a permutation does not have all required properties or when a permutation's property fails to have at least 3 instances. This is achieved by first sorting the permutation and then grouping it by property. After grouping you check if each property's frequency is not less than 3. If the check does not fail for all properties then you have your answer, else move on to the next permutation.
I ran the program using the example data provided it appears that there is no solution that contains all a-k properties. Maybe you need a bigger dataset.
I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.
I am trying to find the best way to compare large sets of numerical sequences to other large sets, in order to rank them against each other. Maybe the following toy example clarifies the issue, where lists a, b, and c represent shingles of size 3 in a time series.
a = [(1,2,3),(2,3,4),(3,4,5)]
b = [(1,2,3),(2,3,4),(3,4,7),(4,7,8)]
c = [(1,2,3),(2,3,5)]
set_a, set_b, set_c = set(a), set(b), set(c)
jaccard_ab = float(len(set_a.intersection(set_b)))/float(len(set_a.union(set_b)))
jaccard_bc = float(len(set_b.intersection(set_c)))/float(len(set_b.union(set_c)))
jaccard_ac = float(len(set_a.intersection(se t_c)))/float(len(set_a.union(set_c)))
The similarity among these sets is:
jaccard_ab, jaccard_bc, jaccard_ac
(0.4, 0.2, 0.25)
So in this example, we can see that set a and b are the most similar with a score of 0.4.
I am having a design problem:
1) Since each set will be composed of ~1000 shingles, do I gain speed by transforming every shingle into a unique hash and then comparing hashes?
2) Initially, I have over 10,000 sets to compare so I think I am much better off storing the shingles (or hashes, depending on answer to 1) in a database or pickling. Is this a good approach?
3) As a new set is added to my workflow, I need to rank it against all existing sets and display, let's say, the top 10 most similar. Is there a better approach than the one in the toy example?
1) Members of a set have to be hashable, so python is already computing hashes. Storing sets of hashes of items would be duplicated effort, so there's no need to do that.
2) The complexity of the set intersection and union is approximately linear. The Jaccard isn't comptationally expensive, and 10,000 sets isn't that many (about 50 million1 computations). It will probably take an hour to compute your initial results, but it won't take days.
3) Once you have all of your combinations, ranking another set against your existing results means doing only 10,000 more comparisons. I can't think of a simpler way than that.
I'd say just do it.
If you want to go faster, then you should be able to use a multiprocessing approach fairly easily with this dataset. (Each computation is independent of the other ones, so they can all run in parallel).
Here's an example adapted from the concurrent.futures examples (Python3).
import concurrent.futures
data = [
{(1, 2, 3), (2, 3, 4), (3, 4, 5), ...},
{(12, 13, 14), (15, 16, 17), ...},
...
]
def jaccard(A, B):
return len(A & B) / len(A | B)
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(jaccard, *sets): sets
for sets in combinations(data, 2)}
for future in concurrent.futures.as_completed(futures):
jaccard_index = future.result()
print(jaccard_index) # write output to a file or database here
[1]:
>>> from itertools import combinations
>>> print(sum(1 for i in combinations(range(10000), 2)))
49995000
1) This is done internally anyways when constructing the set().
2) I'm not sure you'll be able to stay with python for your size of the data set, so I'd suggest using some simple (text) format so it can be easily loaded eg in C/C++. Do you need to store the shingles at all? What about generating them on the fly?
3) If you need all to all comparison for your initial data set, something like google-all-pairs or ppjoin will surely help. It works by reducing the candidate set for each comparison using predefined similarity threshold. You can modify the code to keep the index for further searches.
You should definitely consider utilizing multi-cores as this problem is very suitable for this task. You might consider PyPy, as I see 2-3X speedup comparing to Python 3 for large set comparison. Then you might checkout part 1: resemblance with the jaccard coefficient for a magic C++ implementation to get further speed-ups. This C++ / OpenMP solution is the fastest I have tested yet.
I am working with DNA sequence alignment, and I have a performance issue.
I need to create a dict that maps a word (a sequence of a set length) to a list of all words that are similar as decided by a separate function.
Right now, I am doing the following:
all_words_rdd = sc.parallelize([''.join(word) for word in itertools.product(all_letters, repeat=WORD_SIZE)], PARALLELISM)
all_similar_word_pairs_map = (all_words_rdd.cartesian(all_words_rdd)
.filter(lambda (word1, word2), scoring_matrix=scoring_matrix, threshold_value=threshold_value: areWordsSimilar((word1, word2), scoring_matrix, threshold_value))
.groupByKey()
.mapValues(set)
.collectAsMap())
Where areWordsSimilar obviously calculates whether the words reach a set similarity threshold.
However, this is horribly slow. It works fine with words of length 3, but once I go any higher it slows down exponentially (as you might expect). It also starts complaining about the task size being too big (again, not surprising)
I know the cartesian join is a really inefficient way to do this, but I'm not sure how to approach it otherwise.
I was thinking of starting with something like this:
all_words_rdd = (sc.parallelize(xrange(0, len(all_letters) ** WORD_SIZE))
.repartition(PARALLELISM)
...
)
This would let me split the calculation across multiple nodes. However, how do I calculate this? I was thinking about doing something with bases and inferring the letter using the modulo operator (i.e. in base of len(all_letters), num % 2 = all_letters[0], num % 3 = all_letters[1], etc).
However, this sounds horribly complicated, so I was wondering if anybody had a better way.
Thanks in advance.
EDIT
I understand that I cannot reduce the exponential complexity of the problem, that is not my goal. My goal is to break up the complexity across multiple nodes of execution by having each node perform part of the calculation. However, to do this I need to be able to derive a DNA word from a number using some process.
Generally speaking even without driver side code it looks like a hopeless task. Size of the sequence set is growing exponentially and you simply cannot win with that. Depending on how you plan to use this data there is most likely a better approach out there.
If you still want to go with this you can start with spiting kmers generation between a driver and workers:
from itertools import product
def extend_kmer(n, kmer="", alphabet="ATGC"):
"""
>>> list(extend_kmer(2))[:4]
['AA', 'AT', 'AG', 'AC']
"""
tails = product(alphabet, repeat=n)
for tail in tails:
yield kmer + "".join(tail)
def generate_kmers(k, seed_size, alphabet="ATGC"):
"""
>>> kmers = generate_kmers(6, 3, "ATGC").collect()
>>> len(kmers)
4096
>>> sorted(kmers)[0]
'AAAAAA'
"""
seed = sc.parallelize([x for x in extend_kmer(seed_size, "", alphabet)])
return seed.flatMap(lambda kmer: extend_kmer(k - seed_size, kmer, alphabet))
k = ... # Integer
seed_size = ... # Integer <= k
kmers = generate_kmers(k, seed_size) # RDD kmers
The simplest optimization you can do when it comes to searching is to drop cartesian and use a local generation:
from difflib import SequenceMatcher
def is_similar(x, y):
"""Dummy similarity check
>>> is_similar("AAAAA", "AAAAT")
True
>>> is_similar("AAAAA", "TTTTTT")
False
"""
return SequenceMatcher(None, x, y).ratio() > 0.75
def find_similar(kmer, f=is_similar, alphabet="ATGC"):
"""
>>> kmer, similar = find_similar("AAAAAA")
>>> sorted(similar)[:5]
['AAAAAA', 'AAAAAC', 'AAAAAG', 'AAAAAT', 'AAAACA']
"""
candidates = product(alphabet, repeat=len(kmer))
return (kmer, {"".join(x) for x in candidates if is_similar(kmer, x)})
similar_map = kmers.flatmap(find_similar)
It is still an extremely naive approach but it doesn't require expensive data shuffling.
Next thing you can try is to improve search strategy. It can be done either locally like above or globally using joins.
In both cases you need a smarter approach than checking all possible kmers. First thing that comes to mind is to use seed kmers taken from a given word. In locally mode these can be used as a starting point for candidate generation, in a global mode a join key (optionally combined with hashing).
I have the following need (in python):
generate all possible tuples of length 12 (could be more) containing either 0, 1 or 2 (basically, a ternary number with 12 digits)
filter these tuples according to specific criteria, culling those not good, and keeping the ones I need.
As I had to deal with small lengths until now, the functional approach was neat and simple: a recursive function generates all possible tuples, then I cull them with a filter function. Now that I have a larger set, the generation step is taking too much time, much longer than needed as most of the paths in the solution tree will be culled later on, so I could skip their creation.
I have two solutions to solve this:
derecurse the generation into a loop, and apply the filter criteria on each new 12-digits entity
integrate the filtering in the recursive algorithm, so to prevent it stepping into paths that are already doomed.
My preference goes to 1 (seems easier) but I would like to hear your opinion, in particular with an eye towards how a functional programming style deals with such cases.
How about
import itertools
results = []
for x in itertools.product(range(3), repeat=12):
if myfilter(x):
results.append(x)
where myfilter does the selection. Here, for example, only allowing result with 10 or more 1's,
def myfilter(x): # example filter, only take lists with 10 or more 1s
return x.count(1)>=10
That is, my suggestion is your option 1. For some cases it may be slower because (depending on your criteria) you many generate many lists that you don't need, but it's much more general and very easy to code.
Edit: This approach also has a one-liner form, as suggested in the comments by hughdbrown:
results = [x for x in itertools.product(range(3), repeat=12) if myfilter(x)]
itertools has functionality for dealing with this. However, here is a (hardcoded) way of handling with a generator:
T = (0,1,2)
GEN = ((a,b,c,d,e,f,g,h,i,j,k,l) for a in T for b in T for c in T for d in T for e in T for f in T for g in T for h in T for i in T for j in T for k in T for l in T)
for VAL in GEN:
# Filter VAL
print VAL
I'd implement an iterative binary adder or hamming code and run that way.