Optimized searching in Python against a list

Optimized searching in Python against a list - python

Problem:
Given a list of n objects (n's Order of magnitude is 10^5), search for a given item very fast with a minimum of spacetime tradeoff. Current, unoptimized & prototype-y solution takes too long and consumes too much RAM (the optimization is not premature, that is).
There is not a primary key to sort against in the object, but it can be sorted to a certain degree, such as the following example, where the first column is sorted.
o1 => f, g, h
o2 => f, g, i
o3 => f, j, k
o4 => k, j, m
To date, the solution has been nested filters:
filter(test1, filter(test2, filter(test3, the_list)))
But that has been slow, since it involves n * (n - 1) * (n - 2) operations, which approximates to O(n^3) speed, and at least n*2 extra lists of references.
As a note, it would be vastly preferably to have an in-place search.
I haven't found a standard library for handling this. What is the typical solution to this problem?

filter(test1, filter(test2, filter(test3, the_list)))
Firstly, this is O(n) time, not O(n^3) time. The time adds not multiply. The only this could be worse then that is if test3/test2/test1 are doing something odd, in which we should look at those.
If we suggest that each test? function takes 10 ms, then we have 10*3*10^5 ms = 50 minutes. If it was n^3, then we'd have (10*10^5)^3 = 31 million years. I'm pretty sure you are only one linear time, you just have a ton of data.
Replace filter with itertools.ifilter, it'll avoid generating the list. Instead, python will pull one item out of the list at a time, pass it through the three tests and give it to you if and only if it passes. It'll avoid the memory requirement and probably be faster as well.
You aren't going to be able to improve on O(n) time unless you use some indexing techniques. However, the applicability of indexing techniques depends on what you are doing inside the test1/test2/test3 functions. If you want help on that, show an example for those functions.
As other have noted, database were designed to solve these problems. You can make this faster only be reimplementing badly what databases already do for you.

Concatenate the attribute values for each object to make unique keys. You may have to pad the attributes out to the same length to guarantee uniqueness. Construct a hash table to return the object that matches a key.

10^5 is not really that big a number of objects, even in-memory. littletable is a little module I wrote as an experiment for simulating queries, pivots, etc. using just Python dicts. One nice thing about littletable queries is that the result of any query or join is itself a new littletable Table. Indexes are kept as dicts of keys->table objects, and index keys can be defined to be unique or not.
I created a table of 140K objects with 3 single letter keys, and then queried for a specific key. The time to build the table itself was the longest, the indexing and querying pretty fast.
from itertools import product
from littletable import Table,DataObject
objects = Table()
alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
alphas += alphas.lower()
import time
print "building table", time.time()
objects.insert_many(
DataObject(k1=k1, k2=k2, k3=k3, created=time.time())
for k1,k2,k3 in product(alphas.upper(),alphas,alphas)
)
print "table complete", time.time()
print len(objects)
print "indexing table", time.time()
for k in "k1 k2 k3".split():
objects.create_index(k)
print "index complete", time.time()
print "get specific row", time.time()
matches = objects.query(k1="X", k2="k", k3="W")
for o in matches:
print o
print time.time()
Prints:
building table 1309377011.63
table complete 1309377012.52
140608
indexing table 1309377012.52
index complete 1309377012.98
get specific row 1309377012.98
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377011.9960001}
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377012.4260001}
1309377013.0

It seems to me one typical solution would be to use a database query. Either SQL (raw or with some kind of ORM), or some kind of object database, maybe MongoDB?

If your data is in a CSV file, you could try sql2csv: https://sourceforge.net/projects/sql2csv/.
EDIT: Pardon my early-onset senility, I meant this project: https://github.com/ccoffey/sql4csv/wiki/Examples.

Related

How can I make this Indexing algorithm more efficient?

I've got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I've created a custom indexing function that blocks pairs of records that haven't got similar 'name' attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
indici1=[]
indici2=[]
for i in range(0, 173033):
if(i%2 == 0):
print(i) #keeps track of the iteration
for j in range(i, 173033):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.5):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.
[MultiIndex([( 0, 0),
( 0, 22159),
( 0, 67902),
( 0, 67903),
( 1, 1),
( 1, 1473),
( 1, 5980),
( 1, 123347),
( 2, 2),
...
Here's the code for the similarity function:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Here's an example of the dataframe I have as input:
name
0 Amazon
1 Walmart
2 Apple
3 Amazon.com
4 Walmart Inc.
I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)

You are using .append method of list, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque which does have such quirks, just add import collections and do
indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]
If that would not help enough you would need similar function for possible improvements.

As others have pointed out, the solution to your problem requires O(N^2) running time, which means it won't scale well for very large datasets. Nonetheless, I think there's still a lot of room for improvement.
Here are some strategies you can use to speed up your code:
If your dataset contains many duplicate name values, you can use "memoization" to avoid re-computing the similar score for duplicate name pairs. Of course, caching all 172k^2 pairs would be devastatingly expensive, but if the data is pre-sorted by name, then lru_cache with 172k items should work just fine.
Looking at the difflib documentation, it appears that you have the option of quickly filtering out "obvious" mismatches. If you expect most pairs to be "easy" to eliminate from consideration, then it makes sense to first call SequenceMatcher.quick_ratio() (or even real_quick_ratio()), followed by ratio() only if necessary.
There will be some overhead in the ordinary control flow.
Calling df.loc many times in a for-loop might be a bit slow in comparison to simple iteration.
You can use itertools.combinations to avoid writing a nested for-loop yourself.
BTW, tqdm provides a convenient progress bar, which will give a better indication of true progress than the print statements in your code.
Lastly, I saw no need for the df_b parameter in your function above, so I didn't include it in the code below. Here's the full solution:
import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm
#lru_cache(173_000)
def is_similar(a, b):
matcher = SequenceMatcher(None, a, b)
if matcher.quick_ratio() <= 0.5:
return False
return matcher.ratio() > 0.5
def link_index(df):
# We initialize the index result pairs with [(0,0), (1,1,), (2,2), ...]
# because they are trivially "linked" and your problem statement
# says you want them in the results.
indici1 = df.index.tolist()
indici2 = df.index.tolist()
# Sort the names so that our lru_cache is effective,
# even though it is limited to 173k entries.
name_items = df['name'].sort_values().items()
pairs = combinations(name_items, 2)
num_pairs = math.comb(len(names), 2)
for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
if is_similar(i_name, j_name):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
links = pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
return links.sortlevel([0,1])[0]
Quick Test:
names = ['Amazon', 'Walmart', 'Apple', 'Amazon.com', 'Walmart Inc.']
df = pd.DataFrame({'name': names})
link_index(df)
Output:
(MultiIndex([(0, 0),
(0, 3),
(1, 1),
(1, 4),
(2, 2),
(3, 3),
(4, 4)],
names=['first', 'second']),
array([0, 5, 1, 6, 2, 3, 4]))
Let me know if that speeds things up on your actual data!
Let's set some realistic expectations.
Your original estimate was ~1 minute for 10 "iterations". That implies the total time would have been ~6 days:
print(math.comb(172033, 2) / (10*172033) / 60 / 24)
On the other hand, merely iterating through the full set of i,j combinations and doing absolutely nothing with them would take ~45 minutes on my machine. See for yourself:
sum(1 for _ in tqdm(combinations(np.arange(172033), 2), total=math.comb(172033, 2)))
So the real solution will take longer than that. Now you've got some bounds on what the optimal solution will require: Somewhere between ~1 hour and ~6 days. Hopefully it's closer to the former!

We asked you to reveal the similar() function but you've declined to do so.
You wrote
for i in range(0, 173033):
...
for j in range(i, 173033):
if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:
So you plan to call that mysterious similar 29_940_419_089 (30 billion) times.
I'm guessing that's going to take some little while, maybe two weeks?
We describe this as O(n^2) quadratic performance.
Here is the way out.
First, sort your dataset. It will only cost O(n log n), cheap!
Next, find something in the similar loss function that allows you to
simplify the problem, or design a new related loss function which allows that.
Spoiler alert -- we can only do comparisons against local neighbors,
not against far flung entries more than 100,000 positions away.
I will focus on the five example words.
It seems likely that your mysterious function reports
a large value for similar("Walmart", "Walmart Inc.")
and a small value when comparing those strings against "Apple".
So let's adopt a new cost function rule.
If initial character of the two strings is not identical,
the similarity is immediately reported as 0.
So now we're faced with comparing Apple against two Amazons,
and Walmart with itself.
We have partitioned the problem into "A" and "W" subsets.
The quadratic monster is starting to die.
Minor side note: Since your similarity function likely is
symmetric, it suffices to examine just half the square,
where i < j.
A practical implementation for a dataset of this size
is going to need to be more aggressive, perhaps insisting
that initial 2 or 3 letters be identical.
Perhaps you can run a hyphenation algorithm and look
at identical syllables.
Or use Metaphone
or Soundex.
The simplest possible thing you could do is define a
window W of maybe 10. When examining the sorted dataset,
we arbitrarily declare that similarity between i, j entries
shall be zero if abs(i - j) > W.
This may impact accuracy, but it's great for performance,
it lets you prune the search space before you even call the function.
We went from O(n^2) to O(n) linear.
Perfect accuracy is hardly relevant if you never wait
long enough for the code to produce an answer.
Use these ideas to code up a solution, and
let us know
how you resolved the details.
EDIT
#Pierre D. advocates the use of
locality sensitive hashing
to avoid lost revenue due to {prefix1}{entity} and {prefix2}{entity}
being far apart in a sorted list yet close together IRL and in hash space.
Metaphone is one technique for inducing LSH collisions,
but it seldom has any effect on initial letter / sort order.
Das, et al.,
describe a MinHash technique in
"Google News Personalization: Scalable Online Collaborative Filtering".
Number of syllables in the business names you're examining
will be significantly less than size of click-sets seen by Google.
Hashes give binary results of hit or miss,
rather than a smooth ranking or distance metric.
The need to promote neighbor collisions,
while steering clear of the birthday paradox,
makes it difficult to recommend tuning parameters
absent a snapshot of the production dataset.
Still, it is a technique you should keep in mind.
Asking spacy to do a
named entity recognition
pre-processing pass over your data might also be
worth your while, in the hopes of normalizing
or discarding any troublesome "noise" prefixes.

In the end this is what I came up. After sorting the dataset in alphabetical order according to the attribute 'name', I use this custom index. Every row in the dataset is compared to its neighbours in the range of 100 rows.
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
t0 = time.time()
indici1=[]
indici2=[]
for i in range(1, 173034):
if(i%500 == 0):
print(i)
if(i<100):
n = i
for j in range((i-(n-1)), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
elif(i>172932):
n = 173033 - i
for j in range((i-100), (i+n)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
else:
for j in range((i-100), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
t1 = time.time()
print(t1-t0)
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!
I'll be looking into locality sensitive hashing suggested by Pierre D.

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?

q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.

If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

Python Deque - 10 minutes worth of data

I'm trying to write a script that, when executed, appends a new available piece of information and removes data that's over 10 minutes old.
I'm wondering whats the the most efficient way, performance wise, of keeping track of the specific time to each information element while also removing the data over 10 minutes old.
My novice thought would be to append the information with a time stamp - [info, time] - to the deque and in a while loop continuously evaluate the end of the deque to remove anything older than 10 minutes... I doubt this is the best way.
Can someone provide an example? Thanks.

One way to do this is to use a sorted tree structure, keyed on the timestamps. Then you can find the first element >= 10 minutes ago, and remove everything before that.
Using the bintrees library as an example (because its key-slicing syntax makes this very easy to read and write…):
q = bintrees.FastRBTree.Tree()
now = datetime.datetime.now()
q[now] = 'a'
q[now - datetime.timedelta(seconds=5)] = 'b'
q[now - datetime.timedelta(seconds=10)] = 'c'
q[now - datetime.timedelta(seconds=15)] = 'd'
now = datetime.datetime.now()
del q[:now - datetime.timedelta(seconds=10)]
That will remove everything up to, but not including, now-10s, which should be both c and d.
This way, finding the first element to remove takes log N time, and removing N elements below that should be average case amortized log N but worst case N. So, your overall worst case time complexity doesn't improve, but your average case does.
Of course the overhead of managing a tree instead of a deque is pretty high, and could easily be higher than the savings of N/log N steps if you're dealing with a pretty small queue.
There are other logarithmic data structures that map be more appropriate, like a pqueue/heapqueue (as implemented by heapq in the stdlib), or a clock ring; I just chose a red-black tree because (with a PyPI module) it was the easiest one to demonstrate.

If you're only ever appending to the end, and the values are always inherently in sorted order, you don't actually need a logarithmic data structure like a tree or heap at all; you can do a logarithmic search within any sorted random-access structure like a list or collections.deque.
The problem is that deleting everything up to an arbitrary point in a list or deque takes O(N) time. There's no reason that it should; you should be able to drop N elements off a deque in amortized constant time (with del q[:pos] or q.popleft(pos)), it's just that collections.deque doesn't do that. If you find or write a deque class that does have that feature, you could just write this:
q = deque()
now = datetime.datetime.now()
q.append((now, 'a'))
q.append((now - datetime.timedelta(seconds=5), 'b')
q.append((now - datetime.timedelta(seconds=10), 'c')
q.append((now - datetime.timedelta(seconds=15), 'd')
now = datetime.datetime.now()
pos = bisect.bisect_left(q, now - datetime.timedelta(seconds=10))
del q[:pos]
I'm not sure whether a deque like this exists on PyPI, but the C source to collections.deque is available to fork, or the Python source from PyPy, or you could wrap a C or C++ deque type, or write one from scratch…
Or, if you're expecting that the "current" values in the deque will always be a small subset of the total length, you can do it in O(M) time just by not using the deque destructively:
q = q[pos:]
In fact, in that case, you might as well just use a list; it has O(1) append on the right, and slicing the last M items off a list is about as low-overhead a way to copy M items as you're going to find.

Yet another answer, with even more constraints:
If you can bucket things with, e.g., one-minute precision, all you need is 10 lists.
Unlike my other constrained answer, this doesn't require that you only ever append on the right; you can append in the middle (although you'll come after any other values for the same minute).
The down side is that you can't actually remove everything more than 10 minutes old; you can only remove everything in the 10th bucket, which could be off by up to 1 minute. You can choose what this means by choosing how to round:
Truncate one way, and nothing ever gets dropped too early, but everything is dropped late, an average of 30 seconds and at worst 60.
Truncate the other way, and nothing ever gets dropped late, but everything is dropped early, an average of 30 seconds and at worst 60.
Round at half, and things get dropped both early and late, but with an average of 0 seconds and a worst case of 30.
And you can of course use smaller buckets, like 100 buckets of 6-second intervals instead of 10 buckets of 1-minute intervals, to cut the error down as far as you like. Push that too far and you'll ruin the efficiency; a list of 600000 buckets of 1ms intervals is nearly as slow as a list of 1M entries.* But if you need 1 second or even 50ms, that's probably fine.
Here's a simple example:
def prune_queue(self):
now = time.time() // 60
age = now - self.last_bucket_time
if age:
self.buckets = self.buckets[-(10-age):] + [[] for _ in range(age)]
self.last_bucket_time = now
def enqueue(self, thing):
self.prune_queue()
self.buckets[-1].append(thing)
* Of course you could combine this with the logarithmic data structure—a red-black tree of 600000 buckets is fine.

Efficient strings containing each other

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of #ninjagecko and #Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)

Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)

Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):
Initialize a hashtable or trie
For each string b in B:
For every substring of that string
Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
For each string a in A:
Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those
(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.

A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.
First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).
This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!
There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.
EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:
import pyahocorasick
def aho(A, B):
t = pyahocorasick.Trie();
for a in A:
t.add_word(a, a)
t.make_automaton()
return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]
One thing I observed though, was when testing this implementation with #SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.

There are specialized index structures for this, see for example
http://en.wikipedia.org/wiki/Suffix_tree
You'd build a suffix-tree or something similar for B, then use A to query it.

Recursive generation + filtering. Better non-recursive?

I have the following need (in python):
generate all possible tuples of length 12 (could be more) containing either 0, 1 or 2 (basically, a ternary number with 12 digits)
filter these tuples according to specific criteria, culling those not good, and keeping the ones I need.
As I had to deal with small lengths until now, the functional approach was neat and simple: a recursive function generates all possible tuples, then I cull them with a filter function. Now that I have a larger set, the generation step is taking too much time, much longer than needed as most of the paths in the solution tree will be culled later on, so I could skip their creation.
I have two solutions to solve this:
derecurse the generation into a loop, and apply the filter criteria on each new 12-digits entity
integrate the filtering in the recursive algorithm, so to prevent it stepping into paths that are already doomed.
My preference goes to 1 (seems easier) but I would like to hear your opinion, in particular with an eye towards how a functional programming style deals with such cases.

How about
import itertools
results = []
for x in itertools.product(range(3), repeat=12):
if myfilter(x):
results.append(x)
where myfilter does the selection. Here, for example, only allowing result with 10 or more 1's,
def myfilter(x): # example filter, only take lists with 10 or more 1s
return x.count(1)>=10
That is, my suggestion is your option 1. For some cases it may be slower because (depending on your criteria) you many generate many lists that you don't need, but it's much more general and very easy to code.
Edit: This approach also has a one-liner form, as suggested in the comments by hughdbrown:
results = [x for x in itertools.product(range(3), repeat=12) if myfilter(x)]

itertools has functionality for dealing with this. However, here is a (hardcoded) way of handling with a generator:
T = (0,1,2)
GEN = ((a,b,c,d,e,f,g,h,i,j,k,l) for a in T for b in T for c in T for d in T for e in T for f in T for g in T for h in T for i in T for j in T for k in T for l in T)
for VAL in GEN:
# Filter VAL
print VAL

I'd implement an iterative binary adder or hamming code and run that way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimized searching in Python against a list - python

Concatenate the attribute values for each object to make unique keys. You may have to pad the attributes out to the same length to guarantee uniqueness. Construct a hash table to return the object that matches a key.

It seems to me one typical solution would be to use a database query. Either SQL (raw or with some kind of ORM), or some kind of object database, maybe MongoDB?

If your data is in a CSV file, you could try sql2csv: https://sourceforge.net/projects/sql2csv/. EDIT: Pardon my early-onset senility, I meant this project: https://github.com/ccoffey/sql4csv/wiki/Examples.

Related

How can I make this Indexing algorithm more efficient?

Which is better: deque or list slicing?

Python Deque - 10 minutes worth of data

Efficient strings containing each other

Recursive generation + filtering. Better non-recursive?

Categories

Resources