finding a duplicate in a hdf5 pytable with 500e6 rows

finding a duplicate in a hdf5 pytable with 500e6 rows - python

Problem
I have a large (> 500e6 rows) dataset that I've put into a pytables database.
Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.
As a starter I've done something like this:
index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
query = '(id == %d) & (counts == %d)' % (row['id'], row['counts'])
result = th.readWhere(query)
if len(result) > 1:
print row
It's a brute force method I'll admit. Any suggestions on improvements?
update
current brute force runtime is 8421 minutes.
solution
Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:
ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)
ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
if row['hash'] == ref:
dups.append(row['hash'] )
ref = row['hash']
print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)
I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.
Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.
This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

Two obvious techniques come to mind: hashing and sorting.
A) define a hash function to combine ID and Counter into a single, compact value.
B) count how often each hash code occurs
C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)
D) sort this data set to find duplicates.
The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.
This is essentially a Bloom filter.

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.
I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.
In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.
If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

Related

Two number sum : why don't anybody do it this way

I was looking for the solution to "two number sum problem" and I saw every body using two for loops
and another way I saw was using a hash table
def twoSumHashing(num_arr, pair_sum):
sums = []
hashTable = {}
for i in range(len(num_arr)):
complement = pair_sum - num_arr[i]
if complement in hashTable:
print("Pair with sum", pair_sum,"is: (", num_arr[i],",",complement,")")
hashTable[num_arr[i]] = num_arr[i]
# Driver Code
num_arr = [4, 5, 1, 8]
pair_sum = 9
# Calling function
twoSumHashing(num_arr, pair_sum)
But why don't nobody discuss about this solution
def two_num_sum(array, target):
for num in array:
match = target - num
if match in array:
return [match, num]
return "no result found"
when using a hash table we have to store values into the hash table. But here there is no need for that.
1)Does that affect the time complexity of the solution?
2)looking up a value in hash table is easy compared to array, but if the values are huge in number,
does storing them in a hash table take more space?

First of all, the second function you provide as a solution is not correct and does not return a complete list of answers.
Second, as a Pythonist, it's better to say dictionary instead of the hash table. A python dictionary is one of the implementations of a hash table.
Anyhow, regarding the other questions that you asked:
Using two for-loops is a brute-force approach and usually is not an optimized approach in real. Dictionaries are way faster than lists in python. So for the sake of time-complexity sure, dictionaries are the winner.
From the point of view of space complexity, using dictionaries for sure takes more memory allocation, but with current hardware, it is not an essential issue for billions of numbers. It depends on your situation, whether the speed is crucial to you or the amount of memory.

first function
uses O(n) time complexity as you iterate over n members in the array
uses O(n) space complexity as you can have one pair which is the first and the last, then in the worst case you can store up to n-1 numbers.
second function
uses O(n^2) time complexity as you iterate first on the array then uses in which uses __contains__ on list which is O(n) in worst case.
So the second function is like doing two loops to brute force the solution.
Another thing to point out in second function is that you don't return all the pairs but just the first pair you find.
Then you can try and fix it by iterating from the index of num+1 but you will have duplicates.
This is all comes down to a preference of what's more important - time complexity or space complexity-
this is one of many interview / preparation to interview question where you need to explain why you would use function two (if was working properly) over function one and vice versa.
Answers for your questions
1.when using a hash table we have to store values into the hash table. But here there is no need for that. 1)Does that affect the time complexity of the solution?
Yes now time complexity is O(n^2) which is worse
2)looking up a value in hash table is easy compared to array, but if the values are huge in number, does storing them in a hash table take more space?
In computers numbers are just representation of bits. Larger numbers can take up more space as they need more bits to represent them but storing them will be the same, no matter where you store.

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)

Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).

Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.

In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Django: queryset.count() is significantly slower on chained filters than single filters regardless of returned query size--is there a solution?

EDIT: Best solution thanks to Hakan--
queriedForms.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
I tried more of his suggestions but I can't avoid an INNER JOIN--this seems to be a a stable solution that does get me small, but predictable speed increases across the board. Look through his answer for more details!
I've been struggling with a problem I haven't seen an answer to online.
When chaining two filters in Django e.g.
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
masterQuery.count()
#returns 100,000 results in < 1 second
#test filter--all 100,000+ names have "test x" where x is 0-9
storedCount = masterQuery.filter(name__contains="9").count()
#returns ~50,000 results but takes 5-6 seconds
Trying a slightly different way:
masterQuery = masterQuery.filter(name__contains="9")
masterQuery.count()
#also returns ~50,000 results in 5-6 seconds
performing an & merge seems to ever so slightly improve performance, e.g
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
(masterQuery & masterQuery.filter(name__contains="9")).count()
It seems as if count takes a significantly longer time beyond a single filter in a queryset.
I assume it may have something to do with mySQL, which apparently doesn't like nested statements--and I assume that two filters are creating a nested query that slows mySQL down, regardless of the SELECT COUNT(*) django uses
So my question is: Is there anyway to speed this up? I'm getting ready to do a lot of regular nested querying only using queryset counts (I don't need the actual model values) without database hits to load the models. e.g. I don't need to load 100,000 models from the database, I just need to know there are 100,000 there. It's obviously much faster to do this through querysets than len() but even at 5 secs a count when I'm running 40 counts for an entire complex query is 3+ minutes--I'd prefer it be under a minute. Am I just fantasizing or does someone have a suggestion as to how this could be accomplished outside of increasing the server's processor speed?
EDIT: If it's helpful--the time.clock() speed is .3 secs for the chained filter() count--the actual time to console and django view output is 5-6s
EDIT2: To answer any questions about indexing, the filters use both an indexed and non indexed value for each link in the chain:
mainQuery = masterQuery = bigmodel.relatedmodel_set.all()
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Where "record_attribute_type" is another foreign key being used as a filter
mainQuery.count() #produces 100,000 results in < 1sec
mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="9", reverseforeignkeytestmodel__record_attribute_type__pk=5).count()
#produces ~50,000 results in 5-6 secs
So each filter in the chain is functionally similar, it is an AND filter(condition,condition) where one condition is indexed, and the other is not. I can't index both conditions.
Edit 3:
Similar queries that result in smaller results, e.g. < 10,000 are much faster, regardless of the nesting--e.g. the first filter in the chain produces 10,000 results in ~<1sec but the second filter in the chain will produce 5,000 results in ~<1sec
Edit 4:
Still not working based on #Hakan's solution
mainQuery = bigmodel.relatedmodel_set.all()
#Setup the first filter as normal
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Grab a values list for the second chained filter instead of chaining it
values = bigmodel.relatedmodel_set.all().filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=8).values_list('pk', flat=True)
#filter the first query based on the values_list rather than a second filter
mainQuery = mainQuery.filter(pk__in=values)
mainQuery.count()
#Still takes on average the same amount of time after enough test runs--seems to be slightly faster than average--similar to the (quersetA & querysetB) merge solution I tried.
It's possible I did this wrong--but the count results are consistent between the new value_list filter technique, e.g. I'm getting the same # of results. So it's definitely working--but seemingly taking the same amount of time
EDIT 5:
Also based on #Hakan's solution with some slight tweaks
mainQuery.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
This seems to operate faster for larger results in a queryset, e.g. > 50,000, but is actually much slower on smaller queryset results, e.g. < 50,000--where they used to be <1sec--sometimes 2-3 running in 1 second for chain filtering, they now all take 1 second individually. Essentially the speed gains in the larger queryset have been nullified by the speed loss in the smaller querysets.
I'm still going to try and break up the queries as per his suggestion further--but I'm not sure I'm able to. I'll update again(possibly on Monday) when I figure that out and let everyone interested know the progress.

Not sure if this helps, since I don't have a mysql project to test with.
The QuerySet API reference contains a section about the performance of nested queries.
Performance considerations
Be cautious about using nested queries and understand your database
server’s performance characteristics (if in doubt, benchmark!). Some
database backends, most notably MySQL, don’t optimize nested queries
very well. It is more efficient, in those cases, to extract a list of
values and then pass that into the second query. That is, execute two
queries instead of one:
values = Blog.objects.filter(
name__contains='Cheddar').values_list('pk', flat=True)
entries = Entry.objects.filter(blog__in=list(values))
Note the list() call around the Blog QuerySet to force execution of the first query.
Without it, a nested query would be executed, because QuerySets are
lazy.
So, maybe you can improve the performance by trying something like this:
masterQuery = bigmodel.relatedmodel_set.all()
pks = list(masterQuery.filter(name__contains="test").values_list('pk', flat=True))
count = masterQuery.filter(pk__in=pks, name__contains="9")
Since your initial MySQL performance is so slow, it might even be faster to do the second step in Python instead of in the database.
names = masterQuery.filter(name__contains='test').values_list('name')
count = sum('9' in n for n in names)
Edit:
From your updates, I see that you are querying fields in related models, which result in multiple sql JOIN operations. That's likely a big reason why the query is slow.
To avoid joins, you could try something like this. The goal is to avoid doing deeply chained lookups across relations.
# query only RelatedModel, avoid JOIN
related_pks = RelatedModel.objects.filter(
record_value__contains=constraint['TVAL'],
record_attribute_type=rtypePK,
).values_list('pk', flat=True)
# list(queryset) will do a database query, resulting in a list of integers.
pks_list = list(related_pks)
# use that result to filter your main model.
count = MainModel.objects.filter(
formrecordattributevalue__in=pks_list
).count()
I'm assuming that the relation is defined as a foreign key from MainModel to RelatedModel.

Optimize algorithm to compute orbit under a given action in python

My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!

So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?

How to tractably solve the assignment optimisation task

I'm working on a script that takes the elements from companies and pairs them up with the elements of people. The goal is to optimize the pairings such that the sum of all pair values is maximized (the value of each individual pairing is precomputed and stored in the dictionary ctrPairs).
They're all paired in a 1:1, each company has only one person and each person belongs to only one company, and the number of companies is equal to the number of people. I used a top-down approach with a memoization table (memDict) to avoid recomputing areas that have already been solved.
I believe that I could vastly improve the speed of what's going on here but I'm not really sure how. Areas I'm worried about are marked with #slow?, any advice would be appreciated (the script works for inputs of lists n<15 but it gets incredibly slow for n > ~15)
def getMaxCTR(companies, people):
if(memDict.has_key((companies,people))):
return memDict[(companies,people)] #here's where we return the memoized version if it exists
if(not len(companies) or not len(people)):
return 0
maxCTR = None
remainingCompanies = companies[1:len(companies)] #slow?
for p in people:
remainingPeople = list(people) #slow?
remainingPeople.remove(p) #slow?
ctr = ctrPairs[(companies[0],p)] + getMaxCTR(remainingCompanies,tuple(remainingPeople)) #recurse
if(ctr > maxCTR):
maxCTR = ctr
memDict[(companies,people)] = maxCTR
return maxCTR

To all those who wonder about the use of learning theory, this question is a good illustration. The right question is not about a "fast way to bounce between lists and tuples in python" — the reason for the slowness is something deeper.
What you're trying to solve here is known as the assignment problem: given two lists of n elements each and n×n values (the value of each pair), how to assign them so that the total "value" is maximized (or equivalently, minimized). There are several algorithms for this, such as the Hungarian algorithm (Python implementation), or you could solve it using more general min-cost flow algorithms, or even cast it as a linear program and use an LP solver. Most of these would have a running time of O(n3).
What your algorithm above does is to try each possible way of pairing them. (The memoisation only helps to avoid recomputing answers for pairs of subsets, but you're still looking at all pairs of subsets.) This approach is at least Ω(n222n). For n=16, n3 is 4096 and n222n is 1099511627776. There are constant factors in each algorithm of course, but see the difference? :-) (The approach in the question is still better than the naive O(n!), which would be much worse.) Use one of the O(n^3) algorithms, and I predict it should run in time for up to n=10000 or so, instead of just up to n=15.
"Premature optimization is the root of all evil", as Knuth said, but so is delayed/overdue optimization: you should first carefully consider an appropriate algorithm before implementing it, not pick a bad one and then wonder what parts of it are slow. :-) Even badly implementing a good algorithm in Python would be orders of magnitude faster than fixing all the "slow?" parts of the code above (e.g., by rewriting in C).

i see two issues here:
efficiency: you're recreating the same remainingPeople sublists for each company. it would be better to create all the remainingPeople and all the remainingCompanies once and then do all the combinations.
memoization: you're using tuples instead of lists to use them as dict keys for memoization; but tuple identity is order-sensitive. IOW: (1,2) != (2,1) you better use sets and frozensets for this: frozenset((1,2)) == frozenset((2,1))

This line:
remainingCompanies = companies[1:len(companies)]
Can be replaced with this line:
remainingCompanies = companies[1:]
For a very slight speed increase. That's the only improvement I see.

If you want to get a copy of a tuple as a list you can do
mylist = list(mytuple)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.