Finding intersections of huge sets with huge dicts - python

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).

There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())

Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Related

Compare whether the first two elements on a nested list are equal to a comparison list in python

In python 2.7, I would like to verify whether a subset list of elements is included in a longer nested list when comparing let's say only the first two elements.
Lets say we have a big list of nested elements (this big_list will have over 10k elements so looping for every comparison is very inefficient and I'd like to avoid this). For this example, lets say we only have 4 nested lists in big_list:
`
big_list = ((2,3,5,6,7), (4,5,6,7,8), (6,7,8,8), (8,4,2,7))
`
If I have a single list, let's say (4,5,11,11,11), I am looking for an operation that will return True when compared to big_list since the second list in big_list starts with (4,5,...) and matches the first two elements of my single_list. Essentially I want to know whether the first two elements of a single list (e.g. (4,5,11,11,11)) are repeated in my big list regardless of the other followed numbers (e.g. 11,11, ...).
My operation should also return False if another single_list (e.g. (4,8,11,11,11) ) does not match the first two element in the big_list.
I hope this is clearer. Any help?
Thanks in advance,
Since you have a huge list, to avoid iterating over the whole thing every time — O(n) time complexity for each search, you can do a constant time lookup using a set.
tup_truth_set = set([tup[:2] for tup in big_list]) # set with first two letters of interest
then you would simply do something like this to check in constant time:
tuple_of_interest[:2] in tup_truth_set
I don't think that you can avoid the loop over your list. Even if you don't run the loop yourself and suppose there is a built-in function, that I am not aware of and can do what you are asking, I am pretty sure it would loop the list in the background. So I suggest a single line of code to do that, including a loop, obviously.
(4,5,11,11,11)[:2] in [i[:2] for i in big_list]

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Optimize algorithm to compute orbit under a given action in python

My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!
So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?

Is this the most efficient way to vertically slice a list of dictionaries for unique values?

I've got a list of dictionaries, and I'm looking for a unique list of values for one of the keys.
This is what I came up with, but can't help but wonder if its efficient, time and/or memory wise:
list(set([d['key'] for d in my_list]))
Is there a better way?
This:
list(set([d['key'] for d in my_list]))
… constructs a list of all values, then constructs a set of just the unique values, then constructs a list out of the set.
Let's say you had 10000 items, of which 1000 are unique. You've reduced final storage from 10000 items to 1000, which is great—but you've increased peak storage from 10000 to 11000 (because there clearly has to be a time when the entire list and almost the entire set are both in memory simultaneously).
There are two very simple ways to avoid this.
First (as long as you've got Python 2.4 or later) use a generator expression instead of a list comprehension. In most cases, including this one, that's just a matter of removing the square brackets or turning them into parentheses:
list(set(d['key'] for d in my_list))
Or, even more simply (with Python 2.7 or later), just construct the set directly by using a set comprehension instead of a list comprehension:
list({d['key'] for d in my_list})
If you're stuck with Python 2.3 or earlier, you'll have to write an explicit loop. And with 2.2 or earlier, there are no sets, so you'll have to fake it with a dict mapping each key to None or similar.
Beyond space, what about time? Well, clearly you have to traverse the entire list of 10000 dictionaries, and do an O(1) dict.get for each one.
The original version does a list.append (actually a slightly faster internal equivalent) for each of those steps, and then the set conversion is a traversal of a list of the same size with a set.add for each one, and then the list conversion is a traversal of a smaller set with a list.append for each one. So, it's O(N), which is clearly optimal algorithmically, and only worse by a smallish multiplier than just iterating the list and doing nothing.
The set version skips over the list.appends, and only iterates once instead of twice. So, it's also O(N), but with an even smaller multiplier. And the savings in memory management (if N is big enough to matter) may help as well.

Storing more than one key value in a tuple with python?

I'm new to Python and still learning. I was wondering if there was a standard 'best practice' for storing more than one key value in a tuple. Here's an example:
I have a value called 'red' which has a value of 3 and I need to divide it by a number (say 10). I need to store 3 values: Red (the name), 3 (number of times its divides 10) and 1 (the remainder). There are other values that are similar that will need to be included as well, so this is for red but same results for blue, green, etc. (numbers are different for each label).
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records (and I'll need to search through it so I thought maybe nested anything might slow me down).
I tried to create something like {'red':3:1,...} but its not the correct syntax and I'm considering adding a delimiter in the key value and then splitting it but not sure if that's efficient (such as {'red':3a1,..} then parse by the letter a).
I'm wondering if there's any better ways to store this or is nested tuples my only solution? I'm using Python 2.
The syntax for tuples is: (a,b,c).
If you want a dictionary with multiple values you can have a list as the value: {'red':[3,1]}.
You may want to also consider named tuples, or even classes. This will allow you to name the fields instead of accessing them by index, which will make the code more clear and structured.
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records(and I'll need to search through it so I thought maybe nested anything might slow me down).
If you have a billion records you probably should be persisting the data (for example in a database). You will likely run out of memory if you try to keep all the data in memory at once.
Use tuple. For example:
`('red', 3, 1)`
Perhaps you mean dictionaries instead of tuples?
{'red': [3,1], 'blue': [2,2]}
If you are trying to store key/value pairs the best way would be to store them in a dictionary. And if you need more than one value to each key, just put those values in a list.
I don't think you would want to store such things in a tuple because tuples aren't mutable. So if you decide to change the order of the quotient and remainder (1, 3) instead of (3,1), you would need to create new tuples. Whereas with lists, you could simply rearrange the order.

Categories

Resources