List vs dictionary to store zeroes in python - python

I am solving a problem in which I need a list of zeroes and after that I have to update some values in the list . Now I have two options in my mind how can I do this first is to simply make a list of zeroes and then update the values or I create a dictionary and then I update values .
List method :
l=[0]*n
Dictionary method :
d={}
for i in range(n):
d[i]=0
Now to complexity to build dictionary is O(n) and then updating a key is O(1) . But I don't know how python builds the list of zeroes using above method .
Let's assume n is a large number which one the above method will be better for this task ? and how is the list method implemented in python ? . Also why is the above list method faster than list comprehension method for creating list of zeroes ?

The access and update once you have pre-allocated your sequence will be roughly the same.
Pick a data-structure that makes sense for your application. In this case I suggest a list because it more naturally fits "sequence indexed by integers"
The reason [0]*n is fast is that it can make a list of the correct size in one go, rather than constantly expanding the list as more elements are added.

collections.defaultdict may be a better solution if you expect that a lot of elements will not change during your updates keeping initial value (and if you don't rely on KeyErrors somehow). Just
import collections
d = collections.defaultdict(int)
assert d[42] == 0
d[43] = 1
# ...
Another thing to consider is array.array. You can use it if you want to store only elements (counts) of one type. It should be a little faster and memory efficient than lists:
import array
l = array.array('L', [0]) * n
# use as list

After running a test using timeit:
import timeit
timeit.repeat("[0]*1000", number=1000000)
#[4.489016328923801, 4.459866205812087, 4.477892545204176]
timeit.repeat("""d={}
for i in range(1000):
d[i]=0""", number=1000000)
#[77.77789647192793, 77.88324065372811, 77.7300221235187]
timeit.repeat("""x={};x.fromkeys(range(1000),0)""", number=1000000)
#[53.62738158027423, 53.87422525293914, 53.50821399216625]
As you can see there is HUGE difference between these two methods and third one is better but not as lists! The reason is creating a list with size specified is way too faster than creating a dictionary with expanding it over iteration.

I think in this situation you should just use list, unless you want to access some data without using index.
Python list is an array. It initializes with a specific size, when it needs to store more items than its size can hold, it just copies everything to a new array, and the copying is O(k), where k is the then size of the list. this process can happen a lot of times until the list get to size bigger than or equal to n. However, [0]*n will just create the array with the right size (which is n), so it's faster than updating the list to the right size from the beginning.
For creation by list comprehension, if you mean something like [0 for i in range(n)], I think it suffers from updating the list size and so it is slower.
Python dictionary is an implementation of Hash Table, and it use a hash function to calculate the hash value for the key when you insert a new key-value pair. The execution of hash function itself is comparatively expensive, and dictionary also deals with other situations like collision, which makes it even slower. Thus, creation 0s by dictionary should be the slowest, in theory.

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Is this the most efficient way to vertically slice a list of dictionaries for unique values?

I've got a list of dictionaries, and I'm looking for a unique list of values for one of the keys.
This is what I came up with, but can't help but wonder if its efficient, time and/or memory wise:
list(set([d['key'] for d in my_list]))
Is there a better way?
This:
list(set([d['key'] for d in my_list]))
… constructs a list of all values, then constructs a set of just the unique values, then constructs a list out of the set.
Let's say you had 10000 items, of which 1000 are unique. You've reduced final storage from 10000 items to 1000, which is great—but you've increased peak storage from 10000 to 11000 (because there clearly has to be a time when the entire list and almost the entire set are both in memory simultaneously).
There are two very simple ways to avoid this.
First (as long as you've got Python 2.4 or later) use a generator expression instead of a list comprehension. In most cases, including this one, that's just a matter of removing the square brackets or turning them into parentheses:
list(set(d['key'] for d in my_list))
Or, even more simply (with Python 2.7 or later), just construct the set directly by using a set comprehension instead of a list comprehension:
list({d['key'] for d in my_list})
If you're stuck with Python 2.3 or earlier, you'll have to write an explicit loop. And with 2.2 or earlier, there are no sets, so you'll have to fake it with a dict mapping each key to None or similar.
Beyond space, what about time? Well, clearly you have to traverse the entire list of 10000 dictionaries, and do an O(1) dict.get for each one.
The original version does a list.append (actually a slightly faster internal equivalent) for each of those steps, and then the set conversion is a traversal of a list of the same size with a set.add for each one, and then the list conversion is a traversal of a smaller set with a list.append for each one. So, it's O(N), which is clearly optimal algorithmically, and only worse by a smallish multiplier than just iterating the list and doing nothing.
The set version skips over the list.appends, and only iterates once instead of twice. So, it's also O(N), but with an even smaller multiplier. And the savings in memory management (if N is big enough to matter) may help as well.

Why l.insert(0, i) is slower than l.append(i) in python?

I tested two different ways to reverse a list in python.
import timeit
value = [i for i in range(100)]
def rev1():
v = []
for i in value:
v.append(i)
v.reverse()
def rev2():
v = []
for i in value:
v.insert(0, i)
print timeit.timeit(rev1)
print timeit.timeit(rev2)
Interestingly, the 2nd method that inserts the value to the first element is pretty much slower than the first one.
20.4851300716
73.5116429329
Why is this? In terms of operation, inserting an element to the head doesn't seem that expensive.
insert is an O(n) operation as it requires all elements at or after the insert position to be shifted up by one. append, on the other hand, is generally O(1) (and O(n) in the worst case, when more space must be allocated). This explains the substantial time difference.
The time complexities of these methods are thoroughly documented here.
I quote:
Internally, a list is represented as an array; the largest costs come from growing beyond the current allocation size (because everything must move), or from inserting or deleting somewhere near the beginning (because everything after that must move).
Now, going back to your code, we can see that rev1() is an O(n) implementation whereas rev2() is in fact O(n2), so it makes sense that rev2() will be much slower.
In Python, lists are implemented as arrays. If you append one element to an array, the reserved space for an array is simply expanded. If you prepend an element, all elements are shifted by 1 and that is very expensive.
you can confirm this by reading about python lists online. Python implements a list as an array, where the size of the array is actually typically larger than the size of your current list. The unused elements are at the end of the array and represent new elements that could be added to the END of the list, not the beginning. Python uses a classical amortized cost approach so that on average, appending to the end of the list takes O(1) time if you do a bunch of appends, although occasionally a single append will cause the array to become full so a new larger array needs to be created, and all the data copied to the new array. On the other hand, if you always insert at the front of the list, then in the underlying array all elements need to be moved over one index to make room for the new element at the beginning of the array. So, to summarize, if you create a list by doing N insertions, then the total running time will be O(N) if you always append new items to the end of the list, and it will be O(N^2) if you always append to the front of the list.

Remove an item with certain property value from a list

Given a list of objects, where each has a property named x, and I want to remove all the objects whose x property contains value v from the list.
One way to do it is to use list comprehension: [item for item in mylist if item.x != v], but since my list is small (usually less than 10). Another way is to iterate through the list in a loop and check for every single item.
Is there a third way that is equally fast or even faster?
You can also use a generator or the filter function. Choose what you find the most readable; efficiency doesn't really matter at this point (especially not if you're dealing with just a few elements).
Create a new list using list comprehension syntax. I don't think you can do anything faster than that. It doesn't matter that your list is small, that's even better.

Categories

Resources