How does heapq.nsmallest work - python

I'm trying to determine the fastest runtime for getting k (key,value) pairs based on the smallest k keys in a dictionary.
i.e.:
for
mynahs = {40:(1,3),5:(5,6),11:(9,2),2:(6,3),300:(4,4),15:(2,8)}
smallestK(mynahs,3)
would return:
[(2,(6,3)),(5,(5,6)),(11,(9,2))]
I've seen a few different ways to do this:
1.
mylist = list(mynahs.keys())
mylist.sort
mylist = mylist[:k]
return [(k, mynahs[k]) for k in mylist]
but everyone seems to think heapq is the fastest
cheap = heapq.nsmallest(3, mynahs)
return [(k, mynahs[k]) for k in cheap]
How does heapq.nsmallest work and why is it fastest? I have seen this question and this one
I still don't understand. Is heapq using a minheap to get the nsmallest? How does that work? I've also heard about an algorithm called quickselect, is that what it's using?
What's the runtime of it? If the dictionary is constantly changing/updating, is calling heapq.nsmallest each time you need the nsmallest the fastest way to do that?

The code for heapq.py is available at https://svn.python.org/projects/python/trunk/Lib/heapq.py
nsmallest uses one of two algorithms. If the number of items to be returned is more than 10% of the total number of items in the heap, then it makes a copy of the list, sorts it, and returns the first k items.
If k is smaller than n/10, then it uses a heap selection algorithm:
Make a copy of the first k items, and sort it
for each remaining item in the original heap
if the item is smaller than the largest item in the new list
replace the largest item with the new item
re-sort the new list
That whoever wrote this used such an inefficient algorithm is somewhat surprising. In theory at least, Quick select, which is an O(n) algorithm, should be faster than sorting, and much faster than the "optimized" algorithm for selecting n/10 items.
I'm not a Python guy, so I can't say for sure, but my experience with other languages indicates that the above should be true for Python as well.
Update
The implementation at https://github.com/python/cpython/blob/master/Lib/heapq.py#L395 works somewhat differently.
If the k is greater than or equal to the number of items in the list, then a sorted list containing all of the elements is returned. Otherwise, it uses a standard heap selection algorithm:
create a max heap from the first k items
for each remaining item
if the item is smaller than the largest item on the heap
remove the largest item from the heap
add the new item to the heap
sort the resulting heap and return
The remove/add is combined into a single function called heap_replace.
There's an optimization in there to use the standard comparator if the key is None, but it uses the same basic heap selection algorithm.
This implementation is much more efficient than the other one that I described, although I expect it to be slower than Quickselect in the general case.

heapq uses a a heap ( _heapify_max )
Here is the implementation for heapq.nsmallest - https://github.com/python/cpython/blob/master/Lib/heapq.py#L395
Also look at:
http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest/

Related

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di​. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?
After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507
Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job
As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.
from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

Python - Is there a "rank list" or should I implement one

I want to use a data structure that allows me to store up to X objects with thier rankings and managing that structure in the best run-time.
Lets call it Rank_List(). Lets define X = 2 and the following should happen.
ranked_list = Rank_List()
ranked_list.add((obj1, 0.5)
print ranked_list -> [(obj1, 0.5)]
ranked_list.add((obj2, 0.75))
print ranked_list -> [(obj2, 0.75), (obj1, 0.5)]
So we can see it keeps the rank in check (0.75 is in the first place, and 0.5 is in the 2nd)
ranked_list.add(obj3, 0.7)
print ranked_list -> [(obj2, 0.75), (obj3, 0.7)]
After adding another object that is ranked higher then obj1, obj1 is cast out of the list (X = 2 so only up to 2 object may be stored in the list)
Is there a data structure like that that already exists in python?
If not which way should I implement it to get the best run-time results
From the comments I understand you are looking to extract the top K elements from a sequence. In that case you don't need to keep your list sorted at all. You can use a heap queue.
A heapq is a binary tree where any parent has a value either smaller than any of its children (or larger, if you flip the values). This means you can always find the top K elements, in order, in O(K) time, but keeping the heap in order on inserting only takes O(logN) time. In total, for N items going in and K top items to come out (in order), this gives you a very efficient O(KlogN) algorithm.
The Python standard library includes the heapq module to do this for you.
You can either keep a heap yourself or use the heapq.nlargest function to build a heap from an iterable for you and then directly return the top K items.
To keep the K largest items in a manually kept heap, build a list of K elements first (as (priority, elem) tuples), use heapify() once it reaches that size, and from there on out use heapreplace() to push your next element onto the list and remove the smallest. That way you always keep a fixed-size heap of K largest items. In the end, use sorted(heap, reverse=True) to give you those largest items in reverse sorted order (largest to smallest).

How to faster compute the count frequency of words in a large words list with python and be a dictionary

There is a very long words list, the length of list is about 360000. I want to get the each word frequency, and to be a dictionary.
For example:
{'I': 50, 'good': 30,.......}
Since the word list is large, I found it take a lot of time to compute it. Do you have faster method to accomplish this?
My code, so far, is the following:
dict_pronoun = dict([(i, lst_all_tweet_noun.count(i)) for i in
lst_all_tweet_noun])
sorted(dict_pronoun)
You are doing several things wrong here:
You are building a huge list first, then turn that list object into a dictionary. There is no need to use the [..] list comprehension; just dropping the [ and ] would turn it into a much more memory-efficient generator expression.
You are using dict() with a loop instead of a {keyexpr: valueexpr for ... in ...} dictionary comprehension; this would avoid a generator expression altogether and go straight to building a dictionary.
You are using list.count(), this does a full scan of the list for every element. You turned a linear scan to count N items into a O(N**2) quadratic problem. You could simply increment an integer in the dictionary each time you find the key already is present, set the value to 0 otherwise, but there are better options (see below).
The sorted() call is busy-work; it returns a sorted list of keys that is then discarded again. Dictionaries are not sortable, not and produce a dictionary again at any rate.
Use a collections.Counter() object here to do your counting; it uses a linear scan:
from collections import Counter
dict_pronoun = Counter(lst_all_tweet_noun)
A Counter has a Counter.most_common() method which will efficiently give you output sorted by counts, which is what I suspect you wanted to achieve with the sorted() call.
For example, to get the top K elements (where K is smaller than N, the size of the dictionary), a heapq is used to get you those elements in O(NlogK) time (avoiding a full O(NlogN) sort).

Python dictionary vs list, which is faster?

I was coding a Euler problem, and I ran into question that sparked my curiosity. I have two snippets of code. One is with lists the other uses dictionaries.
using lists:
n=100000
num=[]
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num.append(tmp)
suma+=i
using dictionaries:
n=100000
num={}
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num[tmp]=i
suma+=i
I am only concerned about performance. Why does the second example using dictionaries run incredibly fast, faster than the first example with lists. the example with dictionaries runs almost thirty-fold faster!
I tested these 2 code using n=1000000, and the first code run in 1032 seconds and the second one run in just 3.3 second,,, amazin'!
In Python, the average time complexity of a dictionary key lookup is O(1), since they are implemented as hash tables. The time complexity of lookup in a list is O(n) on average. In your code, this makes a difference in the line if tmp not in num:, since in the list case, Python needs to search through the whole list to detect membership, whereas in the dict case it does not except for the absolute worst case.
For more details, check out TimeComplexity.
If it's about speed, you should not create any lists:
n = 100000
factors = ((frozenset(factors(i)), i) for i in range(2, n+1))
num = {k:v for k,v in factors if len(k)==2}
suma = sum(num.values())
I am almost positive that the "magic sauce" using a dictionary lies in the fact that the dictionary is comprised of key->value pairs.
in a list, youre dealing with arrays, which means the for loop has to start at index 0 inside of your list in order to loop through every record.
the dictionary just has to find the key->value pair in question on the first 'go-round' and return it, hence the speed...
basically, testing for membership in a set of key->value pairs is a lot quicker than searching an entire list for a value. the larger your list gets the slower it will be... but this isnt always the case, there are scenarios where a list will be faster... but i believe this may be the answer youre looking for
In a list, the code if tmp not in num: is O(n), while it is O(lgn) in dict.
Edit: The dict is based on hashing, so it is much quicker than liner list search.
Thanks #user2357112 for point this.

How to keep a list of lists sorted as it is created

I'm reading in a file and pulling in data that includes some strings and some numbers, in Python. I'm storing this information as lists of lists, like this:
dataList = [
['blah', 2, 3, 4],
['blahs', 6, 7, 8],
['blaher', 10, 11, 12],
]
I want to keep dataList sorted by the second element of the sub list: dataList[][1]
I thought I could use insort or bisect right when I want to add them in, but I cannot figure out how to make it look at the second element of the sub list.
Any thoughts here? I was just appending data to the end and then doing a linear sort to find things back later on. But, throw a few 10's of thousands of sub-lists in here and then search for 100k items and it takes a while.
dataList.sort(key=lambda x: x[1])
This sorts the list in place, by the second element in each item.
As has been pointed out in the comments, it is much more efficient to sort just once (at the end). Python's built-in sort method has been heavily optimised to work fast. After testing it looks like the built-in sort is consistently around 3.7 times faster than using the heap method suggested in the other answer, over various size lists (I tested sizes of up to 600000).
Depends on a few things, but the first thing that comes to mind is using the heapq module:
import heapq
heap = []
for row in rows:
heapq.heappush(heap, (row[1], row))
This would create a heap full of tuples, where the first element is the element you want to sort by, and the second element is the row.
The simplest method to read them back from the heap would be to copy it then pop items:
new_heap = list(heap)
while new_heap:
_, row = heapq.heappop(new_heap)
print row
The runtime of inserting each item into the heap is O(lg N), so creating the heap will require O(N lg N) time, and popping items from the heap also requires O(lg N) time, so O(N lg N) time will be required to traverse it.
If these tradeoffs are not ideal, you could use a binary search tree (none exist in the standard library, but they are easy to find), or as other commenters have suggested, sort the rows after reading them: rows.sort(key=lambda row: row[1]).
Now, in practice, unless you're dealing with a very large number of rows, it will almost certainly be faster to sort the list in-place after loading it (ie, using the .sort() method)… So try a few things out and see what works best.
Finally, bisect is a poor idea, because inserting into Python lists requires O(N) time, so inserting items with bisect would require O(N lg N) time per item, so a total time of O((N lg N) * N) = O(N**2) time.

Categories

Resources