Retain top-N elements as we loop across all elements

Retain top-N elements as we loop across all elements - python

Here is what I am trying to do. The output of a calculation on a dataframe gives a number. I use that number to rank the different dataframes and I need to retain the top-N (in the example below, the top 10 is chosen). The ranking is achieved by comparing the number to the last number of a reverse sorted list. If the current number is larger, the list is popped and the new entry added to the list followed by reverse sorting again. The following is structurally identical to what I have and it works, albeit slowly. I would appreciate any suggestions to improve its speed, efficiency or Pythonicness.
import random
import pandas as pd
def gen_df():
return random.uniform(0.0, 1.0), pd.DataFrame()
if __name__ == '__main__':
mylist = []
for i in range(1000):
val, df = gen_df()
if len(mylist) < 10:
mylist.append((val, df))
else:
mylist.sort(reverse=True)
if mylist[-1][0] < val:
mylist.pop()
mylist.append((val, df))
EDIT: Reduced one sort after suggestion by zondo.

The way to speed it up is to replace your list with a min-heap of size 10. Put the first 10 frames into the heap. Then, for each item, if it's larger than the smallest item on the heap, pop the smallest item and push the new item.
I'm not a Python programmer, so I'll present the pseudocode.
heap = new min-heap
for each item
if (heap.length < 10)
heap.push(item)
else if (item > heap.peek())
heap.pop(); // remove smallest item
heap.push(item); // add new item
This assumes, of course, that there's a min-heap implementation that you can use. I suspect heapq will do the trick.
That's going to be significantly faster than sorting the list every time you insert a new item.

Remember that, in Python, lists are really just pointers to the things they contain. So certain list operations can be quite fast, even if the list contains some pretty heavy data structures (i.e. the DataFrames in your example). Your approach involves making a small list (10 items long) and constantly modifying it to be "correct" as more DataFrames are "considered" for the top 10. That feels a bit unnecessary to me. I would just make one big list of all the candidates, sort it once, and take the first 10. Also, appends are slower than insertions, so better to allocate the memory all at once.
My guess is that for big data sets, the approach I lay out below will be a bit faster. But regardless, I find it a bit more readable.
def get_top_10_so():
mylist = []
for i in range(1000):
val, df = gen_df()
if len(mylist) < 10:
mylist.append((val, df))
else:
mylist.sort(reverse=True)
if mylist[-1][0] < val:
mylist.pop()
mylist.append((val, df))
return mylist
def get_top_10_mine():
mylist = [None] * 1000
for i in range(1000):
mylist[i] = gen_df()
mylist.sort(key=lambda tup: tup[0], reverse=True)
return mylist[:10]

Related

Find all combinations of positive integers in increasing order that adds up to a given positive number n

How to write a function that takes n (where n > 0) and returns the list of all combinations of positive integers that sum to n?
This is a common question on the web. And there are different answers provided such as 1, 2 and 3. However, in the answers provided, they use two functions to solve the problem. I want to do it with only one single function. Therefore, I coded as follows:
def all_combinations_sum_to_n(n):
from itertools import combinations_with_replacement
combinations_list = []
if n < 1:
return combinations_list
l = [i for i in range(1, n + 1)]
for i in range(1, n + 1):
combinations_list = combinations_list + (list(combinations_with_replacement(l, i)))
result = [list(i) for i in combinations_list if sum(i) == n]
result.sort()
return result
If I pass 20 to my function which is all_combinations_sum_to_n(20), the OS of my machine kills the process as it is very costly. I think the space complexity of my function is O(n*n!). How do I modify my code so that I don't have to create any other function and yet my single function has an improved time or space complexity? I don't think it is possible by using itertools.combinations_with_replacement.
UPDATE
All answers provided by Barmar, ShadowRanger and pts are great. As I was looking for an efficient answer in terms of both memory and runtime, I used https://perfpy.com and selected python 3.8 to compare the answers. I used six different values of n and in all cases, ShadowRanger's solution had the highest score. Therefore, I chose ShadowRanger's answer as the best one. The scores were as follows:

You've got two main problems, one causing your current problem (out of memory) and one that will continue the problem even if you solve that one:
You're accumulating all combinations before filtering, so your memory requirements are immense. You don't even need a single list if your function can be a generator (that is iterated to produce a value at a time) rather than returning a fully realized list, and even if you must return a list, you needn't generate such huge intermediate lists. You might think you need at least one list for sorting purposes, but combinations_with_replacement is already guaranteed to produce a predictable order based on the input ordering, and since range is ordered, the values produced will be ordered as well.
Even if you solve the memory problem, the computational cost of just generating that many combinations is prohibitive, due to poor scaling; for the memory, but not CPU, optimized version of the code below, it handles an input of 11 in 0.2 seconds, 12 in ~2.6 seconds, and 13 in ~11 seconds; at that scaling rate, 20 is going to approach heat death of the universe timeframes.
Barmar's answer is one solution to both problems, but it's still doing work eagerly and storing the complete results when the complete work might not even be needed, and it involves sorting and deduplication, which aren't strictly necessary if you're sufficiently careful about how you generate the results.
This answer will fix both problems, first the memory issue, then the speed issue, without ever needing memory requirements above linear in n.
Solving the memory issue alone actually makes for simpler code, that still uses your same basic strategy, but without consuming all your RAM needlessly. The trick is to write a generator function, that avoids storing more than one results at a time (the caller can listify if they know the output is small enough and they actually need it all at once, but typically, just looping over the generator is better):
from collections import deque # Cheap way to just print the last few elements
from itertools import combinations_with_replacement # Imports should be at top of file,
# not repeated per call
def all_combinations_sum_to_n(n):
for i in range(1, n + 1): # For each possible length of combinations...
# For each combination of that length...
for comb in combinations_with_replacement(range(1, n + 1), i):
if sum(comb) == n: # If the sum matches...
yield list(comb) # yield the combination
# 13 is the largest input that will complete in some vaguely reasonable time, ~10 seconds on TIO
print(*deque(all_combinations_sum_to_n(13), maxlen=10), sep='\n')
Try it online!
Again, to be clear, this will not complete in any reasonable amount of time for an input of 20; there's just too much redundant work being done, and the growth pattern for combinations scales with the factorial of the input; you must be more clever algorithmically. But for less intense problems, this pattern is simpler, faster, and dramatically more memory-efficient than a solution that builds up enormous lists and concatenates them.
To solve in a reasonable period of time, using the same generator-based approach (but without itertools, which isn't practical here because you can't tell it to skip over combinations when you know they're useless), here's an adaptation of Barmar's answer that requires no sorting, produces no duplicates, and as a result, can produce the solution set in less than a 20th of a second, even for n = 20:
def all_combinations_sum_to_n(n, *, max_seen=1):
for i in range(max_seen, n // 2 + 1):
for subtup in all_combinations_sum_to_n(n - i, max_seen=i):
yield (i,) + subtup
yield (n,)
for x in all_combinations_sum_to_n(20):
print(x)
Try it online!
That not only produces the individual tuples with internally sorted order (1 is always before 2), but produces the sequence of tuples in sorted order (so looping over sorted(all_combinations_sum_to_n(20)) is equivalent to looping over all_combinations_sum_to_n(20) directly, the latter just avoids the temporary list and a no-op sorting pass).

Use recursion instead of generating all combinations and then filtering them.
def all_combinations_sum_to_n(n):
combinations_set = set()
for i in range(1, n):
for sublist in all_combinations_sum_to_n(n-i):
combinations_set.add(tuple(sorted((i,) + sublist)))
combinations_set.add((n,))
return sorted(combinations_set)
I had a simpler solution that didn't use sorted() and put the results in a list, but it would produce duplicates that just differed in order, e.g. [1, 1, 2] and [1, 2, 1] when n == 4. I added those to get rid of duplicates.
On my MacBook M1 all_combinations_sum_to_n(20) completes in about 0.5 seconds.

Here is a fast iterative solution:
def csum(n):
s = [[None] * (k + 1) for k in range(n + 1)]
for k in range(1, n + 1):
for m in range(k, 0, -1):
s[k][m] = [[f] + terms
for f in range(m, (k >> 1) + 1) for terms in s[k - f][f]]
s[k][m].append([k])
return s[n][1]
import sys
n = 5
if len(sys.argv) > 1: n = int(sys.argv[1])
for terms in csum(n):
print(' '.join(map(str, terms)))
Explanation:
Let's define terms as a non-empty, increasing (can contain the same value multiple times) list of postitive integers.
The solution for n is a list of all terms in increasing lexicographical order, where the sum of each terms is n.
s[k][m] is a list of all terms in increasing lexicographical order, where the sum of each terms in n, and the first (smallest) integer in each terms is at least m.
The solution is s[n][1]. Before returning this solution, the csum function populates the s array using iterative dynamic programming.
In the inner loop, the following recursion is used: each term in s[k][m] either has at least 2 elements (f and the rest) or it has 1 element (k). In the former case, the rest is a terms, where the sum is k - f and the smallest integer is f, thus it is from s[k - f][f].
This solution is a lot faster than #Barmar's solution if n is at least 20. For example, on my Linux laptop for n = 25, it's about 400 times faster, and for n = 28, it's about 3700 times faster. For larger values of n, it gets much faster really quickly.
This solution uses more memory than #ShadowRanger's solution, because this solution creates lots of temporary lists, and it uses all of them until the very end.
How to come up with such a fast solution?
Try to find a recursive formula. (Don't write code yet!)
Sometimes recursion works only with recursive functions with multiple variables. (In our case, s[k][m] is the recursive function, k is the obvious variable implied by the problem, and m is the extra variable we had to invent.) So try to add a few variables to the recursive formula: add the minimum number of variables to make it work.
Write your code so that it computes each recursive function value exactly once (not more). For that, you may add a cache (memoization) to your recursive function, or you may use dynamic programming, i.e. populating a (multidimensional) array in the correct order, so that what is needed is already populated.

How to find all unique combinations of k size tuple using a single element from each list of n lists

Given a list containing N sublists of multiple lengths, find all unique combinations of a k size, selecting only one element from each sublist.
The order of the elements in the combination is not relevant: (a, b) = (b, a)
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output =
[
('B1', 'T1'),('B1', 'T2'),('B1', 'L1'),('B1', 'L2'),('B1', 'L3'),('B1', 'L4'),
('B2', 'T1'),('B2', 'T2'),('B2', 'L1'),('B2', 'L2'),('B2', 'L3'),('B2', 'L4'),
('B3', 'T1'),('B3', 'T2'),('B3', 'L1'),('B3', 'L2'),('B3', 'L3'),('B3', 'L4'),
('T1', 'L1'),('T1', 'L2'),('T1', 'L3'),('T1', 'L4'),
('T2', 'L1'),('T2', 'L2'),('T2', 'L3'),('T2', 'L4')
]
Extra points for a pythonic way of doing it
Speed/Efficiency matters, the idea is to use in a list with hundreds of lists ranging from 5 to 50 in length
What I have been able to accomplish so far:
Using for and while loops to move pointers and build the answer, however I am having a hard time figuring out how to include K parameter to set the size of tuple combination dinamically. (not really happy about it)
def build_combinations(lst):
result = []
count_of_lst = len(lst)
for i, sublist in enumerate(lst):
if i == count_of_lst - 1:
continue
else:
for item in sublist:
j = 0
while i < len(lst)-1:
while j <= len(lst[i+1])-1:
comb = (item, lst[i+1][j])
result.append(comb)
j = j + 1
i = i + 1
j = 0
i = 0
return result
I've seen many similar questions in stack overflow, but none of them addressed the parameters the way I am trying to (one item from each list, and the size of the combinations being a params of function)
I tried using itertools combinations, product, permutation and flipping them around without success. Whenever using itertools I have either a hard time using only one item from each list, or not being able to set the size of the tuple I need.
I tried NumPy using arrays and a more math/matrix approach, but didn't go too far. There's definitely a way of solving with NumPy, hence why I tagged numpy as well

You need to combine two itertools helpers, combinations to select the two unique ordered lists to use, then product to combine the elements of the two:
from itertools import combinations, product
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = [pair
for lists in combinations(sample_list, sample_k)
for pair in product(*lists)]
print(expected_output)
Try it online!
If you want to get really fancy/clever/ugly, you can push all the work down to the C layer with:
from itertools import combinations, product, starmap, chain
sample_k = 2
sample_list = [['B1','B2','B3'], ['T1','T2'], ['L1','L2','L3','L4']]
expected_output = list(chain.from_iterable(starmap(product, combinations(sample_list, sample_k))))
print(expected_output)
That will almost certainly run meaningfully faster for huge inputs (especially if you can loop the results from chain.from_iterable directly rather than realizing them as a list), but it's probably not worth the ugliness unless you're really tight for cycles (I wouldn't expect much more than a 10% speed-up, but you'd need to benchmark to be sure).

Merging n/k sorted lists in nlog(n/k) - python

I have round(n/k) sorted sublists, meaning that the length of each sublist is k (and a single list with less than k length). I need to merge them into a single n-length sorted list using the classic O(m+n) merge function, so it would take O(n*log(n/k)).
I had two implementations, one with recursion (which seems to be right, but wouldn't work unless I'd change the recursion depth, which I am not allowed to, and I don't understand why actually, when the input list has no more than 10 sublists, each in length k=3):
def merge_sorted_blocks(lst):
i=0
pairs_lst=[]
n=len(lst)
while i<n-1:
pairs_lst.append(merge(lst[i],lst[i+1]))
i+=2
if n%2>0:
pairs_lst.append(lst[n-1])
if type(pairs_lst[0])!=list:
return pairs_lst
return merge_sorted_blocks(pairs_lst)
and one with consecutive the output list with the next sublist:
def merge_sorted_blocks(lst):
pairs_lst=[]
for i in lst:
pairs_lst=merge(pairs_lst,i)
return pairs_lst
but I don't thing it has the desired complexity, more like O(n*(k+2k+...))=O(n^2)).
I found this thread which suggests it does but I don't understand how:
https://math.stackexchange.com/questions/881599/on-log-k-for-merging-of-k-lists-with-total-of-n-elements
Is there something I'm missing, regarding each of these solutions?

For the second algorithm your computation has a fallacy. Moreover, the thread that you mentioned has some differences with your question.
You have k sublist with size of n/k. Since the complexity of merge function for two sets with size of n1 and n2 is O(n1 + n2), computation complexity of first merge of two sublist is O(2 * n/k), and complexity of the current sublist with the third sublist is O(3 * n/k). Hence, the complexity of the second algorithm is O(2*(n/k) + 3*(n/k) + ... + k*(n/k)) = O(nk).
For the first implementation, some details are missed. For example, if there is just one set (for example for the last step) the loop will be failed.
In addition, complexity analysis for the first algorithm is not accurate. If you want to implement the referenced algorithm, the algorithm is O(n/k * log(k)).

Python quicksort - List comprehension vs Recursion (partition routine)

I watched the talk Three Beautiful Quicksorts and was messing around with quicksort. My implementation in python was very similar to c (select pivot, partition around it and recursing over smaller and larger partitions). Which I thought wasn't pythonic.
So this is the implementation using list comprehension in python.
def qsort(list):
if list == []:
return []
pivot = list[0]
l = qsort([x for x in list[1:] if x < pivot])
u = qsort([x for x in list[1:] if x >= pivot])
return l + [pivot] + u
Lets call the recursion metho qsortR. now I noticed that qsortR runs much slower than qsort for large(r) lists. Actually "maximum recursion depth exceeded in cmp" even for 1000 elems for recursion method. Which I reset in sys.setrecursionlimit.
Some numbers:
list-compr 1000 elems 0.491770029068
recursion 1000 elems 2.24620914459
list-compr 2000 elems 0.992327928543
recursion 2000 elems 7.72630095482
All the code is here.
I have a couple of questions:
Why is list comprehension so much faster?
Some enlightenment on the limit on recursion in python. I first set it to 100000 in what cases should I be careful?
(What exactly is meant by 'optimizing tail recursion', how is it done?)
Trying to sort 1000000 elements hogged memory of my laptop (with the recursion method). What should I do if I want to sort so many elements? What kind of optimizations are possible?

Why is list comprehension so much faster?
Because list comprehension implies C loop which is much faster than slow general way of using Python's for block.
Some enlightenment on the limit on recursion in python. I first set it to 100000 in what cases should I be careful?
In case you run out of memory.
Trying to sort 1000000 elements hogged memory of my laptop (with the recursion method). What should I do if I want to sort so many elements? What kind of optimizations are possible?
Python's recursion gives such an overhead because every function call allocates a lot of stack memory space on each call.
In general, iteration is the answer (will give better performance in statistically 99% of use cases).
Talking about memory structures, if you have simple data structures, like chars, integers, floats: use built-in array.array which is much more memory efficient than a list.

Have you tried writing a non-recursive implementation of partition? I suspect that the performance difference is purely the partition implementation. You are recursing for each element in your implementation.
Update
Here's a quick implementation. It is still not super fast or even efficient, but it is much better than your original recursive one.
>>> def partition(data):
... pivot = data[0]
... less, equal, greater = [], [], []
... for elm in data:
... if elm < pivot:
... less.append(elm)
... elif elm > pivot:
... greater.append(elm)
... else:
... equal.append(elm)
... return less, equal, greater
...
>>> def qsort2(data):
... if data:
... less, equal, greater = partition(data)
... return qsort2(less) + equal + qsort2(greater)
... return data
...
I also think that there are a larger number of temporary lists generated in the "traditional" version.

Try to compare the list comprehension to an in-place algorithm when the memory goes really big. The code below get a near execution time when sorting 100K integers numbers, but you will probably get stucked in the list comprehension solution when sorting 1M integers. I've made the tests using a 4Gb machine. The full code: http://snipt.org/Aaaje2
class QSort:
def __init__(self, lst):
self.lst = lst
def sorted(self):
self.qsort_swap(0, len(self.lst))
return self.lst
def qsort_swap(self, begin, end):
if (end - begin) > 1:
pivot = self.lst[begin]
l = begin + 1
r = end
while l < r:
if self.lst[l] <= pivot:
l += 1
else:
r -= 1
self.lst[l], self.lst[r] = self.lst[r], self.lst[l]
l -= 1
self.lst[begin], self.lst[l] = self.lst[l], self.lst[begin]
# print begin, end, self.lst
self.qsort_swap(begin, l)
self.qsort_swap(r, end)

How can I merge two lists and sort them working in 'linear' time?

I have this, and it works:
# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.
def linear_merge(list1, list2):
finalList = []
for item in list1:
finalList.append(item)
for item in list2:
finalList.append(item)
finalList.sort()
return finalList
# +++your code here+++
return
But, I'd really like to learn this stuff well. :) What does 'linear' time mean?

Linear means O(n) in Big O notation, while your code uses a sort() which is most likely O(nlogn).
The question is asking for the standard merge algorithm. A simple Python implementation would be:
def merge(l, m):
result = []
i = j = 0
total = len(l) + len(m)
while len(result) != total:
if len(l) == i:
result += m[j:]
break
elif len(m) == j:
result += l[i:]
break
elif l[i] < m[j]:
result.append(l[i])
i += 1
else:
result.append(m[j])
j += 1
return result
>>> merge([1,2,6,7], [1,3,5,9])
[1, 1, 2, 3, 5, 6, 7, 9]

Linear time means that the time taken is bounded by some undefined constant times (in this context) the number of items in the two lists you want to merge. Your approach doesn't achieve this - it takes O(n log n) time.
When specifying how long an algorithm takes in terms of the problem size, we ignore details like how fast the machine is, which basically means we ignore all the constant terms. We use "asymptotic notation" for that. These basically describe the shape of the curve you would plot in a graph of problem size in x against time taken in y. The logic is that a bad curve (one that gets steeper quickly) will always lead to a slower execution time if the problem is big enough. It may be faster on a very small problem (depending on the constants, which probably depends on the machine) but for small problems the execution time isn't generally a big issue anyway.
The "big O" specifies an upper bound on execution time. There are related notations for average execution time and lower bounds, but "big O" is the one that gets all the attention.
O(1) is constant time - the problem size doesn't matter.
O(log n) is a quite shallow curve - the time increases a bit as the problem gets bigger.
O(n) is linear time - each unit increase means it takes a roughly constant amount of extra time. The graph is (roughly) a straight line.
O(n log n) curves upwards more steeply as the problem gets more complex, but not by very much. This is the best that a general-purpose sorting algorithm can do.
O(n squared) curves upwards a lot more steeply as the problem gets more complex. This is typical for slower sorting algorithms like bubble sort.
The nastiest algorithms are classified as "np-hard" or "np-complete" where the "np" means "non-polynomial" - the curve gets steeper quicker than any polynomial. Exponential time is bad, but some are even worse. These kinds of things are still done, but only for very small problems.
EDIT the last paragraph is wrong, as indicated by the comment. I do have some holes in my algorithm theory, and clearly it's time I checked the things I thought I had figured out. In the mean time, I'm not quite sure how to correct that paragraph, so just be warned.
For your merging problem, consider that your two input lists are already sorted. The smallest item from your output must be the smallest item from one of your inputs. Get the first item from both and compare the two, and put the smallest in your output. Put the largest back where it came from. You have done a constant amount of work and you have handled one item. Repeat until both lists are exhausted.
Some details... First, putting the item back in the list just to pull it back out again is obviously silly, but it makes the explanation easier. Next - one input list will be exhausted before the other, so you need to cope with that (basically just empty out the rest of the other list and add it to the output). Finally - you don't actually have to remove items from the input lists - again, that's just the explanation. You can just step through them.

Linear time means that the runtime of the program is proportional to the length of the input. In this case the input consists of two lists. If the lists are twice as long, then the program will run approximately twice as long. Technically, we say that the algorithm should be O(n), where n is the size of the input (in this case the length of the two input lists combined).
This appears to be homework, so I will no supply you with an answer. Even though this is not homework, I am of the opinion that you will be best served by taking a pen and a piece of paper, construct two smallish example lists which are sorted, and figure out how you would merge those two lists, by hand. Once you figured that out, implementing the algorithm is a piece of cake.
(If all goes well, you will notice that you need to iterate over each list only once, in a single direction. That means that the algorithm is indeed linear. Good luck!)

If you build the result in reverse sorted order, you can use pop() and still be O(N)
pop() from the right end of the list does not require shifting the elements, so is O(1)
Reversing the list before we return it is O(N)
>>> def merge(l, r):
... result = []
... while l and r:
... if l[-1] > r[-1]:
... result.append(l.pop())
... else:
... result.append(r.pop())
... result+=(l+r)[::-1]
... result.reverse()
... return result
...
>>> merge([1,2,6,7], [1,3,5,9])
[1, 1, 2, 3, 5, 6, 7, 9]

This thread contains various implementations of a linear-time merge algorithm. Note that for practical purposes, you would use heapq.merge.

Linear time means O(n) complexity. You can read something about algorithmn comlexity and big-O notation here: http://en.wikipedia.org/wiki/Big_O_notation .
You should try to combine those lists not after getting them in the finalList, try to merge them gradually - adding an element, assuring the result is sorted, then add next element... this should give you some ideas.

A simpler version which will require equal sized lists:
def merge_sort(L1, L2):
res = []
for i in range(len(L1)):
if(L1[i]<L2[i]):
first = L1[i]
secound = L2[i]
else:
first = L2[i]
secound = L1[i]
res.extend([first,secound])
return res

itertoolz provides an efficient implementation to merge two sorted lists
https://toolz.readthedocs.io/en/latest/_modules/toolz/itertoolz.html#merge_sorted

'Linear time' means that time is an O(n) function, where n - the number of items input (items in the lists).
f(n) = O(n) means that that there exist constants x and y such that x * n <= f(n) <= y * n.
def linear_merge(list1, list2):
finalList = []
i = 0
j = 0
while i < len(list1):
if j < len(list2):
if list1[i] < list2[j]:
finalList.append(list1[i])
i += 1
else:
finalList.append(list2[j])
j += 1
else:
finalList.append(list1[i])
i += 1
while j < len(list2):
finalList.append(list2[j])
j += 1
return finalList

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.