compute suffix maximums using itertools.accumulate - python

What's the recommended way to compute the suffix maximums of a sequence of integers?
Following is the brute-force approach (O(n**2)time), based on the problem definition:
>>> A
[9, 9, 4, 3, 6]
>>> [max(A[i:]) for i in range(len(A))]
[9, 9, 6, 6, 6]
One O(n) approach using itertools.accumulate() is the following, which uses two list constructors:
>>> A
[9, 9, 4, 3, 6]
>>> list(reversed(list(itertools.accumulate(reversed(A), max))))
[9, 9, 6, 6, 6]
Is there a more pythonic way to do this?

Slice-reversal makes things more concise and less nested:
list(itertools.accumulate(A[::-1], max))[::-1]
It's still something you'd want to bundle up into a function, though:
from itertools import accumulate
def suffix_maximums(l):
return list(accumulate(l[::-1], max))[::-1]
If you're using NumPy, you'd want numpy.maximum.accumulate:
import numpy
def numpy_suffix_maximums(array):
return numpy.maximum.accumulate(array[::-1])[::-1]

Personally when I think "Pythonic" I think "simple and easy-to-read", so here's my Pythonic version:
def suffix_max(a_list):
last_max = a[-1]
maxes = []
for n in reversed(a):
last_max = max(n, last_max)
maxes.append(last_max)
return list(reversed(maxes))
For what it's worth, this looks to be about 50% slower than the itertools.accumulate approach, but we're talking 25ms vs 17ms for a list of 100,000 ints, so it may not much matter.
If speed is the utmost concern and the range of numbers you expect to see is significantly smaller than the length of list you're working with, it might be worth using RLE:
def suffix_max_rle(a_list):
last_max = a_list[-1]
count = 1
max_counts = []
for n in a_list[-2::-1]:
if n <= last_max:
count += 1
else:
max_counts.append([last_max, count])
last_max = n
count = 1
if n <= last_max:
max_counts.append([last_max, count])
return list(reversed(max_counts))
This is about 4 times faster than the above, and about 2.5 times faster than the itertools approach, for a list of 100,000 ints in the range 0-10000. Provided, again, that your range of numbers is significantly smaller than the length of your lists, it will take less memory, too.

Related

Best way to directly generate a sublist from a python iterator

It is easy to convert an entire iterator sequence into a list using list(iterator), but what is the best/fastest way to directly create a sublist from an iterator without first creating the entire list, i.e. how to best create list(iterator)[m:n] without first creating the entire list?
It seems obvious that it should not* (at least not always) be possible to do so directly for m > 0, but it should be for n less than the length of the sequence. [p for i,p in zip(range(n), iterator)] comes to mind, but is that the best way?
The context is simple: Creating the entire list would cause a RAM overflow, so it needs to be broken down. So how do you do this efficiently and/or python-ic-ly?
*The list comprehension I mentioned could obviously be used for m > 0 by calling next(iterator) m times prior to execution, but I don't enjoy the lack of python-ness here.
itertools.islice:
from itertools import islice
itr = (i for i in range(10))
m, n = 3, 8
result = list(islice(itr, m, n))
print(result)
# [3, 4, 5, 6, 7]
In addition, you can add an argument as the step if you wanted:
itr = (i for i in range(10))
m, n, step = 3, 8, 2
result = list(islice(itr, m, n, step))
print(result)
# [3, 5, 7]

Iterating over an infinite container with a stop point in python

Suppose I want to generate a list of the set of positive integers whose square is less than 100 in python.
My initial thinking was to do something like this
from itertools import count
numbers = [x for x in count() if x**2<100]
However this code won't complete, as python goes through infinitely many numbers.
There are two solutions to this as far as I can tell:
Put in a bound on the range of integers to go over, so use range(1000) instead of count() above.
Use a while loop, incrementing x and stopping the loop when x squared is greater than or equal to 100.
Neither of these solutions are elegant or (as far as I can tell) pythonic. Is there a good way to handle cases like this ,when iterating over an infinite container but you know that the loop stops at some point.
Thanks!
This would be a use case for itertools.takewhile:
from itertools import count, takewhile
numbers = list(takewhile(lambda x: x**2 < 100, count()))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
For a slice of known bounds from a (infinite) generator (not all of those have such simple finite equivalents as range), use itertools.islice:
numbers = list(islice(count(), 4, 10, 2))
# [4, 6, 8]
That being said, and with a loop-based approach being perfectly legit as well, there is a hacky way of achieving it in one line without any library tools, by causing an "artificial" StopIteration:
numbers = list(x if x**2<100 else next(iter([])) for x in count())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
don't do this in serious code though - not just because it will stop working in Python3.7 ;-)
I'd use takewhile:
nums = itertools.takewhile(lambda x: x**2 < 100, count())
Literally, take while the square is less than 100
If you were to use range, I would write it like this.
import math
[x for x in range(int(math.sqrt(100))+1) if (x**2) < 100]
You can change 100 to what you like or set it as a variable, but you know the list of numbers will end at the square root of your target number.

Python heapq: Split and merge into a ordered heapq

I am wanting to split two heapqs (used as a priority queues), and then add them together and have the resulting heapq ordered in relation to both of the previous heapqs.
Is this possible in python?
My current code:
population = []
for i in range(0, 6):
heappush(population, i)
new_population = []
for i in range(4, 9):
heappush(new_population, i)
split_index = len(population) // 2
temp_population = population[:split_index]
population = new_population[:split_index] + temp_population
print(population)
print(heappop(population))
Output:
[4, 5, 6, 0, 1, 2]
4
Wanted output:
[0, 1, 2, 4, 5, 6]
0
Use nlargest instead of slicing, then reheapify the combined lists.
from heapq import nlargest, heapify
n = len(population) // 2
population = heapify(nlargest(population, n) +
nlargest(new_population, n))
print(heappop(population))
You may want to benchmark, though, if sorting the two original lists, then merging the results, is faster. Python's sort routine is fast for nearly sorted lists, and this may impose a lot less overhead than the heapq functions. The last heapify step may not be necessary if you don't actually need a priority queue (since you are sorting them anyway).
from itertools import islice
from heapq import merge, heapify
n = len(population) # == len(new_population), presumably
population = heapify(islice(merge(sorted(population), sorted(new_population)), n))

Return tuple of any two items in list, which if summed equal given int

For example, assume a given list of ints:
int_list = list(range(-10,10))
[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
What is the most efficient way to find if any given two values in int_list sum to equal a given int, say 2?
I was asked this in a technical phone interview this morning on how to efficiently handle this scenario with an int_list of say, 100 million items (I rambled and had no good answer :/).
My first idea was:
from itertools import combinations
int_list = list(range(-10,10))
combo_list = list(combinations(int_list, 2))
desired_int = 4
filtered_tuples = list(filter(lambda x: sum(x) == desired_int, combo_list))
filtered_tuples
[(-5, 9), (-4, 8), (-3, 7), (-2, 6), (-1, 5), (0, 4), (1, 3)]
Which doesn't even work with a range of only range(-10000, 10000)
Also, does anyone know of a good online Python performance testing tool?
For any integer A there is at most one integer B that will sum together to equal integer N. It seems easier to go through the list, do the arithmetic, and do a membership test to see if B is in the set.
int_list = set(range(-500000, 500000))
TARGET_NUM = 2
def filter_tuples(int_list, target):
for int_ in int_list:
other_num = target - int_
if other_num in int_list:
yield (int_, other_num)
filtered_tuples = filter_tuples(int_list, TARGET_NUM)
Note that this will duplicate results. E.g. (-2, 4) is a separate response from (4, -2). You can fix this by changing your function:
def filter_tuples(int_list, target):
for int_ in int_list:
other_num = target - int_
if other_num in int_list:
set.remove(int_)
set.remove(other_num)
yield (int_, other_num)
EDIT: See my other answer for an even better approach (with caveats).
What is the most efficient way to find if any given two values in int_list sum to equal a given int, say 2?
My first inclination was to do it with the itertools module's combinations and the short-cutting power of any, but it could be quite slower than Adam's approach:
>>> import itertools
>>> int_list = list(range(-10,10))
>>> any(i + j == 2 for i, j in itertools.combinations(int_list, 2))
True
Seems to be fairly responsive for larger ranges:
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-10000,10000), 2))
True
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-1000000,1000000), 2))
True
Takes about 10 seconds on my machine:
>>> any(i + j == 2 for i, j in itertools.combinations(xrange(-10000000,10000000), 2))
True
A more literal approach using math:
Assume a given list of ints:
int_list = list(range(-10,10)) ... [-10, -9, -8, -7, -6, -5, -4, -3, -2,
-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
What is the most efficient way to find if any given two values in
int_list sum to equal a given int, say 2? ... how to efficiently
handle this scenario with an int_list of say, 100 million items.
It's clear that we can deduce the requirements that we can apply a single parameter, n, for the range of integers, of the form range(-n, n), which means every integer from negative n up to but not including positive n. From there the requirements are clearly to whether some number, x, is a sum of any two integers in that range.
Any such range can be trivially shown to contain a pair that sum to any number in that range and n-1 beyond it, so it's a waste of computing power to search for it.
def x_is_sum_of_2_diff_numbers_in_range(x, n):
if isinstance(x, int) and isinstance(n, int):
return -(n*2) < x < (n - 1)*2
else:
raise ValueError('args x and n must be ints')
Computes nearly instantly:
>>> x_is_sum_of_2_diff_numbers_in_range(2, 1000000000000000000000000000)
True
Testing the edge-cases:
def main():
print x_is_sum_of_2_diff_numbers_in_range(x=5, n=4) # True
print x_is_sum_of_2_diff_numbers_in_range(x=6, n=4) # False
print x_is_sum_of_2_diff_numbers_in_range(x=-7, n=4) # True
print x_is_sum_of_2_diff_numbers_in_range(x=-8, n=4) # False
EDIT:
Since I can see that a more generalized version of this problem (where the list could contain any given numbers) is a common one, I can see why some people have a preconceived approach to this, but I still stand by my interpretation of this question's requirements, and I consider this answer the best approach for this more specific case.
I would have thought that any solution that depends on a doubly nested iteration over the list (albeit having the inner loop concealed by a nifty Python function) is O(n^2).
It is worth considering sorting the input. For any reasonable comparison-based sort, this will be O(n.lg(n)), which is already better than O(n^2). You might do better with a radix sort or pre-sort (making something like a bucket sort) depending on the range of the input list.
Having sorted the input, it is an O(n) operation to find a pair of numbers that sum to any given number, so your overall complexity is O(n.lg(n)).
In practice, it's an open question whether, for the stipulated “large number” of elements, a brute-force O(n^2) algorithm with nice cache behavior (zipping through arrays in order) would outperform the asymptotically better algorithm that moves a lot of data around, but eventually the one with the lower asymptotic complexity will win.

How do I return a list of the 3 lowest values in another list

How do I return a list of the 3 lowest values in another list. For example I want to get the 3 lowest values of this list:
in_list = [1, 2, 3, 4, 5, 6]
input: function(in_list, 3)
output: [1, 2, 3]
You can use heapq.nsmallest:
>>> from heapq import nsmallest
>>> in_list = [1, 2, 3, 4, 5, 6]
>>> nsmallest(3, in_list)
[1, 2, 3]
>>>
If you could sort,you can get frst 3 elements as below:
alist=[6, 4, 3, 2, 5, 1]
sorted(alist)[:3]
Output:
[1,2,3]
even simpler without modules imported:
l =[3,8,9,10,2,4,1]
l1 = sorted(l)[:3]
hope this helps
Sorting is the reasonable approach. If you care about asymptotic complexity though, you'd want to do this in time O(n) and space O(1).
def k_min(values, k):
return sorted(values)[:k]
Calling to sorted() can give you time only O(n*log n) and space O(n), so to achieve O(n) time and O(1) space, different approach is needed.
For that, you'd iterate through the list (that's where O(n) comes from) and keep track of the k minimal elements seen so far, which can be done in constant time (since k is here a constant).
For keeping track of the minimal elements, you can use a heap (module heapq), or a list. In case of list, time is O(nk) and in case of heap time is O(nlog k). In any case, because k is for you a constant, this whole thing will end up linear in n in total.
Using a list (which is bit simpler to use than the heap, though of course bad if k is large), it may look like this
def k_min(values, k):
minima = [] # len(minima) <= k + 1
for v in values:
minima.append(v)
if len(minima) > k:
minima.remove(max(minima)) # O(k) == O(1)
return minima
If your lists are long, the most efficient way of doing this is via numpy.partition:
>>> def lowest(a, n): return numpy.partition(a, n-1)[:n]
>>> in_list = [6, 4, 3, 2, 5, 1]
>>> lowest(in_list, 3)
array([1, 2, 3])
This executes in O(N) time, unlike a full sort which would operate in O(NlogN) time. The time savings come from not performing a full sort, but only the minimal amount needed to ensure that the n lowest elements are first. Hence the output is not necessarily sorted.
If you need them to be sorted, you can do that afterwards (numpy.sort(lowest(in_list,3)) => array([1,2,3])). For a large array this will still be faster than sorting the whole thing first.
Edit: Here is a comparison of the speed of numpy.partition, heapq.nsmallest and sorted:
>>> a = numpy.random.permutation(np.arange(1000000))
>>> timeit numpy.partition(a, 2)[:3]
100 loops, best of 3: 3.32 ms per loop
>>> timeit heapq.nsmallest(3,a)
1 loops, best of 3: 220 ms per loop
>>> timeit sorted(a)[:3]
1 loops, best of 3: 1.18 s per loop
So numpy.partition is 66 times faster than heapq.nsmallest for an array with a million elements, and it is 355 times faster than sorted. This doesn't mean that you should never use heapq.nsmallest (which is very flexible), but it demonstrates how important it is to avoid plain lists when speed is important.

Categories

Resources