Algos - Delete Extremes From A List of Integers in Python? - python

I want to eliminate extremes from a list of integers in Python. I'd say that my problem is one of design. Here's what I cooked up so far:
listToTest = [120,130,140,160,200]
def function(l):
length = len(l)
for x in xrange(0,length - 1):
if l[x] < (l[x+1] - l[x]) * 4:
l.remove(l[x+1])
return l
print function(listToTest)
So the output of this should be: 120,130,140,160 without 200, since that's way too far ahead from the others.
And this works, given 200 is the last one or there's only one extreme. Though, it gets problematic with a list like this:
listToTest = [120,200,130,140,160,200]
Or
listToTest = [120,130,140,160,200,140,130,120,200]
So, the output for the last list should be: 120,130,140,160,140,130,120. 200 should be gone, since it's a lot bigger than the "usual", which revolved around ~130-140.
To illustrate it, here's an image:
Obviously, my method doesn't work. Some thoughts:
- I need to somehow do a comparison between x and x+1, see if the next two pairs have a bigger difference than the last pair, then if it does, the pair that has a bigger difference should have one element eliminated (the biggest one), then, recursively do this again. I think I should also have an "acceptable difference", so it knows when the difference is acceptable and not break the recursivity so I end up with only 2 values.
I tried writting it, but no luck so far.

You can use statistics here, eliminating values that fall beyond n standard deviations from the mean:
import numpy as np
test = [120,130,140,160,200,140,130,120,200]
n = 1
output = [x for x in test if abs(x - np.mean(test)) < np.std(test) * n]
# output is [120, 130, 140, 160, 140, 130, 120]

Your problem statement is not clear. If you simply want to remove the max and min then that is a simple
O(N) with 2 extra memory- which is O(1)
operation. This is achieved by retaining the current min/max value and comparing it to each entry in the list in turn.
If you want the min/max K items it is still
O(N + KlogK) with O(k) extra memory
operation. This is achieved by two priorityqueue's of size K: one for the mins, one for the max's.
Or did you intend a different output/outcome from your algorithm?
UPDATE the OP has updated the question: it appears they want a moving (/windowed) average and to delete outliers.
The following is an online algorithm -i.e. it can handle streaming data http://en.wikipedia.org/wiki/Online_algorithm
We can retain a moving average: let's say you keep K entries for the average.
Then create a linked list of size K and a pointer to the head and tail. Now: handling items within the first K entries needs to be thought out separately. After the first K retained items the algo can proceed as follows:
check the next item in the input list against the running k-average. If the value exceeds the acceptable ratio threshold then put its list index into a separate "deletion queue" list. Otherwise: update the running windowed sum as follows:
(a) remove the head entry from the linked list and subtract its value from the running sum
(b) add the latest list entry as the tail of the linked list and add its value to the running sum
(c) recalculate the running average as the running sum /K
Now: how to handle the first K entries? - i.e. before we have a properly initialized running sum?
You will need to make some hard-coded decisions here. A possibility:
run through all first K+2D (D << K) entries.
Keep d max/min values
Remove the d (<< K) max/min values from that list

Related

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

Python: Unsorted/Sorted Lists Return Different Values?

Working on some stuff and I came into an odd issue that I am having trouble figuring out. I am sorting a list with 10,000 values in it two ways, one with the usage of quick select, and another with the usgae of insertion sort. The goal of this is to find a median and then using said median I have to find the total distance between the median value and all of the values. The median calculations work perfectly fine, but the total calculations, for reasons I cannot understand return different values. The inputs into the function that calculates the total are the list and the median. The median stays the same between both programs and the values of the list do as well however one of the lists is sorted and the other list is not sorted.
Here is what I am using to calculate the total (formatting on this is fine it just copies into here weird)...
def ttlDist(lst, med):
total = 0
for i in lst:
total = abs(med - i)
print(total)
With one list being sorted and another not being sorted why would I be getting drastically different values? For refrence the distance I get when using insertion sort is 49846.0 but the distance I get when using quickselect is 29982
You're not accumulating anything; you're replacing total with a new value each time through the loop. So, at the end of the loop, total is the value from the last element of lst. A sorted list and an unsorted list will generally have different last elements.
What you probably wanted is:
total += abs(med - i)
Or, more simply, replace the whole function with:
total = sum(abs(med-i) for i in lst)

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?
To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.
Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.
Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List
If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).
Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).

"sorted 1-d iterator" based on "2-d iterator" (Cartesian product of iterators)

I'm looking for a clean way to do this in Python:
Let's say I have two iterators "iter1" and "iter2": perhaps a prime number generator, and itertools.count(). I know a priori that both are are infinite and monotonically increasing. Now I want to take some simple operation of two args "op" (perhaps operator.add or operator.mul), and calculate every element of the first iterator with every element of the next, using said operation, then yield them one at a time, sorted. Obviously, this is an infinite sequence itself. (As mentioned in comment by #RyanThompson: this would be called the Cartesian Product of these sequences... or, more exactly, the 1d-sort of that product.)
What is the best way to:
wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output.
Allowable simplifying assumptions:
If it helps, we can assume op(a,b) >= a and op(a,b) >= b.
If it helps, we can assume op(a,b) > op(a,c) for all b > c.
Also allowable:
Also acceptable would be an iterator that yields values in "generally increasing" order... by which I mean the iterable could occasionally give me a number less than a previous one, but it would somehow make "side information" available (as via a method on the object) that would say "I'm not guarenteeing the next value I give you will be greater than the one I just gave you, but I AM SURE that all future values will at least be greater than N.".... and "N" itself is monotonically increasing.
The only way I can think to do this is a sort of "diagonalization" process, where I keep an increasing number of partially processed iterables around, and "look ahead" for the minimum of all the possible next() values, and yield that. But this weird agglomeration of a heapq and a bunch of deques just seems outlandish, even before I start to code it.
Please: do not base your answer on the fact that my examples mentioned primes or count().... I have several uses for this very concept that are NOT related to primes and count().
UPDATE: OMG! What a great discussion! And some great answers with really thorough explanations. Thanks so much. StackOverflow rocks; you guys rock.
I'm going to delve in to each answer more thoroughly soon, and give the sample code a kick in the tires. From what I've read so far, my original suspicions are confirmed that there is no "simple Python idiom" to do this. Rather, by one way or another, I can't escape keeping all yielded values of iter1 and iter2 around indefinitely.
FWIW: here's an official "test case" if you want to try your solutions.
import operator
def powers_of_ten():
n = 0
while True:
yield 10**n
n += 1
def series_of_nines():
yield 1
n = 1
while True:
yield int("9"*n)
n += 1
op = operator.mul
iter1 = powers_of_ten()
iter2 = series_of_nines()
# given (iter1, iter2, op), create an iterator that yields:
# [1, 9, 10, 90, 99, 100, 900, 990, 999, 1000, 9000, 9900, 9990, 9999, 10000, ...]
import heapq
import itertools
import operator
def increasing(fn, left, right):
"""
Given two never decreasing iterators produce another iterator
resulting from passing the value from left and right to fn.
This iterator should also be never decreasing.
"""
# Imagine an infinite 2D-grid.
# Each column corresponds to an entry from right
# Each row corresponds to an entry from left
# Each cell correspond to apply fn to those two values
# If the number of columns were finite, then we could easily solve
# this problem by keeping track of our current position in each column
# in each iteration, we'd take the smallest value report it, and then
# move down in that column. This works because the values must increase
# as we move down the column. That means the current set of values
# under consideration must include the lowest value not yet reported
# To extend this to infinite columns, at any point we always track a finite
# number of columns. The last column current tracked is always in the top row
# if it moves down from the top row, we add a new column which starts at the top row
# because the values are increasing as we move to the right, we know that
# this last column is always lower then any columns that come after it
# Due to infinities, we need to keep track of all
# items we've ever seen. So we put them in this list
# The list contains the first part of the incoming iterators that
# we have explored
left_items = [next(left)]
right_items = [next(right)]
# we use a heap data structure, it allows us to efficiently
# find the lowest of all value under consideration
heap = []
def add_value(left_index, right_index):
"""
Add the value result from combining the indexed attributes
from the two iterators. Assumes that the values have already
been copied into the lists
"""
value = fn( left_items[left_index], right_items[right_index] )
# the value on the heap has the index and value.
# since the value is first, low values will be "first" on the heap
heapq.heappush( heap, (value, left_index, right_index) )
# we know that every other value must be larger then
# this one.
add_value(0,0)
# I assume the incoming iterators are infinite
while True:
# fetch the lowest of all values under consideration
value, left_index, right_index = heapq.heappop(heap)
# produce it
yield value
# add moving down the column
if left_index + 1 == len(left_items):
left_items.append(next(left))
add_value(left_index+1, right_index)
# if this was the first row in this column, add another column
if left_index == 0:
right_items.append( next(right) )
add_value(0, right_index+1)
def fib():
a = 1
b = 1
while True:
yield a
a,b = b,a+b
r = increasing(operator.add, fib(), itertools.count() )
for x in range(100):
print next(r)
Define the sequences as:
a1 <= a2 <= a3 ...
b1 <= b2 <= b3 ...
Let a1b1 mean op(a1,b1) for short.
Based on your allowable assumptions (very important) you know the following:
max(a1, b1) <= a1b1 <= a1b2 <= a1b3 ...
<=
max(a2, b1) <= a2b1 <= a2b2 <= a2b3 ...
<=
max(a3, b1) <= a3b1 <= a3b2 <= a3b3 ...
. .
. .
. .
You'll have to do something like:
Generate a1b1. You know that if you continue increasing the b variables, you will only get higher values. The question now is: is there a smaller value by increasing the a variables? Your lower bound is min(a1, b1), so you will have to increase the a values until min(ax,b1) >= a1b1. Once you hit that point, you can find the smallest value from anb1 where 1 <= n <= x and yield that safely.
You then will have multiple horizontal chains that you'll have to keep track of. Every time you have a value that goes past min(ax,b1), you'll have to increase x (adding more chains) until min(ax,b1) is larger than it before safely emitting it.
Just a starting point... I don't have time to code it at the moment.
EDIT: Oh heh that's exactly what you already had. Well, without more info, this is all you can do, as I'm pretty sure that mathematically, that's what is necessary.
EDIT2: As for your 'acceptable' solution: you can just yield a1bn in increasing order of n, returning min(a1,b1) as N =P. You'll need to be more specific. You speak as if you have a heuristic of what you generally want to see, the general way you want to progress through both iterables, but without telling us what it is I don't know how one could do better.
UPDATE: Winston's is good but makes an assumption that the poster didn't mention: that op(a,c) > op(b,c) if b>a. However, we do know that op(a,b)>=a and op(a,b)>=b.
Here is my solution which takes that second assumption but not the one Winston took. Props to him for the code structure, though:
def increasing(fn, left, right):
left_items = [next(left)]
right_items = [next(right)]
#columns are (column value, right index)
columns = [(fn(left_items[0],right_items[0]),0)]
while True:
#find the current smallest value
min_col_index = min(xrange(len(columns)), key=lambda i:columns[i][0])
#generate columns until it's impossible to get a smaller value
while right_items[0] <= columns[min_col_index][0] and \
left_items[-1] <= columns[min_col_index][0]:
next_left = next(left)
left_items.append(next_left)
columns.append((fn(next_left, right_items[0]),0))
if columns[-1][0] < columns[min_col_index][0]:
min_col_index = len(columns)-1
#yield the smallest value
yield columns[min_col_index][0]
#move down that column
val, right_index = columns[min_col_index]
#make sure that right value is generated:
while right_index+1 >= len(right_items):
right_items.append(next(right))
columns[min_col_index] = (fn(left_items[min_col_index],right_items[right_index+1]),
right_index+1)
#repeat
For a (pathological) input that demonstrates the difference, consider:
def pathological_one():
cur = 0
while True:
yield cur
cur += 100
def pathological_two():
cur = 0
while True:
yield cur
cur += 100
lookup = [
[1, 666, 500],
[666, 666, 666],
[666, 666, 666],
[666, 666, 666]]
def pathological_op(a, b):
if a >= 300 or b >= 400: return 1005
return lookup[b/100][a/100]
r = increasing(pathological_op, pathological_one(), pathological_two())
for x in range(15):
print next(r)
Winston's answer gives:
>>>
1
666
666
666
666
500
666
666
666
666
666
666
1005
1005
1005
While mine gives:
>>>
1
500
666
666
666
666
666
666
666
666
666
666
1005
1005
1005
Let me start with an example of how I would solve this intuitively.
Because reading code inline is a little tedious, I'll introduce some notation:
Notation
i1 will represent iter1. i10 will represent the first element of iter1. Same for iter2.
※ will represent the op operator
Intuitive solution
By using simplifying assumption 2, we know that i10 ※ i20 is the smallest element that will ever be yielded from your final iterator. The next element would the smaller of i10 ※ i21 and i11 ※ i20.
Assuming i10 ※ i21 is smaller, you would yield that element. Next, you would yield the smaller of i11 ※ i20, i11 ※ i20, and i11 ※ i21.
Expression as traversal of a DAG
What you have here is a graph traversal problem. First, think of the problem as a tree. The root of the tree is i10 ※ i20. This node, and each node below it, has two children. The two children of i1x ※ i2y are the following: One child is i1x+1 ※ i2y, and the other child is i1x ※ i2y+1. Based on your second assumption, we know that i1x ※ i2y is less than both of its children.
(In fact, as Ryan mentions in a comment, this is a directed acyclic graph, or DAG. Some "parents" share "children" with other "parent" nodes.)
Now, we need to keep a frontier - a collection of nodes that could be next to be returned. After returning a node, we add both its children to the frontier. To select the next node to visit (and return from your new iterator), we compare the values of all the nodes in the frontier. We take the node with the smallest value and we return it. Then, we again add both of its child nodes to the frontier. If the child is already in the frontier (added as the child of some other parent), just ignore it.
Storing the frontier
Because you are primarily interested in the value of the nodes, it makes sense to store these nodes indexed by value. As such, it may be in your interest to use a dict. Keys in this dict should be the values of nodes. Values in this dict should be lists containing individual nodes. Because the only identifying information in a node is the pair of operands, you can store individual nodes as a two-tuple of operands.
In practice, after a few iterations, your frontier may look like the following:
>>> frontier
{1: [(2, 3), (2, 4)], 2: [(3, 5), (5, 4)], 3: [(1, 6)], 4: [(6, 3)]}
Other implementation notes
Because iterators don't support random access, you'll need to hang on to values that are produced by your first two iterators until they are no longer needed. You'll know that a value is still needed if it is referenced by any value in your frontier. You'll know that a value is no longer needed once all nodes in the frontier reference values later/greater than one you've stored. For example, i120 is no longer needed when nodes in your frontier reference only i121, i125, i133, ...
As mentioned by Ryan, each value from each iterator will be used an infinite number of times. Thus, every value produced will need to be saved.
Not practical
Unfortunately, in order to assure that elements are returned only in increasing order, the frontier will grow without bound. Your memoized values will probably also take a significant amount of space also grow without bound. This may be something you can address by making your problem less general, but this should be a good starting point.
So you basically want to take two monotonically increasing sequences, and then (lazily) compute the multiplication (or addition, or another operation) table between them, which is a 2-D array. Then you want to put the elements of that 2-D array in sorted order and iterate through them.
In general, this is impossible. However, if your sequences and operation are such that you can make certain guarantees about the rows and columns of the table, then you can make some progress. For example, let's assume that your sequences are monitonically-increasing sequences of positive integers only, and that the operation is multiplication (as in your example). In this case, we know that every row and column of the array is a monotonically-increasing sequence. In this case, you do not need to compute the entire array, but rather only parts of it. Specifically, you must keep track of the following:
How many rows you have ever used
The number of elements you have taken from each row that you have used
Every element from either input sequence that you have ever used, plus one more from each
To compute the next element in your iterator, you must do the following:
For each row that you have ever used, compute the "next" value in that row. For example, if you have used 5 values from row 1, then compute the 6th value (i=1, j=6) by taking the 1st value from the first sequence and the 6th value from the second sequence (both of which you have cached) and applying the operation (multiplication) to them. Also, compute the first value in the first unused row.
Take the minimum of all the values you computed. Yield this value as the next element in your iterator
Increment the counter for the row from which you sampled the element in the previous step. If you took the element from a new, unused row, you must increment the count of the number of rows you have used, and you must create a new counter for that row initialized to 1. If necessary, you must also compute more values of one or both input sequences.
This process is kind of complex, and in particular notice that to compute N values, you must in the worst case save an amount of state proportional to the square root of N. (Edit: sqrt(N) is actually the best case.) This is in stark contrast to a typical generator, which only requires constant space to iterate through its elements regardless of length.
In summary, you can do this under certain assumptions, and you can provide a generator-like interface to it, but it cannot be done in a "streaming" fashion, because you need to save a lot of state in order to iterate through the elements in the correct order.
Use generators, which are just iterators written as functions that yield results. In this case you can write generators for iter1 and iter2 and another generator to wrap them and yield their results (or do calculations with them, or the history of their results) as you go.
From my reading of the question you want something like this, which will calculate every element of the first iterator with every element of the next, using said operation, you also state you want some way to wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output. I propose generators offer a simple solution to such problem.
import itertools
def prime_gen():
D, q = {}, 2
while True:
if q not in D:
yield q
D[q * q] = [q]
else:
for p in D[q]:
D.setdefault(p + q, []).append(p)
del D[q]
q += 1
def infinite_gen(op, iter1, iter2):
while True:
yield op(iter1.next(), iter2.next())
>>> gen = infinite_gen(operator.mul, prime_gen(), itertools.count())
>>> gen.next()
<<< 0
>>> gen.next()
<<< 3
>>> gen.next()
<<< 10
Generators offer a lot of flexibility, so it should be fairly easy to write iter1 and iter2 as generators that return values you want in the order you want. You could also consider using coroutines, which let you send values into a generator.
Discussion in other answers observes that there is potentially infinite storage required no matter what the algorithm, since every a[n] must remain available for each new b[n]. If you remove the restriction that the input be two iterators and instead only require that they be sequences (indexable or merely something that can be regenerated repeatedly) then I believe all of your state suddenly collapses to one number: The last value you returned. Knowing the last result value you can search the output space looking for the next one. (If you want to emit duplicates properly then you may need to also track the number of times the result has been returned)
With a pair of sequences you have a simple recurrence relation:
result(n) = f(seq1, seq1, result(n-1))
where f(seq1, seq1, p) searches for the minimum value in the output space q such that q > p. In practical terms you'd probably make the sequences memoized functions and choose your search algorithm to avoid thrashing the pool of memoized items.

Finding Nth item of unsorted list without sorting the list

Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...
A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.
import heapq
def nth_largest(n, iter):
return heapq.nlargest(n, iter)[-1]
Example Usage:
>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920
Confirm result by sorting:
>>> list(sorted(iter))[-10]
920
Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithms which can solve your problem in linear time.
Partition-based selection (sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).
A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).
Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison.
In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.
You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.
You could try the Median of Medians method - it's speed is O(N).
Use heapsort. It only partially orders the list until you draw the elements out.
You essentially want to produce a "top-N" list and select the one at the end of that list.
So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.
After you finish scanning, pick the last item in your top-N list.
An example for ints and N = 5:
int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value
for(int i = 0; i < largeArray.length; i++) {
if(largeArray[i] > top5[4]) {
// insert into top5:
top5[4] = largeArray[i];
// resort:
quickSort(top5);
}
}
As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).
However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).
According to Wikipedia, this is the best selection algorithm:
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
Its complexity is O(n)
One thing you should do if this is in production code is test with samples of your data.
For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.
The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.
So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.
You can keep two different counts for each element -- the number of elements bigger than the element, and the number of elements lesser than the element.
Then do a if check N == number of elements bigger than each element
-- the element satisfies this above condition is your output
check below solution
def NthHighest(l,n):
if len(l) <n:
return 0
for i in range(len(l)):
low_count = 0
up_count = 0
for j in range(len(l)):
if l[j] > l[i]:
up_count = up_count + 1
else:
low_count = low_count + 1
# print(l[i],low_count, up_count)
if up_count == n-1:
#print(l[i])
return l[i]
# # find the 4th largest number
l = [1,3,4,9,5,15,5,13,19,27,22]
print(NthHighest(l,4))
-- using the above solution you can find both - Nth highest as well as Nth Lowest
If you do not mind using pandas then:
import pandas as pd
N = 10
column_name = 0
pd.DataFrame(your_array).nlargest(N, column_name)
The above code will show you the N largest values along with the index position of each value.
Hope it helps. :-)
Pandas Nlargest Documentation

Categories

Resources