Reducing the number of comparisons to find min/max of an array - python

I was looking into something simple, as finding the max and min elements of an array (no specific language). I know there are built-in functions in many languages, and also you can write your own, such as:
Initialize max = min = firstElement in Array
Loop through each element, and check if it's less than min or more than max
Update accordingly
Return
This of course, results in k comparisons for an array of size k. Is there a way to reduce the number of comparisons that we would have to do? Just assuming the array is unsorted. I've tagged it in Python as the basic algorithm I coded was in Python.

Related

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

What is the fastest way of computing powerset of an array in Python?

Given a list of numbers, e.g. x = [1,2,3,4,5] I need to compute its powerset (set of all subsets of that list). Right now, I am using the following code to compute the powerset, however when I have a large array of such lists (e.g. 40K of such arrays), it is extremely slow. So I am wondering if there can be any way to speed this up.
superset = [sorted(x[:i]+x[i+s:]) for i in range(len(x)) for s in range(len(x))]
I also tried the following code, however it is much slower than the code above.
from itertools import chain, combinations
def powerset(x):
xx = list(x)
return chain.from_iterable(combinations(xx, i) for i in range(len(xx)+1))
You can represent a powerset more efficiently by having all subsets reference the original set as a list and have each subset include a number whose bits indicate inclusion in the set. Thus you can enumerate the power set by computing the number of elements and then iterating through the integers with that many bits. However, as have been noted in the comments, the power set grows extremely fast, so if you can avoid having to compute or iterate through the power set, you should do so if at all possible.

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?
To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.
Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.
Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List
If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).
Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).

Calculating the how often an element of an array falls within a certain range

Basically I'm attempting a question given to me by one of my peers to help me with python. I've got to calculate how often a given value falls within a certain distance from the 'edge' of the array.
I've generated a 100x100 array filled with random variables by using rand(100,100). However from there I'm pretty stumped.
As far as I can work out I've got to declare the range and then use counters to count the elements within that range, but I honestly don't have a clue.
I'm not 100% clear on what's meant by "within a certain distance from the 'edge' of the array", but am assuming that you have a numpy array and that you are trying to count the number of occurrences within an upper and lower bound, in which case you can use:
((lowerBound < numpyArray) & (numpyArray < upperBound)).sum()
where:
lowerBound and upperBound are float variables between 0 and 1
numpyArray is the array that you have generated

Efficiently find the range of an array in python?

Is there an accepted efficient way to find the range (ie. max value - min value) of a list of numbers in python? I have tried using a loop and I know I can use the min and max functions with subtraction. I am just wondering if there is some kind of built-in that is faster.
If you really need high performance, try Numpy. The function numpy.ptp computes the range of values (i.e. max - min) across an array.
You're unlikely to find anything faster than the min and max functions.
You could possibly code up a minmax function which did a single pass to calculate the two values rather than two passes but you should benchmark this to ensure it's faster. It may not be if it's written in Python itself but a C routine added to Python may do it. Something like (pseudo-code, even though it looks like Python):
def minmax (arr):
if arr is empty:
return (None, None)
themin = arr[0]
themax = arr[0]
for each value in arr[1:]:
if value < themin:
themin = value
else:
if value > themax:
themax = value
return (themin, themax)
Another possibility is to interpose your own class around the array (this may not be possible if you want to work on real arrays directly). This would basically perform the following steps:
mark the initial empty array clean.
if adding the first element to an array, set themin and themax to that value.
if adding element to a non-empty array, set themin and themax depending on how the new value compares to them.
if deleting an element that is equal to themin or themax, mark the array dirty.
if requesting min and max from a clean array, return themin and themax.
if requesting min and max from a dirty array, calculate themin and themax using loop in above pseudo-code, then set array to be clean.
What this does is to cache the minimum and maximum values so that, at worst, you only need to do the big calculation infrequently (after deletion of an element which was either the minimum or maximum). All other requests use cached information.
In addition, the adding of elements keep themin and themax up to date without a big calculation.
And, possibly even better, you could maintain a dirty flag for each of themin and themax, so that dirtying one would still allow you to use the cached value of the nother.
If you use Numpy and you have an 1-D array (or can create one quickly from a list), then there's the function numpy.ptp():
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ptp.html

Categories

Resources