Grouping arbitrary arrays of data into N bins

Grouping arbitrary arrays of data into N bins - python

I want to group an arbitrary-sized array of random values into n groups, such that the sum of values in any one group/bin is as equal as possible.
So for values [1, 2, 4, 5] and n = 2, the output buckets should be [sum(5+1), sum(4+2)].
Some possibilities that occur to me:
Full exhaustive breadth first search
Random processes with stopping conditions hard coded
Start from one end of the sorted array, grouping until the sum is equal to the global average, and move to the next group until n is reached
Seems like the optimal solution (where the sum of the contents of the bins are as equal as possible given the input array) is probably non-trivial; so at the moment I'm leaning towards the last option, but have the feeling I am possibly missing more elegant solutions?

This is an NP-hard problem. In other words, it's not possible to find an optimal solution without exploring all combinations, and the number of combinations is n^M (where M is the size of you array, and n the number of beans). It's a problem very similar to clustering, which is also NP-hard.
If your data set is small enough to deal with, a brute force algorithm is best (explore all combinations).
However, if your data set is big, you'll want a polynomial-time algorithm that won't get you the optimal solution, but a good approximation. In that case, I suggest you use something similar to K-Means...
Step 1. Calculate the expected sum per bin. Let A be your array, then the expected sum per bin is SumBin = SUM(A) / n (the sum of all elements in your array over the number of bins).
Step 2. Put all elements of your array in some collection (e.g. another array) that we'll call The Bag (this is just a conceptual, so you understand the next steps).
Step 3. Partition The Bag into n groups (preferably randomly, so that each element ends up in some bin i with probability 1/n). At this point, your bins have all the elements, and The Bag is empty.
Step 4. Calculate the sum for each bin. If result is the same as last iteration, exit. (this is the expectation step of K-Means)
Step 5. For each bin i, if its sum is greater than SumBin, pick the first element greater than SumBin and put it back in The Bag; if its sum is less than SumBin, pick the first element less than SumBin and put back in The Bag. This is the gradient descent step (aka maximization step) of K-Means.
Step 6. Go to step 3.
This algorithm is just an approximation, but it's fast and guaranteed to converge.
If you are skeptical about a randomized algorithm like the above, after the first iteration when you are back to step 3, instead of assigning elements randomly, you can do so optimally by running the Hungarian algorithm, but I am not sure that will guarantee better over-all results.

Related

Ordering a two-dimensional array relative to the main diagonal

Given a two-dimensional array T of size NxN, filled with various natural numbers (They do not have to be sorted in any way as in the example below.). My task is to write a program that transforms the array in such a way that all elements lying above the main diagonal are larger than each element lying on the diagonal and all elements lying below the main diagonal are to be smaller than each element on the diagonal.
For example:
T looks like this:
[2,3,5][7,11,13][17,19,23] and one of the possible solutions is:
[13,19,23][3,7,17][5,2,11]
I have no clue how to do this. Would anyone have an idea what algorithm should be used here?

Let's say the matrix is NxN.
Put all N² values inside an array.
Sort the array with whatever method you prefer (ascending order).
In your final array, the (N²-N)/2 first values go below the diagonal, the following N values go to the diagonal, and the final (N²-N)/2 values go above the diagonal.
The following pseudo-code should do the job:
mat <- array[N][N] // To be initialized.
vec <- array[N*N]
for i : 0 to (N-1)
for j : 0 to (N-1)
vec[i*N+j]=mat[i][j]
next j
next i
sort(vec)
p_below <- 0
p_diag <- (N*N-N)/2
p_above <- (N*N+N)/2
for i : 0 to (N-1)
for j : 0 to (N-1)
if (i>j)
mat[i][j] = vec[p_above]
p_above <- p_above + 1
endif
if (i<j)
mat[i][j] = vec[p_below]
p_below <- p_below + 1
endif
if (i=j)
mat[i][j] = vec[p_diag]
p_diag <- p_diag + 1
endif
next j
next i
Code can be heavily optimized by sorting directly the matrix, using a (quite complex) custom sort operator, so it can be sorted "in place". Technically, you'll do a bijection between the matrix indices to a partitioned set of indices representing "below diagonal", "diagonal" and "above diagonal" indices.
But I'm unsure that it can be considered as an algorithm in itself, because it will be highly dependent on the language used AND on how you stored, internally, your matrix (and how iterators/indices are used). I could write one in C++, but I lack knownledge to give you such an operator in Python.
Obviously, if you can't use a standard sorting function (because it can't work on anything else but an array), then you can write your own with the tricky comparison builtin the algorithm.
For such small matrixes, even a bubble-sort can work properly, but obviously implementing at least a quicksort would be better.
Elements about optimizing:
First, we speak about the trivial bijection from matrix coordinate [x][y] to [i]: i=x+y*N. The invert is obviously x=floor(i/N) & y=i mod N. Then, you can parse the matrix as a vector.
This is already what I do in the first part initializing vec, BTW.
With matrix coordinates, it's easy:
Diagonal is all cells where x=y.
The "below" partition is everywhere x<y.
The "above" partition is everywhere x>y.
Look at coordinates in the below 3x3 matrix, it's quite evident when you know it.
0,0 1,0 2,0
0,1 1,1 2,1
0,2 1,2 2,2
We already know that the ordered vector will be composed of three parts: first the "below" partition, then the "diagonal" partition, then the "above" partition.
The next bijection is way more tricky, since it requires either a piecewise linear function OR a look-up table. The first requires no additional memory but will use more CPU power, the second use as much memory as the matrix but will require less CPU power.
As always, optimization for speed often cost memory. If memory is scarse because you use huge matrixes, then you'll prefer a function.
In order to shorten a bit, I'll explain only for "below" partition. In the vector, the (N-1) first elements will be the ones belonging to the first column. Then, we'll have (N-2) elements for the 2nd column, (N-3) for the third, until we had only 1 element for the (N-1)th column. You see the scheme, sum of the number of elements and the column (zero-based index) is always (N-1).
I won't write the function, because it's quite complex and, honestly, it won't help so much to understand. Simply know that converting from matrix indices to vector is "quite easy".
The opposite is more tricky and CPU-intensive, and it SHOULD use a (N-1) element vector to store where each column starts within the vector to GREATLY speed up the process. Thanks, this vector can also be used (from end to begin) for the "above" partition, so it won't burn too much memory.
Now, you can sort your "vector" normally, simply by chaining the two bijection together with the vector index, and you'll get a matrix cell instead. As long as the sorting algorithm is stable (that's usually the case), it will works and will sort your matrix "in place", at the expense of a lot of mathematical computing to "route" the linear indexes to matrix indexes.
Please note that, despite we speak about bijections, we need ONLY the "vector to matrix" formulas. The "matrix to vector" are important - it MUST be a bijection! - but you won't use them, since you'll sort directly the (virtual) vector from 0 to N²-1.

Suppose an array contains only two kinds of elements, how to quickly find their boundaries?

I've asked a similar question before, but this time it's different.
Since our array contains only two elements, we might as well set it to 1 and -1, where 1 is on the left side of the array and -1 is on the right side of the array:
[1,...,1,1,-1,-1,...,-1]
Both 1 and -1 exist at the same time and the number of 1 and -1 is not necessarily the same. Also, the numbers of 1 and -1 are both very large.
Then, define the boundary between 1 and -1 as the index of the -1 closest to 1. For example, for the following array:
[1,1,1,-1,-1,-1,-1]
Its boundary is 3.
Now, for each number in the array, I cover it with a device that you have to unlock to see the number in it.
I want to try to unlock as few devices as possible that cover 1, because it takes much longer to see a '1' than it takes to see a '-1'. And I also want to reduce my time cost as much as possible.
How can I search to get the boundary as quickly as possible?

The problem is very like the "egg dropping" problem, but where a wrong guess has a large fixed cost (100), and a good guess has a small cost (1).
Let E(n) be the (optimal) expected cost of finding the index of the right-most 1 in an array (or finding that the array is all -1), assuming each possible position of the boundary is equally likely. Define the index of the right-most 1 to be -1 if the array is all -1.
If you choose to look at the array element at index i, then it's -1 with probability i/(n+1), and 1 with probability (n-i+1)/(n+1).
So if you look at array element i, your expected cost for finding the boundary is (1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1).
Thus E(n) = min((1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1), i=0..n-1)
For each n, the i that minimizes the equation is the optimal array element to look at for an array of that length.
I don't think you can solve these equations analytically, but you can solve them with dynamic programming in O(n^2) time.
The solution is going to look like a very skewed binary search for large n. For smaller n, it'll be skewed so much that it will be a traversal from the right.

If I am right, a strategy to minimize the expectation of the cost is to draw at a fraction of the interval that favors the -1 outcome, in inverse proportion of the cost. So instead of picking the middle index, take the right centile.
But this still corresponds to a logarithmic asymptotic complexity.
There is probably nothing that you can do regarding the worst case.

Plotting average number of steps for Euclid's extended algorithm

I was given the following assignment by my Algorithms professor:
Write a Python program that implements Euclid’s extended algorithm. Then perform the following experiment: run it on a random selection of inputs of a given size, for sizes bounded by some parameter N; compute the average number of steps of the algorithm for each input size n ≤ N, and use gnuplot to plot the result. What does f(n) which is the “average number of steps” of Euclid’s extended algorithm on input size n look like? Note that size is not the same as value; inputs of size n are inputs with a binary representation of n bits.
The programming of the algorithm was the easy part but I just want to make sure that I understand where to go from here. I can fix N to be some arbitrary value. I generate a set of random values of a and b to feed into the algorithm whose length in binary (n) are bounded above by N. While the algorithm is running, I have a counter that is keeping track of the number of steps (ignoring trivial linear operations) taken for that particular a and b.
At the end of this, I sum the lengths of each individual inputs a and b binary representation and that represents a single x value on the graph. My single y value would be the counter variable for that particular a and b. Is this a correct way to think about it?
As a follow up question, I also know that the best case for this algorithm is θ(1) and worst case is O(log(n)) so my "average" graph should lie between those two. How would I manually calculate average running time to verify that my end graph is correct?
Thanks.

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?

To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.

Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.

Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List

If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).

Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).

Sum of maximal element from every possible subset of size 'k' of an array

I have a very large list comprising of about 10,000 elements and each element is an integer as big as 5 billion. I would like to find the sum of maximal elements from every possible subset of size 'k' (given by the user) of an array whose maximum size is 10,000 elements. The only solution that comes to my head is to generate each of the subset (using itertools) and find its maximum element. But this would take an insane amount of time! What would be a pythonic way to solve this?

Don't use python, use mathematics first. This is a combinatorial problem: If you have an array S of n numbers (n large), and generate all possible subsets of size k, you want to calculate the sum of the maximal elements of the subsets.
Assuming the numbers are all distinct (though it also works if they are not), you can calculate exactly how often each will appear in a subset, and go on from there without ever actually constructing a subset. You should have taken it over to math.stackexchange.com, they'd have sorted you out in a jiffy. Here it is, but without the nice math notation:
Sort your array in increasing order and let S_1 be the smallest (first) number,
S_2 the next smallest, and so on. (Note: Indexing from 1).
S_n, the largest element, is obviously the maximal element of any subset
it is part of, and there are exactly (n-1 choose k-1) such subsets.
Of the subsets that don't contain S_n, there are (n-2 choose k-1)
subsets that contain S_{n-1}, in which it is the largest element.
Continue this until you come down to S_k, the k-th smallest number
(counting from the smallest), which will be the maximum of exactly one
subset: (k-1 choose k-1) = 1. Smaller numbers (S_1 to S_{k-1})
can never be maximal: Every set of k elements will contain something
larger.
Sum the above (n-k+1 terms), and there's your answer:
S_n*(n-1 choose k-1) + S_{n-1}*(n-2 choose k-1) + ... + S_k*(k-1 choose k-1)
Writing the terms from smallest to largest, this is just the sum
Sum(i=k..n) S_i * (i-1 choose k-1)
If we were on math.stackexchange you'd get it in the proper mathematical notation, but you get the idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.