Plotting average number of steps for Euclid's extended algorithm

Plotting average number of steps for Euclid's extended algorithm - python

I was given the following assignment by my Algorithms professor:
Write a Python program that implements Euclid’s extended algorithm. Then perform the following experiment: run it on a random selection of inputs of a given size, for sizes bounded by some parameter N; compute the average number of steps of the algorithm for each input size n ≤ N, and use gnuplot to plot the result. What does f(n) which is the “average number of steps” of Euclid’s extended algorithm on input size n look like? Note that size is not the same as value; inputs of size n are inputs with a binary representation of n bits.
The programming of the algorithm was the easy part but I just want to make sure that I understand where to go from here. I can fix N to be some arbitrary value. I generate a set of random values of a and b to feed into the algorithm whose length in binary (n) are bounded above by N. While the algorithm is running, I have a counter that is keeping track of the number of steps (ignoring trivial linear operations) taken for that particular a and b.
At the end of this, I sum the lengths of each individual inputs a and b binary representation and that represents a single x value on the graph. My single y value would be the counter variable for that particular a and b. Is this a correct way to think about it?
As a follow up question, I also know that the best case for this algorithm is θ(1) and worst case is O(log(n)) so my "average" graph should lie between those two. How would I manually calculate average running time to verify that my end graph is correct?
Thanks.

Related

Suppose an array contains only two kinds of elements, how to quickly find their boundaries?

I've asked a similar question before, but this time it's different.
Since our array contains only two elements, we might as well set it to 1 and -1, where 1 is on the left side of the array and -1 is on the right side of the array:
[1,...,1,1,-1,-1,...,-1]
Both 1 and -1 exist at the same time and the number of 1 and -1 is not necessarily the same. Also, the numbers of 1 and -1 are both very large.
Then, define the boundary between 1 and -1 as the index of the -1 closest to 1. For example, for the following array:
[1,1,1,-1,-1,-1,-1]
Its boundary is 3.
Now, for each number in the array, I cover it with a device that you have to unlock to see the number in it.
I want to try to unlock as few devices as possible that cover 1, because it takes much longer to see a '1' than it takes to see a '-1'. And I also want to reduce my time cost as much as possible.
How can I search to get the boundary as quickly as possible?

The problem is very like the "egg dropping" problem, but where a wrong guess has a large fixed cost (100), and a good guess has a small cost (1).
Let E(n) be the (optimal) expected cost of finding the index of the right-most 1 in an array (or finding that the array is all -1), assuming each possible position of the boundary is equally likely. Define the index of the right-most 1 to be -1 if the array is all -1.
If you choose to look at the array element at index i, then it's -1 with probability i/(n+1), and 1 with probability (n-i+1)/(n+1).
So if you look at array element i, your expected cost for finding the boundary is (1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1).
Thus E(n) = min((1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1), i=0..n-1)
For each n, the i that minimizes the equation is the optimal array element to look at for an array of that length.
I don't think you can solve these equations analytically, but you can solve them with dynamic programming in O(n^2) time.
The solution is going to look like a very skewed binary search for large n. For smaller n, it'll be skewed so much that it will be a traversal from the right.

If I am right, a strategy to minimize the expectation of the cost is to draw at a fraction of the interval that favors the -1 outcome, in inverse proportion of the cost. So instead of picking the middle index, take the right centile.
But this still corresponds to a logarithmic asymptotic complexity.
There is probably nothing that you can do regarding the worst case.

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.

This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?

Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.

Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())

The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Disorderly escape-Google Foobar 2020 not passing test cases [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
The code is running fine and is executing in python compiler online but failing all test cases in the Google Foobar
from math import factorial
from collections import Counter
from fractions import gcd
def cycle_count(c, n):
cc=factorial(n)
for a, b in Counter(c).items():
cc//=(a**b)*factorial(b)
return cc
def cycle_partitions(n, i=1):
yield [n]
for i in range(i, n//2+1):
for p in cycle_partitions(n-i, i):
yield [i]+p
def solution(w, h, s):
grid=0
for cpw in cycle_partitions(w):
for cph in cycle_partitions(h):
m=cycle_count(cpw, w)*cycle_count(cph, h)
grid+=m*(s**sum([sum([gcd(i, j) for i in cpw]) for j in cph]))
return grid//(factorial(w)*factorial(h))
Check out this code which is to be executed .Would love suggestions !!!

This is a gorgeous problem, both mathematically and from the algorithmic point of view.
Let me try to explain each part.
The mathematics
This part is better read with nicely typeset formulas. See a concise explanation here where links to further reading are given.
Let me add a reference directly here: For example Harary and Palmer's Graphical enumeration, Chapter 2.
In short, there is a set (the whole set of h x w-matrices, where the entries can take any of s different values) and a group of permutations that transforms some matrices in others. In the problem the group consists of all permutations of rows and/or columns of the matrices.
The set of matrices gets divided into classes of matrices that can be transformed into one another. The goal of the problem is to count the number of these classes. In technical terminology the set of classes is called the quotient of the set by the action of the group, or orbit space.
The good thing is that there is a powerful theorem (with many generalizations and versions) that does exactly that. That is Polya's enumeration theorem. The theorem expresses the number of elements of the orbit space in terms of the value of a polynomial known in the area as Cycle Index. Now, in this problem the group is a direct product of two special groups the group of all permutations of h and w elements, respectively. The Cycle Index polynomials for these groups are known, and so are formulas for computing the Cycle Index polynomial of the product of groups in terms of the Cycle Index polynomials of the factors.
Maybe a comment worth making that motivates the name of the polynomial is the following:
Every permutation of elements can be seen as cycling disjoint subsets of those elements. For example, a permutation of (1,2,3,4,5) and can be (2,3,1,5,4), where we mean that 2 moved to the position of 1, 3 moved to the position of 2, 1 to the position of 3, 5 to the position of 4 and 4 to the position of 5. The effect of this permutation is the same as cycling 1-> 3 -> 2 and 2 back to 1, and cycling 4 -> 5 and 5 back to 4. Similar to how natural numbers can be factored into a product of prime factors, each permutation can be factored into disjoint cycles. For each permutation, the cycles are unique in a sense for each permutation. The Cycle Index polynomial is computed in terms of the number of cycles of each length for each permutation in the group.
Putting all these together we get that the total count is given by the last formula in the link.
Implementation
As seen in the final formula, we need to compute:
Partitions of a number
Greatest common divisors (gcd) of many numbers.
Factorials of many numbers.
For these, we can do:
To compute all partitions one can use the iterative algorithms here. Already written in Python here.
An efficient way to compute gcd one could use Euclidean algorithm. However, since we are going to need the gcd of all pairs of numbers in a range and each one many times. It is better to pre-compute the full table of gcd all at once by dynamic programming. If a>b, then gcd(a,b)=gcd(a-b,b). This recurrence equation allows to compute gcd of larger pairs in terms of that of smaller pairs. In the table, one has the initial values gcd(1,a)=gcd(a,1)=1 and gcd(a,a)=a, for all a.
The same happens for factorials. The formula will require the factorials of all numbers in a range many times each. So, it is better to compute them all from the bottom up using that n! = n(n-1)! and 0!=1!=1.
An implementation in Python could look like this. Feel free to improve it.

I know this is a copied code but you have to :
1) write your own factorial function
2) write your own gcd function
3) cast to string before returning final value.

Fast way find strings within hamming distance x of each other in a large array of random fixed length strings

I have a large array with millions of DNA sequences which are all 24 characters long. The DNA sequences should be random and can only contain A,T,G,C,N. I am trying to find strings that are within a certain hamming distance of each other.
My first approach was calculating the hamming distance between every string but this would take way to long.
My second approach used a masking method to create all possible variations of the strings and store them in a dictionary and then check if this variation was found more then 1 time. This worked pretty fast(20 min) for a hamming distance of 1 but is very memory intensive and would not be viable to use for a hamming distance of 2 or 3.
Python 2.7 implementation of my second approach.
sequences = []
masks = {}
for sequence in sequences:
for i in range(len(sequence)):
try:
masks[sequence[:i] + '?' + sequence[i + 1:]].append(sequence[i])
except KeyError:
masks[sequence[:i] + '?' + sequence[i + 1:]] = [sequence[i], ]
matches = {}
for mask in masks:
if len(masks[mask]) > 1:
matches[mask] = masks[mask]
I am looking for a more efficient method. I came across Trie-trees, KD-trees, n-grams and indexing but I am lost as to what will be the best approach to this problem.

One approach is Locality Sensitive Hashing
First, you should note that this method does not necessarily return all the pairs, it returns all the pairs with a high probability (or most pairs).
Locality Sensitive Hashing can be summarised as: data points that are located close to each other are mapped to similar hashes (in the same bucket with a high probability). Check this link for more details.
Your problem can be recast mathematically as:
Given N vectors v ∈ R^{24}, N<<5^24 and a maximum hamming distance d, return pairs which have a hamming distance atmost d.
The way you'll solve this is to randomly generates K planes {P_1,P_2,...,P_K} in R^{24}; Where K is a parameter you'll have to experiment with. For every data point v, you'll define a hash of v as the tuple Hash(v)=(a_1,a_2,...,a_K) where a_i∈{0,1} denotes if v is above this plane or below it. You can prove (I'll omit the proof) that if the hamming distance between two vectors is small then the probability that their hash is close is high.
So, for any given data point, rather than checking all the datapoints in the sequences, you only check data points in the bin of "close" hashes.
Note that these are very heuristic based and will need you to experiment with K and how "close" you want to search from each hash. As K increases, your number of bins increase exponentially with it, but the likelihood of similarity increases.
Judging by what you said, it looks like you have a gigantic dataset so I thought I would throw this for you to consider.

Found my solution here: http://www.cs.princeton.edu/~rs/strings/
This uses ternary search trees and took only a couple of minutes and ~1GB of ram. I modified the demo.c file to work for my use case.

Grouping arbitrary arrays of data into N bins

I want to group an arbitrary-sized array of random values into n groups, such that the sum of values in any one group/bin is as equal as possible.
So for values [1, 2, 4, 5] and n = 2, the output buckets should be [sum(5+1), sum(4+2)].
Some possibilities that occur to me:
Full exhaustive breadth first search
Random processes with stopping conditions hard coded
Start from one end of the sorted array, grouping until the sum is equal to the global average, and move to the next group until n is reached
Seems like the optimal solution (where the sum of the contents of the bins are as equal as possible given the input array) is probably non-trivial; so at the moment I'm leaning towards the last option, but have the feeling I am possibly missing more elegant solutions?

This is an NP-hard problem. In other words, it's not possible to find an optimal solution without exploring all combinations, and the number of combinations is n^M (where M is the size of you array, and n the number of beans). It's a problem very similar to clustering, which is also NP-hard.
If your data set is small enough to deal with, a brute force algorithm is best (explore all combinations).
However, if your data set is big, you'll want a polynomial-time algorithm that won't get you the optimal solution, but a good approximation. In that case, I suggest you use something similar to K-Means...
Step 1. Calculate the expected sum per bin. Let A be your array, then the expected sum per bin is SumBin = SUM(A) / n (the sum of all elements in your array over the number of bins).
Step 2. Put all elements of your array in some collection (e.g. another array) that we'll call The Bag (this is just a conceptual, so you understand the next steps).
Step 3. Partition The Bag into n groups (preferably randomly, so that each element ends up in some bin i with probability 1/n). At this point, your bins have all the elements, and The Bag is empty.
Step 4. Calculate the sum for each bin. If result is the same as last iteration, exit. (this is the expectation step of K-Means)
Step 5. For each bin i, if its sum is greater than SumBin, pick the first element greater than SumBin and put it back in The Bag; if its sum is less than SumBin, pick the first element less than SumBin and put back in The Bag. This is the gradient descent step (aka maximization step) of K-Means.
Step 6. Go to step 3.
This algorithm is just an approximation, but it's fast and guaranteed to converge.
If you are skeptical about a randomized algorithm like the above, after the first iteration when you are back to step 3, instead of assigning elements randomly, you can do so optimally by running the Hungarian algorithm, but I am not sure that will guarantee better over-all results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.