Order bias in wrong implementation of Fisher Yates Shuffle

Order bias in wrong implementation of Fisher Yates Shuffle - python

I implemented the shuffling algorithm as:
import random
a = range(1, n+1) #a containing element from 1 to n
for i in range(n):
j = random.randint(0, n-1)
a[i], a[j] = a[j], a[i]
As this algorithm is biased. I just wanted to know for any n(n ≤ 17), is it possible to find that which permutation have the highest probablity of occuring and which permutation have least probablity out of all possible n! permutations. If yes then what is that permutation??
For example n=3:
a = [1,2,3]
There are 3^3 = 27 possible shuffle
No. occurence of different permutations:
1 2 3 = 4
3 1 2 = 4
3 2 1 = 4
1 3 2 = 5
2 1 3 = 5
2 3 1 = 5
P.S. I am not so good with maths.

This is not a proof by any means, but you can quickly come up with the distribution of placement probabilities by running the biased algorithm a million times. It will look like this picture from wikipedia:
An unbiased distribution would have 14.3% in every field.
To get the most likely distribution, I think it's safe to just pick the highest percentage for each index. This means it's most likely that the entire array is moved down by one and the first element will become the last.
Edit: I ran some simulations and this result is most likely wrong. I'll leave this answer up until I can come up with something better.

Related

how to optimize my existing code to solve google kickstart round b 2020

I am trying to solve the google kickstart round b 2020 first question. but I a facing a RE- Runtime error. can anyone help me to tell? what am I dining wrong?
Question Problem
Li has planned a bike tour through the mountains of Switzerland. His tour consists of N checkpoints, numbered from 1 to N in the order he will visit them. The i-th checkpoint has a height of Hi.
A checkpoint is a peak if:
It is not the 1st checkpoint or the N-th checkpoint, and
The height of the checkpoint is strictly greater than the checkpoint immediately before it and the checkpoint immediately after it.
Please help Li find out the number of peaks.
Input
The first line of the input gives the number of test cases, T. T test cases follow. Each test case begins with a line containing the integer N. The second line contains N integers. The i-th integer is Hi.
Output
For each test case, output one line containing Case #x: y, where x is the test case number (starting from 1) and y is the number of peaks in Li's bike tour.
Limits
Time limit: 10 seconds per test set.
Memory limit: 1GB.
1 ≤ T ≤ 100.
1 ≤ Hi ≤ 100.
Test set 1
3 ≤ N ≤ 5.
Test set 2
3 ≤ N ≤ 100.
Sample
Input
Output
4
3
10 20 14
4
7 7 7 7
5
10 90 20 90 10
3
10 3 10
Case #1: 1
Case #2: 0
Case #3: 2
Case #4: 0
In sample case #1, the 2nd checkpoint is a peak.
In sample case #2, there are no peaks.
In sample case #3, the 2nd and 4th checkpoint are peaks.
In the sample case #4, there are no peaks.
my Code for that question
for z in range(int(input())):
no_range,s = input(),list(map(int,input().split()))
if s[0] and s[-1] < max(s[1:-1]):
print(f'Case #{z+1}:',s.count(max(s[1:-1])))
else:print(f'Case #{z+1}:',0)

1. Wrong if-conditional:
if s[0] and s[-1] < max(s[1:-1]):
is True as soon as s[0] is different from 0. See How to test multiple variables against a value? :
You could fix the comparison to be
if s[0] < max(s[1:-1]) and s[-1] < max(s[1:-1]):
but that makes no sense in respect to what your task it: you do not have to check if the first and last number is less then the maximal height of the numbers in between.
2. Wrong logic:
Given data of
20, 10, 15, 10, 50, 70
your code reports the correct number - but the wrong thing - the peak that counts is 10,15,10. The 50 (the max value of [1:-1] ) is not a peak!
3. Wrong counting:
Furthermore you do not check triplets in with s.count(max(s[1:-1]))) but you count how often the max value is inside your list.
Fix:
for z in range(int(input())):
_, s = input(), list(map(int,input().split()))
# create triplets of values - this will automagically get rid of the cornercases
r = zip(s, s[1:], s[2:])
# count how often the middle value is greater then the other 2 values
print(f'Case #{z+1}: {sum(a<b>c for a,b,c in r)}')

Implementing the following algorithm for sorting

I am writing this algorithm for a sort. I fail to see how it is different from insertion sort. I was wondering if someone can help me understand the difference. The current sort is written as insertion because I don't see the difference yet. This is homework, so I don't want an answer I want to understand the difference. The algorithm is here
def file_open(perkList,fileName):
with open(fileName, 'r') as f:
for line in f.readlines():
perkList.append(int(line))
def perkSort(perkList):
for marker in range(len(perkList)):
save = perkList[marker]
i = marker
while i < len(perkList) and perkList[i+1] > save:
perkList[i] = perkList[i-1]
i = i - 1
perkList[i] = save
print("New list",perkList)
def main():
perkList = []
file_open(perkList,'integers')
file_open(perkList,'integers2')
print("initial list",perkList)
perkSort(perkList)
main()
Apologies that this question is not that clean. Edits are appreciated.

Perksort algorithm mentioned in your homework is essentially Bubble sort algorithm. What you have implemented is Inserion Sort algorithm. The difference is as follows:
Insertion Sort
It works by inserting an element in the input list to the correct position in the list that is already sorted. That is it builds the sorted array one item at a time.
## Unsorted List ##
7 6 1 3 2
# First Pass
7 6 1 3 2
# Second Pass
6 7 1 3 2
# Third Pass
1 6 7 3 2
# Fourth Pass
1 3 6 7 2
# Fifth Pass
1 2 3 6 7
Note that after i iterations the first i elements are ordered.
You have got maximum i iterations on ith step.
Psuedocode:
for i ← 1 to length(A)
j ← i
while j > 0 and A[j-1] > A[j]
swap A[j] and A[j-1]
j ← j - 1
This is what you did in your python implementation.
Some complexity analysis:
Worst case performance О(n2)
Best case performance O(n)
Average case performance О(n2)
Worst case space complexity О(n) total, O(1) auxiliary
Bubble Sort
This is PerkSort Algorithm given to implement in your homework.
It works by repeatedly scanning through the list to be sorted while comparing pairs of elements that are adjacent and hence swapping them if required.
## Unsorted List ##
7 6 1 3 2
# First Pass
6 1 3 2 7
# Second Pass
1 3 2 6 7
# Third Pass
1 2 3 6 7
# Fourth Pass
1 2 3 6 7
# No Fifth Pass as there were no swaps in Fourth Pass
Note that after i iterations the last i elements are the biggest, and ordered.
You have got maximum n-i-1 iterations on ith step.
I am not giving psuedocode here as this is your homework assignment.
Hint: You will move from marker in forward direction, in order to shift the elements towards up, just like bubbling
Some complexity analysis:
Worst case performance О(n2)
Best case performance O(n)
Average case performance О(n2)
Worst case space complexity О(n) total, O(1) auxiliary
Similarities
Both have same worst case, average case and best case time
complexities
Both have same space complexities
Both are in-place algorithms (i.e. they change the original data)
Both are Comparision Sorts
Differences ( Apart from algorithm, of course )
Even though both algorithms have same time and space complexities on average, practically Insertion sort is better than Bubble sort.This is because on an average Bubble sort needs more swaps than Insertion sort. Insertion sort performs better on a list with small number of inversions.

The program that you have written does implement insertion sort.
Lets take an example and see what your program would do. For input 5 8 2 7
After first iteration
5 8 2 7
After second iteration
2 5 8 7
After third iteration
2 5 7 8
But the algorithm that is given in your link works differently. It takes the largest element and puts it in the end. For our example
After first iteration
5 2 7 8
After second iteration
2 5 7 8

Python find max number of combinations in binary

Hi I'm trying to figure out a function where given a length n of a list [x1, x2... xn], how many digits would be needed for a base 2 number system to assign a unique code to each value in the list.
For example, one digit can hold two unique values:
x1 0
x2 1
two digits can hold four:
x1 00
x2 01
x3 10
x4 11
etc. I'm trying to write a python function calcBitDigits(myListLength) that takes this list length and returns the number of digits needed. calcBitDigits(2) = 1, calcBitDigits(4) = 2, calcBitDigits(3) = 2, etc.

>>> for i in range(10):
... print i, i.bit_length()
0 0
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 4
I'm not clear on exactly what it is you want, but it appears you want to subtract 1 from what bit_length() returns - or maybe not ;-)
On third thought ;-), maybe you really want this:
def calcBitDigits(n):
return (n-1).bit_length()
At least that gives the result you said you wanted in each of the examples you gave.
Note: for an integer n > 0, n.bit_length() is the number of bits needed to represent n in binary. (n-1).bit_length() is really a faster way of computing int(math.ceil(math.log(n, 2))).
Clarification: I understand the original question now ;-) Here's how to think about the answer: if you have n items, then you can assign them unique codes using the n integers in 0 through n-1 inclusive. How many bits does that take? The number of bits needed to express n-1 (the largest of the codes) in binary. I hope that makes the answer obvious instead of mysterious ;-)
As comments pointed out, the argument gets strained for n=1. It's a quirk then that (0).bit_length() == 0. So watch out for that one!

Use the following -
import math
int(math.ceil(math.log(x,2)))
where x is the list length.
Edit:
For x = 1, we need to have a separate case that would return 1. Thanks #thefourtheye for pointing this out.

I am not comfortable with the other answers, since most of them fail at the corner case (when n == 1). So, I wrote this based on Tim's answer.
def calcBitDigits(n):
if n <= 0: return 0
elif n <= 2: return 1
return (n-1).bit_length()
for i in range(10):
print i, calcBitDigits(i)
Output
0 0
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 3
9 4

x = int(log(n,2))+1
x will be the number of bits required to store the integer value n.

If for some reason you don't want to use .bit_length, here's another way to find it.
from itertools import count
def calcBitDigits(n):
return next(i for i in count() if 1<<i >= n)

How can you compute percentiles and ranks with a generator on a single pass?

Building off and earlier question: Computing stats on generators in single pass. Python
As I mentioned before computing statistics from a generator in a single pass is extremely fast and memory efficient. Complex statistics and rank attributes like the 90th percentile and the nth smallest often need more complex work than standard deviation and averages (solved in the above). These approaches become very important when working with map/reduce jobs and large datasets where putting the data into a list or computing multiple passes becomes very slow.
The following is an O(n) quicksort style algorithm for looking up data based on rank order. Useful for finding medians, percentiles, quartiles, and deciles. Equivalent to data[n] when the data is already sorted. But needs all the data in a list that can be split/pivoted.
How can you compute medians, percentiles, quartiles, and deciles with a generator on a single pass?
The Quicksort style algorithm that needs a complete list
import random
def select(data, n):
"Find the nth rank ordered element (the least value has rank 0)."
data = list(data)
if not 0 <= n < len(data):
raise ValueError('not enough elements for the given rank')
while True:
pivot = random.choice(data)
pcount = 0
under, over = [], []
uappend, oappend = under.append, over.append
for elem in data:
if elem < pivot:
uappend(elem)
elif elem > pivot:
oappend(elem)
else:
pcount += 1
if n < len(under):
data = under
elif n < len(under) + pcount:
return pivot
else:
data = over
n -= len(under) + pcount

You will need to store large parts of the data. Up to the point where it may just pay off to store it completely. Unless you are willing to accept an approximate algorithm (which may be very reasonable when you know your data is independent).
Consider you need to find the median of the following data set:
0 1 2 3 4 5 6 7 8 9 -1 -2 -3 -4 -5 -6 -7 -8 -9
The median is obviously 0. However, if you have seen only the first 10 elements, it is your worst guess at that time! So in order to find the median of an n element stream, you need to keep at least n/2 candidate elements in memory. And if you do not know the total size n, you need to keep all!
Here are the medians for every odd-sized situation:
0 _ 1 _ 2 _ 3 _ 4 _ 4 _ 3 _ 2 _ 1 _ 0
While they were never candidates, you also need to remember the element 5 - 9:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
yields the median 9. For every element in a series of size n I can find a continued series of size O(2*n) that has this element as median. But obviously, these series are not random / independent.
See "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis? for an overview of related methods.

How to optimize edit distance code?

How to optimize this edit distance code i.e. finding the number of bits changed between 2 values! e.g. word1 = '010000001000011111101000001001000110001'
word2 = '010000001000011111101000001011111111111'
When i tried to run on Hadoop it takes ages to complete?
How to reduce the for loop and comparsions ?
#!/usr/bin/python
import os, re, string, sys
from numpy import zeros
def calculateDistance(word1, word2):
x = zeros( (len(word1)+1, len(word2)+1) )
for i in range(0,len(word1)+1):
x[i,0] = i
for i in range(0,len(word2)+1):
x[0,i] = i
for j in range(1,len(word2)+1):
for i in range(1,len(word1)+1):
if word1[i-1] == word2[j-1]:
x[i,j] = x[i-1,j-1]
else:
minimum = x[i-1, j] + 1
if minimum > x[i, j-1] + 1:
minimum = x[i, j-1] + 1
if minimum > x[i-1, j-1] + 1:
minimum = x[i-1, j-1] + 1
x[i,j] = minimum
return x[len(word1), len(word2)]

I looked for a bit counting algorithm online, and I found this page, which has several good algorithms. My favorite there is a one-line function which claims to work for Python 2.6 / 3.0:
return sum( b == '1' for b in bin(word1 ^ word2)[2:] )
I don't have Python, so I can't test, but if this one doesn't work, try one of the others. The key is to count the number of 1's in the bitwise XOR of your two words, because there will be a 1 for each difference.
You are calculating the Hamming distance, right?
EDIT: I'm trying to understand your algorithm, and the way you're manipulating the inputs, it looks like they are actually arrays, and not just binary numbers. So I would expect that your code should look more like:
return sum( a != b for a, b in zip(word1, word2) )
EDIT2: I've figured out what your code does, and it's not the Hamming distance at all! It's actually the Levenshtein distance, which counts how many additions, deletions, or substitutions are needed to turn one string into another (the Hamming distance only counts substitutions, and so is only suitable for equal length strings of digits). Looking at the Wikipedia page, your algorithm is more or less a straight port of the pseudocode they have there. As they point out, the time and space complexity of a comparison of strings of length m and n is O(mn), which is pretty bad. They have a few suggestions of optimizations depending on your needs, but I don't know what you use this function for, so I can't say what would be best for you. If the Hamming distance is good enough for you, the code above should suffice (time complexity O(n)), but it gives different results on some sets of strings, even if they are of equal length, like '0101010101' and '1010101010', which have Hamming distance 10 (flip all bits) and Levenshtein distance 2 (remove the first 0 and add it at the end)

Since you haven't specified what edit distance you're using yet, I'm gonna go on a limb and assume it's Levenshtein distance. In which case, you can shave off some operations here and there:
def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space.
# Not really important to the algorithm anyway.
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
edit: also, you make no mention of your dataset. According to its characteristics, the implementation might change to benefit from it.

Your algorithm seems to do a lot of work. It compares every bit to all bits in the opposite bit vector, meaning you get an algorithmic complexity of O(m*n). That is unnecessary if you are computing Hamming distance, so I assume you're not.
Your loop builds an x[i,j] matrix looking like this:
0 1 0 0 0 0 0 0 1 0 0 ... (word1)
0 0 1 0 0 0 0 0 0 1
1 1 0 1 1 1 1 1 1 0
0 0 1 0 1 1 1 1 1 1
0 0 1 1 0 1 1 1 1 2
0 0 1 1 1 0 1 1 1 2
0 0 1 1 1 1 0 1 1 2
1
1
...
(example word2)
This may be useful for detecting certain types of edits, but without knowing what edit distance algorithm you are trying to implement, I really can't tell you how to optimize it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Order bias in wrong implementation of Fisher Yates Shuffle - python

Related

how to optimize my existing code to solve google kickstart round b 2020

Implementing the following algorithm for sorting

Python find max number of combinations in binary

How can you compute percentiles and ranks with a generator on a single pass?

How to optimize edit distance code?

Categories

Resources