Why does this not seem to be random? - python

I was running a procedure to be like one of those games were people try to guess a number between 0 and 100 where there are 100 people guessing.I then averaged how many different guesses there are.
import random
def averager(times):
tests=[]
for i in range(times):
l=[]
for i in range(0,100):
l.append(random.randint(0,100))
tests.append(len(set(l)))
return (sum(tests))/len(tests)
print(averager(1000))
For some reason, the number of different guesses averages out to 63.6
Why is this?Is it due to a flaw in the python random library?
In a scenario where people were guessing a number between 1 and 10
The first person has a 100% chance to guess a previously unguessed number
The second person has a 90% chance to guess a previously unguessed number
The third person has a 80% chance to guess a previously unguessed number
and so on...
The average chance of guessing a new number(by my reasoning) is 55%.
But the data doesn't reflect this.

Your code is for finding the average number of unique guesses made by 100 people each guessing a number from 1 to 100.
As for why it converges to a number around 63... you should post your question to the math Stack Exchange.

If this was a completely flat distribution, you would expect the average to come out as 100, meaning everybody's guess was different. However, you know that such a scenario is much less random than a scenario where you have duplication. The fact that you get repeated numbers during a random sequence should be comforting.
All you are doing here is measuring some kind of uniqueness within very small sets: ie 1000 repeats of an experiment involving 100 random values. You might get a better appreciation of this if you use some sort of bootstrapping algorithm to sample from.
Also, if you scale up the number of repeats to millions, and perhaps measure the sample distribution (not just the mean), you'll have a little more confidence in the results you're getting.
It may be that the pseudo-random generator has a characteristic which yields approximately 60-70% non-repeated values inside a sequence the same length as the range. However, you would need to experiment with far more samples, as well as different random seeds. Otherwise your results are meaningless.

I modified your code so it would take an already generated sequence as input, rather than calculating random numbers:
def averager(seqs):
tests = []
for s in seqs:
tests.append(len(set(s)))
return float(sum(tests))/len(tests)
Then I made a function to return all possible choices for any given number of people and guess range:
def combos(n, limit):
return itertools.product(*((range(limit),) * n))
(One of the things I love about Python is that it's so easy to break apart a function into trivial pieces.)
Then I started testing with increasing numbers:
for n in range(2,100):
x = averager(combos(n, n))
print n, x, x/n
2 1.5 0.75
3 2.11111111111 0.703703703704
4 2.734375 0.68359375
5 3.3616 0.67232
6 3.99061213992 0.66510202332
7 4.62058326038 0.660083322911
8 5.25112867355 0.656391084194
This algorithm has a horrible complexity, so at this point I got a MemoryError. As you can see, the percentage of unique results keeps dropping as the number of people and guess range keeps increasing.
Repeating the test with random numbers:
def rands(repeats, n, limit):
for i in range(repeats):
yield [random.randint(0, limit) for j in range(n)]
for n in range(10, 101, 10):
x = averager(rands(10000, n, n))
print n, x, x/n
10 6.7752 0.67752
20 13.0751 0.653755
30 19.4131 0.647103333333
40 25.7309 0.6432725
50 32.0471 0.640942
60 38.3333 0.638888333333
70 44.6882 0.638402857143
80 50.948 0.63685
90 57.3525 0.63725
100 63.6322 0.636322
As you can see the results are consistent with what we saw earlier and with your own observation. I'm sure a bit of combinatorial math could explain it all.

Related

Been using rand.int for a while and seeing unexpected results

I've been running some code for an hour or so using a rand.int function, where the code models a dice's roll, where the dice has ten faces, and you have to roll it six times in a row, and each time it has to roll the same number, and it is tracking how many tries it takes for this to happen.
success = 0
times = 0
count = 0
total = 0
for h in range(0,100):
for i in range(0,10):
times = 0
while success == 0:
numbers = [0,0,0,0,0,0,0,0,0,0]
for j in range(0,6):
x = int(random.randint(0,9))
numbers[x] = 1
count = numbers.count(1)
if count == 1:
success = 1
else:
times += 1
print(i)
total += times
success = 0
randtst = open("RandomTesting.txt", "a" )
randtst.write(str(total / 10)+"\n")
randtst.close()
And running this code, this has been going into a file, the contents of which is below
https://pastebin.com/7kRK1Z5f
And taking the average of these numbers using
newtotal = 0
totalamounts = 0
with open ('RandomTesting.txt', 'rt') as rndtxt:
for myline in rndtxt: ,
newtotal += float(myline)
totalamounts += 1
print(newtotal / totalamounts)
Which returns 742073.7449342106. This number is incorrect, (I think) as this is not near to 10^6. I tried getting rid of the contents and doing it again, but to no avail, the number is nowhere near 10^6. Can anyone see a problem with this?
Note: I am not asking for fixes to the code or anything, I am asking whether something has gone wrong to get the above number rather that 100,000
There are several issues working against you here. Bottom line up front:
your code doesn't do what you described as your intent;
you currently have no yardstick for measuring whether your results agree with the theoretical answer; and
your expectations regarding the correct answer are incorrect.
I felt that your code was overly complex for the task you were describing, so I wrote my own version from scratch. I factored out the basic experiment of rolling six 10-sided dice and checking to see if the outcomes were all equal by creating a list of length 6 comprised of 10-sided die rolls. Borrowing shamelessly from BoarGules' comment, I threw the results into a set—which only stores unique elements—and counted the size of the set. The dice are all the same value if and only if the size of the set is 1. I kept repeating this while the number of distinct elements was greater than 1, maintaining a tally of how many trials that required, and returned the number of trials once identical die rolls were obtained.
That basic experiment is then run for any desired number of replications, with the results placed in a numpy array. The resulting data was processed by numpy and scipy to yield the average number of trials and a 95% confidence interval for the mean. The confidence interval uses the estimated variability of the results to construct a lower and an upper bound for the mean. The bounds produced this way should contain the true mean for 95% of estimates generated in this way if the underlying assumptions are met, and address the second point in my BLUF.
Here's the code:
import random
import scipy.stats as st
import numpy as np
NUM_DIGITS = 6
SAMPLE_SIZE = 1000
def expt():
num_trials = 1
while(len(set([random.randrange(10) for _ in range(NUM_DIGITS)])) > 1):
num_trials += 1
return num_trials
data = np.array([expt() for _ in range(SAMPLE_SIZE)])
mu_hat = np.mean(data)
ci = st.t.interval(alpha=0.95, df=SAMPLE_SIZE-1, loc=mu_hat, scale=st.sem(data))
print(mu_hat, ci)
The probability of producing 6 identical results of a particular value from a 10-sided die is 10-6, but there are 10 possible particular values so the overall probability of producing all duplicates is 10*10-6, or 10-5. Consequently, the expected number of trials until you obtain a set of duplicates is 105. The code above took a little over 5 minutes to run on my computer, and produced 102493.559 (96461.16185897154, 108525.95614102845) as the output. Rounding to integers, this means that the average number of trials was 102493 and we're 95% confident that the true mean lies somewhere between 96461 and 108526. This particular range contains 105, i.e., it is consistent with the expected value. Rerunning the program will yield different numbers, but 95% of such runs should also contain the expected value, and the handful that don't should still be close.
Might I suggest if you're working with whole integers that you should be receiving a whole number back instead of a floating point(if I'm understanding what you're trying to do.).
##randtst.write(str(total / 10)+"\n") Original
##randtst.write(str(total // 10)+"\n")
Using a floor division instead of a division sign will round down the number to a whole number which is more idea for what you're trying to do.
If you ARE using floating point numbers, perhaps using the % instead. This will not only divide your number, but also ONLY returns the remainder.
% is Modulo in python
// is floor division in python
Those signs will keep your numbers stable and easier to work if your total returns a floating point integer.
If this isn't the case, you will have to account for every number behind the decimal to the right of it.
And if this IS the case, your result will never reach 10x^6 because the line for totalling your value is stuck in a loop.
I hope this helps you in anyway and if not, please let me know as I'm also learning python.

How to solve this problem without using a strucure? (Getting Time Limit Exceeded)

Recently I took part in a competition for middle school girls. I ran across this problem and I have been working on it for a few weeks. Here is the problem:
I. Ventilator Shipments
At the local hospital, Gabriela keeps track of all the ventilator shipments. Recently, a new factory has been established to produce ventilators. She knows that the new factory is almost extraordinary in its production, as on a certain day Di, it produces the same amount of ventilators as the product of the previous K days' production. However, the hospital's computer can only handle non-negative numbers less than P, a prime number. Gabriela knows the production value, Di, for each of the first K days. Accordingly, Gabriela wants to know how many ventilators are produced after N days. If this number is greater than or equal to P, the computer displays the remainder of the number of ventilators produced divided by P.
Input
Line 1: Three space-separated integers N, K, P
Lines 2...K+1: A single integer Di
Output
Line 1: Number of ventilators produced after N days as displayed by the computer
Example Input:
5 2 7
1
3
Output:
6
Note:
2 ≤ N ≤ 1000000
1 ≤ K ≤ N
2 ≤ P ≤ 1000003 (where P is guaranteed to be prime)
1 ≤ Di ≤ P−1
The time limit for this problem has been extended to 2000 ms.
I have tried 3 different methods
Here is the first:
import math
import sys
string=sys.stdin.readline()
string=string.rstrip()
arr=[0]*3
arr=string.split(' ')
n=int(arr[0])
k=int(arr[1])
p=int(arr[2])
mylist=[0]*k
for i in range (k):
a=int(sys.stdin.readline())
mylist[i]=a%p
product=math.prod(mylist)
for start in range (n-k):
smallest=mylist[start%k]
mylist[start%k]=(product%p)
product=product*(product%p)
product=product//smallest
sys.stdout.write (str(mylist[start%k]))
In another method I used a queue:
import math
from collections import deque
import sys
string=sys.stdin.readline()
string=string.rstrip()
arr=[0]*3
arr=string.split(' ')
n=int(arr[0])
k=int(arr[1])
p=int(arr[2])
q=deque()
for i in range (k):
a=int(sys.stdin.readline())
q.append(a%p)
product=math.prod(q)
for i in range (n-k):
q.append(product%p)
product=product*(product%p)
smallest=q.popleft()
product=product//smallest
sys.stdout.write (str(q.pop())+'\n')
However, I'm still getting time limit exceeded on test cast 8. Given the time and space constraints, I don't think I can any kind of structure (list, queue, etc.) to solve this problem. Can someone give me an idea on how to solve this problem?
The problem is not with your data structures so much as your algorithmic overhead. Your first attempt includes a multiplication and five divisions in each loop, plus two list accesses and four assignments. Your second attempt has three divisions, three assignments, and two list-changing operations.
You might want to experiment a little to determine roughly how many operations you can perform in 2 seconds. How long does it take you to run 10*6 loop iterations with a trivial body? I suspect that you're not going be able to carry out an iterative solution.
Instead of carrying out each iterative computation individually, try focusing on the problem as given. You do not need each day's output; you need only to compute the final day's output, modulo p. That production is a high-order product of the input production sequence (the "seed" days of production). How many times does each of those days appear in that final product? For large n, what is the cycle of values produced? Most importantly, what factor gives you a modular residue of 1? (It's p-1 for any factor)
Compute how many times each factor appears in the final product; call it use. Reduce that mod p-1. Now you have an expression such as
product = k[0] ** (use[0] % (p-1) ) *
k[1] ** (use[1] % (p-1) ) *
...
print(product % p)

Rolling dice with random bits: Is my methodology flawed or am I overthinking things?

So I am simply playing around with trying to make a "dice roller" using random.getrandbits() and the "wasteful" methodology stated here: How to generate an un-biased random number within an arbitrary range using the fewest bits
My code seems to be working fine, however when I roll D6's the Max\Min ratio is in the 1.004... range but with D100's it's in the 1.05... range. Considering my dataset is only about a million rolls, is this ok or is the pRNG nature of random affecting the results? Or am I just being an idiot and overthinking it and it's due to D100s simply having a larger range of values than a D6?
Edit: Max/Min ratio is the frequency of the most common result divided by the frequency of the least common result. For a perfectly fair dice this should be 1.
from math import ceil, log2
from random import getrandbits
def wasteful_die(dice_size: int):
#Generate minumum binary number greater than or equal to dice_size number of random bits
bits = getrandbits(ceil(log2(dice_size)))
#If bits is a valid number (i.e. its not greater than dice_size), yeild
if bits < dice_size:
yield 1 + bits
def generate_rolls(dice_size: int, number_of_rolls: int) -> list:
#Store the results
list_of_numbers = []
#Command line activity indicator
print('Rolling '+f'{number_of_rolls:,}'+' D'+str(dice_size)+'s',end='',flush=True)
activityIndicator = 0
#As this is a wasteful algorithm, keep rolling until you have the desired number of valid rolls.
while len(list_of_numbers) < number_of_rolls:
#Print a period every 1000 attempts
if activityIndicator % 1000 == 0:
print('.',end='',flush=True)
#Build up the list of rolls with valid rolls.
for value in wasteful_die(dice_size):
list_of_numbers.append(value)
activityIndicator+=1
print(' ',flush=True)
#Use list slice just in case something wrong.
return list_of_numbers[0:number_of_rolls]
#Rolls one million, fourty eight thousand, five hundred and seventy six D6s
print(generate_rolls(6, 1048576), file=open("RollsD6.txt", "w"))
#Rolls one million, fourty eight thousand, five hundred and seventy six D100
print(generate_rolls(100, 1048576), file=open("RollsD100.txt", "w"))
Your final statement is incorrect: for a perfectly fair douse (never say die :-) ), the ratio should tend to 1.0, but should rarely land directly on that value for large numbers of rolls. To hit 1.0 regularly requires the die to know the history of previous rolls, which violates the fairness principles.
A variation of 0.4% for a D6 is reasonable over 10^6 rolls, as is 0.5% for a D100. As you surmised, this is because the D100 has many more "buckets" (different values).
The D6 will average 10^6/6, or nearly 170K expected instances per "bucket". A D100 has only 10K expected instances per bucket: somewhat less room for the Law of Central Tendency to influence the numbers. Having a 50:4 difference in a single test run is well within expectations.
I suggest that you try running a chi-squared test, rather than a simple max/min metric.

Pseudorandom Algorithm for VERY Large (10^1.2mil) Numbers?

I'm looking for a pseudo-random number generator (an algorithm where you input a seed number and it outputs a different 'random-looking' number, and the same seed will always generate the same output) for numbers between 1 and 951,312,000.
I would use the Linear Feedback Shift Register (LFSR) PRNG, but if I did, I would have to convert the seed number (which could be up to 1.2 million digits long in base-10) into a binary number, which would be so massive that I think it would take too long to compute.
In response to a similar question, the Feistel cipher was recommended, but I didn't understand the vocabulary of the wiki page for that method (I'm going into 10th grade so I don't have a degree in encryption), so if you could use layman's terms, I would strongly appreciate it.
Is there an efficient way of doing this which won't take until the end of time, or is this problem impossible?
Edit: I forgot to mention that the prng sequence needs to have a full period. My mistake.
A simple way to do this is to use a linear congruential generator with modulus m = 95^1312000.
The formula for the generator is x_(n+1) = a*x_n + c (mod m). By the Hull-Dobell Theorem, it will have full period if and only if gcd(m,c) = 1 and 95 divides a-1. Furthermore, if you want good second values (right after the seed) even for very small seeds, a and c should be fairly large. Also, your code can't store these values as literals (they would be much too big). Instead, you need to be able to reliably produce them on the fly. After a bit of trial and error to make sure gcd(m,c) = 1, I hit upon:
import random
def get_book(n):
random.seed(1941) #Borges' Library of Babel was published in 1941
m = 95**1312000
a = 1 + 95 * random.randint(1, m//100)
c = random.randint(1, m - 1) #math.gcd(c,m) = 1
return (a*n + c) % m
For example:
>>> book = get_book(42)
>>> book % 10**100
4779746919502753142323572698478137996323206967194197332998517828771427155582287891935067701239737874
shows the last 100 digits of "book" number 42. Given Python's built-in support for large integers, the code runs surprisingly fast (it takes less than 1 second to grab a book on my machine)
If you have a method that can produce a pseudo-random digit, then you can concatenate as many together as you want. It will be just as repeatable as the underlying prng.
However, you'll probably run out of memory scaling that up to millions of digits and attempting to do arithmetic. Normally stuff on that scale isn't done on "numbers". It's done on byte vectors, or something similar.

Probability exercise returning different result that expected

As an exercise I'm writing a program to calculate the odds of rolling 5 die with the same number. The idea is to get the result via simulation as opposed to simple math though. My program is this:
# rollFive.py
from random import *
def main():
n = input("Please enter the number of sims to run: ")
hits = simNRolls(n)
hits = float(hits)
n = float(n)
prob = hits/n
print "The odds of rolling 5 of the same number are", prob
def simNRolls(n):
hits = 0
for i in range(n):
hits = hits + diceRoll()
return hits
def diceRoll():
firstDie = randrange(1,7,1)
for i in range(4):
nextDie = randrange(1,7,1)
if nextDie!=firstDie:
success = 0
break
else:
success = 1
return success
The problem is that running this program with a value for n of 1 000 000 gives me a probability usually between 0.0006 and 0.0008 while my math makes me believe I should be getting an answer closer to 0.0001286 (aka (1/6)^5).
Is there something wrong with my program? Or am I making some basic mistake with the math here? Or would I find my result revert closer to the right answer if I were able to run the program over larger iterations?
The probability of getting a particular number five times is (1/6)^5, but the probability of getting any five numbers the same is (1/6)^4.
There are two ways to see this.
First, the probability of getting all 1's, for example, is (1/6)^5 since there is only one way out of six to get a 1. Multiply that by five dice, and you get (1/6)^5. But, since there are six possible numbers to get the same, then there are six ways to succeed, which is 6((1/6)^5) or (1/6)^4.
Looked at another way, it doesn't matter what the first roll gives, so we exclude it. Then we have to match that number with the four remaining rolls, the probability of which is (1/6)^4.
Your math is wrong. The probability of getting five dice with the same number is 6*(1/6)^5 = 0.0007716.
Very simply, there are 6 ** 5 possible outcomes from rolling 5 dice, and only 6 of those outcomes are successful, so the answer is 6.0 / 6 ** 5
I think your expected probability is wrong, as you've stated the problem. (1/6)^5 is the probability of rolling some specific number 5 times in a row; (1/6)^4 is the probability of rolling any number 5 times in a row (because the first roll is always "successful" -- that is, the first roll will always result in some number).
>>> (1.0/6.0)**4
0.00077160493827160479
Compare to running your program with 1 million iterations:
[me#host:~] python roll5.py
Please enter the number of sims to run: 1000000
The odds of rolling 5 of the same number are 0.000755

Categories

Resources