Been using rand.int for a while and seeing unexpected results

Been using rand.int for a while and seeing unexpected results - python

I've been running some code for an hour or so using a rand.int function, where the code models a dice's roll, where the dice has ten faces, and you have to roll it six times in a row, and each time it has to roll the same number, and it is tracking how many tries it takes for this to happen.
success = 0
times = 0
count = 0
total = 0
for h in range(0,100):
for i in range(0,10):
times = 0
while success == 0:
numbers = [0,0,0,0,0,0,0,0,0,0]
for j in range(0,6):
x = int(random.randint(0,9))
numbers[x] = 1
count = numbers.count(1)
if count == 1:
success = 1
else:
times += 1
print(i)
total += times
success = 0
randtst = open("RandomTesting.txt", "a" )
randtst.write(str(total / 10)+"\n")
randtst.close()
And running this code, this has been going into a file, the contents of which is below
https://pastebin.com/7kRK1Z5f
And taking the average of these numbers using
newtotal = 0
totalamounts = 0
with open ('RandomTesting.txt', 'rt') as rndtxt:
for myline in rndtxt: ,
newtotal += float(myline)
totalamounts += 1
print(newtotal / totalamounts)
Which returns 742073.7449342106. This number is incorrect, (I think) as this is not near to 10^6. I tried getting rid of the contents and doing it again, but to no avail, the number is nowhere near 10^6. Can anyone see a problem with this?
Note: I am not asking for fixes to the code or anything, I am asking whether something has gone wrong to get the above number rather that 100,000

There are several issues working against you here. Bottom line up front:
your code doesn't do what you described as your intent;
you currently have no yardstick for measuring whether your results agree with the theoretical answer; and
your expectations regarding the correct answer are incorrect.
I felt that your code was overly complex for the task you were describing, so I wrote my own version from scratch. I factored out the basic experiment of rolling six 10-sided dice and checking to see if the outcomes were all equal by creating a list of length 6 comprised of 10-sided die rolls. Borrowing shamelessly from BoarGules' comment, I threw the results into a set—which only stores unique elements—and counted the size of the set. The dice are all the same value if and only if the size of the set is 1. I kept repeating this while the number of distinct elements was greater than 1, maintaining a tally of how many trials that required, and returned the number of trials once identical die rolls were obtained.
That basic experiment is then run for any desired number of replications, with the results placed in a numpy array. The resulting data was processed by numpy and scipy to yield the average number of trials and a 95% confidence interval for the mean. The confidence interval uses the estimated variability of the results to construct a lower and an upper bound for the mean. The bounds produced this way should contain the true mean for 95% of estimates generated in this way if the underlying assumptions are met, and address the second point in my BLUF.
Here's the code:
import random
import scipy.stats as st
import numpy as np
NUM_DIGITS = 6
SAMPLE_SIZE = 1000
def expt():
num_trials = 1
while(len(set([random.randrange(10) for _ in range(NUM_DIGITS)])) > 1):
num_trials += 1
return num_trials
data = np.array([expt() for _ in range(SAMPLE_SIZE)])
mu_hat = np.mean(data)
ci = st.t.interval(alpha=0.95, df=SAMPLE_SIZE-1, loc=mu_hat, scale=st.sem(data))
print(mu_hat, ci)
The probability of producing 6 identical results of a particular value from a 10-sided die is 10-6, but there are 10 possible particular values so the overall probability of producing all duplicates is 10*10-6, or 10-5. Consequently, the expected number of trials until you obtain a set of duplicates is 105. The code above took a little over 5 minutes to run on my computer, and produced 102493.559 (96461.16185897154, 108525.95614102845) as the output. Rounding to integers, this means that the average number of trials was 102493 and we're 95% confident that the true mean lies somewhere between 96461 and 108526. This particular range contains 105, i.e., it is consistent with the expected value. Rerunning the program will yield different numbers, but 95% of such runs should also contain the expected value, and the handful that don't should still be close.

Might I suggest if you're working with whole integers that you should be receiving a whole number back instead of a floating point(if I'm understanding what you're trying to do.).
##randtst.write(str(total / 10)+"\n") Original
##randtst.write(str(total // 10)+"\n")
Using a floor division instead of a division sign will round down the number to a whole number which is more idea for what you're trying to do.
If you ARE using floating point numbers, perhaps using the % instead. This will not only divide your number, but also ONLY returns the remainder.
% is Modulo in python
// is floor division in python
Those signs will keep your numbers stable and easier to work if your total returns a floating point integer.
If this isn't the case, you will have to account for every number behind the decimal to the right of it.
And if this IS the case, your result will never reach 10x^6 because the line for totalling your value is stuck in a loop.
I hope this helps you in anyway and if not, please let me know as I'm also learning python.

Related

What is the expected value of a coin-toss that doubles in value if heads and why is it different in practice? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 months ago.
Improve this question
Here's the thought experiment: say I have a coin that is worth 1$. Everytime I toss it, if it lands on head, it will double in value. If it lands on tail, it will be forever stuck with the latest value. What is the expected final value of the coin?
Here is how I am thinking about it:
ExpectedValue = 1 * 0.5 + (1 * 2) * (0.5 * 0.5) + (1 * 2 * 2) * (0.5 * 0.5 * 0.5) ...
=0.5 + 0.5 + 0.5 ...
= Infinity
Assuming my Math is correct, the expected value should be infinity. However, when I do try to simulate it out on code, the expected value comes out very different. Here's the code below:
import random
def test(iterations):
total = 0
max = 0
for i in range(iterations):
coin = False
val = 1
while coin == False:
coin = random.choice([True, False])
val *= 2
total += val
if val > max:
max = val
ave = total/iterations
print(ave)
test(10000000) # returns 38.736616
I assume that the sample size of 10000000 should be statistically significant enough. However, the final expected value returned is 38.736616, which is nowhere near Infinity. Either my Math is wrong or my code is wrong. Which is it?

The average value of the process over infinitely many trials is infinite. However, you did not perform infinitely many trials; you only performed 10,000,000, which falls short of infinity by approximately infinity.
Suppose we have a fair coin. In four flips, the average number of heads that come up is two. So, I do 100 trials: 100 times, I flip the coin four times, and I count the heads. I got 2.11, not two. Why?
My 100 trials are only 100 samples from the population. They are not distributed the same way as the population. Your 10,000,000 trials are only 10,000,000 samples from an infinite population. None of your samples happened to include the streak of a hundred heads in a row, which would have made the value for that sample 299, and would have made the average for your 10,000,000 trials more than 299/10,000,000 = 6.338•1022, which is a huge number (but still not infinity).
If you increase the number of trials, you will tend to see increasing averages, because increasing the number of trials tends to move your samples toward the full population distribution. For the process you describe, you need to move infinitely far to get to the full distribution. So you need infinitely many trials.
(Also, there is a bug in your code. If the trial starts with False, representing tails, it still doubles the value. This means the values for 0, 1, 2, 3… heads are taken as 2, 4, 8, 16,… The process you describe in the question would have them as 1, 2, 4, 8,…)
Another way of looking at this is to conduct just one trial. The average value for just one trial is infinite. However, half the time you start one trial, you stop after one coin flip, since you got a tail immediately. One quarter of the time you stop after one or two flips. One-eighth of the time, you stop after three flips. Most of the time, you will get a small number as the answer. In every trial, you will get a finite number as the answer; every trial will always end with getting a tail, and the value at that point will be finite. It is impossible to do a finite number of trials and ever end up with an infinite value. Yet there is no finite number that is greater than the expected value: If you do more and more trials, the average will tend to grow and grow, and, eventually, it will exceed any finite target.

More efficient simluation of 2 dice rolls - Python

I wrote a program that records how many times 2 fair dice need to be rolled to match the probabilities for each result that we should expect.
I think it works but I'm wondering if there's a more resource friendly way to solve this problem.
import random
expected = [0.0, 0.0, 0.028, 0.056, 0.083,
0.111, 0.139, 0.167, 0.139, 0.111,
0.083, 0.056, 0.028]
results = [0.0] * 13 # store our empirical results here
emp_percent = [0.0] * 13 # results / by count
count = 0.0 # how many times have we rolled the dice?
while True:
r = random.randrange(1,7) + random.randrange(1,7) # roll our die
count += 1
results[r] += 1
emp_percent = results[:]
for i in range(len(emp_percent)):
emp_percent[i] /= count
emp_percent[i] = round(emp_percent[i], 3)
if emp_percent == expected:
break
print(count)
print(emp_percent)

There are several problems here.
Firstly, there is no guarantee that this will ever terminate, nor is it particularly likely to terminate in a reasonable amount of time. Ignoring floating point arithmetic issues, this should only terminate when your numbers are distributed exactly right. But the law of large numbers does not guarantee this will ever happen. The law of large numbers works like this:
Your initial results are (by random chance) almost certainly biased one way or another.
Eventually, the trials not yet performed will greatly outnumber your initial trials, and the lack of bias in those later trials will outweigh your initial bias.
Notice that the initial bias is never counterbalanced. Rather, it is dwarfed by the rest of the results. This means the bias tends to zero, but it does not guarantee the bias actually vanishes in a finite number of trials. Indeed, it specifically predicts that progressively smaller amounts of bias will continue to exist indefinitely. So it would be entirely possible that this algorithm never terminates, because there's always that tiny bit of bias still hanging around, statistically insignificant, but still very much there.
That's bad enough, but you're also working with floating point, which has its own issues; in particular, floating point arithmetic violates lots of conventional rules of math because the computer keeps doing intermediate rounding to ensure the numbers continue to fit into memory, even if they are repeating (in base 2) or irrational. The fact that you are rounding the empirical percents to three decimal places doesn't actually fix this, because not all terminating decimals (base 10) are terminating binary values (base 2), so you may still find mismatches between your empirical and expected values. Instead of doing this:
if emp_percent == expected:
break
...you might try this (in Python 3.5+ only):
if all(map(math.is_close, emp_percent, expected)):
break
This solves both problems at once. By default, math.is_close() requires the values to be within (about) 9 decimal places of one another, so it inserts the necessary give for this algorithm to actually have a chance of working. Note that it does require special handling for comparisons involving zero, so you may need to tweak this code for your use case, like this:
is_close = functools.partial(math.is_close, abs_tol=1e-9)
if all(map(is_close, emp_percent, expected)):
break
math.is_close() also removes the need to round your empiricals, since it can do this approximation for you:
is_close = functools.partial(math.is_close, rel_tol=1e-3, abs_tol=1e-5)
if all(map(is_close, emp_percent, expected)):
break
If you really don't want these approximations, you will have to give up floating point and work with fractions exclusively. They produce exact results when divided by one another. However, you still have the problem that your algorithm is unlikely to terminate quickly (or perhaps at all), for the reasons discussed above.

Rather than trying to match floating point numbers -- you could try to match expected values for each possible sum. This is equivalent to what you are trying to do since (observed number)/(number of trials) == (theoretical probability) if and only if the observed number equals the expected number. These will always be an integer exactly when the number of rolls is a multiple of 36. Hence, if the number of rolls is not a multiple of 36 then it is impossible for your observations to equal expectations exactly.
To get the expected values, note that the numerators that appear in the exact probabilities of the various sums (1,2,3,4,5,6,5,4,3,2,1 for the sums 2,3,..., 12 respectively) are the expected values for the sums if the dice are rolled 36 times. If the dice are rolled 36i times then multiply these numerators by i to get the expected values of the sums. The following code simulates repeatedly rolling a pair of fair dice 36 times, accumulating the total counts and then comparing them with the expected counts. If there is a perfect match, the number of trials (where a trial is 36 rolls) needed to get the match is returned. If this doesn't happen by max_trials, a vector showing the discrepancy between the final counts and final expected value is given:
import random
def roll36(counts):
for i in range(36):
r1 = random.randint(1,6)
r2 = random.randint(1,6)
counts[r1+r2 - 2] += 1
def match_expected(max_trials):
counts = [0]*11
numerators = [1,2,3,4,5,6,5,4,3,2,1]
for i in range(1, max_trials+1):
roll36(counts)
expected = [i*j for j in numerators]
if counts == expected:
return i
#else:
return [c-e for c,e in zip(counts,expected)]
Here is some typical output:
>>> match_expected(1000000)
[-750, 84, 705, -286, 5783, -3504, -1208, 1460, 543, -1646, -1181]
Not only have the exact expected values never been observed in 36 million simulated rolls of a pair of fair dice, in the final state the discrepancies between observations and expectations have become quite large (in absolute value -- the relative discrepancies are approaching zero, as the law of large numbers predicts). This approach is unlikely to ever yield a perfect match. A variation that would work (while still focusing on expected numbers) would be to iterate until the observations pass a chi-squared goodness of fit test when compared with the theoretical distribution. In that case there would no longer be any reason to focus on multiples of 36.

Solving recursive sequence

Lately I've been solving some challenges from Google Foobar for fun, and now I've been stuck in one of them for more than 4 days. It is about a recursive function defined as follows:
R(0) = 1
R(1) = 1
R(2) = 2
R(2n) = R(n) + R(n + 1) + n (for n > 1)
R(2n + 1) = R(n - 1) + R(n) + 1 (for n >= 1)
The challenge is writing a function answer(str_S) where str_S is a base-10 string representation of an integer S, which returns the largest n such that R(n) = S. If there is no such n, return "None". Also, S will be a positive integer no greater than 10^25.
I have investigated a lot about recursive functions and about solving recurrence relations, but with no luck. I outputted the first 500 numbers and I found no relation with each one whatsoever. I used the following code, which uses recursion, so it gets really slow when numbers start getting big.
def getNumberOfZombits(time):
if time == 0 or time == 1:
return 1
elif time == 2:
return 2
else:
if time % 2 == 0:
newTime = time/2
return getNumberOfZombits(newTime) + getNumberOfZombits(newTime+1) + newTime
else:
newTime = time/2 # integer, so rounds down
return getNumberOfZombits(newTime-1) + getNumberOfZombits(newTime) + 1
The challenge also included some test cases so, here they are:
Test cases
==========
Inputs:
(string) str_S = "7"
Output:
(string) "4"
Inputs:
(string) str_S = "100"
Output:
(string) "None"
I don't know if I need to solve the recurrence relation to anything simpler, but as there is one for even and one for odd numbers, I find it really hard to do (I haven't learned about it in school yet, so everything I know about this subject is from internet articles).
So, any help at all guiding me to finish this challenge will be welcome :)

Instead of trying to simplify this function mathematically, I simplified the algorithm in Python. As suggested by #LambdaFairy, I implemented memoization in the getNumberOfZombits(time) function. This optimization sped up the function a lot.
Then, I passed to the next step, of trying to see what was the input to that number of rabbits. I had analyzed the function before, by watching its plot, and I knew the even numbers got higher outputs first and only after some time the odd numbers got to the same level. As we want the highest input for that output, I first needed to search in the even numbers and then in the odd numbers.
As you can see, the odd numbers take always more time than the even to reach the same output.
The problem is that we could not search for the numbers increasing 1 each time (it was too slow). What I did to solve that was to implement a binary search-like algorithm. First, I would search the even numbers (with the binary search like algorithm) until I found one answer or I had no more numbers to search. Then, I did the same to the odd numbers (again, with the binary search like algorithm) and if an answer was found, I replaced whatever I had before with it (as it was necessarily bigger than the previous answer).
I have the source code I used to solve this, so if anyone needs it I don't mind sharing it :)

The key to solving this puzzle was using a binary search.
As you can see from the sequence generators, they rely on a roughly n/2 recursion, so calculating R(N) takes about 2*log2(N) recursive calls; and of course you need to do it for both the odd and the even.
Thats not too bad, but you need to figure out where to search for the N which will give you the input. To do this, I first implemented a search for upper and lower bounds for N. I walked up N by powers of 2, until I had N and 2N that formed the lower and upper bounds respectively for each sequence (odd and even).
With these bounds, I could then do a binary search between them to quickly find the value of N, or its non-existence.

Why does this not seem to be random?

I was running a procedure to be like one of those games were people try to guess a number between 0 and 100 where there are 100 people guessing.I then averaged how many different guesses there are.
import random
def averager(times):
tests=[]
for i in range(times):
l=[]
for i in range(0,100):
l.append(random.randint(0,100))
tests.append(len(set(l)))
return (sum(tests))/len(tests)
print(averager(1000))
For some reason, the number of different guesses averages out to 63.6
Why is this?Is it due to a flaw in the python random library?
In a scenario where people were guessing a number between 1 and 10
The first person has a 100% chance to guess a previously unguessed number
The second person has a 90% chance to guess a previously unguessed number
The third person has a 80% chance to guess a previously unguessed number
and so on...
The average chance of guessing a new number(by my reasoning) is 55%.
But the data doesn't reflect this.

Your code is for finding the average number of unique guesses made by 100 people each guessing a number from 1 to 100.
As for why it converges to a number around 63... you should post your question to the math Stack Exchange.

If this was a completely flat distribution, you would expect the average to come out as 100, meaning everybody's guess was different. However, you know that such a scenario is much less random than a scenario where you have duplication. The fact that you get repeated numbers during a random sequence should be comforting.
All you are doing here is measuring some kind of uniqueness within very small sets: ie 1000 repeats of an experiment involving 100 random values. You might get a better appreciation of this if you use some sort of bootstrapping algorithm to sample from.
Also, if you scale up the number of repeats to millions, and perhaps measure the sample distribution (not just the mean), you'll have a little more confidence in the results you're getting.
It may be that the pseudo-random generator has a characteristic which yields approximately 60-70% non-repeated values inside a sequence the same length as the range. However, you would need to experiment with far more samples, as well as different random seeds. Otherwise your results are meaningless.

I modified your code so it would take an already generated sequence as input, rather than calculating random numbers:
def averager(seqs):
tests = []
for s in seqs:
tests.append(len(set(s)))
return float(sum(tests))/len(tests)
Then I made a function to return all possible choices for any given number of people and guess range:
def combos(n, limit):
return itertools.product(*((range(limit),) * n))
(One of the things I love about Python is that it's so easy to break apart a function into trivial pieces.)
Then I started testing with increasing numbers:
for n in range(2,100):
x = averager(combos(n, n))
print n, x, x/n
2 1.5 0.75
3 2.11111111111 0.703703703704
4 2.734375 0.68359375
5 3.3616 0.67232
6 3.99061213992 0.66510202332
7 4.62058326038 0.660083322911
8 5.25112867355 0.656391084194
This algorithm has a horrible complexity, so at this point I got a MemoryError. As you can see, the percentage of unique results keeps dropping as the number of people and guess range keeps increasing.
Repeating the test with random numbers:
def rands(repeats, n, limit):
for i in range(repeats):
yield [random.randint(0, limit) for j in range(n)]
for n in range(10, 101, 10):
x = averager(rands(10000, n, n))
print n, x, x/n
10 6.7752 0.67752
20 13.0751 0.653755
30 19.4131 0.647103333333
40 25.7309 0.6432725
50 32.0471 0.640942
60 38.3333 0.638888333333
70 44.6882 0.638402857143
80 50.948 0.63685
90 57.3525 0.63725
100 63.6322 0.636322
As you can see the results are consistent with what we saw earlier and with your own observation. I'm sure a bit of combinatorial math could explain it all.

Probability exercise returning different result that expected

As an exercise I'm writing a program to calculate the odds of rolling 5 die with the same number. The idea is to get the result via simulation as opposed to simple math though. My program is this:
# rollFive.py
from random import *
def main():
n = input("Please enter the number of sims to run: ")
hits = simNRolls(n)
hits = float(hits)
n = float(n)
prob = hits/n
print "The odds of rolling 5 of the same number are", prob
def simNRolls(n):
hits = 0
for i in range(n):
hits = hits + diceRoll()
return hits
def diceRoll():
firstDie = randrange(1,7,1)
for i in range(4):
nextDie = randrange(1,7,1)
if nextDie!=firstDie:
success = 0
break
else:
success = 1
return success
The problem is that running this program with a value for n of 1 000 000 gives me a probability usually between 0.0006 and 0.0008 while my math makes me believe I should be getting an answer closer to 0.0001286 (aka (1/6)^5).
Is there something wrong with my program? Or am I making some basic mistake with the math here? Or would I find my result revert closer to the right answer if I were able to run the program over larger iterations?

The probability of getting a particular number five times is (1/6)^5, but the probability of getting any five numbers the same is (1/6)^4.
There are two ways to see this.
First, the probability of getting all 1's, for example, is (1/6)^5 since there is only one way out of six to get a 1. Multiply that by five dice, and you get (1/6)^5. But, since there are six possible numbers to get the same, then there are six ways to succeed, which is 6((1/6)^5) or (1/6)^4.
Looked at another way, it doesn't matter what the first roll gives, so we exclude it. Then we have to match that number with the four remaining rolls, the probability of which is (1/6)^4.

Your math is wrong. The probability of getting five dice with the same number is 6*(1/6)^5 = 0.0007716.

Very simply, there are 6 ** 5 possible outcomes from rolling 5 dice, and only 6 of those outcomes are successful, so the answer is 6.0 / 6 ** 5

I think your expected probability is wrong, as you've stated the problem. (1/6)^5 is the probability of rolling some specific number 5 times in a row; (1/6)^4 is the probability of rolling any number 5 times in a row (because the first roll is always "successful" -- that is, the first roll will always result in some number).
>>> (1.0/6.0)**4
0.00077160493827160479
Compare to running your program with 1 million iterations:
[me#host:~] python roll5.py
Please enter the number of sims to run: 1000000
The odds of rolling 5 of the same number are 0.000755

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.