Sampling Without Replacement Probabilities

Sampling Without Replacement Probabilities - python

I am using np.random.choice to do sampling without replacement.
I would like the following code to choose 0 50% of the time, 1 30% of the time, and 2 20% of the time.
import numpy as np
draws = []
for _ in range(10000):
draw = np.random.choice(3, size=2, replace=False, p=[0.5, 0.3, 0.2])
draws.append(draw)
result = np.r_[draws]
How can I correctly choose the parameters for np.random.choice to give me the result that I want?
The numbers I want represent the probability of the events being drawn in either 1st or 2nd position exclusively.
print(np.any(result==0, axis=1).mean()) # 0.83, want 0.8
print(np.any(result==1, axis=1).mean()) # 0.68, want 0.7
print(np.any(result==2, axis=1).mean()) # 0.47, want 0.5

I'm giving two interpretations of the problem. One I prefer ("Timeless") and one I consider technically valid but inferior ("Naive")
Timeless:
Given probabilities x, y, z this approach computes x', y', z' such that if we draw twice independently and discard all equal pairs the frequencies of 0, 1, 2 are x, y, z.
This gives the right total frequencies over both trials and has the added benefit of being simple and being timeless in the sense that first and second trial are equivalent.
For this to hold we must have
(x'y' + x'z') / [2 (x'y' + x'z' + y'z')] = x
(x'y' + y'z') / [2 (x'y' + x'z' + y'z')] = y (1)
(y'z' + x'z') / [2 (x'y' + x'z' + y'z')] = z
If we add two of those and subtract the third we get
x'y' / (x'y' + x'z' + y'z') = x + y - z = 1 - 2 z
x'z' / (x'y' + x'z' + y'z') = x - y + z = 1 - 2 y (2)
y'z' / (x'y' + x'z' + y'z') = -x + y + z = 1 - 2 x
Multiplying 2 of those and dividing by the third
x'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 y) / (1 - 2 x)
y'^2 / (x'y' + x'z' + y'z') = (1 - 2 z) (1 - 2 x) / (1 - 2 y) (3)
z'^2 / (x'y' + x'z' + y'z') = (1 - 2 x) (1 - 2 y) / (1 - 2 z)
Therefore up to a constant factor
x' ~ sqrt[(1 - 2 z) (1 - 2 y) / (1 - 2 x)]
y' ~ sqrt[(1 - 2 z) (1 - 2 x) / (1 - 2 y)] (4)
z' ~ sqrt[(1 - 2 x) (1 - 2 y) / (1 - 2 z)]
Since we know that x', y', z' must sum to one this is enough to solve.
But: we needn't actually completely solve for x', y', z'. Since we are only interested in unequal pairs, all we need are the conditional probabilities x'y' / (x'y' + x'z' + y'z'), x'z' / (x'y' + x'z' + y'z') and y'z' / (x'y' + x'z' + y'z'). These we can compute using equation (2).
We then halve each of them to get the probabilities for ordered pairs and draw from the six legal pairs with these probabilities.
Naive:
This is based on the (arbitrary in my opinion) postulate that after the first draw with probability x', y', z', the second must have conditional probability 0, y' / (y'+z'), z' / (y'+z') if first was 0 x' / (x'+z'), 0, z' / (x'+z') if first was 1 and probability x' / (x'+y'), y' / (x'+y'), 0) if first was 2.
This has the disadvantage that as far as I can tell there is no simple, closed-form solution and the second and first draws are quite different.
The advantage is that one can use it directly with np.random.choice; this is, however, so slow that in the implementation below I give a workaround that avoids this function.
After some algebra one finds:
1/x' - x' = c (1 - 2x)
1/y' - y' = c (1 - 2y)
1/z' - z' = c (1 - 2z)
where c = 1/x' + 1/y' + 1/z' - 1. This I only managed to solve numerically.
Implementation and results:
And here is the implementation.
import numpy as np
from scipy import optimize
def f_pairs(n, p):
p = np.asanyarray(p)
p /= p.sum()
assert np.all(p <= 0.5)
pp = 1 - 2*p
# the following two lines show how to compute x', y', z'
# pp = np.sqrt(pp.prod()) / pp
# pp /= pp.sum()
# now pp contains x', y', z'
i, j = np.triu_indices(3, 1)
i, j = i[::-1], j[::-1]
pairs = np.c_[np.r_[i, j], np.r_[j, i]]
pp6 = np.r_[pp/2, pp/2]
return pairs[np.random.choice(6, size=(n,), replace=True, p=pp6)]
def f_opt(n, p):
p = np.asanyarray(p)
p /= p.sum()
pp = 1 - 2*p
def target(l):
lp2 = l*pp/2
return (np.sqrt(1 + lp2**2) - lp2).sum() - 1
l = optimize.root(target, 8).x
lp2 = l*pp/2
pp = np.sqrt(1 + lp2**2) - lp2
fst = np.random.choice(3, size=(n,), replace=True, p=pp)
snd = (
(np.random.random((n,)) < (1 / (1 + (pp[(fst+1)%3] / pp[(fst-1)%3]))))
+ fst + 1) % 3
return np.c_[fst, snd]
def f_naive(n, p):
p = np.asanyarray(p)
p /= p.sum()
pp = 1 - 2*p
def target(l):
lp2 = l*pp/2
return (np.sqrt(1 + lp2**2) - lp2).sum() - 1
l = optimize.root(target, 8).x
lp2 = l*pp/2
pp = np.sqrt(1 + lp2**2) - lp2
return np.array([np.random.choice(3, (2,), replace=False, p=pp)
for _ in range(n)])
def check_sol(p, sol):
N = len(sol)
print("Frequencies [value: observed, desired]")
c1 = np.bincount(sol[:, 0], minlength=3) / N
print(f"1st column: 0: {c1[0]:8.6f} {p[0]:8.6f} 1: {c1[1]:8.6f} {p[1]:8.6f} 2: {c1[2]:8.6f} {p[2]:8.6f}")
c2 = np.bincount(sol[:, 1], minlength=3) / N
print(f"2nd column: 0: {c2[0]:8.6f} {p[0]:8.6f} 1: {c2[1]:8.6f} {p[1]:8.6f} 2: {c2[2]:8.6f} {p[2]:8.6f}")
c = c1 + c2
print(f"1st or 2nd: 0: {c[0]:8.6f} {2*p[0]:8.6f} 1: {c[1]:8.6f} {2*p[1]:8.6f} 2: {c[2]:8.6f} {2*p[2]:8.6f}")
print()
print("2nd column conditioned on 1st column [value 1st: val / prob 2nd]")
for i in range(3):
idx = np.flatnonzero(sol[:, 0]==i)
c = np.bincount(sol[idx, 1], minlength=3) / len(idx)
print(f"{i}: 0 / {c[0]:8.6f} 1 / {c[1]:8.6f} 2 / {c[2]:8.6f}")
print()
# demo
p = 0.4, 0.35, 0.25
n = 1000000
print("Method: Naive")
check_sol(p, f_naive(n//10, p))
print("Method: naive, optimized")
check_sol(p, f_opt(n, p))
print("Method: Timeless")
check_sol(p, f_pairs(n, p))
Sample output:
Method: Naive
Frequencies [value: observed, desired]
1st column: 0: 0.449330 0.400000 1: 0.334180 0.350000 2: 0.216490 0.250000
2nd column: 0: 0.349050 0.400000 1: 0.366640 0.350000 2: 0.284310 0.250000
1st or 2nd: 0: 0.798380 0.800000 1: 0.700820 0.700000 2: 0.500800 0.500000
2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.608128 2 / 0.391872
1: 0 / 0.676133 1 / 0.000000 2 / 0.323867
2: 0 / 0.568617 1 / 0.431383 2 / 0.000000
Method: naive, optimized
Frequencies [value: observed, desired]
1st column: 0: 0.450606 0.400000 1: 0.334881 0.350000 2: 0.214513 0.250000
2nd column: 0: 0.349624 0.400000 1: 0.365469 0.350000 2: 0.284907 0.250000
1st or 2nd: 0: 0.800230 0.800000 1: 0.700350 0.700000 2: 0.499420 0.500000
2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.608132 2 / 0.391868
1: 0 / 0.676515 1 / 0.000000 2 / 0.323485
2: 0 / 0.573727 1 / 0.426273 2 / 0.000000
Method: Timeless
Frequencies [value: observed, desired]
1st column: 0: 0.400756 0.400000 1: 0.349099 0.350000 2: 0.250145 0.250000
2nd column: 0: 0.399128 0.400000 1: 0.351298 0.350000 2: 0.249574 0.250000
1st or 2nd: 0: 0.799884 0.800000 1: 0.700397 0.700000 2: 0.499719 0.500000
2nd column conditioned on 1st column [value 1st: val / prob 2nd]
0: 0 / 0.000000 1 / 0.625747 2 / 0.374253
1: 0 / 0.714723 1 / 0.000000 2 / 0.285277
2: 0 / 0.598129 1 / 0.401871 2 / 0.000000

Related

Is there an error with pandas.Dataframe.ewm calculation or I am wrong?

I choose the recursive option in order to calculate weighted moving average starting from the latest calculated value.
According to Documentation :
When adjust=False, the exponentially weighted function is calculated
recursively:
y0 = x0
y(t) = (1-alpha) * y(t-1) + alpha * x(t)
So I have the following code :
import pandas as pd
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8, 9],
})
alpha=0.5
df['ewm'] = df['col1'].ewm(alpha, adjust=False).mean()
which gives :
>>> df
col1 ewm
0 1 1.000000
1 1 1.000000
2 2 1.666667
3 3 2.555556
4 3 2.851852
5 5 4.283951
6 8 6.761317
7 9 8.253772
The problem is that it's not corresponding to following mathematical calculations :
y0 = x0 = 1
y1 = (1-0.5) * y0 + 0.5 * x1 = 0.5 + 0.5 = 1
y2 = (1-0.5) * y1 + 0.5 * x2 = 0.5 + 0.5 * 2 = 1.5
y3 = (1-0.5) * y2 + 0.5 * x3 = 0.5 * 1.5 + 0.5 * 3 = 0.75 + 1.5 = 2.25
...
We do not have the same values. What's wrong ?

Like I read in comments parameters should be named.
Documentation do not exposed this fact clearly.
One must be careful because no exception is raised when using no named arguments, but calculations are false.

Print spiral square matrix in python

Here's a python code that prints the square matrix from interior to outer. How can I reverse to print from outer to interior clockwise
# Function to prints a N x N spiral matrix without using any extra space
# The matrix contains numbers from 1 to N x N
def printSpiralMatrix(N):
for i in range(N):
for j in range(N):
# x stores the layer in which (i, j)'th element lies
# find minimum of four inputs
x = min(min(i, j), min(N - 1 - i, N - 1 - j))
# print upper right half
if i <= j:
print((N - 2 * x) * (N - 2 * x) - (i - x) - (j - x), end='')
# print lower left half
else:
print((N - 2 * x - 2) * (N - 2 * x - 2) + (i - x) + (j - x), end='')
print('\t', end='')
print()
if __name__ == '__main__':
N = 4
printSpiralMatrix(N)
The output should be like that
1 2 3 4
12 13 14 5
11 16 15 6
10 9 8 7

Try with this:
def printSpiralMatrix(N):
for i in range(N):
for j in range(N):
# x stores the layer in which (i, j)'th element lies
# find minimum of four inputs
x = min(min(i, j), min(N - 1 - i, N - 1 - j))
# print upper right half
if i <= j:
print(abs((N - 2 * x) * (N - 2 * x) - (i - x) - (j - x) -(N**2 + 1)), end='')
# print lower left half
else:
print( abs((N - 2 * x - 2) * (N - 2 * x - 2) + (i - x) + (j - x) - (N**2 + 1)), end='')
print('\t', end='')
print()
printSpiralMatrix(4)
1 2 3 4
12 13 14 5
11 16 15 6
10 9 8 7

def generateMatrix(n):
if n<=0:
return []
matrix=[row[:] for row in [[0]*n]*n]
row_st=0
row_ed=n-1
col_st=0
col_ed=n-1
current=1
while (True):
if current>n*n:
break
for c in range (col_st, col_ed+1):
matrix[row_st][c]=current
current+=1
row_st+=1
for r in range (row_st, row_ed+1):
matrix[r][col_ed]=current
current+=1
col_ed-=1
for c in range (col_ed, col_st-1, -1):
matrix[row_ed][c]=current
current+=1
row_ed-=1
for r in range (row_ed, row_st-1, -1):
matrix[r][col_st]=current
current+=1
col_st+=1
return matrix
print(list(generateMatrix(3)))
Output
[[1, 2, 3], [8, 9, 4], [7, 6, 5]]

how to understand this recursion algorithm

def sum(a):
if a==1:
s=1
else:
s=1+2*sum(a-1)
return s
function:calculate the sum of the sequence of number. Its common ratio is 2, last term is 2^( a-1) and first term is 1.
Why does it use s=1+2*sum(a-1) to implement the function?

def sum1(a):
if a==1:
s=1
else:
s=1+2*sum1(a-1)
return s
What this function does, let's take a=4.
(1) s = 1 + 2*sum1(4-1) = 1 + 2*sum1(3) = 1 + 2*s2
(2) s2 = 1 + 2*sum1(3-1) = 1 + 2*sum1(2) = 1 + 2*s3
(3) s3 = 1 + 2*sum(2-1) = 1 + 2*sum(1) = 1 + 2*s4 = 1+2 = 3
Going BackWard : (3 * 2 + 1 ) * 2 + 1 = (7) * 2 + 1 = 15
Do it for bigger numbers, you will notice that this is the formula of 2^a - 1.
Print 2^a and 2^a - 1
Difference: (4, 3)
Difference: (8, 7)
Difference: (16, 15)
Difference: (32, 31)
Difference: (64, 63)
Difference: (128, 127)
Difference: (256, 255)

#Zhongyi I'm understanding this question as asking "how does recursion work."
Recursion exists in Python, but I find it a little difficult to explain how it works in Python. Instead I'll show how this works in Racket (a Lisp dialect).
First, I'lll rewrite yoursum provided above into mysum:
def yoursum(a):
if a==1:
s=1
else:
s=1+2*yoursum(a-1)
return s
def mysum(a):
if a == 1:
return 1
return 1 + (2 * mysum(a - 1))
for i in range(1, 11):
print(mysum(i), yoursum(i))
They are functionally the same:
# Output
1 1
3 3
7 7
15 15
31 31
63 63
127 127
255 255
511 511
1023 1023
In Racket, mysum looks like this:
#lang racket
(define (mysum a)
(cond
((eqv? a 1) 1)
(else (+ 1 (* 2 (mysum (sub1 a)))))))
But we can use language features called quasiquoting and unquoting to show what the recursion is doing:
#lang racket
(define (mysum a)
(cond
((eqv? a 1) 1)
(else `(+ 1 (* 2 ,(mysum (sub1 a))))))) ; Notice the ` and ,
Here is what this does for several outputs:
> (mysum 1)
1
> (mysum 2)
'(+ 1 (* 2 1))
> (mysum 3)
'(+ 1 (* 2 (+ 1 (* 2 1))))
> (mysum 4)
'(+ 1 (* 2 (+ 1 (* 2 (+ 1 (* 2 1))))))
> (mysum 5)
'(+ 1 (* 2 (+ 1 (* 2 (+ 1 (* 2 (+ 1 (* 2 1))))))))
You can view the recursive step as substituting the expression (1 + 2 * rest-of-the-computation)
Please comment or ask for clarification if there are parts that still do not make sense.

Formula Explanation
1+2*(sumOf(n-1))
This is not a generic formula for Geometric Progression.
this formula is only valid for the case when ratio is 2 and first term is 1
So how it is working.
The Geometric Progression with first term = 1 and r = 2 will be
1,2,4,6,8,16,32,64
FACT 1
Here you can clearly see nth term is always equals to sumOf(n-1) terms + 1
Let declare an equation from fact 1
n = sumOf(n-1) + 1 =======> Eq1.
Let put our equation to test
put n = 2 in Eq1
2 = sumOf(2-1) + 1
we know that sumOf(1) is 1 then
2 = 2 ==> proved
so if n = sumOf(n-1)+1 then
FACT 2
Sum of n term is n term + sum(n-1) terms
Lets declare an Equation from FACT 2
sumOf(n) = sumOf(n-1) + n ==> eq2
let us put eq1 in eq2 i.e. n = sumOf(n-1) + 1
sumOf(n) = sumOf(n-1) + sumOf(n-1) + 1 ==> eq3
Simplifying
sumOf(n) = 2 * sumOf(n-1) + 1
Rearranging
sumOf(n) = 1 + 2 * sumOf(n-1) ==> final Equation
Now Code this equation
we know sumOf 1st term is alway 1 so, this is our base case.
def sumOf(a):
if a==1:
return 1
so now sum of first n terms will be 1 + 2 * sumOf(n-1) ==> From final Equation
put this equation in else part
def sumOf(a):
if a==1:
return 1
else:
return 1 + 2 * sumOf(a-1)

Solve equations with combinations and for loops

I have equations with multiple unknowns, and a number range:
eq1 = (x + 5 + y) #
ans = 15
no_range = [1..5]
I know that I can solve the equation by checking all possible combinations:
solved = False
for i in range(1, 5+1) # for x
for j in range(1, 5+1) # for y
if i + 5 + j == ans:
solved = True
So, the problem is that I want a function to deal with unknown_count amount of unknowns. So that both the following equations, or any, can be solved in the same manner above:
eq1 = (x + 5 + y)
ans = 15
eq2 = (x + 5 + y + z * a + 5 * b / c)
ans = 20
I just cannot think of a way, since for each unknown you need a for loop.

You could use itertools.product to generate the Cartesian product for an
arbitrary number of variables:
In [4]: import itertools
In [5]: list(itertools.product(range(1, 5+1), repeat=2))
Out[5]:
[(1, 1),
(1, 2),
(1, 3),
...
(5, 3),
(5, 4),
(5, 5)]
So you could modify your code like this:
import itertools as IT
unknown_count = 6
ans = 20
solved = False
def func(*args):
x, y, z, a, b, c = args
return x + 5 + y + z * a + 5 * b / c
for args in IT.product(range(1, 5+1), repeat=unknown_count):
if func(*args) == ans:
solved = True
print('{} + 5 + {} + {} * {} + 5 * {} / {} = {}'.format(*(args+(ans,))))
which yields a lot of solutions, such as
1 + 5 + 1 + 1 * 3 + 5 * 2 / 1 = 20
1 + 5 + 1 + 1 * 3 + 5 * 4 / 2 = 20
1 + 5 + 1 + 2 * 4 + 5 * 1 / 1 = 20
...
5 + 5 + 5 + 2 * 2 + 5 * 1 / 5 = 20
5 + 5 + 5 + 3 * 1 + 5 * 2 / 5 = 20
5 + 5 + 5 + 4 * 1 + 5 * 1 / 5 = 20
The * unpacking operator was used
to create a function, func, which accepts an arbitrary number of arguments (i.e. def func(*args)), and also to
pass an arbitrary number of arguments to func (i.e. func(*args)).

How to convert algorithm to python

new to python and I'm having trouble converting a script to a more effective algorithm I was given.
Here's the python code:
#!/usr/bin/env python
import itertools
target_sum = 10
a = 1
b = 2
c = 4
a_range = range(0, target_sum + 1, a)
b_range = range(0, target_sum + 1, b)
c_range = range(0, target_sum + 1, c)
for i, j, k in itertools.product(a_range, b_range, c_range):
if i + j + k == 10:
print a, ':', i/a, ',', b, ':', j/b, ',', c, ':', k/c
(it only does 3 variables just for example, but I want to use it on thousands of variables in the end).
Here's the result I am looking for(all the combo's that make it result to 10):
1 : 0 , 2 : 1 , 4 : 2
1 : 0 , 2 : 3 , 4 : 1
1 : 0 , 2 : 5 , 4 : 0
1 : 2 , 2 : 0 , 4 : 2
1 : 2 , 2 : 2 , 4 : 1
1 : 2 , 2 : 4 , 4 : 0
1 : 4 , 2 : 1 , 4 : 1
1 : 4 , 2 : 3 , 4 : 0
1 : 6 , 2 : 0 , 4 : 1
1 : 6 , 2 : 2 , 4 : 0
1 : 8 , 2 : 1 , 4 : 0
1 : 10 , 2 : 0 , 4 : 0
In question Can brute force algorithms scale? a better algorithm was suggested but I'm having a hard time implementing the logic within python. The new test code:
# logic to convert
#for i = 1 to k
#for z = 0 to sum:
# for c = 1 to z / x_i:
# if T[z - c * x_i][i - 1] is true: #having trouble creating the table...not sure if thats a matrix
# set T[z][i] to true
#set the variables
sum = 10
data = [1, 2, 4]
# trying to find all the different ways to combine the data to equal the sum
for i in range(len(data)):
print(i)
if i == 0:
continue
for z in range(sum):
for c in range(z/i):
print("*" * 15)
print('z is equal to: ', z)
print('c is equal to: ', c)
print('i is equal to: ', i)
print(z - c * i)
print('i - 1: ', (i - 1))
if (z - c * i) == (i - 1):
print("(z - c * i) * (i - 1)) match!")
print(z,i)
Sorry its obviously pretty messy, I have no idea how to generate a table in the section that has:
if T[z - c * x_i][i - 1] is true:
set T[z][i] to true
In other places while converting the algo, I had more problems because in lines like 'or i = 1 to k' converting it to python gives me an error saying "TypeError: 'int' object is not utterable"

You can get that block which creates the table for dynamic programming with this:
from collections import defaultdict
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(sum + 1):
for c in range(s / x + 1):
if T[s - c * x, i]:
T[s, i + 1] = True

You can create the table you need with a list of lists:
t = [[False for i in range(len(data))] for z in range(sum)] #Creates table filled with 'False'
for i in range(len(data)):
print(i)
if i == 0:
continue
for z in range(sum):
for c in range(int(z/i)):
print("*" * 15)
print('z is equal to: ', z)
print('c is equal to: ', c)
print('i is equal to: ', i)
print(z - c * i)
print('i - 1: ', (i - 1))
if (z - c * i) == (i - 1):
print("(z - c * i) * (i - 1)) match!")
t[z][i] = True # Sets element in table to 'True'
As for your TypeError, you can't say i = 1 to k, you need to say: for i in range(1,k+1): (assuming you want k to be included).
Another tip is that you shouldn't use sum as a variable name, because this is already a built-in python function. Try putting print sum([10,4]) in your program somewhere!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sampling Without Replacement Probabilities - python

Related

Is there an error with pandas.Dataframe.ewm calculation or I am wrong?

Print spiral square matrix in python

how to understand this recursion algorithm

Solve equations with combinations and for loops

How to convert algorithm to python

Categories

Resources