G-test in R and Python (Two Sample Test of Proportions) - python

I am conducting the G test in both R and Python and I am getting different results, the results I am getting in Python being wrong. Somehow I am misapplying the formula.
The data are:
prfs
Sex F M
Pref
B 29 17
A 2 12
The R Code is :
library(RVAideMemoire)
G.test(prfs)
G-test
data: prfs
G = 11.025, df = 1, p-value = 0.0008989
The Python code is :
stats.power_divergence(prfs, lambda_ = 'log-likelihood')
Power_divergenceResult(statistic=array([28.14366538, 0.86639163]), pvalue=array([1.12635722e-07, 3.51956200e-01]))
stats.power_divergence(prfs, lambda_ = 'log-likelihood', axis = None, ddof = 2)
Power_divergenceResult(statistic=29.07673602201342, pvalue=6.956736686069527e-08)

Its an old question but following answer may help:
obs = np.array([[29,17], [2,12]])
# G test with scipy:
from scipy.stats import *
g, p, dof, expctd = chi2_contingency(obs, lambda_="log-likelihood")
print("G={}; df={}; P={}".format(g, dof, p))
Output:
G=8.859368223179882; df=1; P=0.0029158847773319975
The values are similar to those obtained by R method.
Reference for above method is here.

Related

Estimating the p value of the difference between two proportions using statsmodels and PyMC3 (MCMC simulation) in Python

In Probabilistic-Programming-and-Bayesian-Methods-for-Hackers, a method is proposed to compute the p value that two proportions are different.
(You can find the jupyter notebook here containing the entire chapter
http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC2.ipynb)
The code is the following:
import pymc3 as pm
figsize(12, 4)
#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04
N_A = 1700
N_B = 1700
#generate some observations
observations_A = bernoulli.rvs(true_p_A, size=N_A)
observations_B = bernoulli.rvs(true_p_B, size=N_B)
print(np.mean(observations_A))
print(np.mean(observations_B))
0.04058823529411765
0.03411764705882353
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform("p_A", 0, 1)
p_B = pm.Uniform("p_B", 0, 1)
# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic("delta", p_A - p_B)
# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli("obs_A", p_A, observed=observations_A)
obs_B = pm.Bernoulli("obs_B", p_B, observed=observations_B)
# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]
p_A_samples = burned_trace["p_A"]
p_B_samples = burned_trace["p_B"]
delta_samples = burned_trace["delta"]
# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
print("Probability site A is WORSE than site B: %.3f" % \
np.mean(delta_samples < 0))
print("Probability site A is BETTER than site B: %.3f" % \
np.mean(delta_samples > 0))
Probability site A is WORSE than site B: 0.167
Probability site A is BETTER than site B: 0.833
However, if we compute the p value using statsmodels, we get a very different result:
from scipy.stats import norm, chi2_contingency
import statsmodels.api as sm
s1 = int(1700 * 0.04058823529411765)
n1 = 1700
s2 = int(1700 * 0.03411764705882353)
n2 = 1700
p1 = s1/n1
p2 = s2/n2
p = (s1 + s2)/(n1+n2)
z = (p2-p1)/ ((p*(1-p)*((1/n1)+(1/n2)))**0.5)
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2])
print('z1 is {0} and p is {1}'.format(z1, p))
z1 is 0.9948492584166934 and p is 0.03735294117647059
With MCMC, the p value seems to be 0.167, but using statsmodels, we get a p value 0.037.
How can I understand this?
Looks like you printed the wrong value. Try this instead:
print('z1 is {0} and p is {1}'.format(z1, p_value1))
Also, if you want to test the hypothesis p_A > p_B then you should set the alternative parameter in the function call to larger like so:
z1, p_value1 = sm.stats.proportions_ztest([s1, s2], [n1, n2], alternative='larger')
The docs have more examples on how to use it.

Using Truncated Poisson in Pandas data frame

Withe help of # Welcome to Stack Overflow, I managed to truncate the Poisson distribution using upper limit.When I used the function so called truncated Poisson, which is user defined function, it worked with single value entry, which I have shown in the code below:
import scipy.stats as sct
import pandas as pd
def truncated_Poisson(mu, max_value, size):
temp_size = size
while True:
temp_size *= 2
temp = sct.poisson.rvs(mu, size=temp_size)
truncated = temp[temp <= max_value]
if len(truncated) >= size:
return truncated[:size]
mu = 2.5
max_value = 10
print(truncated_Poisson(mu, max_value, 1))
Unfortunately, I threw me an error when I applied it in the data frame as follow:
data = pd.DataFrame()
data['Name'] = ['A','B','C','D','E']
data ['mu'] = [0.5,1.2,2,2.5,2.8]
max_value = 5
size = 1
data ['Pos'] = truncated_Poisson(data.mu,max_value,size = 1)
The error statement is
ValueError: size does not match the broadcast shape of the parameters.
Can anyone advise me how to use that function in dataframe?
Thanks
Zep.
As I understand, you want to call truncated_Poisson with the same parameter and each of the mu from the data. You can do this for example by using .apply:
data['Pos'] = data.mu.apply(lambda mu: truncated_Poisson(mu, max_value, size=1))
>>> data
Name mu Pos
0 A 0.5 [0]
1 B 1.2 [0]
2 C 2.0 [3]
3 D 2.5 [4]
4 E 2.8 [3]

Python rpy2 - nls regression RRuntimeError

I am trying to do some nls regression using R within Python. I am getting stuck with a RRuntimeError and am getting to a point where I am way outside my expertise and have struggled for a few days to get it to work so would appreciate some help.
This is my csv of data:
http://www.sharecsv.com/s/4cdd4f832b606d6616260f9dc0eedf38/ratedata.csv
This is my code:
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
dfData = pd.read_csv('C:\\Users\\nick\\Desktop\\ratedata.csv')
rdf = pandas2ri.py2ri(dfData)
a = 0.5
b = 1.1
count = rdf.rx(True, 'Trials')
rates = rdf.rx(True, 'Successes')
base = importr('base', robject_translations={'with': '_with'})
stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
my_formula = stats.as_formula('rates ~ 1-(1/(10^(a * count ^ (b-1))))')
d = ro.ListVector({'a': a, 'b': b})
fit = stats.nls(my_formula, weights=count, start=d)
Everything is compiling apart from:
fit = stats.nls(my_formula, weights=count, start=d)
I am getting the following traceback:
---------------------------------------------------------------------------
RRuntimeError Traceback (most recent call last)
<ipython-input-12-3f7fcd7d7851> in <module>()
6 d = ro.ListVector({'a': a, 'b': b})
7
----> 8 fit = stats.nls(my_formula, weights=count, start=d)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2\robjects\functions.py in __call__(self, *args, **kwargs)
176 v = kwargs.pop(k)
177 kwargs[r_k] = v
--> 178 return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
179
180 pattern_link = re.compile(r'\\link\{(.+?)\}')
~\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2\robjects\functions.py in __call__(self, *args, **kwargs)
104 for k, v in kwargs.items():
105 new_kwargs[k] = conversion.py2ri(v)
--> 106 res = super(Function, self).__call__(*new_args, **new_kwargs)
107 res = conversion.ri2ro(res)
108 return res
RRuntimeError: Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
parameters without starting value in 'data': rates, count
I would be eternally thankful if anyone can see where I am going wrong, or can offer advice. All I want is the two numbers from that formula back in Python so I can use those to construct some confidence intervals.
Thank you
Consider incorporating all your formula variables into a single dataframe and use the data argument. The as_formula call looks in the R environment but rates and count are in the Python scope. Hence, contain all items in same object. Then run your nls with either the Pandas dataframe or R dataframe:
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
base = importr('base', robject_translations={'with': '_with'})
stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
a = 0.05
b = 1.1
d = ro.ListVector({'a': a, 'b': b})
dfData = pd.read_csv('Input.csv')
dfData['count'] = dfData['Trials'].astype('float')
dfData['rates'] = dfData['Successes'] / dfData['Trials']
dfData['a'] = a
dfData['b'] = b
pandas2ri.activate()
rdf = pandas2ri.py2ri(dfData)
my_formula = stats.as_formula('rates ~ 1-(1/(10^(a * count ^ (b-1))))')
# WITH PANDAS DATAFRAME
fit = stats.nls(formula=my_formula, data=dfData, weights=dfData['count'], start=d)
print(fit)
# WITH R DATAFRAME
fit = stats.nls(formula=my_formula, data=rdf, weights=rdf.rx(True, 'count'), start=d)
print(fit)
Alternatively, you can use robjects.globalenv and not use data argument:
ro.globalenv['rates'] = dfData['rates']
ro.globalenv['count'] = dfData['count']
ro.globalenv['a'] = dfData['a']
ro.globalenv['b'] = dfData['b']
fit = stats.nls(formula=my_formula, weights=dfData['count'], start=d)
print(fit)
# Nonlinear regression model
# model: rates ~ 1 - (1/(10^(a * count^(b - 1))))
# data: parent.frame()
# a b
# 0.01043 1.24943
# weighted residual sum-of-squares: 14.37
# Number of iterations to convergence: 6
# Achieved convergence tolerance: 9.793e-07
# To return parameters
num = fit.rx('m')[0].names.index('getPars')
obj = fit.rx('m')[0][num]()
print(obj[0])
# 0.010425686223717435
print(obj[1])
# 1.2494303314553932
Equivalently in R:
dfData <- read.csv('Input.csv')
a <- .05
b <- 1.1
d <- list(a=a, b=b)
dfData$count <- dfData$Trials
dfData$rates <- dfData$Successes / dfData$Trials
dfData$a <- a
dfData$b <- b
my_formula <- stats::as.formula("rates ~ 1-(1/(10^(a * count ^ (b-1))))")
fit <- stats::nls(my_formula, data=dfData, weights=dfData$count, start=d)
print(fit)
# Nonlinear regression model
# model: rates ~ 1 - (1/(10^(a * count^(b - 1))))
# data: dfData
# a b
# 0.01043 1.24943
# weighted residual sum-of-squares: 14.37
# Number of iterations to convergence: 6
# Achieved convergence tolerance: 9.793e-07
# To return parameters
fit$m$getPars()['a']
# 0.01042569
fit$m$getPars()['b']
# 1.24943

Python's fsolve not working

I'm currently trying to find the intercept of 2 equations from my code (pasted below). I'm using fsolve and have used it successfully in one part but I can't get it to work for the second.
Confusingly it's not showing up an error, if you paste this code into your notebook and run it you'll see 2 grphs, on the first graph there's a line at an angle which should be stopping at the eqm line.
The section which wont work is def q_eqm(x_q). Thank you for your help
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
AC_LK = np.array([4.02232,1206.53,220.291])
AC_HK = np.array([4.0854,1348.77,219.976])
P_Tot = 1 # Bara
N_Size = 11 # 1001 = 0.1% accuracy for xA
xf = 0.7
q = 0.7
xA = np.linspace(0,1,N_Size)
yA = np.linspace(0.00,0.00,N_Size)
T = np.linspace(0.00,0.00,N_Size)
x = np.array([xA[0:N_Size],yA[0:N_Size],T[0:N_Size]]) # x[xA,yA,T]
F = np.empty((1))
def xA_T(N):
xA_Ant = x[0,N]
def P_Ant(T):
PA = pow(10,AC_LK[0]-(AC_LK[1]/(T+AC_LK[2])))*xA_Ant
PB = pow(10,AC_HK[0]-(AC_HK[1]/(T+AC_HK[2])))*(1-xA_Ant)
F[0] = P_Tot - (PA + PB)
return F
return x
TGuess = [100]
T = opt.fsolve(P_Ant,TGuess)
x[2,N] = T
return x
for N in range(0,len(xA)):
xA_T(N)
x[1,N] = pow(10,AC_LK[0]-(AC_LK[1]/(x[2,N]+AC_LK[2])))*x[0,N]/P_Tot
q_int = ((-q*0)/(1-q)) + (xf/(1-q))
Eqm_Poly = np.polyfit(x[0,0:N_Size], x[1,0:N_Size], 6)
q_Poly = np.polyfit([xf,0], [xf,q_int], 1)
F = np.empty((1))
def q_Eqm(x_q):
y_q = q_Poly[0]*x_q + q_Poly[1]
eqm_y = (Eqm_Poly[0]*pow(x_q,6)+Eqm_Poly[1]*pow(x_q,5)+Eqm_Poly[2]*pow(x_q,4)+Eqm_Poly[3]*pow(x_q,3)+Eqm_Poly[4]*pow(x_q,2)+Eqm_Poly[5]*pow(x_q,1)+Eqm_Poly[6]*pow(x_q,0))
F[0] = y_q - eqm_y
return F
x_qGuess = [0]
x_q = opt.fsolve(q_Eqm,x_qGuess)
print(x,Eqm_Poly,x_q,q_int)
plt.plot(x[0,0:N_Size],x[1,0:N_Size],'k-',linewidth=1)
plt.plot([xf,xf],[0,xf],'b-',linewidth=1)
plt.plot([xf,x_q],[xf,(q_Poly[0]*x_q + q_Poly[1])],'r-',linewidth=1)
plt.legend(['Eqm','Feed'])
plt.xlabel('xA')
plt.ylabel('yA')
plt.xlim([0.00, 1])
plt.ylim([0.00, 1])
plt.savefig('x.png')
plt.savefig('x.eps')
plt.show()
plt.plot(x[0,0:N_Size],x[2,0:N_Size],'r--',linewidth=3)
plt.plot(x[1,0:N_Size],x[2,0:N_Size],'b--',linewidth=3)
plt.legend(['xA','yA'])
plt.xlabel('Mol Frac')
plt.ylabel('Temp degC')
plt.xlim([0, 1])
plt.savefig('Txy.png')
plt.savefig('Txy.eps')
plt.show()
The answer turns out to be relatively simple:
#F = np.empty((1)) # remove this
def q_Eqm(x_q):
y_q = q_Poly[0]*x_q + q_Poly[1]
eqm_y = (Eqm_Poly[0]*pow(x_q,6)+Eqm_Poly[1]*pow(x_q,5)+Eqm_Poly[2]*pow(x_q,4)+Eqm_Poly[3]*pow(x_q,3)+Eqm_Poly[4]*pow(x_q,2)+Eqm_Poly[5]*pow(x_q,1)+Eqm_Poly[6]*pow(x_q,0))
return y_q - eqm_y
The original code defines a global F, which is modified in the function and then returned. So in each iteration the function returns different values but they are the same object. This seems to confuse fsolve (I guess it internally stores references to the results rather than values). Removing this F and simply returning the result of the subtraction resolves the problem.

Is a Fuzzy C-Means algorithm available for Python?

I have some dots in a 3 dimensional space and would like to cluster them. I know Pythons module "cluster", but it has only K-Means. Do you know a module which has FCM (Fuzzy C-Means)?
(If you know some other python modules which are related to clustering you could name them as a bonus. But the important question is the one for a FCM-algorithm in python.)
Matlab
It seems to be quite easy to use FCM in Matlab (example). Isn't something like this available for Python?
NumPy, SciPy and Sage
I didn't find FCM in NumPy, SciPy or Sage. I've downloaded the documentation and searched for it. No results
Python-cluster
It seems like the cluster module will add fuzzy C-Means with the next version (see Roadmap). But I need it now
PEACH will provide some Fuzzy C-Means functionality:
http://code.google.com/p/peach/
However there doesn't seem to be any usable documentation as the wiki is empty. An example for using FCM with PEACH can be found on its website.
Have a look at scikit-fuzzy package. It has the very basic fuzzy logic functionality, including fuzzy c-means clustering.
Python
There is a fuzzy-c-means package in the PyPI. Check out the link : fuzzy-c-means Python
This is the simplest way to use FCM in python. Hope it helps.
I have done it from scratch, using K++ initialization (with fixed seeds and 5 centroids. It should't be too difficult to addapt it to your desired number of centroids):
# K++ initialization Algorithm:
import random
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
np.random.seed(20) # fixxing seeds
#random.seed(0) # fixxing seeds
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
a = initialize(data2,5) # "a" is the centroids initial array... I used 5 centroids
# Now the Fuzzy c means algorithm:
m = 1.5 # Fuzzy parameter (it can be tuned)
r = (2/(m-1))
# Initial centroids:
c1,c2,c3,c4,c5 = a[0],a[1],a[2],a[3],a[4]
# prepare empty lists to add the final centroids:
cc1,cc2,cc3,cc4,cc5 = [],[],[],[],[]
n_iterations = 10000
for j in range(n_iterations):
u1,u2,u3,u4,u5 = [],[],[],[],[]
for i in range(len(data2)):
# Distances (of every point to each centroid):
a = LA.norm(data2[i]-c1)
b = LA.norm(data2[i]-c2)
c = LA.norm(data2[i]-c3)
d = LA.norm(data2[i]-c4)
e = LA.norm(data2[i]-c5)
# Pertenence matrix vectors:
U1 = 1/(1 + (a/b)**r + (a/c)**r + (a/d)**r + (a/e)**r)
U2 = 1/((b/a)**r + 1 + (b/c)**r + (b/d)**r + (b/e)**r)
U3 = 1/((c/a)**r + (c/b)**r + 1 + (c/d)**r + (c/e)**r)
U4 = 1/((d/a)**r + (d/b)**r + (d/c)**r + 1 + (d/e)**r)
U5 = 1/((e/a)**r + (e/b)**r + (e/c)**r + (e/d)**r + 1)
# We will get an array of n row points x K centroids, with their degree of pertenence
u1.append(U1)
u2.append(U2)
u3.append(U3)
u4.append(U4)
u5.append(U5)
# now we calculate new centers:
c1 = (np.array(u1)**2).dot(data2) / np.sum(np.array(u1)**2)
c2 = (np.array(u2)**2).dot(data2) / np.sum(np.array(u2)**2)
c3 = (np.array(u3)**2).dot(data2) / np.sum(np.array(u3)**2)
c4 = (np.array(u4)**2).dot(data2) / np.sum(np.array(u4)**2)
c5 = (np.array(u5)**2).dot(data2) / np.sum(np.array(u5)**2)
cc1.append(c1)
cc2.append(c2)
cc3.append(c3)
cc4.append(c4)
cc5.append(c5)
if (j>5):
change_rate1 = np.sum(3*cc1[j] - cc1[j-1] - cc1[j-2] - cc1[j-3])/3
change_rate2 = np.sum(3*cc2[j] - cc2[j-1] - cc2[j-2] - cc2[j-3])/3
change_rate3 = np.sum(3*cc3[j] - cc3[j-1] - cc3[j-2] - cc3[j-3])/3
change_rate4 = np.sum(3*cc4[j] - cc4[j-1] - cc4[j-2] - cc4[j-3])/3
change_rate5 = np.sum(3*cc5[j] - cc5[j-1] - cc5[j-2] - cc5[j-3])/3
change_rate = np.array([change_rate1,change_rate2,change_rate3,change_rate4,change_rate5])
changed = np.sum(change_rate>0.0000001)
if changed == 0:
break
print(c1) # to check a centroid coordinates c1 - c5 ... they are the last centroids calculated, so supposedly they converged.
print(U) # this is the degree of pertenence to each centroid (so n row points x K centroids columns).
I know it is not very pythonic, but I hope it can be a starting point for your complete fuzzy C means algorithm. I think that "soft clustering" is the way to go when data is not easily separable (for example, when "t-SNE visualization" show all data together instead of showing groups clearly separated. In this case, forcing data to pertain strictly to only one clustering can be dangerous). I would give a try with m = 1.1, to m = 2.0, so you can see how the fuzzy parameter affects to the pertenence matrix.

Categories

Resources