I have performed a hypergeometric analysis (using a python script) to investigate enrichment of GO-terms in a subset of genes. An example of my output is as follows:
GO00001 1500 300 200 150 5.39198144708e-77
GO00002 1500 500 400 350 1.18917839281e-160
GO00003 1500 400 350 320 9.48402847878e-209
GO00004 1500 100 100 75 3.82935778527e-82
GO00005 1500 100 80 80 2.67977253966e-114
where
Column1 = GO ID
Column2 = Total sum of all terms in the original dataset
Column3 = Total sum of [Column 1] IDs in the original dataset
Column4 = Sum of all terms in the subset
Column5 = Sum of [Column 1] IDs in subset
Column6 = pvalue derived from hypergeometric test
I know that I must multiply the number of experiments by the pvalue but I'm not sure how to do this with the data I have. Am I calculating from the subset or a combination of the original dataset and the subset? For example, would it be:
Column2 * Column5 * pvalue
Column3 * Column5 * pvalue
Column4 * Column5 * pvalue
I apologise if this seems like a stupid question but I just can't seem to get my head around it. Many thanks in advance!
from statsmodels.sandbox.stats.multicomp import multipletests
p_adjusted = multipletests(Column6, method='bonferroni')
Or am I missing something?..
We can implement the Bonferroni correction for multiple testing on our own like the following
np.random.seed(123)
alpha = 0.05 # level of significance / type-I error rate
m = 100 # number of tests
raw_pvals = np.random.beta(1, 10, m) # some raw p-values, e.g., from hypergeometric analysis
significant = np.sum(raw_pvals < alpha)
significant
# 46
alpha_corrected = alpha / m
significant_bonferroni = np.sum(raw_pvals < alpha_corrected)
alpha_corrected
# 0.0005
significant_bonferroni
# 2
or we can use multipletests from statsmodels.stats:
from statsmodels.stats.multitest import multipletests
rejected, p_adjusted, _, alpha_corrected = multipletests(raw_pvals, alpha=alpha,
method='bonferroni', is_sorted=False, returnsorted=False)
np.sum(rejected)
# 2
alpha_corrected
# 0.0005
We can plot the distribution of raw vs adjusted p-values:
import seaborn as sns
sns.kdeplot(raw_pvals, color="red", shade=True, label='raw')
ax = sns.kdeplot(p_adjusted, color="green", shade=True, label='adujusted')
ax.set(xlim=(0, 1))
plt.title('distribution of p-values')
plt.legend()
Note that, as expected, Bonferroni is very conservative in the sense that it allowed rejection of only a couple of null hypothesis propositions.
Related
I have two very huge dataframes that I'd like to calculate ( 1 minus the Spearman correlation ) between each point between them...
And, store the information in a new table where the index of the first dataframe is the index of the new dataframe and the index of the 2nd dataframe is now the column names of the new dataframe.
I found this post where a similar question was asked, which I've attempted but its been running for over an hour now unsuccessfully.
So for example: Given the following two data frames:
import random
X = pd.DataFrame({"A":np.random.random_sample(size = 10000), "B":np.random.random_sample(size = 10000), "C":np.random.random_sample(size = 10000), "D":np.random.random_sample(size = 10000), "E":np.random.random_sample(size = 10000)})
Y = pd.DataFrame({"AA":np.random.random_sample(size = 10000), "BB":np.random.random_sample(size = 10000), "CC":np.random.random_sample(size = 10000), "DD":np.random.random_sample(size = 10000), "EE":np.random.random_sample(size = 10000)})
I'd like to calculate (1 - spearman correlation) between each point between dataframes, where each point is a 1 x 10000 matrix such that the end results looks like this:
AA BB CC DD EE
A (1 - spearman corr) x x
B x x
C x
D
E
This is the 1st part of what I've done. Its been running for hours. No result.
test = pd.concat([X,Y], axis=0).corr(method="spearman")
I have this data:
puf = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
'val':[850,1889,3289,6083,10349,17860,28180,41236]})
The data seems to follow an exponential curve. Let's see the plot:
puf.plot('id','val')
I want to fit an exponential curve ($$ y = Ae^{Bx} $$, A times e to the B*X)and add it as a column in Pandas. Firstly I tried to log the values:
puf['log_val'] = np.log(puf['val'])
And then to use Numpy to fit the equation:
puf['fit'] = np.polyfit(puf['id'],puf['log_val'],1)
But I get an error:
ValueError: Length of values (2) does not match length of index (8)
My expected result is the fitted values as a new column in Pandas. I attach an image with the column fitted values I want (in orange):
I'm stuck in this code. I'm not sure what I am doing wrong. How can I create a new column with my fitted values?
Note that you asked for an exponential model yet you have the results for log-linear model.
Check out the work below:
For log-linear, we are fitting E(log(Y))ie log(y) - (log(b[0]) +b[1]*x):
from scipy.optimize import least_squares
least_squares(lambda b: np.log(puf['val']) -(np.log(b[0]) + b[1] * puf['id']),
[1,1])['x']
array([5.99531305e+02, 5.51106793e-01])
These are the values that excel gives.
On the other hand to fit an exponential curve, the randomness is on Y and not on its logarithm, E(Y)=b[0]*exp(b[1] *x) Hence we have:
least_squares(lambda b: puf['val'] - b[0]*exp(b[1] * puf['id']), [0,1])['x']
array([1.08047304e+03, 4.58116127e-01]) # correct results for exponential fit
Depending on your model choice, the values are alittle different.
Better Model? Since you have same number of parameters, consider the one that gives you lower deviance or better out of sample prediction
Note that the ideal exponential model is E(Y) = A'B'^X which for comparison can be written as log(E(Y)) = A + XB while log-linear model will be E(log(Y) = A + XB. Note the difference in Expectation.
From the two models we have:
Notice how when we go to higher numbers the log-linear overestimates. While in the lower numbers the exponential overestimates.
Code for image:
from scipy.optimize import least_squares
log_lin = least_squares(lambda b: np.log(puf['val']) -(np.log(b[0]) + b[1] * puf['id']),
[1,1])['x']
expo = least_squares(lambda b: puf['val'] - b[0]*exp(b[1] * puf['id']), [0,1])['x']
exp_fun = lambda x: expo[0] * exp(expo[1]*x)
log_lin_fun = lambda x:log_lin[0] * exp(log_lin[1]*x)
plt.plot(puf.id, puf.val, label = 'original')
plt.plot(puf.id, exp_fun(puf.id), label='exponential')
plt.plot(puf.id, log_lin_fun(puf.id), label='log-linear')
plt.legend()
Your getting that error because np.polyfit(puf['id'],puf['log_val'],1) returns two values array([0.55110679, 6.39614819]) which isn't the shape of your dataframe.
This is what you want
y = a* exp (b*x) -> ln(y)=ln(a)+bx
f = np.polyfit(df['id'], np.log(df['val']), 1)
where
a = np.exp(f[1]) -> 599.5313046712091
b = f[0] -> 0.5511067934637022
Giving
puf['fit'] = a * np.exp(b * puf['id'])
id val fit
0 1 850 1040.290193
1 2 1889 1805.082864
2 3 3289 3132.130026
3 4 6083 5434.785677
4 5 10349 9430.290286
5 6 17860 16363.179739
6 7 28180 28392.938399
7 8 41236 49266.644002
I have a dataframe like this:
Category Frequency
1 30000
2 45000
3 32400
4 42200
5 56300
6 98200
How do I calculate the mean, median and skewness of the Categories?
I have tried the following:
df['cum_freq'] = [df["Category"]]*df["Frequnecy"]
mean = df['cum_freq'].mean()
median = df['cum_freq'].median()
skew = df['cum_freq'].skew()
If the total frequency is small enough to fit into memory, use repeat to generate the data and then you can easily call those methods.
s = df['Category'].repeat(df['Frequency']).reset_index(drop=True)
print(s.mean(), s.var(ddof=1), s.skew(), s.kurtosis())
# 4.13252219664584 3.045585008424625 -0.4512924988072343 -1.1923306818513022
Otherwise, you will need more complicated algebra to calculate the moments, which can be done with the k-statistics Some of the lower moments can be done with other libraries like numpy or statsmodels. But for things like skewness and kurtosis this is done manually from the sums of the de-meaned values (calculated from counts). Since these sums will overflow numpy, we need use normal python.
def moments_from_counts(vals, weights):
"""
Returns tuple (mean, N-1 variance, skewness, kurtosis) from count data
"""
vals = [float(x) for x in vals]
weights = [float(x) for x in weights]
n = sum(weights)
mu = sum([x*y for x,y in zip(vals,weights)])/n
S1 = sum([(x-mu)**1*y for x,y in zip(vals,weights)])
S2 = sum([(x-mu)**2*y for x,y in zip(vals,weights)])
S3 = sum([(x-mu)**3*y for x,y in zip(vals,weights)])
S4 = sum([(x-mu)**4*y for x,y in zip(vals,weights)])
k1 = S1/n
k2 = (n*S2-S1**2)/(n*(n-1))
k3 = (2*S1**3 - 3*n*S1*S2 + n**2*S3)/(n*(n-1)*(n-2))
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
return mu, k2, k3/k2**1.5, k4/k2**2
moments_from_counts(df['Category'], df['Frequency'])
#(4.13252219664584, 3.045585008418879, -0.4512924988072345, -1.1923306818513018)
statsmodels has a nice class that takes care of lower moments, as well as the quantiles.
from statsmodels.stats.weightstats import DescrStatsW
d = DescrStatsW(df['Category'], weights=df['Frequency'])
d.mean
#4.13252219664584
d.var_ddof(1)
#3.045585008418879
the DescrStatsW class also gives you access to the underlying data as an array if you call d.asrepeats()
Edited to include VBA code for comparison
Also, we know the analytical value, which is 8.021, towards which the Monte-Carlo should converge, which makes the comparison easier.
Excel VBA gives 8.067 based on averaging 5 Monte-Carlo simulations (7.989, 8.187, 8.045, 8.034, 8.075)
Python gives 7.973 based on 5 MCs (7.913, 7.915, 8.203, 7.739, 8.095) and a larger Variance!
The VBA code is not even "that good", using a rather bad way to produce samples from Standard Normal!
I am running a super simple code in Python to price European Call Option via Monte Carlo, and I am surprised at how "bad" the convergence is with 10,000 "simulated paths". Usually, when running a Monte-Carlo for this simple problem in C++ or even VBA, I get better convergence.
I show the code below (the code is taken from Textbook "Python for Finance" and I run in in Visual Studio Code under Python 3.7.7, 64-bit version): I get the following results, as an example: Run 1 = 7.913, Run 2 = 7.915, Run 3 = 8.203, Run 4 = 7.739, Run 5 = 8.095,
Results such as the above, that differ by so much, would be unacceptable. How can the convergence be improved??? (Obviously by running more paths, but as I said: for 10,000 paths, the result should already have converged much better):
#MonteCarlo valuation of European Call Option
import math
import numpy as np
#Parameter Values
S_0 = 100. # initial value
K = 105. # strike
T = 1.0 # time to maturity
r = 0.05 # short rate (constant)
sigma = 0.2 # vol
nr_simulations = 10000
#Valuation Algo:
# Notice the vectorization below, instead of a loop
z = np.random.standard_normal(nr_simulations)
# Notice that the S_T below is a VECTOR!
S_T = S_0 * np.exp((r-0.5*sigma**2)+math.sqrt(T)*sigma*z)
#Call option pay-off at maturity (Vector!)
C_T = np.maximum((S_T-K),0)
# C_0 is a scalar
C_0 = math.exp(-r*T)*np.average(C_T)
print('Value of the European Call is: ', C_0)
I also include VBA code, which produces slightly better results (in my opinion): with the VBA code below, I get 7.989, 8.187, 8.045, 8.034, 8.075.
Option Explicit
Sub monteCarlo()
' variable declaration
' stock initial & final values, option pay-off at maturity
Dim stockInitial, stockFinal, optionFinal As Double
' r = rate, sigma = volatility, strike = strike price
Dim r, sigma, strike As Double
'maturity of the option
Dim maturity As Double
' instatiate variables
stockInitial = 100#
r = 0.05
maturity = 1#
sigma = 0.2
strike = 105#
' normal is Standard Normal
Dim normal As Double
' randomNr is randomly generated nr via "rnd()" function, between 0 & 1
Dim randomNr As Double
' variable for storing the final result value
Dim result As Double
Dim i, j As Long, monteCarlo As Long
monteCarlo = 10000
For j = 1 To 5
result = 0#
For i = 1 To monteCarlo
' get random nr between 0 and 1
randomNr = Rnd()
'max(Rnd(), 0.000000001)
' standard Normal
normal = Application.WorksheetFunction.Norm_S_Inv(randomNr)
stockFinal = stockInitial * Exp((r - (0.5 * (sigma ^ 2))) + (sigma * Sqr(maturity) * normal))
optionFinal = max((stockFinal - strike), 0)
result = result + optionFinal
Next i
result = result / monteCarlo
result = result * Exp(-r * maturity)
Worksheets("sheet1").Cells(j, 1) = result
Next j
MsgBox "Done"
End Sub
Function max(ByVal number1 As Double, ByVal number2 As Double)
If number1 > number2 Then
max = number1
Else
max = number2
End If
End Function
I don't think there is anything wrong with Python or numpy internals, the convergence is definitely should be the same no matter what tool you're using. I ran a few simulations with different sample sizes and different sigma values. No surprise, it turns out the speed of convergence is heavily controlled by the sigma value, see the plot below. Note that x axis is on log-scale! After the bigger oscillations fade away there are more smaller waves before it stabilizes. The easiest to see at sigma=0.5.
I'm definitely not an expert, but I think the most obvious solution is to increase sample size, as you mentioned. It would be nice to see results and code from C++ or VBA, because I don't know how familiar you are with numpy and python functions. Maybe something is not doing what you think it's doing.
Code to generate the plot (let's not talk about efficiency, it's horrible):
import numpy as np
import matplotlib.pyplot as plt
S_0 = 100. # initial value
K = 105. # strike
T = 1.0 # time to maturity
r = 0.05 # short rate (constant)
fig = plt.figure()
ax = fig.add_subplot()
plt.xscale('log')
samplesize = np.geomspace(1000, 20000000, 64)
sigmas = np.arange(0, 0.7, 0.1)
for s in sigmas:
arr = []
for n in samplesize:
n = n.astype(int)
z = np.random.standard_normal(n)
S_T = S_0 * np.exp((r-0.5*s**2)+np.sqrt(T)*s*z)
C_T = np.maximum((S_T-K),0)
C_0 = np.exp(-r*T)*np.average(C_T)
arr.append(C_0)
ax.scatter(samplesize, arr, label=f'sigma={s:.2f}')
plt.tight_layout()
plt.xlabel('Sample size')
plt.ylabel('Value')
plt.grid()
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles[::-1], labels[::-1], loc='upper left')
plt.show()
Addition:
This time you got closer results to the real value using VBA. But sometimes you don't. The effect of randomness is too big here. The truth is averaging out only 5 results from a low sample number simulation is not meaningful. For example averaging out 50 different simulations in Python (with only n=10000, even though you shouldn't do that if you're willing to get the right answer) yields to 8.025167 (± 0.039717 with 95% confidence level), which is very close to the real solution.
I have a dataframe of weights, in which I want to constrain the maximum weight for any one element to 30%. However in doing this, the sum of the weights becomes less than 1, so the weights of all other elements should be uniformly increased, and then repetitively capped at 30% until the sum of all weights is 1.
For example:
If my data is in a pandas data frame, how can I do this efficiently?
Note: in reality I have like 20 elements which I want to cap at 10%... so there is much more processing involved. I also intent to run this step 1000s of times.
#jpp
The following is a rough approach, modified from your answer to iteratively solveand re-cap. It doenst produce a perfect answer though... and having a while loop makes it inefficient. Any ideas how this could be improved?
import pandas as pd
import numpy as np
cap = 0.1
df = pd.DataFrame({'Elements': list('ABCDEFGHIJKLMNO'),
'Values': [17,11,7,5,4,4,3,2,1.5,1,1,1,0.8,0.6,0.5]})
df['Uncon'] = df['Values']/df['Values'].sum()
df['Con'] = np.minimum(cap, df['Uncon'])
while df['Con'].sum() < 1 or len(df['Con'][df['Con']>cap]) >=1:
df['Con'] = np.minimum(cap, df['Con'])
nonmax = df['Con'].ne(cap)
adj = (1 - df['Con'].sum()) * df['Con'].loc[nonmax] /
df['Uncon'].loc[nonmax].sum()
df['Con'] = df['Con'].mask(nonmax, df['Con'] + adj)
print(df)
print(df['Con'].sum())
Here's one vectorised solution. The idea is to calculate an adjustment and distribute it proportionately among the non-capped values.
df = pd.DataFrame({'Elements': list('ABCDE'),
'Uncon': [0.53, 0.34, 0.06, 0.03, 0.03]})
df['Con'] = np.minimum(0.30, df['Uncon'])
nonmax = df['Con'].ne(0.30)
adj = (1 - df['Con'].sum()) * df['Uncon'].loc[nonmax] / df['Uncon'].loc[nonmax].sum()
df['Con'] = df['Con'].mask(nonmax, df['Uncon'] + adj)
print(df)
Elements Uncon Con
0 A 0.53 0.3
1 B 0.34 0.3
2 C 0.06 0.2
3 D 0.03 0.1
4 E 0.03 0.1