Based on H2O's documentation it would seem as though relevel('most_frequency_category') and relevel_by_frequency() should accomplish the same thing. However the coefficient estimates are different depending on which method is used to set the reference level for a factor column.
Using an open source dataset from sklearn demonstrates how the GLM coefficients are misaligned when the base level is set using the two releveling methods. Why do the coefficient estimates vary when the base level is the same between the two models?
import pandas as pd
from sklearn.datasets import fetch_openml
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init(max_mem_size=8)
def load_mtpl2(n_samples=100000):
"""
Fetch the French Motor Third-Party Liability Claims dataset.
https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html
Parameters
----------
n_samples: int, default=100000
number of samples to select (for faster run time). Full dataset has
678013 samples.
"""
# freMTPL2freq dataset from https://www.openml.org/d/41214
df_freq = fetch_openml(data_id=41214, as_frame=True)["data"]
df_freq["IDpol"] = df_freq["IDpol"].astype(int)
df_freq.set_index("IDpol", inplace=True)
# freMTPL2sev dataset from https://www.openml.org/d/41215
df_sev = fetch_openml(data_id=41215, as_frame=True)["data"]
# sum ClaimAmount over identical IDs
df_sev = df_sev.groupby("IDpol").sum()
df = df_freq.join(df_sev, how="left")
df["ClaimAmount"].fillna(0, inplace=True)
# unquote string fields
for column_name in df.columns[df.dtypes.values == object]:
df[column_name] = df[column_name].str.strip("'")
return df.iloc[:n_samples]
df = load_mtpl2()
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=100000)
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]
X_freq = h2o.H2OFrame(df)
X_freq["VehBrand"] = X_freq["VehBrand"].asfactor()
X_freq["VehBrand"] = X_freq["VehBrand"].relevel_by_frequency()
X_relevel = h2o.H2OFrame(df)
X_relevel["VehBrand"] = X_relevel["VehBrand"].asfactor()
X_relevel["VehBrand"] = X_relevel["VehBrand"].relevel("B1") # most frequent category
response_col = "PurePremium"
weight_col = "Exposure"
predictors = "VehBrand"
glm_freq = H2OGeneralizedLinearEstimator(family="tweedie",
solver='IRLSM',
tweedie_variance_power=1.5,
tweedie_link_power=0,
lambda_=0,
compute_p_values=True,
remove_collinear_columns=True,
seed=1)
glm_relevel = H2OGeneralizedLinearEstimator(family="tweedie",
solver='IRLSM',
tweedie_variance_power=1.5,
tweedie_link_power=0,
lambda_=0,
compute_p_values=True,
remove_collinear_columns=True,
seed=1)
glm_freq.train(x=predictors, y=response_col, training_frame=X_freq, weights_column=weight_col)
glm_relevel.train(x=predictors, y=response_col, training_frame=X_relevel, weights_column=weight_col)
print('GLM with the reference level set using relevel_by_frequency()')
print(glm_freq._model_json['output']['coefficients_table'])
print('\n')
print('GLM with the reference level manually set using relevel()')
print(glm_relevel._model_json['output']['coefficients_table'])
Output
GLM with the reference level set using relevel_by_frequency()
Coefficients: glm coefficients
names coefficients std_error z_value p_value standardized_coefficients
------------ -------------- ----------- ---------- ----------- ---------------------------
Intercept 5.40413 1.24082 4.35531 1.33012e-05 5.40413
VehBrand.B2 -0.398721 1.2599 -0.316472 0.751645 -0.398721
VehBrand.B12 -0.061573 1.46541 -0.0420176 0.966485 -0.061573
VehBrand.B3 -0.393908 1.30712 -0.301356 0.763144 -0.393908
VehBrand.B5 -0.282484 1.31929 -0.214118 0.830455 -0.282484
VehBrand.B6 -0.387747 1.25943 -0.307876 0.758177 -0.387747
VehBrand.B4 0.391771 1.45615 0.269047 0.787894 0.391771
VehBrand.B10 -0.0542706 1.35049 -0.040186 0.967945 -0.0542706
VehBrand.B13 -0.306381 1.4628 -0.209449 0.834098 -0.306381
VehBrand.B11 -0.435297 1.29155 -0.337035 0.736091 -0.435297
VehBrand.B14 -0.304243 1.34781 -0.225732 0.821411 -0.304243
GLM with the reference level manually set using relevel()
Coefficients: glm coefficients
names coefficients std_error z_value p_value standardized_coefficients
------------ -------------- ----------- ---------- ---------- ---------------------------
Intercept 5.01639 0.215713 23.2549 2.635e-119 5.01639
VehBrand.B10 0.081366 0.804165 0.101181 0.919407 0.081366
VehBrand.B11 0.779518 0.792003 0.984237 0.325001 0.779518
VehBrand.B12 -0.0475497 0.41834 -0.113663 0.909505 -0.0475497
VehBrand.B13 0.326174 0.80891 0.403227 0.686782 0.326174
VehBrand.B14 0.387747 1.25943 0.307876 0.758177 0.387747
VehBrand.B2 -0.010974 0.306996 -0.0357465 0.971485 -0.010974
VehBrand.B3 -0.00616108 0.464188 -0.0132728 0.98941 -0.00616108
VehBrand.B4 0.333477 0.575082 0.579877 0.561999 0.333477
VehBrand.B5 0.105263 0.497431 0.211613 0.832409 0.105263
VehBrand.B6 0.0835042 0.568769 0.146816 0.883278 0.0835042
The two datasets are almost the same except at one place:
In the first dataset, number of rows for VehBrand with B1 = 72
In the second dataset, number of rows for VehBrand with B14 = 721.
If you look and compare the two datasets, you can map the equivalent names to the number of rows in the two dataset as follows:
Freq B2 == Relevel B2 with 26500 rows
Freq B12 == Relevel B13 with 1883 rows
Freq B3 == Relevel B3 with 8260 rows
Freq B5 == Relevel B5 with 6053 rows
Freq B6 == Relevel B1 with 27240 rows
Freq B4 == Relevel B11 with 1774 rows
Freq B10 == Relevel B4 with 3968 rows
Freq B13 == Relevel B10 with 2268 rows
Freq B11 == Relevel B12 with 16619 rows
Freq B14 == Relevel B6 with 4714 rows.
Since you are training the two GLM models with different datasets, you will get different coefficients and different prediction results.
Related
I'm intrigued on why I'm unable to arrived at the same values the model is predicting.
Consider the below model. I'm trying to understand the relations between features insurance charges, age and if a client is or not a smoker.
Notice age variable has been pre-processed (mean centered).
import pandas as pd
import statsmodels.formula.api as smf
insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['I(age - np.mean(age))'], params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1
y_hat_smok = (b0 + b2) + (b1 + b3) * x1
Now when I generate new data and apply the predict method, I'll arrive at different values when trying to compute these manually.
Take for example index 0 and index 2 ,I would expected the prediction values to be similar to the output below, but these are really far off.
Am I missing something regarding the feature transformation done when fitting the model?
new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43},
'smoker': {0: 'yes', 1: 'no', 2: 'no'}})
idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4
fit1.predict(new_data)
0 27581.276650
1 10168.273779
2 10702.771604
I suppose you want to center the age variable , this I(age - np.mean(age)) works, but when you try to predict, it will re-evaluate age again according to the mean in your prediction data frame.
Also when you multiply by the coefficients, you have to multiply it by the centered value (i.e age - mean(age)) not the raw values.
It doesn't hurt to create another column with the centered age:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
from sklearn.preprocessing import StandardScaler
sc = StandardScaler(with_std=False)
insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance['age_c'] = sc.fit_transform(insurance[['age']])
model1 = smf.ols('charges~age_c * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['age_c'], params['age_c:smoker[T.yes]']
And you can predict, by using the scaler from before onto the age column:
new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43},
'smoker': {0: 'yes', 1: 'no', 2: 'no'}})
new_data['age_c'] = sc.transform(new_data[['age']])
new_data
age smoker age_c
0 19 yes -20.207025
1 41 no 1.792975
2 43 no 3.792975
Check:
idx_0 = (b0+b2) + (b1+b3) * -20.207025
# 26093.64269247414
idx_2 = b0 + b1 * 3.792975
9400.282805032146
fit1.predict(new_data)
Out[13]:
0 26093.642567
1 8865.784870
2 9400.282695
I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304
I am trying to implement a 3 layer neural network with feedforward and backpropagation.
I have tested my cost function and it is working fine. My gradient function also seems ok.
but when I try to optimize variable using fmin_cg from scipy, I get this warning :
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 4.643489
Iterations: 1
Function evaluations: 123
Gradient evaluations: 110
I searched about this and someone told the problem is with gradient. This is my code for gradient:
theta_flatten = theta_flatten.reshape(1,-1)
# retrieve theta values from flattened theta
theta_hidden = theta_flatten[0,0:((input_layer_size+1)*hidden_layer_size)]
theta_hidden = theta_hidden.reshape((input_layer_size+1),hidden_layer_size)
theta_output = theta_flatten[0,((input_layer_size+1)*hidden_layer_size):]
theta_output = theta_output.reshape(hidden_layer_size+1,num_labels)
# start of section 1
a1 = x # 5000x401
z2 = np.dot(a1,theta_hidden) # 5000x25
a2 = sigmoid(z2)
a2 = np.append(np.ones(shape=(a1.shape[0],1)),a2,axis = 1) # 5000x26 # adding column of 1's to a2
z3 = np.dot(a2,theta_output) # 5000x10
a3 = sigmoid(z3) # a3 = h(x) w.r.t theta
a3 = rotate_column(a3) # mapping 0 to "0" instead of 0 to "10"
# end of section 1
# strat of section 2
delta3 = a3 - y # 5000x10
# end of section 2
# start of section 3
delta2 = (np.dot(delta3,theta_output.transpose()))[:,1:] # 5000x25 # drop delta2(0)
delta2 = delta2*sigmoid_gradient(z2)
# end of section 3
# start of section 4
DELTA2 = np.dot(a2.transpose(),delta3) # 26x10
DELTA1 = np.dot(a1.transpose(),delta2) # 401x25
# end of section 4
# start of section 5
theta_hidden_ = np.append(np.ones(shape=(theta_hidden.shape[0],1)),theta_hidden[:,1:],axis = 1) # regularization
theta_output_ = np.append(np.ones(shape=(theta_output.shape[0],1)),theta_output[:,1:],axis = 1) # regularization
D1 = DELTA1/a1.shape[0] + (theta_hidden_*lambda_)/a1.shape[0]
D2 = DELTA2/a1.shape[0] + (theta_output_*lambda_)/a1.shape[0]
# end of section 5
Dvec = np.append(D1,D2)
return Dvec
I look at github for other people implementations, but nothing works, and they implemented like me.
some comments :
Section one: feedforward implementation
Section two to four: backpropagation from ouput layer to input layer
Section five: aggregating gradients
Please help
Thank you
I have the following sioma_df data frame:
These are the sioma_df shape and column index. It has 13807 rows and 37 columns:
sioma_df.columns
(13807, 37)
Index(['Luz (lux)', 'Precipitación (ml)', 'Temperatura (°C)',
'Velocidad del Viento (km/h)', 'E', 'N', 'NE', 'NO', 'O', 'S', 'SE',
'SO', 'PORVL2N1', 'PORVL2N2', 'PORVL4N1', 'PORVL5N1', 'PORVL6N1',
'PORVL7N1', 'PORVL8N1', 'PORVL9N1', 'PORVL10N1', 'PORVL13N1',
'PORVL14N1', 'PORVL15N1', 'PORVL16N1', 'PORVL16N2', 'PORVL18N1',
'PORVL18N2', 'PORVL18N3', 'PORVL18N4', 'PORVL21N1', 'PORVL21N2',
'PORVL21N3', 'PORVL21N4', 'PORVL21N5', 'PORVL24N1', 'PORVL24N2'],
dtype='object')
I want to apply k-means algorithm and I've decided that in the random initialization phase I will have k=9 centroids
# Turn the dataframe to numpy array
sioma_numpy = sioma_df.get_values()
k=9
# Create a dictionary with the centroids coordinates
centroids = {
i + 1: [np.random.randint(0, np.max(sioma_numpy)), np.random.randint(0, np.max(sioma_numpy))]
for i in range(k)
}
I plot my data before to apply clustering
# I get each column individually into an array
c1 = sioma_df['Luz (lux)'].values
c2 = sioma_df['Precipitación (ml)'].values
c3 = sioma_df['Temperatura (°C)'].values
c4 = sioma_df['Velocidad del Viento (km/h)'].values
c5 = sioma_df['PORVL2N1'].values
c6 = sioma_df['PORVL2N2'].values
c7 = sioma_df['PORVL4N1'].values
c8 = sioma_df['PORVL5N1'].values
c9 = sioma_df['PORVL6N1'].values
c10 = sioma_df['PORVL7N1'].values
c11 = sioma_df['PORVL8N1'].values
c12 = sioma_df['PORVL9N1'].values
c13 = sioma_df['PORVL10N1'].values
c14 = sioma_df['PORVL13N1'].values
c15 = sioma_df['PORVL14N1'].values
c16 = sioma_df['PORVL15N1'].values
c17 = sioma_df['PORVL16N1'].values
c18 = sioma_df['PORVL16N2'].values
c19 = sioma_df['PORVL18N1'].values
c20 = sioma_df['PORVL18N2'].values
c21 = sioma_df['PORVL18N3'].values
c22 = sioma_df['PORVL18N4'].values
c23 = sioma_df['PORVL18N4'].values
c24 = sioma_df['PORVL21N1'].values
c25 = sioma_df['PORVL21N2'].values
c26 = sioma_df['PORVL21N3'].values
c27 = sioma_df['PORVL21N4'].values
c28 = sioma_df['PORVL21N5'].values
c29 = sioma_df['PORVL24N1'].values
c30 = sioma_df['E'].values
c31 = sioma_df['N'].values
c32 = sioma_df['NE'].values
c33 = sioma_df['NO'].values
c34 = sioma_df['O'].values
c35 = sioma_df['S'].values
c36 = sioma_df['SE'].values
c37 = sioma_df['S'].values
""" I generate the X and Y coordinates points of previous c1 to c36
variables above. With zip I've associate between each Ci and store in
a list to will represent array X and array Y
"""
X = np.array(list(zip(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18)))
print( " ARRAY X" +'\n', X, '\n' )
Y = np.array(list(zip(c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,)))
print( " ARRAY Y" +'\n', Y, '\n' )
Then, I've generated the pair x,y centroids coordinates.
I want to start with the assignment stage where I assign data points to the closest centroid. I have the following:
def assignment(df, centroids):
# We take the k=9 centroids keys to iterations based
for i in centroids.keys():
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
# I want create a new column in a sioma_df dataframe named
#distance_from_i
sioma_df['distance_from_{}'.format(i)] = (
# We calculate the distances between each data point and
# each one of the 9 centroids
# The distance_from_i column will have the distance value
# of each data point with reference to each centroid (Are 9 in total)
np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
)
# We iterate by each distance value of each data point i with
# reference to each centroid j to compare and meet to what
# distance is more closest
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
# We create the closest column in the sioma_df dataframe,
# selecting the more minimum values in the column axis=1:
sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
return df
# We wxecute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)
But when I execute my code I get the following error:
KeyError: 'distance_from_1'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-160-b96e0351c13d> in <module>()
24
25 #
---> 26 df = assignment(sioma_df, centroids)
27 print(df.head)
<ipython-input-160-b96e0351c13d> in assignment(df, centroids)
11 np.sqrt(
12 (X - centroids[i][0]) ** 2
---> 13 + (Y - centroids[i][1]) ** 2
14 )
15 )
ValueError: Wrong number of items passed 18, placement implies 1
This suggests that you are attempting to put too many data in too few memory positions. In this case, the value on the right of the equation In this case in
sioma_df['distance_from'] = np.sqrt((X - centroids[i][0]) ** 2 + (Y - centroids[i][1]) ** 2)
I don't really understand how to solve this inconvenient in the sense of have a correct allocation; which is making it difficult for me to troubleshoot.
Any support that point me in the correct direction will be highly appreciated
My issue is that the np.sqrt(…) statement does not return a 1-dimensional array.
Each row ,col position is expecting 1 value but it is receiving an array that is 18 elements long, due to lenght of X and Y numpy arrays.
Operations on numpy arrays are element wise, and therefore might no change the shape of the array being operated on.
Then When I want create the new distance_from_i column and make this:
sioma_df['distance_from_{}'.format(i)] = (
np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
)
I am assigning to this distance_from_i column, not a 1-dimensional array which is the capacity to have to receive or accept, else, that my distance_from_i column (each row, col) receive an array that is 18 elements long, and this is the reason of the error
ValueError: Wrong number of items passed 18, placement implies 1
Then, I’ve initialized my new distance_from_i column to NaN values before to assign it the result value of np.sqrt(…) statement and it works. My assignment function works O.K and has been the stayed of this way:
def assignment(df, centroids):
# We take the k=9 centroids keys to iterations based
for i in centroids.keys():
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
# We calculate the distances between each data point and
# each one of the 9 centroids
# The distance_from_i column will have the distance value
# of each data point with reference to each centroid (Are 9 in total)
n = np.sqrt(
(X - centroids[i][0]) ** 2
+ (Y - centroids[i][1]) ** 2
)
# I want create a new column in a sioma_df dataframe named
# distance_from_i
sioma_df['distance_from_{}'.format(i)] = np.nan
sioma_df['distance_from_{}'.format(i)] = n
# We iterate by each distance value of each data point i with
# reference to each centroid j to compare and meet to what
# distance is more closest
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
# We create the closest column in the sioma_df dataframe,
# selecting the more minimum values in the column axis=1
sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
return df
# We execute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)
I have performed a hypergeometric analysis (using a python script) to investigate enrichment of GO-terms in a subset of genes. An example of my output is as follows:
GO00001 1500 300 200 150 5.39198144708e-77
GO00002 1500 500 400 350 1.18917839281e-160
GO00003 1500 400 350 320 9.48402847878e-209
GO00004 1500 100 100 75 3.82935778527e-82
GO00005 1500 100 80 80 2.67977253966e-114
where
Column1 = GO ID
Column2 = Total sum of all terms in the original dataset
Column3 = Total sum of [Column 1] IDs in the original dataset
Column4 = Sum of all terms in the subset
Column5 = Sum of [Column 1] IDs in subset
Column6 = pvalue derived from hypergeometric test
I know that I must multiply the number of experiments by the pvalue but I'm not sure how to do this with the data I have. Am I calculating from the subset or a combination of the original dataset and the subset? For example, would it be:
Column2 * Column5 * pvalue
Column3 * Column5 * pvalue
Column4 * Column5 * pvalue
I apologise if this seems like a stupid question but I just can't seem to get my head around it. Many thanks in advance!
from statsmodels.sandbox.stats.multicomp import multipletests
p_adjusted = multipletests(Column6, method='bonferroni')
Or am I missing something?..
We can implement the Bonferroni correction for multiple testing on our own like the following
np.random.seed(123)
alpha = 0.05 # level of significance / type-I error rate
m = 100 # number of tests
raw_pvals = np.random.beta(1, 10, m) # some raw p-values, e.g., from hypergeometric analysis
significant = np.sum(raw_pvals < alpha)
significant
# 46
alpha_corrected = alpha / m
significant_bonferroni = np.sum(raw_pvals < alpha_corrected)
alpha_corrected
# 0.0005
significant_bonferroni
# 2
or we can use multipletests from statsmodels.stats:
from statsmodels.stats.multitest import multipletests
rejected, p_adjusted, _, alpha_corrected = multipletests(raw_pvals, alpha=alpha,
method='bonferroni', is_sorted=False, returnsorted=False)
np.sum(rejected)
# 2
alpha_corrected
# 0.0005
We can plot the distribution of raw vs adjusted p-values:
import seaborn as sns
sns.kdeplot(raw_pvals, color="red", shade=True, label='raw')
ax = sns.kdeplot(p_adjusted, color="green", shade=True, label='adujusted')
ax.set(xlim=(0, 1))
plt.title('distribution of p-values')
plt.legend()
Note that, as expected, Bonferroni is very conservative in the sense that it allowed rejection of only a couple of null hypothesis propositions.