How to use Mann-Whitney U test in learning - python

I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:
X = 0 0 1 0 1 and Y = 1
0 1 0 0 0 1
1 1 1 0 1 0
I want to use Mann-Whitney U test to compute the feature importance(feature selection)
from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
u, prob = mannwhitneyu(X[:,i], Y)
results[i,:] = u, pro
I'm not sure if this is correct or no? I obtained large values for a large table, u = 990 for some columns.

I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:
>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
array([[ 1. , -0.07155116],
[-0.07155116, 1. ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated
Because a and b have the same distributions we fail to reject the null hypothesis that the distributions are identical.
And since in features selection you are trying find features that mostly explain Y, Mann-Whitney U does not help you with that.

Related

Exponential fit in pandas

I have this data:
puf = pd.DataFrame({'id':[1,2,3,4,5,6,7,8],
'val':[850,1889,3289,6083,10349,17860,28180,41236]})
The data seems to follow an exponential curve. Let's see the plot:
puf.plot('id','val')
I want to fit an exponential curve ($$ y = Ae^{Bx} $$, A times e to the B*X)and add it as a column in Pandas. Firstly I tried to log the values:
puf['log_val'] = np.log(puf['val'])
And then to use Numpy to fit the equation:
puf['fit'] = np.polyfit(puf['id'],puf['log_val'],1)
But I get an error:
ValueError: Length of values (2) does not match length of index (8)
My expected result is the fitted values as a new column in Pandas. I attach an image with the column fitted values I want (in orange):
I'm stuck in this code. I'm not sure what I am doing wrong. How can I create a new column with my fitted values?
Note that you asked for an exponential model yet you have the results for log-linear model.
Check out the work below:
For log-linear, we are fitting E(log(Y))ie log(y) - (log(b[0]) +b[1]*x):
from scipy.optimize import least_squares
least_squares(lambda b: np.log(puf['val']) -(np.log(b[0]) + b[1] * puf['id']),
[1,1])['x']
array([5.99531305e+02, 5.51106793e-01])
These are the values that excel gives.
On the other hand to fit an exponential curve, the randomness is on Y and not on its logarithm, E(Y)=b[0]*exp(b[1] *x) Hence we have:
least_squares(lambda b: puf['val'] - b[0]*exp(b[1] * puf['id']), [0,1])['x']
array([1.08047304e+03, 4.58116127e-01]) # correct results for exponential fit
Depending on your model choice, the values are alittle different.
Better Model? Since you have same number of parameters, consider the one that gives you lower deviance or better out of sample prediction
Note that the ideal exponential model is E(Y) = A'B'^X which for comparison can be written as log(E(Y)) = A + XB while log-linear model will be E(log(Y) = A + XB. Note the difference in Expectation.
From the two models we have:
Notice how when we go to higher numbers the log-linear overestimates. While in the lower numbers the exponential overestimates.
Code for image:
from scipy.optimize import least_squares
log_lin = least_squares(lambda b: np.log(puf['val']) -(np.log(b[0]) + b[1] * puf['id']),
[1,1])['x']
expo = least_squares(lambda b: puf['val'] - b[0]*exp(b[1] * puf['id']), [0,1])['x']
exp_fun = lambda x: expo[0] * exp(expo[1]*x)
log_lin_fun = lambda x:log_lin[0] * exp(log_lin[1]*x)
plt.plot(puf.id, puf.val, label = 'original')
plt.plot(puf.id, exp_fun(puf.id), label='exponential')
plt.plot(puf.id, log_lin_fun(puf.id), label='log-linear')
plt.legend()
Your getting that error because np.polyfit(puf['id'],puf['log_val'],1) returns two values array([0.55110679, 6.39614819]) which isn't the shape of your dataframe.
This is what you want
y = a* exp (b*x) -> ln(y)=ln(a)+bx
f = np.polyfit(df['id'], np.log(df['val']), 1)
where
a = np.exp(f[1]) -> 599.5313046712091
b = f[0] -> 0.5511067934637022
Giving
puf['fit'] = a * np.exp(b * puf['id'])
id val fit
0 1 850 1040.290193
1 2 1889 1805.082864
2 3 3289 3132.130026
3 4 6083 5434.785677
4 5 10349 9430.290286
5 6 17860 16363.179739
6 7 28180 28392.938399
7 8 41236 49266.644002

How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?

I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.
Does anyone know how to implement this correctly so it reflects the WGCNA version?
Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.
For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).
When I computed this in R, I used the following:
> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC
I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.
Here is my implementation in Python :
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
def get_iris_data():
iris = load_iris()
# Iris dataset
X = pd.DataFrame(iris.data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])
y = pd.Series(iris.target,
index = X.index,
name = "Species")
return X, y
# Get data
X, y = get_iris_data()
# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values
# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)
# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]
# np.fill_diagonal(A, 0)
# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
a_iu = A[i]
k_i = a_iu.sum()
for j in range(i+1, d):
a_ju = A[:,j]
k_j = a_ju.sum()
l_ij = np.dot(a_iu, a_ju)
a_ij = A[i,j]
numerator = l_ij + a_ij
denominator = min(k_i, k_j) + 1 - a_ij
w_ij = numerator/denominator
A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)
There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).
Does anyone know how to reproduce the WGCNA implementation?
EDIT: 2019.06.20
I've adapted some of the code from #scleronomic and #benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.
https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py
import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True
First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:
a_ij: indicates if node i is connected to node j
k_i: count of the neighbors of node i (connectivity)
l_ij: count of the common neighbors of node i and node j
so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").
My guess is that they define the diagonal of A to be zero instead of one.
With this assumption I can reproduce the values of WGCNA.
A[range(d), range(d)] = 0 # Assumption
L = A # A # Could be done smarter by using the symmetry
K = A.sum(axis=1)
A_tom = np.zeros_like(A)
for i in range(d):
for j in range(i+1, d):
numerator = L[i, j] + A[i, j]
denominator = min(K[i], K[j]) + 1 - A[i, j]
A_tom[i, j] = numerator / denominator
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1 # Set diagonal to 1 by default
A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC",
sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))
An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:
Graph Case Zero Case One
B A B C D A B C D
/ \ A 0 1 1 1 A 1 1 1 1
A-----D B 1 0 0 1 B 1 1 0 1
\ / C 1 0 0 1 C 1 0 1 1
C D 1 1 1 0 D 1 1 1 1
The given description of equation 4 explains:
Note that w_ij = 1 if the node with fewer connections satisfies two conditions:
(a) all of its neighbors are also neighbors of the other node and
(b) it is connected to the other node.
In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.
So the connection between A-D should fulfill this criterion and be w_14=1.
Case Zero Diagonal:
Case One Diagonal:
What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me.
Neither Case Zero nor Case One result in w_ii=1 in the simple example.
In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.
So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.
Given adjacency matrix A, its possible to calculate the TOM matrix W without the use of for loops, which speeds up the process tremendously
L = np.dot(A,A)
k = np.sum(A,axis=0); d = len(k); tile = np.tile(k,(d,1))
K = np.min(np.stack((tile,tile.T),axis=2),axis=2)
W = (L + A)/(K + 1 - A); np.fill_diagonal(W, 1)

To calculate the "se.fit" and "resdiual.scale" that are the outputs from lm() function in R using python code

i have a code snippet in R with simple lm() function with one dependent and one independent variable which is as follows.
X = ([149876.9876, 157853.421, 147822.3803, 147904.6639, 152625.6781, 147229.8083, 181202.081, 164499.6566, 171461.6586, 164309.3919])
Y = ([26212109.07, 28376408.76, 30559566.77, 26765176.65, 28206749.66, 27560521.33, 32713878.83, 31263763.7, 30812063.54, 30225631.6])
lmfit <- lm(formula = Data_df$Y ~ Data_df$X, data=Data_df)
lmpred <- predict(lmfit, newdata=Data_df, se.fit=TRUE, interval = "prediction")
print(lmpred) #prints out fit, se.fit, df, residual.scale
The output of the above code are 3 vectors
1.) fit
2.) se.fit
3.) df
4.) residual.scale
Please help me find the way to calculate se.fit and residual.scale in python.
I m using statsmodels.ols to do the linear regression model.
Below is the python code that i m using to build the linear regression.
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
ols_result = smf.ols(formula='Y ~ X', data=DATA_X_Y_OLS).fit()
ols_result.predict(data_x_values)
R output
$fit
fit lwr upr
1 27594475 23262089 31926862
2 28768803 24486082 33051524
3 27291987 22943619 31640354
4 27304101 22956398 31651804
5 27999150 23686118 32312183
6 27204745 22851531 31557960
7 32206302 27951767 36460836
8 29747293 25490577 34004009
9 30772271 26527501 35017042
10 29719281 25462018 33976544
$se.fit
1 2 3 4 5 6 7 8 9 10
578003.4 483363.7 605520.6 604399.0 542961.1 613642.7 420890.0 426036.9 397072.7 427318.3
$df
[1] 24
$residual.scale
[1] 2017981
To find the fit, se.fit, df, residual.scale that are outputs of lm() function in R.
Below is the python code to calculate the above mentioned 4 values
import statsmodels.formula.api as smf
import numpy as np
ols_result = smf.ols(formula='Y ~ X', data=DATA).fit()
fit = ols_result.predict(X_new) //predicted values ie, fit from lm()
covariance_matrix= ols_result.cov_params()
x = DATA['x'].values
xO = pd.DataFrame({"Constant":np.ones(len(x))}).join(pd.DataFrame(x)).values
x1 = np.dot(xO, COVARIANCE_MATRIX)
se_fit = np.sqrt(np.sum(x1 * xO,axis = 1)) //Standard error of the fitted values ie, se.fit in lm()
df = ols_result.df_resid //Degree of freedom ie, df in lm()
residual_scale = round(np.sqrt(np.dot(np.transpose(x), x)/df)) //Residual SD ie, Residual standard deviation

PolynomialFeatures sklearn

Here is my code:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X_arr = []
Y_arr = []
with open('input.txt') as fp:
for line in fp:
b = line.split("|")
x,y = b
X_arr.append(int(x))
Y_arr.append(int(y))
X=np.array([X_arr]).T
print(X)
y=np.array(Y_arr)
print(y)
model = make_pipeline(PolynomialFeatures(degree=2),
LinearRegression(fit_intercept = False))
model.fit(X,y)
X_predict = np.array([[3]])
print(model.predict(X_predict))
Please, i have a question about:
model = make_pipeline(PolynomialFeatures(degree=2),
Please, how can i choose this value (2 or 3 or 4 etc.) ? is there a method to set this value dynamically ?
For example, i have this file of test:
1 1
2 4
4 16
5 75
for the first three lines the model is
y=a*x*x+b*x + c (b=c=0)
for the last line, the model is:
y=a*x*x*x+b*x+c (b=c=0)
This is by no means a fool-proof way to approach your problem, but I think I understand what you want, perhaps:
import math
epsilon = 1e-2
# Do your error checking on size of array
...
# Warning: This only works for positive x, negative logarithm is not proper.
# If you really want to, do an `abs(X_arr[0])` and check the degree is even.
deg = math.log(Y_arr[0], X_arr[0])
assert deg % 1 < epsilon
for x, y in zip(X_arr[1:], Y_arr[1:]):
if x == y == 1: continue # All x^n fits this and will cause divide by 0
assert abs(math.log(y, x) - deg) < epsilon
...
PolynomialFeature(degree=int(deg))
This checks to see if the degree is an integer value, and that all other data points fit the same polynomial.
This is purely a heuristic. If you have a bunch of data points of (1,1), there's no way you can decide what the actual degree is. Without any assumptions of the data, you cannot determine the degree of the polynomial x^n.
This is just an example of how you'd implement such a heuristic, and please don't use this in production.

Updated: Apply (vectorized) function on each cell to interpolate grid

I've got a question. I've used these SO threads this, this and that to get where I'm now.
I've got a DEM file and coordinates + data from weather-stations. Now, I would like to interpolate the air temperature data following the GIDS model (Model 12 in this article) using my DEM. For selection of stations I want to use the 8 nearest neighbors using KDTree.
In short, (I think) I want to evaluate a function at every cell using the coordinates and elevation of my DEM.
I've developed a working function which uses x,y as input to evaluate each value of my grid. See details my IPython Notebook.
But now for a whole numpy array. I somehow understand that I have to vectorize my function so that I can apply it on a Numpy array instead of using a double loop. See my simplified code to evaluate my function on the array using both for-loop and a trial with a vectorized function using a numpy meshgrid. Is this the way forward?
>>> data = [[0.8,0.7,5,25],[2.1,0.71,6,35],[0.75,2.2,8,20],[2.2,2.1,4,18]]
>>> columns = ['Long', 'Lat', 'H', 'T']
>>> df = pd.DataFrame(data, columns=columns)
>>> tree = KDTree(zip(df.ix[:,0],df.ix[:,1]), leafsize=10)
>>> dem = np.array([[5,7,6],[7,9,7],[8,7,4]])
>>> print 'Ground points\n', df
Ground points
Long Lat H T
0 0.80 0.70 5 25
1 2.10 0.71 6 35
2 0.75 2.20 8 20
3 2.20 2.10 4 18
>>> print 'Grid to evaluate\n', dem
Grid to evaluate
[[5 7 6]
[7 9 7]
[8 7 4]]
>>> def f(x,y):
... [see IPython Notebook for details]
... return m( sum((p((d(1,di[:,0])),2)))**-1 ,
... sum(m(tp+(m(b1,(s(pix.ix[0,0],longp))) + m(b2,(s(pix.ix[0,1],latp))) + m(b3,(s(pix.ix[0,2],hp)))), (p((d(1,di[:,0])),2)))) )
...
>>> #Double for-loop
...
>>> tp = np.zeros([dem.shape[0],dem.shape[1]])
>>> for x in range(dem.shape[0]):
... for y in range(dem.shape[1]):
... tp[x][y] = f(x,y)
...
>>> print 'T predicted\n', tp
T predicted
[[ 24.0015287 18.54595636 19.60427132]
[ 28.90354881 20.72871172 17.35098489]
[ 54.69499782 43.79200925 15.33702417]]
>>> # Evaluation of vectorized function using meshgrid
...
>>> x = np.arange(0,3,1)
>>> y = np.arange(0,3,1)
>>> xx, yy = np.meshgrid(x,y, sparse=True)
>>> f_vec = np.vectorize(f) # vectorization of function f
>>> tp_vec = f_vec(xx,yy).T
>>> print 'meshgrid\nx\n', xx,'\ny\n',yy
meshgrid
x
[[0 1 2]]
y
[[0]
[1]
[2]]
>>> print 'T predicted using vectorized function\n', tp_vec
T predicted using vectorized function
[[ 24.0015287 18.54595636 19.60427132]
[ 28.90354881 20.72871172 17.35098489]
[ 54.69499782 43.79200925 15.33702417]]
EDIT
I used %%timeit to check on real data with grid with size of 100,100 and results were as follow:
#double loop
for x in range(100):
for y in range(100):
tp[x][y] = f(x,y)
1 loops, best of 3: 29.6 s per loop
#vectorized
tp_vec = f_vec(xx,yy).T
1 loops, best of 3: 29.5 s per loop
Both not so great..
If using vectorized function for a grid, try building a meshgrid with the shape of the dependent array. Use the components derived from the meshgrid to evaluate each grid cell using the vectorized function. Something like this
def f(x,y):
'...some code...'
single_value = array[x,y] # = dependent array (e.g. DEM)
'...some code...'
return z
x = np.arange(array.shape[0])
y = np.arange(array.shape[1])
xx, yy = np.meshgrid(x,y, sparse=True)
f_vec = np.vectorize(f) # vectorization of function f
tp_vec = f_vec(xx,yy).T

Categories

Resources