I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])
Related
What is the most { (1) memory efficient (2) time efficient (3) easy-to-access* } way to store the upper/lower half of the correlation matrix to a file in python ?
(By "easy-to-access" I mean- to be able to read from the file and plot the correlation matrix using matplotlib/seaborn)
Example, for a correlation matrix below:
C1 C2 C3 C4
C1 1.0 0.6 0.7 0.5
C2 0.6 1.0 0.4 0.9
C3 0.7 0.4 1.0 0.3
C4 0.5 0.9 0.3 1.0
I want to store the below numbers to a file.
C2 C3 C4
C1 0.6 0.7 0.5
C2 0.4 0.9
C3 0.3
OR
C1 C2 C3
C2 0.6
C3 0.7 0.4
C4 0.5 0.9 0.3
(I thought of storing it as a csv/tsv file but it will still eat up memory for blank characters which will be there for the other half of the matrix.)
You need somthing like this:
matrix = np.array([[1, 0.6, 0.7, 0.5],
[0.6, 1, 0.4, 0.9],
[0.7, 0.4, 1, 0.3],
[0.5, 0.9, 0.3, 1]])
ut = np.triu(matrix, k=1)
lt = np.tril(matrix, k=-1)
ut = np.where(ut==0, np.nan, ut)
lt = np.where(lt==0, np.nan, lt)
np.savetxt("upper.csv", ut, delimiter=",")
np.savetxt("lower.csv", lt, delimiter=",")
Use the second representation. Its just the transpose of the first and you don't need to store any blank characters for the other half. If blanks characters is your concern, write a custom file writer/reader for your matrix.
Example:
mat = []
mat.append(["C1", "C2", "C3"])
mat.append(["C2", 0.6])
mat.append(["C3", 0.7, 0.4])
mat.append(["C4", 0.5, 0.9, 0.3])
print(mat)
with open("correlation.txt", "w") as _file:
for row in mat:
_file.write("\t".join(str(val) for val in row))
_file.write("\n") # you will not have blank characters
with open("correlation.txt", "r") as _file:
for line in _file.readlines():
print(len(line.split()))
Result:
[['C1', 'C2', 'C3'], ['C2', 0.6], ['C3', 0.7, 0.4], ['C4', 0.5, 0.9, 0.3]]
3
2
3
4
I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])
You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64
I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.
I am trying to convert a multi-index pandas DataFrame into a numpy.ndarray. The DataFrame is below:
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
I would like the resulting numpy.ndarray to be the following with np.shape() = (2,2,4):
[[[ 0.0 0.0 0.8 0.2 ]
[ 0.1 0.0 0.9 0.0 ]]
[[ 0.0 0.0 0.9 0.1 ]
[ 0.0 0.0 1.0 0.0]]]
I have tried df.as_matrix() but this returns:
[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]
[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]
How do I return a list of lists for the first level with each list representing an Action records.
You could use the following:
dim = len(df.index.get_level_values(0).unique())
result = df.values.reshape((dim1, dim1, df.shape[1]))
print(result)
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
The first line just finds the number of groups that you want to groupby.
Why this (or groupby) is needed: as soon as you use .values, you lose the dimensionality of the MultiIndex from pandas. So you need to re-pass that dimensionality to NumPy in some way.
One way
In [151]: df.groupby(level=0).apply(lambda x: x.values.tolist()).values
Out[151]:
array([[[0.0, 0.0, 0.8, 0.2],
[0.1, 0.0, 0.9, 0.0]],
[[0.0, 0.0, 0.9, 0.1],
[0.0, 0.0, 1.0, 0.0]]], dtype=object)
Using Divakar's suggestion, np.reshape() worked:
>>> print(P)
s1 s2 s3 s4
Action State
1 s1 0.0 0 0.8 0.2
s2 0.1 0 0.9 0.0
2 s1 0.0 0 0.9 0.1
s2 0.0 0 1.0 0.0
>>> np.reshape(P,(2,2,-1))
[[[ 0. 0. 0.8 0.2]
[ 0.1 0. 0.9 0. ]]
[[ 0. 0. 0.9 0.1]
[ 0. 0. 1. 0. ]]]
>>> np.shape(P)
(2, 2, 4)
Elaborating on Brad Solomon's answer, to get a sligthly more generic solution - indexes of different sizes and an unfixed number of indexes - one could do something like this:
def df_to_numpy(df):
try:
shape = [len(level) for level in df.index.levels]
except AttributeError:
shape = [len(df.index)]
ncol = df.shape[-1]
if ncol > 1:
shape.append(ncol)
return df.to_numpy().reshape(shape)
If df has missing sub-indexes reshape will not work. One way to add them would be (maybe there are better solutions):
def enforce_df_shape(df):
try:
ind = pd.MultiIndex.from_product([level.values for level in df.index.levels])
except AttributeError:
return df
fulldf = pd.DataFrame(-1, columns=df.columns, index=ind) # remove -1 to fill fulldf with nan
fulldf.update(df)
return fulldf
If you are just trying to pull out one column, say s1, and get an array with shape (2,2) you can use the .index.levshape like this:
x = df.s1.to_numpy().reshape(df.index.levshape)
This will give you a (2,2) containing the value of s1.
Consider the following Multiindex Pandas Seires:
import pandas as pd
import numpy as np
val = np.array([ 0.4, -0.6, 0.6, 0.5, -0.4, 0.2, 0.6, 1.2, -0.4])
inds = [(-1000, 1921.6), (-1000, 1922.3), (-1000, 1923.0), (-500, 1921.6),
(-500, 1922.3), (-500, 1923.0), (-400, 1921.6), (-400, 1922.3),
(-400, 1923.0)]
names = ['pp_delay', 'wavenumber']
example = pd.Series(val)
example.index = pd.MultiIndex.from_tuples(inds, names=names)
example should now look like
pp_delay wavenumber
-1000 1921.6 0.4
1922.3 -0.6
1923.0 0.6
-500 1921.6 0.5
1922.3 -0.4
1923.0 0.2
-400 1921.6 0.6
1922.3 1.2
1923.0 -0.4
dtype: float64
I want to group example by pp_delay and select a range within each group using the wavenumber index and perform an operation on that subgroup. To clarify what I mean, I have a few examples.
Here is a position based solution.
example.groupby(level="pp_delay").nth(list(range(1,3))).groupby(level="pp_delay").sum()
this gives
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
Now the last to elements of each pp_delay group have been summed.
An alternative solution and more straight forward is to loop over the groups:
delays = example.index.levels[0]
res = np.zeros(delays.shape)
roi = slice(1922, 1924)
for i in range(3):
res[i] = example[delays[i]][roi].sum()
res
gives
array([ 0. , -0.2, 0.8])
Anyhow I don't like it much ether because it doesn't fit well with the usual pandas style.
Now what I ideally would want something like:
example.groupby(level="pp_delay").loc[1922:1924].sum()
or maybe even something like
example[:, 1922:1924].sum()
But apparently pandas indexing doesn't work that way. Anybody got a better way?
Cheers
I'd skip the groupby
example.unstack(0).ix[1922:1924].sum()
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
I'm looking for a good way to store and use conditional probabilities in python.
I'm thinking of using a pandas dataframe. If the conditional probabilities of some X are P(X=A|P1=1, P2=1) = 0.2, P(X=B|P1=2, P2=1) = 0.9 etc., I would use the dataframe
A B
P1 P2
1 1 0.2 0.8
2 0.5 0.5
2 1 0.9 0.1
2 0.9 0.1
and given the marginal probabilities of P1 and P2 as Series
1 0.4
2 0.6
Name: P1
1 0.7
2 0.3
Name: P2
I would like to obtain the Series of marginal probabilities of X, i.e. the series
A 0.602
B 0.398
Name: X
I can get what I want by
X = sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
X.name="X"
but this is not easily generalizable to more dependencies, the asymmetry between the first xs with level and the second one without looks weird and as usual when working with pandas I'm very sure that there is a better solution using it's tricks and methods.
Is pandas a good tool for this, should I represent my data in another way, and what is the best way to do this calculation, which is essentially an indexed tensor product, in pandas?
One way to vectorize is access the values in Series P1 and P2 by indexing with an array of labels.
In [20]: df = X.reset_index()
In [21]: mP1 = P1[df.P1].values
In [22]: mP2 = P2[df.P2].values
In [23]: mP1
Out[23]: array([ 0.4, 0.4, 0.6, 0.6])
In [24]: mP2
Out[24]: array([ 0.7, 0.3, 0.7, 0.3])
In [25]: mp = mP1 * mP2
In [26]: mp
Out[26]: array([ 0.28, 0.12, 0.42, 0.18])
In [27]: X.mul(mp, axis=0)
Out[27]:
A B
P1 P2
1 1 0.056 0.224
2 0.060 0.060
2 1 0.378 0.042
2 0.162 0.018
In [28]: X.mul(mp, axis=0).sum()
Out[28]:
A 0.656
B 0.344
In [29]: sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
Out[29]:
A 0.656
B 0.344
(Alternately, access the values of a MultiIndex
without resetting the index as follows.)
In [38]: P1[X.index.get_level_values("P1")].values
Out[38]: array([ 0.4, 0.4, 0.6, 0.6])