I'm looking for a good way to store and use conditional probabilities in python.
I'm thinking of using a pandas dataframe. If the conditional probabilities of some X are P(X=A|P1=1, P2=1) = 0.2, P(X=B|P1=2, P2=1) = 0.9 etc., I would use the dataframe
A B
P1 P2
1 1 0.2 0.8
2 0.5 0.5
2 1 0.9 0.1
2 0.9 0.1
and given the marginal probabilities of P1 and P2 as Series
1 0.4
2 0.6
Name: P1
1 0.7
2 0.3
Name: P2
I would like to obtain the Series of marginal probabilities of X, i.e. the series
A 0.602
B 0.398
Name: X
I can get what I want by
X = sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
X.name="X"
but this is not easily generalizable to more dependencies, the asymmetry between the first xs with level and the second one without looks weird and as usual when working with pandas I'm very sure that there is a better solution using it's tricks and methods.
Is pandas a good tool for this, should I represent my data in another way, and what is the best way to do this calculation, which is essentially an indexed tensor product, in pandas?
One way to vectorize is access the values in Series P1 and P2 by indexing with an array of labels.
In [20]: df = X.reset_index()
In [21]: mP1 = P1[df.P1].values
In [22]: mP2 = P2[df.P2].values
In [23]: mP1
Out[23]: array([ 0.4, 0.4, 0.6, 0.6])
In [24]: mP2
Out[24]: array([ 0.7, 0.3, 0.7, 0.3])
In [25]: mp = mP1 * mP2
In [26]: mp
Out[26]: array([ 0.28, 0.12, 0.42, 0.18])
In [27]: X.mul(mp, axis=0)
Out[27]:
A B
P1 P2
1 1 0.056 0.224
2 0.060 0.060
2 1 0.378 0.042
2 0.162 0.018
In [28]: X.mul(mp, axis=0).sum()
Out[28]:
A 0.656
B 0.344
In [29]: sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
Out[29]:
A 0.656
B 0.344
(Alternately, access the values of a MultiIndex
without resetting the index as follows.)
In [38]: P1[X.index.get_level_values("P1")].values
Out[38]: array([ 0.4, 0.4, 0.6, 0.6])
Related
I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])
You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64
I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.
I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)
Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])
Consider a large dataframe of scores S containing entries like the following. Each row represents a contest between a subset of the participants A, B, C and D.
A B C D
0.1 0.3 0.8 1
1 0.2 NaN NaN
0.7 NaN 2 0.5
NaN 4 0.6 0.8
The way to read the matrix above is: looking at the first row, the participant A scored 0.1 in that round, B scored 0.3, and so forth.
I need to build a triangular matrix C where C[X,Y] stores how much better participant X was than participant Y. More specifically, C[X,Y] would hold the mean % difference in score between X and Y.
From the example above:
C[A,B] = 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) = 33%
My matrix S is huge, so I am hoping to take advantage of JIT (Numba?) or built-in methods in numpy or pandas. I certainly want to avoid having a nested loop, since S has millions of rows.
Does an efficient algorithm for the above have a name?
Let's look at a NumPy based solution and thus let's assume that the input data is in an array named a. Now, the number of pairwise combinations for 4 such variables would be 4*3/2 = 6. We can generate the IDs corresponding to such combinations with np.triu_indices(). Then, we index into the columns of a with those indices. We perform the subtractions and divisions and simply add the columns ignoring the NaN affected results with np.nansum() for the desired output.
Thus, we would have an implementation like so -
R,C = np.triu_indices(a.shape[1],1)
out = 100*np.nansum((a[:,R] - a[:,C])/a[:,C],0)
Sample run -
In [121]: a
Out[121]:
array([[ 0.1, 0.3, 0.8, 1. ],
[ 1. , 0.2, nan, nan],
[ 0.7, nan, 2. , 0.5],
[ nan, 4. , 0.6, 0.8]])
In [122]: out
Out[122]:
array([ 333.33333333, -152.5 , -50. , 504.16666667,
330. , 255. ])
In [123]: 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) # Sample's first o/p elem
Out[123]: 333.33333333333337
If you need the output as (4,4) array, we can use Scipy's squareform -
In [124]: from scipy.spatial.distance import squareform
In [125]: out2D = squareform(out)
Let's convert to a pandas dataframe for a good visual feedback -
In [126]: pd.DataFrame(out2D,index=list('ABCD'),columns=list('ABCD'))
Out[126]:
A B C D
A 0.000000 333.333333 -152.500000 -50
B 333.333333 0.000000 504.166667 330
C -152.500000 504.166667 0.000000 255
D -50.000000 330.000000 255.000000 0
Let's compute [B,C] manually and check back -
In [127]: 100 * ((0.3 - 0.8)/0.8 + (4 - 0.6)/0.6)
Out[127]: 504.1666666666667
Consider the following Multiindex Pandas Seires:
import pandas as pd
import numpy as np
val = np.array([ 0.4, -0.6, 0.6, 0.5, -0.4, 0.2, 0.6, 1.2, -0.4])
inds = [(-1000, 1921.6), (-1000, 1922.3), (-1000, 1923.0), (-500, 1921.6),
(-500, 1922.3), (-500, 1923.0), (-400, 1921.6), (-400, 1922.3),
(-400, 1923.0)]
names = ['pp_delay', 'wavenumber']
example = pd.Series(val)
example.index = pd.MultiIndex.from_tuples(inds, names=names)
example should now look like
pp_delay wavenumber
-1000 1921.6 0.4
1922.3 -0.6
1923.0 0.6
-500 1921.6 0.5
1922.3 -0.4
1923.0 0.2
-400 1921.6 0.6
1922.3 1.2
1923.0 -0.4
dtype: float64
I want to group example by pp_delay and select a range within each group using the wavenumber index and perform an operation on that subgroup. To clarify what I mean, I have a few examples.
Here is a position based solution.
example.groupby(level="pp_delay").nth(list(range(1,3))).groupby(level="pp_delay").sum()
this gives
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
Now the last to elements of each pp_delay group have been summed.
An alternative solution and more straight forward is to loop over the groups:
delays = example.index.levels[0]
res = np.zeros(delays.shape)
roi = slice(1922, 1924)
for i in range(3):
res[i] = example[delays[i]][roi].sum()
res
gives
array([ 0. , -0.2, 0.8])
Anyhow I don't like it much ether because it doesn't fit well with the usual pandas style.
Now what I ideally would want something like:
example.groupby(level="pp_delay").loc[1922:1924].sum()
or maybe even something like
example[:, 1922:1924].sum()
But apparently pandas indexing doesn't work that way. Anybody got a better way?
Cheers
I'd skip the groupby
example.unstack(0).ix[1922:1924].sum()
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])