I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])
You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64
I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.
Related
I have several pairs of arrays of measurements and the times at which the measurements were taken that I want to average. Unfortunately the times at which these measurements were taken isn't regular or the same for each pair.
My idea for averaging them is to create a new array with the value at each second then average these. It works but it seems a bit clumsy and means I have to create many unnecessarily long arrays.
Example Inputs
m1 = [0.4, 0.6, 0.2]
t1 = [0.0, 2.4, 5.2]
m2 = [1.0, 1.4, 1.0]
t2 = [0.0, 3.6, 4.8]
Generated Regular Arrays for values at each second
r1 = [0.4, 0.4, 0.4, 0.6, 0.6, 0.6, 0.2]
r2 = [1.0, 1.0, 1.0, 1.0, 1.4, 1.0]
Average values up to length of shortest array
a = [0.7, 0.7, 0.7, 0.8, 1.0, 0.8]
My attempt given list of measurement arrays measurements and respective list of time interval arrays times
def granulate(values, times):
count = 0
regular_values = []
for index, x in enumerate(times):
while count <= x:
regular_values.append(values[index])
count += 1
return np.array(regular_values)
processed_measurements = [granulate(m, t) for m, t in zip(measurements, times)]
min_length = min(len(m) for m in processed_measurements )
processed_measurements = [m[:min_length] for m in processed_measurements]
average_measurement = np.mean(processed_measurements, axis=0)
Is there a better way to do it, ideally using numpy functions?
This will average to closest second:
time_series = np.arange(np.stack((t1, t2)).max())
np.mean([m1[abs(t1-time_series[:,None]).argmin(axis=1)], m2[abs(t2-time_series[:,None]).argmin(axis=1)]], axis=0)
If you want to floor times to each second (with possibility of generalizing to more arrays):
m = [m1, m2]
t = [t1, t2]
m_t=[]
time_series = np.arange(np.stack(t).max())
for i in range(len(t)):
time_diff = time_series-t[i][:,None]
m_t.append(m[i][np.where(time_diff > 0, time_diff, np.inf).argmin(axis=0)])
average = np.mean(m_t, axis=0)
output:
[0.7 0.7 0.7 0.8 1. 0.8]
You can do (a bit more numpy-ish solution):
import numpy as np
# oddly enough - numpy doesn't have it's own ffill function:
def np_ffill(arr):
mask = np.arange(len(arr))
mask[np.isnan(arr)]=0
np.maximum.accumulate(mask, axis=0, out=mask)
return arr[mask]
t1=np.ceil(t1).astype("int")
t2=np.ceil(t2).astype("int")
r1=np.empty(max(t1)+1)
r2=np.empty(max(t2)+1)
r1[:]=np.nan
r2[:]=np.nan
r1[t1]=m1
r2[t2]=m2
r1=np_ffill(r1)
r2=np_ffill(r2)
>>> print(r1,r2)
[0.4 0.4 0.4 0.6 0.6 0.6 0.2] [1. 1. 1. 1. 1.4 1. ]
#in order to get avg:
r3=np.vstack([r1[:len(r2)],r2[:len(r1)]]).mean(axis=0)
>>> print(r3)
[0.7 0.7 0.7 0.8 1. 0.8]
I see two possible solutions:
Create a 'bucket' for each time step, lets say 1 second, and insert all measurements that were taken at the time step +/- 1 second in the bucket. Average all values in the bucket.
Interpolate every measurement row, so that they have equal time steps. Average all measurements for every time step
I am using a numpy arange.
[In] test = np.arange(0.01, 0.2, 0.02)
[In] test
[Out] array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19])
But then, if I iterate over this array, it iterates over slightly smaller values.
[In] for t in test:
.... print(t)
[Out]
0.01
0.03
0.049999999999999996
0.06999999999999999
0.08999999999999998
0.10999999999999997
0.12999999999999998
0.15
0.16999999999999998
0.18999999999999997
Why is this happening?
To avoid this problem, I have been rounding the values, but is this the best way to solve this problem?
for t in test:
print(round(t, 2))
I think the nature of the floating point numbers mentioned in the comments is the issue.
If you still think you're afraid of leaving it that way I suggest that you multiply your numbers by 100 and so work with intergers:
test = np.arange(1, 20, 2)
print(test)
for t in test:
print(t / 100)
This gives me the following output:
[ 1 3 5 7 9 11 13 15 17 19]
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
Alternatively you can also try the following:
test = np.arange(1, 20, 2) / 100
Did you try:
test = np.arange(0.01, 0.2, 0.02, dtype=np.float32)
Consider the following Multiindex Pandas Seires:
import pandas as pd
import numpy as np
val = np.array([ 0.4, -0.6, 0.6, 0.5, -0.4, 0.2, 0.6, 1.2, -0.4])
inds = [(-1000, 1921.6), (-1000, 1922.3), (-1000, 1923.0), (-500, 1921.6),
(-500, 1922.3), (-500, 1923.0), (-400, 1921.6), (-400, 1922.3),
(-400, 1923.0)]
names = ['pp_delay', 'wavenumber']
example = pd.Series(val)
example.index = pd.MultiIndex.from_tuples(inds, names=names)
example should now look like
pp_delay wavenumber
-1000 1921.6 0.4
1922.3 -0.6
1923.0 0.6
-500 1921.6 0.5
1922.3 -0.4
1923.0 0.2
-400 1921.6 0.6
1922.3 1.2
1923.0 -0.4
dtype: float64
I want to group example by pp_delay and select a range within each group using the wavenumber index and perform an operation on that subgroup. To clarify what I mean, I have a few examples.
Here is a position based solution.
example.groupby(level="pp_delay").nth(list(range(1,3))).groupby(level="pp_delay").sum()
this gives
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
Now the last to elements of each pp_delay group have been summed.
An alternative solution and more straight forward is to loop over the groups:
delays = example.index.levels[0]
res = np.zeros(delays.shape)
roi = slice(1922, 1924)
for i in range(3):
res[i] = example[delays[i]][roi].sum()
res
gives
array([ 0. , -0.2, 0.8])
Anyhow I don't like it much ether because it doesn't fit well with the usual pandas style.
Now what I ideally would want something like:
example.groupby(level="pp_delay").loc[1922:1924].sum()
or maybe even something like
example[:, 1922:1924].sum()
But apparently pandas indexing doesn't work that way. Anybody got a better way?
Cheers
I'd skip the groupby
example.unstack(0).ix[1922:1924].sum()
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])
I'm looking for a good way to store and use conditional probabilities in python.
I'm thinking of using a pandas dataframe. If the conditional probabilities of some X are P(X=A|P1=1, P2=1) = 0.2, P(X=B|P1=2, P2=1) = 0.9 etc., I would use the dataframe
A B
P1 P2
1 1 0.2 0.8
2 0.5 0.5
2 1 0.9 0.1
2 0.9 0.1
and given the marginal probabilities of P1 and P2 as Series
1 0.4
2 0.6
Name: P1
1 0.7
2 0.3
Name: P2
I would like to obtain the Series of marginal probabilities of X, i.e. the series
A 0.602
B 0.398
Name: X
I can get what I want by
X = sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
X.name="X"
but this is not easily generalizable to more dependencies, the asymmetry between the first xs with level and the second one without looks weird and as usual when working with pandas I'm very sure that there is a better solution using it's tricks and methods.
Is pandas a good tool for this, should I represent my data in another way, and what is the best way to do this calculation, which is essentially an indexed tensor product, in pandas?
One way to vectorize is access the values in Series P1 and P2 by indexing with an array of labels.
In [20]: df = X.reset_index()
In [21]: mP1 = P1[df.P1].values
In [22]: mP2 = P2[df.P2].values
In [23]: mP1
Out[23]: array([ 0.4, 0.4, 0.6, 0.6])
In [24]: mP2
Out[24]: array([ 0.7, 0.3, 0.7, 0.3])
In [25]: mp = mP1 * mP2
In [26]: mp
Out[26]: array([ 0.28, 0.12, 0.42, 0.18])
In [27]: X.mul(mp, axis=0)
Out[27]:
A B
P1 P2
1 1 0.056 0.224
2 0.060 0.060
2 1 0.378 0.042
2 0.162 0.018
In [28]: X.mul(mp, axis=0).sum()
Out[28]:
A 0.656
B 0.344
In [29]: sum(
sum(
X.xs(i, level="P1")*P1[i]
for i in P1.index
).xs(j)*P2[j]
for j in P2.index
)
Out[29]:
A 0.656
B 0.344
(Alternately, access the values of a MultiIndex
without resetting the index as follows.)
In [38]: P1[X.index.get_level_values("P1")].values
Out[38]: array([ 0.4, 0.4, 0.6, 0.6])