I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64
Related
I have big dataset of values as follow:
column "bigger" would be index of the first row with bigger "bsl" than "mb" from current row. I need to do it without loop as I need it to be done in less than a second. by loop it's over a minute.
For example for the first row (with index 74729) the bigger is going to be 74731. I know it can be done by linq in C# but I'm almost new in python.
here is another example:
here is text version:
index bsl mb bigger
74729 47091.89 47160.00 74731.0
74730 47159.00 47201.00 74735.0
74731 47196.50 47201.50 74735.0
74732 47186.50 47198.02 74735.0
74733 47191.50 47191.50 74735.0
74734 47162.50 47254.00 74736.0
74735 47252.50 47411.50 74736.0
74736 47414.50 47421.00 74747.0
74737 47368.50 47403.00 74742.0
74738 47305.00 47310.00 74742.0
74739 47292.00 47320.00 74742.0
74740 47302.00 47374.00 74742.0
74741 47291.47 47442.50 74899.0
74742 47403.50 47416.50 74746.0
74743 47354.34 47362.50 74746.0
I'm not sure how many rows you have, but if the number is reasonable, you can perform a pairwise comparison:
# get data as arrays
a = df['bsl'].to_numpy()
b = df['mb'].to_numpy()
idx = df.index.to_numpy()
# compare values and mask lower triangle
# to ensure comparing only the greater indices
out = np.triu(a>b[:,None]).argmax(1).astype(float)
# reindex to original indices
idx = idx[out]
# mask invalid indices
idx[out<np.arange(len(out))] = np.nan
df['bigger'] = idx
Output:
bsl mb bigger
0 1 2 2.0
1 2 4 6.0
2 3 3 5.0
3 2 1 3.0
4 3 5 NaN
5 4 2 5.0
6 5 1 6.0
7 1 0 7.0
I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.
Looking to get a continuous rolling mean of a dataframe.
df looks like this
index price
0 4
1 6
2 10
3 12
looking to get a continuous rolling of price
the goal is to have it look this a moving mean of all the prices.
index price mean
0 4 4
1 6 5
2 10 6.67
3 12 8
thank you in advance!
you can use expanding:
df['mean'] = df.price.expanding().mean()
df
index price mean
0 4 4.000000
1 6 5.000000
2 10 6.666667
3 12 8.000000
Welcome to SO: Hopefully people will soon remember you from prior SO posts, such as this one.
From your example, it seems that #Allen has given you code that produces the answer in your table. That said, this isn't exactly the same as a "rolling" mean. The expanding() function Allen uses is taking the mean of the first row divided by n (which is 1), then adding rows 1 and 2 and dividing by n (which is now 2), and so on, so that the last row is (4+6+10+12)/4 = 8.
This last number could be the answer if the window you want for the rolling mean is 4, since that would indicate that you want a mean of 4 observations. However, if you keep moving forward with a window size 4, and start including rows 5, 6, 7... then the answer from expanding() might differ from what you want. In effect, expanding() is recording the mean of the entire series (price in this case) as though it were receiving a new piece of data at each row. "Rolling", on the other hand, gives you a result from an aggregation of some window size.
Here's another option for doing rolling calculations: the rolling() method in a pandas.dataframe.
In your case, you would do:
df['rolling_mean'] = df.price.rolling(4).mean()
df
index price rolling_mean
0 4 nan
1 6 nan
2 10 nan
3 12 8.000000
Those nans are a result of the windowing: until there are enough rows to calculate the mean, the result is nan. You could set a smaller window:
df['rolling_mean'] = df.price.rolling(2).mean()
df
index price rolling_mean
0 4 nan
1 6 5.000000
2 10 8.000000
3 12 11.00000
This shows the reduction in the nan entries as well as the rolling function: it 's only averaging within the size-two window you provided. That results in a different df['rolling_mean'] value than when using df.price.expanding().
Note: you can get rid of the nan by using .rolling(2, min_periods = 1), which tells the function the minimum number of defined values within a window that have to be present to calculate a result.
I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)