above25percentile=df.loc[df["order_amount"]>np.percentile(df["order_amount"],25)]
below75percentile=df.loc[df["order_amount"]<np.percentile(df["order_amount"],75)]
interquartile=above25percentile & below75percentile
print(interquartile.mean())
can't seem to get the mean here. any thoughts?
You attempt to compute interquartile as a boolean mask based on the & operator, but its components are Series containing values from the ranges. While the two series are likely to be similar sizes, & will not give you an intersection of their indices. If they were boolean masks, in your subsequent usage, you'd be taking the mean of a bunch of zeros and ones, which is going to be 0.5 (the ratio of data that falls within the IQR as a matter of fact).
First, compute interquartile as a proper mask. Pandas has its own quantile method, which, like np.percentile and siblings, accepts multiple percentiles simultaneously. You can combine that with between to get your mask more efficiently:
interquartile = df['order_amount'].between(*df['order_amount'].quantile([0.25, 0.75]))
You can apply the mask to the column and take the mean like this:
df.loc[interquartile, 'order_amount'].mean()
Try:
above25percentile = df["order_amount"]>np.percentile(df['order_amount'],25)
below75percentile = df['order_amount']<np.percentile(df['order_amount'],75)
print(df.loc[above25percentile & below75percentile, 'order_amount'].mean())
Or you can use between:
df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
np.percentile(df['order_amount'], 75),
inclusive='neither'), 'order_amount'].mean()
Suppose the following dataframe:
df = pd.DataFrame({'order_amount': range(0, 10)})
print(df)
# Output
order_amount
0 0 # Excluded
1 1 # "
2 2 # "
3 3
4 4 # mean <- (3 + 4 + 5 + 6) / 4 = 4.5
5 5
6 6
7 7 # Excluded
8 8 # "
9 9 # "
Output:
>>> df.loc[df['order_amount'].between(np.percentile(df['order_amount'], 25),
np.percentile(df['order_amount'], 75),
inclusive='neither'), 'order_amount'].mean()
4.5
Related
I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64
I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.
I have a pandas dataframe that looks like this:
import pandas as pd
dt = pd.DataFrame({'idx':[1,2,3,4,5,1,2,3,4,5], 'id':[1,1,1,1,1,2,2,2,2,2], 'value':[5,10,15,20,25, 55,65,75,85,97]})
I have another that looks like this:
dt_idx = pd.DataFrame({'cutoff':[1,1,1,3,3,3,3,3,5,5,5,5,2,2,2,2,2,2,2,4,4]})
For the 3 "most common" cutoffs from the dt_idx (in this toy example it is 3,5 and 2), I would like to obtain the mean and the std of the value column of the dt dataframe, for the following 2 groups:
idx <= cutoff and
idx > cutoff
Is there a pythonic way to do that ?
A simple loop here is a good option. Get the cutoffs you care about using value_counts and then loop over those cutoffs. You can use groupby to get both the <= and > at the same time. Store everything in a dict, keyed by the cutoffs, and then you can concat to get a DataFrame with a MultiIndex.
d = {}
for cutoff in dt_idx.cutoff.value_counts().head(3).index:
d[cutoff] = dt.groupby(dt.idx.gt(cutoff))['value'].agg(['mean', 'std'])
pd.concat(d, names=['cutoff', 'greater_than_cutoff'])
mean std
cutoff greater_than_cutoff
2 False 33.750000 30.652624
True 52.833333 36.771819
3 False 37.500000 30.943497
True 56.750000 39.903007
5 False 45.200000 34.080949
If you want to use those cutoffs as ranges then we'll create the list, adding np.inf to the end, and we can use a single groupby with pd.cut to make the groups.
bins = dt_idx.cutoff.value_counts().head(3).index.sort_values().tolist() + [np.inf]
#[2, 3, 5, inf]
dt.groupby(pd.cut(dt.idx, bins, right=False))['value'].agg(['mean', 'std'])
# mean std
#idx
#[2.0, 3.0) 37.50 38.890873
#[3.0, 5.0) 48.75 36.371921
#[5.0, inf) 61.00 50.911688
First we get the 3 most common values, then we use GroupBy.agg for each of these values.
import numpy as np
n=3
l = dt_idx['cutoff'].value_counts()[:n].index
new_df = pd.concat({val : dt.groupby(np.where(dt['idx'].le(val),
'less than or equal',
'higher'))['value']
.agg(['mean','std'])
for val in l}, axis=1)
print(new_df)
2 3 5
mean std mean std mean std
higher 52.833333 36.771819 56.75 39.903007 NaN NaN
less than or equal 33.750000 30.652624 37.50 30.943497 45.2 34.080949
#new_df.stack(0).swaplevel().sort_index()
# mean std
#2 higher 52.833333 36.771819
# less than or equal 33.750000 30.652624
#3 higher 56.750000 39.903007
# less than or equal 37.500000 30.943497
#5 less than or equal 45.200000 34.080949
I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks
you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)