Optimal way to acquire percentiles of DataFrame rows - python

Problem
I have a pandas DataFrame df:
year val0 val1 val2 ... val98 val99
1983 -42.187 15.213 -32.185 12.887 -33.821
1984 39.213 -142.344 23.221 0.230 1.000
1985 -31.204 0.539 2.000 -1.000 3.442
...
2007 4.239 5.648 -15.483 3.794 -25.459
2008 6.431 0.831 -34.210 0.000 24.527
2009 -0.160 2.639 -2.196 52.628 71.291
My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90
1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528
...
2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
Attempt
The following was my initial attempt:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles for each val, all percentiles returned the same values.
I still managed to run the desired task by trying the following:
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
for i in range(len(df)):
list_1.append(np.percentile(df.iloc[i,1:],10))
list_2.append(np.percentile(df.iloc[i,1:],20))
list_3.append(np.percentile(df.iloc[i,1:],30))
list_4.append(np.percentile(df.iloc[i,1:],40))
mlist.append(np.median(df.iloc[i,1:]))
list_6.append(np.percentile(df.iloc[i,1:],60))
list_7.append(np.percentile(df.iloc[i,1:],70))
list_8.append(np.percentile(df.iloc[i,1:],80))
list_9.append(np.percentile(df.iloc[i,1:],90))
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?

You can get use .describe() function like this:
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
output:
count mean std min 25% 50% 75% max
0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049
1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036
2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978
3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565
4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
if you want other percentiles than the default 0.25, .05, .075 you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])

Use DataFrame.quantile with convert year to index and last transpose with rename columns by custom lambda function:
a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
percentile_10 percentile_20 percentile_30 percentile_40 median \
year
1983 -38.8406 -35.4942 -33.4938 -32.8394 -32.185
1984 -85.3144 -28.2848 0.3840 0.6920 1.000
1985 -19.1224 -7.0408 -0.6922 -0.0766 0.539
2007 -21.4686 -17.4782 -11.6276 -3.9168 3.794
2008 -20.5260 -6.8420 0.1662 0.4986 0.831
2009 -1.3816 -0.5672 0.3998 1.5194 2.639
percentile_60 percentile_70 percentile_80 percentile_90
year
1983 -14.1562 3.8726 13.3522 14.2826
1984 9.8884 18.7768 26.4194 32.8162
1985 1.1234 1.7078 2.2884 2.8652
2007 3.9720 4.1500 4.5208 5.0844
2008 3.0710 5.3110 10.0502 17.2886
2009 22.6346 42.6302 56.3606 63.8258

Related

How to do math operations on a dataframe with an undefined number of columns?

I have a data frame in which there is an indefinite number of columns, to be defined later.
Like this:
index
GDP
2004
2005
...
brasil
1000
0.10
0.10
...
china
1000
0.15
0.10
...
india
1000
0.05
0.10
...
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.05],
'2005': [0.10, 0.10, 0.10]})
Being the column GDP the initial GDP, and the columns from 2004 onwards being floats, representing percentages, relating to GDP growth in each year.
Using percentages to get the absolute number of the GDP in each year, based on initial GDP. I need a dataframe like this:
index
GDP
2004
2005
brasil
1000
1100
1210
china
1000
1150
1265
india
1000
1050
1155
I tried to use itertuples, df.columns and for loops, but i probably missing something.
Remembering that there are an indefinite number of columns.
Thank you very much in advance!
My answer is a combination of Wardy and user19*.
Starting with...
df = pd.DataFrame(data={'GDP': [1000, 1000, 1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10],
'index': ['brasil', 'china', 'india']})
Find the percentage columns and make sure they are in the right order.
columns_of_interest = sorted(c for c in df.columns if c not in ['GDP', 'index'])
Now we calculate...
running_GDP = df['GDP'] # starting value
for column in columns_of_interest:
running_GDP *= 1.0 + df[column]
df[column] = running_GDP
This results in
GDP 2004 2005 index
0 1000 1100.0 1210.0 brasil
1 1000 1150.0 1265.0 china
2 1000 1500.0 1650.0 india
A simple way is to count the columns and loop over:
num = df.shape[1]
start = 2
for idx in range(start, num):
df.iloc[:, idx] = df.iloc[:, idx-1] * (1+df.iloc[:, idx])
print(df)
which gives
index GDP 2004 2005
0 brasil 1000 1100.0 1210.0
1 china 1000 1150.0 1265.0
2 india 1000 1050.0 1155.0
You can use df.columns to access a list of the dataframes columns.
Then you can do a loop over all of these column names. Here is an example of your data frame where I multiplied every value by 2. If you want to do different operations to different columns you can add conditions into the loop.
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10]})
for colName in df.columns:
df[colName] *= 2
print(df)
this returns...
index GDP 2004 2005
0 brasilbrasil 2000 0.2 0.2
1 chinachina 2000 0.3 0.2
2 indiaindia 2000 1.0 0.2
Hope this helps!
Add one to the percentages; calculate the cumulative product;
q = (df.iloc[:,2:] + 1).cumprod(axis=1)
multiply by the beginning gdp.
q = q.mul(df['GDP'],axis='index')
If you are trying to change the original DataFrame assign the result.
df.iloc[:,2:] = q
If you want to make new DataFrame concatenate the result with the first columns of the original.
new = pd.concat([df.iloc[:,:2],q],axis=1)
You can put those first two lines together if you want.
q = (df.iloc[:,2:] + 1).cumprod(axis=1).mul(df.GDP,axis='index')

T Test on Multiple Columns in Dataframe

Dataframe looks something like:
decade rain snow
1910 0.2 0.2
1910 0.3 0.4
2000 0.4 0.5
2010 0.1 0.1
I'd love some help with a function in python to run a t test comparing decade combinations for a given column. This function works great except does not take an input column such as rain or snow.
from itertools import combinations
def ttest_run(c1, c2):
results = st.ttest_ind(cat1, cat2,nan_policy='omit')
df = pd.DataFrame({'dec1': c1,
'dec2': c2,
'tstat': results.statistic,
'pvalue': results.pvalue},
index = [0])
return df
df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]
final_df = pd.concat(df_list, ignore_index = True)
I think you want something like this:
import pandas as pd
from itertools import combinations
from scipy import stats as st
d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'],
'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4],
'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)
def all_pairwise(df, compare_col = 'decade'):
decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
# or add a list of colnames to function signature
cols = list(df.columns)
cols.remove(compare_col)
list_of_dfs = []
for pair in decade_pairs:
for col in cols:
c1 = df[df[compare_col] == pair[0]][col]
c2 = df[df[compare_col] == pair[1]][col]
results = st.ttest_ind(c1, c2, nan_policy='omit')
tmp = pd.DataFrame({'dec1': pair[0],
'dec2': pair[1],
'tstat': results.statistic,
'pvalue': results.pvalue}, index = [col])
list_of_dfs.append(tmp)
df_stats = pd.concat(list_of_dfs)
return df_stats
df_stats = all_pairwise(df)
df_stats
Now if you execute that code you'll get runtime warnings from division by 0 errors occurring from too few data points when calculating t-statistics which cause the Nans in the output
>>> df_stats
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
snow 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
snow 1910 2010 NaN NaN
rain 1910 1990 0.000000 1.000000
snow 1910 1990 0.436436 0.685044
rain 2000 2010 NaN NaN
...
If you don't want all columns but only some specified set change the function signature/definition line to read:
def all_pairwise(df, cols, compare_col = 'decade'):
where cols should be an iterable of string column names (a list will work fine). You'll need to remove the two lines:
cols = list(df.columns)
cols.remove(compare_col)
from the function body and otherwise will work fine.
You'll always get the runtime warnings unless you filter out decades with too few records before passing to the function.
Here is an example call from the version that accepts a list of columns as arguments and shows the runtime warning.
>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
rain 1910 1990 0.0 1.0
rain 2000 2010 NaN NaN
rain 2000 1990 NaN NaN
rain 2010 1990 NaN NaN
>>>

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

Express pandas operations as pipeline

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94
I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

Pandas - get last n values from a group with an offset.

I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.
A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Categories

Resources