Group By Median, Percentile and Percent of Total

Group By Median, Percentile and Percent of Total - python

I have a dataframe that looks like this...
ID Acuity TOTAL_ED_LOS
1 2 423
2 5 52
3 5 535
4 1 87
...
I would like to produce a table that looks like this:
Acuity Count Median Percentile_25 Percentile_75 % of total
1 234 ... 31%
2 65 ... 8%
3 56 ... 7%
4 345 ... 47%
5 35 ... 5%
I already have code that will give me everything I need except for the % of total column
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
df_grp = df_merged_v1.groupby(['Acuity'])
df_grp['TOTAL_ED_LOS'].agg(['count','median',
percentile(25), percentile(75)]).reset_index()
Is there an efficient way I can add the percent of total column? The link below contain code on how to obtain the percent of total but I'm unsure how to apply it to my code. I know that I could create two tables and then merge them but am curious if there is a cleaner way.
How to calculate count and percentage in groupby in Python

Here's a one way to do it using some pandas builtin tools:
# Set random number seeed and create a dummy datafame with two columns
np.random.seed(123)
df = pd.DataFrame({'activity':np.random.choice([*'ABCDE'], 40),
'TOTAL_ED_LDS':np.random.randint(50, 500, 40)})
# Reshape dataframe to get activit per column
# then use the output from describe and transpose
df_out = df.set_index([df.groupby('activity').cumcount(),'activity'])['TOTAL_ED_LDS']\
.unstack().describe().T
#Calculate percent count of total count
df_out['% of Total'] = df_out['count'] / df_out['count'].sum() * 100.
df_out
Output:
count mean std min 25% 50% 75% max % of Total
activity
A 8.0 213.125000 106.810162 93.0 159.50 200.0 231.75 421.0 20.0
B 10.0 308.200000 116.105125 68.0 240.75 324.5 376.25 461.0 25.0
C 6.0 277.666667 117.188168 114.0 193.25 311.5 352.50 409.0 15.0
D 7.0 370.285714 124.724649 120.0 337.50 407.0 456.00 478.0 17.5
E 9.0 297.000000 160.812002 51.0 233.00 294.0 415.00 488.0 22.5

Related

Is there a way to do rolling rank in Pandas?

I am trying to rank some values in one column over a rolling period of N days instead of having the ranking done over the entire set. I have seen several methods here using rolling_apply but I have read that this is no longer in python. For example, in the following table;
A
01-01-2013
100
02-01-2013
85
03-01-2013
110
04-01-2013
60
05-01-2013
20
06-01-2013
40
For the column A above, how can I have the rank as below for N = 3;
A
Ranked_A
01-01-2013
100
NaN
02-01-2013
85
Nan
03-01-2013
110
1
04-01-2013
60
3
05-01-2013
20
3
06-01-2013
40
2

Yes we have some work around, still with rolling but need apply
df.A.rolling(3).apply(lambda x: pd.Series(x).rank(ascending=False)[-1])
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1.0
04-01-2013 3.0
05-01-2013 3.0
06-01-2013 2.0
Name: A, dtype: float64

Panda(Python): add a new column in a data frame which depends on its row value and aggregated value from another data frame

I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.

For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:

I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Filter one DataFrame by unique values in another DataFrame

I have 2 Python Dataframes:
The first Dataframe contains all data imported to the DataFrame, which consists of "prodcode", "sentiment", "summaryText", "reviewText",etc. of all initial Review Data.
DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]
which produces:
prodcode summaryText reviewText overall reviewerID ... helpful reviewTime unixReviewTime sentiment textLength
0 B00002243X Work Well - Should Have Bought Longer Ones I needed a set of jumper cables for my new car... 5.0 A3F73SC1LY51OO ... [4, 4] 08 17, 2011 1313539200 2 516
1 B00002243X Okay long cables These long cables work fine for my truck, but ... 4.0 A20S66SKYXULG2 ... [1, 1] 09 4, 2011 1315094400 2 265
2 B00002243X Looks and feels heavy Duty Can't comment much on these since they have no... 5.0 A2I8LFSN2IS5EO ... [0, 0] 07 25, 2013 1374710400 2 1142
3 B00002243X Excellent choice for Jumper Cables!!! I absolutley love Amazon!!! For the price of ... 5.0 A3GT2EWQSO45ZG ... [19, 19] 12 21, 2010 1292889600 2 4739
4 B00002243X Excellent, High Quality Starter Cables I purchased the 12' feet long cable set and th... 5.0 A3ESWJPAVRPWB4 ... [0, 0] 07 4, 2012 1341360000 2 415
The second Dataframe is a grouping of all prodcodes and the ratio of sentiment score / all reviews made for that product. It is the ratio for that review score over all reviews scores made, for that particular product.
df1 = (
DFF.groupby(["prodcode", "sentiment"]).count()
.join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]
df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')
sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)
which produces the following DF:
sentiment 0 1 2
prodcode
B0024E6QOO 80.0 0.0 20.0
B000GPV2QA 67.0 17.0 17.0
B0067DNSUI 67.0 0.0 33.0
B00192JH4S 62.0 12.0 25.0
B0087FSA0C 60.0 20.0 20.0
B0002KM5L0 60.0 0.0 40.0
B000DZBP60 60.0 0.0 40.0
B000PJCBOE 60.0 0.0 40.0
B0033A5PPO 57.0 29.0 14.0
B003POL69C 57.0 14.0 29.0
B0002Z9L8K 56.0 31.0 12.0
What I am now trying to do filter my first dataframe in two ways. The first, by the results of the second dataframe. By that, I mean I want the first dataframe to be filtered by the prodcode's from the second dataframe where df1.sentiment['0'] > 40. From that list, I want to filter the first dataframe by those rows where 'sentiment' from the first dataframe = 0.
At a high level, I am trying to obtain the prodcode, summaryText and reviewText in the first dataframe for Products that had high ratios in lower sentiment scores, and whose sentiment is 0.

Something like this :
assuming all the data you need is in df1 and no merges are needed.
m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions

I figured it out:
DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))

Back Filling Dataframe

I have a dataframe with 3 columns. Something like this:
Data Initial_Amount Current
31-01-2018
28-02-2018
31-03-2018
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
I would like to populate the prior rows with the Initial Amount as such:
Data Initial_Amount Current
31-01-2018 100 100
28-02-2018 100 100
31-03-2018 100 100
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
So find the:
First non_empty row with Initial Amount populated
use that to backfill the initial Amounts to the starting date
If it is the first row and current is empty then copy Initial_Amount, else copy prior balance.
Regards,

Pandas fillna with fill method 'bfill' (uses next valid observation to fill gap) should do what you're looking for:
In [13]: df.fillna(method='bfill')
Out[13]:
Data Initial_Amount Current
0 31-01-2018 100.0 100.0
1 28-02-2018 100.0 100.0
2 31-03-2018 100.0 100.0
3 30-04-2018 100.0 100.0
4 31-05-2018 100.0 90.0
5 30-06-2018 100.0 80.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group By Median, Percentile and Percent of Total - python

Related

Is there a way to do rolling rank in Pandas?

Panda(Python): add a new column in a data frame which depends on its row value and aggregated value from another data frame

Select value from dataframe based on other dataframe

Filter one DataFrame by unique values in another DataFrame

Back Filling Dataframe

Categories

Resources