Django weighted percentage - python

I'm trying to calculate a weighted percentage in a Django query.
This is an example of what my data looks like:
id start_date agency_id area_id housetype_id no_of_changed price_change_percentage total
6716 2017-08-26 11 1 1 16 -0.09 35
6717 2017-08-26 11 1 3 44 -0.11 73
6718 2017-08-26 11 1 4 7 -0.1 12
6719 2017-08-26 11 1 5 0 0 4
6720 2017-08-26 11 1 6 0 0 1
6721 2017-08-26 21 1 1 0 0 1
6722 2017-08-26 34 1 1 0 0 1
6723 2017-08-26 35 1 1 0 0 1
6724 2017-08-26 38 1 1 0 0 1
and this is my current code:
from django.db.models import F, FloatField, ExpressionWrapper
from app.models import PriceChange
def weighted_percentage(area_id, date_range, agency_id, housetype):
data = PriceChange.objects.filter(area_id=area_id,
start_date__range=date_range,
agency_id=agency_id,
)
if housetype:
data = data.filter(housetype=housetype) \
.values('start_date') \
.annotate(price_change_total=ExpressionWrapper((F('price_change_percentage') * F('no_of_changed')) / F('total'), output_field=FloatField())) \
.order_by('start_date')
else:
# what to do?
pass
x = [x['start_date'] for x in data]
y = [y['price_change_total'] for y in data]
return x, y
I figured out how to do the calculation when housetype is defined and I only need to data from one row. I can't figure out how to done when I need to calculate for multiple rows with the same start_date. I don't want a value for each row but for each start_date.
As an example (two rows with same start_date, area_id, agency_id but different housetype_ids):
no_of_changed price_change_percentage total
16 -0.09 35
44 -0.11 73
The calculation is in pseudocode:
((no_of_changed[0] * price_changed_percentage[0]) + (no_of_changed[0] * price_changed_percentage[0])) / (total[0] + total[1]) = price_change_total
((16 * -0.09) + (44 * -0.11)/ (35 + 73) = -0.03148148
I'm using Django 1.11 and Python 3.6.

You need to wrap the expression in a Sum expression.
Add the following import:
from django.db.models import Sum
Then add the following query
else:
data = data.values('start_date') \
.annotate(
price_change_total=ExpressionWrapper(
Sum(F('price_change_percentage') * F('no_of_changed')) / Sum(F('total')),
output_field=FloatField()
)
) \
.order_by('start_date')
What is happening here is that when you use an aggregation expression such as Sum inside an annotate() call, it is translated into a group_by query in the database. All columns listed in the previous values() clause are used to create the group_by query.
See this blog post for further explanation and a breakdown of the resulting SQL query.

Related

Calculating Time Weighted Rate of Return in Python

I'm trying to calculate daily returns using the time weighted rate of return formula:
(Ending Value-(Beginning Value + Net Additions)) / (Beginning value + Net Additions)
My DF looks like:
Account # Date Balance Net Additions
1 9/1/2022 100 0
1 9/2/2022 115 10
1 9/3/2022 117 0
2 9/1/2022 50 0
2 9/2/2022 52 0
2 9/3/2022 40 -15
It should look like:
Account # Date Balance Net Additions Daily TWRR
1 9/1/2022 100 0
1 9/2/2022 115 10 0.04545
1 9/3/2022 117 0 0.01739
2 9/1/2022 50 0
2 9/2/2022 52 0 0.04
2 9/3/2022 40 -15 0.08108
After calculating the daily returns for each account, I want to link all the returns throughout the month to get the monthly return:
((1 + return) * (1 + return)) - 1
The final result should look like:
Account # Monthly Return
1 0.063636
2 0.12432
Through research (and trial and error), I was able to get the output I am looking for but as a new python user, I'm sure there is an easier/better way to accomplish this.
DF["Numerator"] = DF.groupby("Account #")[Balance].diff() - DF["Net Additions"]
DF["Denominator"] = ((DF["Numerator"] + DF["Net Additions"] - DF["Balance"]) * -1) + DF["Net Additions"]
DF["Daily Returns"] = (DF["Numerator"] / DF["Denominator"]) + 1
DF = DF.groupby("Account #")["Daily Returns"].prod() - 1
Any help is appreciated!

Calculating current, min, max, mean monthly growth from pandas dataframe

I have a dataset similar to the one below:
product_ID month amount_sold
1 1 23
1 2 34
1 3 85
2 1 47
2 2 28
2 3 9
3 1 73
3 2 84
3 3 12
I want the output to be like this:
For example, for product 1:
-avg_monthly_growth is calculated by ((85-34)/34*100 + (34-23)/23*100)/2 = 98.91%
-lowest_monthly_growth is (34-23)/23*100) = 47.83%
-highest_monthly_growth is (85-34)/34*100) = 150%
-current_monthly_growth is the growth between the lastest two months (in this case, it's the growth from month 2 to month 3, as the month ranges from 1-3 for each product)
product_ID avg_monthly_growth lowest_monthly_growth highest_monthly_growth current_monthly_growth
1 98.91% 47.83% 150% 150%
2 ... ... ... ...
3 ... ... ... ...
I've tried df.loc[df.groupby('product_ID')['amount_sold'].idxmax(), :].reset_index() which gets me the max (and similarly the min), but I'm not too sure how to get the percentage growths.
You can use a pivot_table withh pct_change() on axis=1 , then create a dictionary with desired series and create a df:
m=df.pivot_table(index='product_ID',columns='month',values='amount_sold').pct_change(axis=1)
d={'avg_monthly_growth':m.mean(axis=1)*100,'lowest_monthly_growth':m.min(1)*100,
'highest_monthly_growth':m.max(1)*100,'current_monthly_growth':m.iloc[:,-1]*100}
final=pd.DataFrame(d)
print(final)
avg_monthly_growth lowest_monthly_growth highest_monthly_growth \
product_ID
1 98.913043 47.826087 150.000000
2 -54.141337 -67.857143 -40.425532
3 -35.322896 -85.714286 15.068493
current_monthly_growth
product_ID
1 150.000000
2 -67.857143
3 -85.714286

Pandas: duplicating dataframe entries while column higher or equal to 0

I have a dataframe containing clinical readings of hospital patients, for example a similar dataframe could look like this
heartrate pid time
0 67 151 0.0
1 75 151 1.2
2 78 151 2.5
3 99 186 0.0
In reality there are many more columns, but I will just keep those 3 to make the example more concise.
I would like to "expand" the dataset. In short, I would like to be able to give an argument n_times_back and another argument interval.
For each iteration, which corresponds to for i in range (n_times_back + 1), we do the following:
Create a new, unique pid [OLD ID | i] (Although as long as the new
pid is unique for each duplicated entry, the exact name isn't
really important to me so feel free to change this if it makes it
easier)
For every patient (pid), remove the rows with time column which is
more than the final time of that patient - i * interval. For
example if i * interval = 2.0 and the times associated to one pid
are [0, 0.5, 1.5, 2.8], the new times will be [0, 0.5], as final
time - 2.0 = 0.8
iterate
Since I realize that explaining this textually is a bit messy, here is an example.
With the dataset above, if we let n_times_back = 1 and interval=1 then we get
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 99 18600 0.0
For n_times_back = 2, the result would be
heartrate pid time
0 67 15100 0.0
1 75 15100 1.2
2 78 15100 2.5
3 67 15101 0.0
4 75 15101 1.2
5 67 15102 0.0
6 99 18600 0.0
n_times_back = 3 and above would lead to the same result as n_times_back = 2, as no patient data goes below that point in time
I have written code for this.
def expand_df(df, n_times_back, interval):
for curr_patient in df['pid'].unique():
patient_data = df[df['pid'] == curr_patient]
final_time = patient_data['time'].max()
for i in range(n_times_back + 1):
new_data = patient_data[patient_data['time'] <= final_time - i * interval]
new_data['pid'] = patient_data['pid'].astype(str) + str(i).zfill(2)
new_data['pid'] = new_data['pid'].astype(int)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
df = df[df['pid'] != curr_patient] # remove original patient data, now duplicate
df.reset_index(inplace = True, drop = True)
return df
As far as functionality goes, this code works as intended. However, it is very slow. I am working with a dataframe of 30'000 patients and the code has been running for over 2 hours now.
Is there a way to use pandas operations to speed this up? I have looked around but so far I haven't managed to reproduce this functionality with high level pandas functions
ended up using a groupby function and breaking when no more times were available, as well as creating an "index" column that I merge with the "pid" column at the end.
def expand_df(group, n_times, interval):
df = pd.DataFrame()
final_time = group['time'].max()
for i in range(n_times + 1):
new_data = group[group['time'] <= final_time - i * interval]
new_data['iteration'] = str(i).zfill(2)
#check if there is any time index left, if not don't add useless entry to dataframe
if(new_data['time'].count()>0):
df = df.append(new_data)
else:
break
return df
new_df = df.groupby('pid').apply(lambda x : expand_df(x, n_times_back, interval))
new_df = new_df.reset_index(drop=True)
new_df['pid'] = new_df['pid'].map(str) + new_df['iteration']

difference between two rows pandas

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0
IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

Categories

Resources