Python: cohorts analysis about calculating ARPU - python

I have a dataframe with one column:revenue_sum
revenue_sum
10000.0
12324.0
15534.0
26435.0
45623.0
56736.0
56353.0
And I want to write a function that creates all new columns at once that shows the sum of revenues.
For example, first row in the 'revenue_1'should show the sum of first two float in revenue_sum;
second row in the 'revenue_1'should show the sum of 2nd and 3rd float in revenue_sum.
First row in the 'revenue_2' should show the sum of first 3 float in revenue_sum
revenue_sum revenue_1 revenue_2
10000.0 22324.0 47858.0
12324.0 27858.0 54293.0
15534.0 41969.0 87592.0
26435.0 72058.0 128794.0
45623.0 102359.0 158712.0
56736.0 113089.0 NaN
56353.0 NaN NaN
Here is my code:
'''python
df_revenue_sum1 = df_revenue_sum1.iloc[::-1]
len_sum1 = len(df_revenue_sum1)+1
def func(df_revenue_sum1):
for i in range(1,len_sum1):
df_revenue_sum1['revenue_'+'i']=
df_revenue_sum1['revenue_sum'].rolling(i+1).sum()
return df_revenue_sum1
df_revenue_sum1 = df_revenue_sum1.applymap(func)
'''
And it shows the error:
"'float' object is not subscriptable", 'occurred at index revenue_sum'

I think there might be an easier way to do this without a for loop. The pandas function rolling (http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html) might do what you need. It sums along a sliding window specified by the min_periods and window parameters. Min periods means how many values it should sum at least. Window means it will sum at most that many values. Applying this works as follows:
import pandas as pd
# The dataframe provided
d = {
'revenue_sum': [
10000.0,
12324.0,
15534.0,
26435.0,
45623.0,
56736.0,
56353.0
]
}
# Reverse the dataframe because rolling only looks backwards and
# we want to make a rolling window forward
d1 = pd.DataFrame(data=d)
df = d1[::-1]
# apply rolling summing 2 at a time
df['revenue_1'] = df['revenue_sum'].rolling(min_periods=2, window=2).sum()
# apply rolling window 3 at a time
df['revenue_2'] = df['revenue_sum'].rolling(min_periods=3, window=3).sum()
print(df[::-1])
This gave me the following dataframe:
revenue_sum revenue_1 revenue_2
0 10000.0 22324.0 37858.0
1 12324.0 27858.0 54293.0
2 15534.0 41969.0 87592.0
3 26435.0 72058.0 128794.0
4 45623.0 102359.0 158712.0
5 56736.0 113089.0 NaN
6 56353.0 NaN NaN

Related

How can i calculate for Average true range in pandas

how can I calculate the Average true range in a data frame
I have tried to using np where() and is not working
I have all this value below
Current High - Current Low
abs(Current High - Previous Close)
abs(Current Low - Previous Close)
but I don't know how I to set the highest between the three value to the pandas data frame
It looks like you might be trying to do the following :
import pandas as pd
from numpy.random import rand
df = pd.DataFrame(rand(10,5),columns={'High-Low','High-close','Low-close','A','B'})
cols = ['High-Low','High-close','Low-close']
df['true_range'] = df[cols].max(axis=1)
print(df)
The output will look like
High-Low Low-close B A High-close true_range
0 0.916121 0.026572 0.082619 0.672000 0.605287 0.916121
1 0.622589 0.944646 0.638486 0.905139 0.262275 0.944646
2 0.611374 0.756191 0.829803 0.828205 0.614956 0.756191
3 0.810638 0.501693 0.504800 0.069532 0.283825 0.810638
4 0.984463 0.900823 0.434061 0.905273 0.518056 0.984463
5 0.377742 0.480266 0.018676 0.383831 0.819448 0.819448
6 0.473753 0.652077 0.730400 0.305507 0.396969 0.652077
7 0.427047 0.733135 0.526076 0.542852 0.719194 0.733135
8 0.911629 0.633997 0.101848 0.020811 0.327233 0.911629
9 0.244624 0.893365 0.278941 0.354696 0.678280 0.893365
If this isn't what you had in mind, it would be helpful to clarify your question by providing a small example where you clearly identify the columns and the index in your DataFrame and what you mean by "true range".

pandas dataframe and external list interaction

I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL

How can I get the normalized matrix out of this function?

I am given a dataset called stocks_df. Each column has stock prices for different stocks in each day. I am trying to normalize it and return it as a matrix. So, each column will have normalized for a stock for each day.
Wrote up this function-
def normalized_prices(stocks_df):
normalized=np.zeros((stocks_df.shape[0],len(stocks_df.columns[1:])))
for i in range(1,len(stocks_df.columns[1:])+1):
for j in range(0,stocks_df.shape[0]+1):
normalized[i,j]=((stocks_df[i][j]/stocks_df[0][i]))
return normalized
And then tried to call the function-
normalized_prices(stocks_df)
But I'm getting this error-
What can be done to fix this?
From your code, it looks you want to divide everything by the first column, so you can simply do:
import numpy as np
import pandas as pd
np.random.seed(123)
stocks_df = pd.DataFrame(np.random.uniform(0,1,(20,10)))
stocks_df.div(stocks_df[0],axis=0)
0 1 2 3 4 5 6 7 8 9
0 1.0 0.410843 0.325716 0.791585 1.033023 0.607502 1.408195 0.983288 0.690529 0.563008
1 1.0 2.124407 1.277973 0.173898 1.159877 2.150474 0.531770 0.511256 1.548909 1.549713
2 1.0 1.338951 1.141952 0.963150 1.138780 0.509077 0.570284 0.359809 0.462979 0.994601
3 1.0 4.708772 4.677955 5.360028 4.623317 3.390277 4.628973 9.699688 10.250916 5.448532
4 1.0 0.185300 0.508509 0.664836 1.388421 0.401401 0.774152 1.579542 0.832571 0.982277
This gives you every column divided by the first. Now you just need to subset this output:
stocks_df.div(stocks_df[0],axis=0).iloc[:,1:]

Rolling sum and mean in a dataframe in python

I have this input df
import pandas as pd
df = pd.DataFrame([[0,'B','A',1,0], [1,'B','C',0,0], [2,'A','B',3,2],[3,'A','B',5,2],[4,'A','C',2,1],[5,'B','A',0,1],[6,'C','B',5,5]], columns=['events','Runner 1','Runner 2','dist_R1','dist_R2'])
print(df)
and i'd like to add 4 more rolling calculated columns as below:
import pandas as pd
df = pd.DataFrame([[0,'B','A',1,0,0,0,0,0], [1,'B','C',0,0,1,0,1,0], [2,'A','B',3,2,0,1,0,0.5],[3,'A','B',5,2,3,3,2,1],[4,'A','C',2,1,8,0,2.67,0],[5,'B','A',0,1,5,10,1.25,2.5],[6,'C','B',5,5,1,5,0.5,1]], columns=['events','Runner 1','Runner 2','dist_R1','dist_R2','sum_dist_last_2_by_R1','sum dist last 2 by R2','mean dist last 2 by R1','mean dist last 2 by R2'])
print(df)
(sorry, but i'm learning how to format a df in StackOverflow)
I want to calculate last 4 columns.
In details i need to now at the star of the event "n" the sum and the mean km that Runner 1 and Runner 2 completed during the last two events they joined between thost from event 0 to n-1.
I think is quite challenging.
Any help is welcome.
Thanks in advance,
M
You wrote "rolling", but as a matter of fact it is a "very special type"
of rolling calculation (including only rows for runners from the current
row), so you can not use "pandasonic" rolling functions.
Instead you should compute the result other way.
Start from preparatory computation:
Generate 2 auxiliary DataFrames - results for runner 1 and runner 2:
wrk1 = df[['events', 'Runner 1', 'dist_R1']]
wrk1.columns = ['events', 'Runner', 'dist']
wrk2 = df[['events', 'Runner 2', 'dist_R2']]
wrk2.columns = ['events', 'Runner', 'dist']
Concatenate them, getting wrk DataFrame and delete 2 previous DataFrames:
wrk = pd.concat([wrk1, wrk2]).sort_values('events')
del wrk1, wrk2
Then define 2 following functions:
Get statistisc (sum and mean) for the given runner (rnr),
from 2 last events before the given event (ev):
def getStat(rnr, ev):
res = wrk.query('Runner == #rnr and events < #ev').dist.iloc[-2:]
return res.sum(), res.mean()
Get additional columns for the current row:
def getAddCols(row):
td_r1, md_r1 = getStat(row['Runner 1'], row.events)
td_r2, md_r2 = getStat(row['Runner 2'], row.events)
return pd.Series([td_r1, td_r2, md_r1, md_r2],
index=['tot dist_R1', 'tot dist_R2', 'mean dist_R1', 'mean dist_R2'])
And to get the result, run:
df.join(df.apply(getAddCols, axis=1).fillna(0))\
.astype({'tot dist_R1': int, 'tot dist_R2': int})
Note that a Series returned by getAddCols contains some float values,
so all 4 new columns are coerced to float.
To convert both total columns back to int, the last step (astype)
is needed.
The detailed results are a bit different from what you wrote in your post,
but I assume that you failed in your computation (in some cases).

combine pandas apply results as multiple columns in a single dataframe

Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....

Categories

Resources