How to do weighted average on a timeseries groupby bucket - python

My timeseries dataframe looks like below:
ts_ms a. b. c. flow. latency. duration
1614772770705. 10. 10. 4. 1 2 3
1614772770800. 10. 10. 2. 1 2 4
1614772770750. 10. 5. 4. 1 2 3
I need to create a 5Min bucket, then groupby a,, b, c such that latency is summed and duration is weighted averaged on flow
What I have so far is
wm = lambda x: (x * df.loc[x.index, "flow"]).sum() / df.flow.sum()
def agg_func(df):
df.groupby(pd.Grouper(freq='5Min')).agg(latency_sum=("latency", "sum"), duration_weighted=("duration", wm))
#convert to datetimes
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df.set_index('ts_date', inplace=True)
df1 = df.groupby(["a", "b", "c"]).apply(agg_func)
That does now work. I basically get an empty dataframe as df1
What am I missing? Please suggest.
EDIT
For clarity, the expected output dataframe should have below columns with some values ...
ts_date a. b. c. latency_sum duration_weighted
But I get an empty dataframe
df1.to_dict('records')
[]

You have to also return:
wm = lambda x: (x * df.loc[x.index, "flow"]).sum() / df.flow.sum()
def agg_func(df):
return df.groupby(pd.Grouper(freq='5Min')).agg(latency_sum=("latency", "sum"), duration_weighted=("duration", wm))
#convert to datetimes
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df.set_index('ts_date', inplace=True)
df1 = df.groupby(["a", "b", "c"]).apply(agg_func)
print(df1)
Output:
latency_sum duration_weighted
a b c ts_date
10 5 4 1970-01-01 00:25:00 2 1.000000
10 2 1970-01-01 00:25:00 2 1.333333
4 1970-01-01 00:25:00 2 1.000000

Related

Identify customer segments based on transactions that they have made in specific period using Python

For customer segmentation purpose, I want to analyse, How many transactions did the customer do in prior 10 days & 20 days based on given table of transaction records with date.
In this table, the last 2 columns are joined by using the following code.
But I'm not satisfied with this code, please suggest me improvement.
import pandas as pd
df4 = pd.read_excel(path)
# Since A and B two customers are there, two separate dataframe created
df4A = df4[df4['Customer_ID'] == 'A']
df4B = df4[df4['Customer_ID'] == 'B']
from datetime import date
from dateutil.relativedelta import relativedelta
txn_prior_10days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_10days_date = current_date - relativedelta(days=10)
if df4.iloc[i,1] == 'A':
No_of_txn = ((df4A['Transaction_Date'] >= prior_10days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
elif df4.iloc[i,1] == 'B':
No_of_txn = ((df4B['Transaction_Date'] >= prior_10days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
txn_prior_20days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_20days_date = current_date - relativedelta(days=20)
if df4.iloc[i,1] == 'A':
no_of_txn = ((df4A['Transaction_Date'] >= prior_20days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
elif df4.iloc[i,1] == 'B':
no_of_txn = ((df4B['Transaction_Date'] >= prior_20days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
df4['txn_prior_10days'] = txn_prior_10days
df4['txn_prior_20days'] = txn_prior_20days
df4
Your code would be very difficult to write if you had
e.g. 10 different Customer_IDs.
Fortunately, there is much shorter solution:
When you read your file, convert Transaction_Date to datetime,
e.g. passing parse_dates=['Transaction_Date'] to read_excel.
Define a fuction counting how many dates in group (gr) are
within the range between tDlt (Timedelta) and 1 day before the
current date (dd):
def cntPrevTr(dd, gr, tDtl):
return gr.between(dd - tDtl, dd - pd.Timedelta(1, 'D')).sum()
It will be applied twice to each member of the current group
by Customer_ID (actually to Transaction_Date column only),
once with tDtl == 10 days and second time with tDlt == 20 days.
Define a function counting both columns containing the number of previous
transactions, for the current group of transaction dates:
def priorTx(td):
return pd.DataFrame({
'tx10' : td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D'))),
'tx20' : td.apply(cntPrevTr, args=(td, pd.Timedelta(20, 'D')))})
Generate the result:
df[['txn_prior_10days', 'txn_prior_20days']] = df.groupby('Customer_ID')\
.Transaction_Date.apply(priorTx)
The code above:
groups df by Customer_ID,
takes from the current group only Transaction_Date column,
applies priorTx function to it,
saves the result in 2 target columns.
The result, for a bit shortened Transaction_ID, is:
Transaction_ID Customer_ID Transaction_Date txn_prior_10days txn_prior_20days
0 912410 A 2019-01-01 0 0
1 912341 A 2019-01-03 1 1
2 312415 A 2019-01-09 2 2
3 432513 A 2019-01-12 2 3
4 357912 A 2019-01-19 2 4
5 912411 B 2019-01-06 0 0
6 912342 B 2019-01-11 1 1
7 312416 B 2019-01-13 2 2
8 432514 B 2019-01-20 2 3
9 357913 B 2019-01-21 3 4
You cannot use rolling computation, because:
the rolling window extends forward from the current row, but you
want to count previous transactions,
rolling calculations include the current row, whereas
you want to exclude it.
This is why I came up with the above solution (just 8 lines of code).
Details how my solution works
To see all details, create the test DataFrame the following way:
import io
txt = '''
Transaction_ID Customer_ID Transaction_Date
912410 A 2019-01-01
912341 A 2019-01-03
312415 A 2019-01-09
432513 A 2019-01-12
357912 A 2019-01-19
912411 B 2019-01-06
912342 B 2019-01-11
312416 B 2019-01-13
432514 B 2019-01-20
357913 B 2019-01-21'''
df = pd.read_fwf(io.StringIO(txt), skiprows=1,
widths=[15, 12, 16], parse_dates=[2])
Perform groupby, but for now retrieve only group with key 'A':
gr = df.groupby('Customer_ID')
grp = gr.get_group('A')
It contains:
Transaction_ID Customer_ID Transaction_Date
0 912410 A 2019-01-01
1 912341 A 2019-01-03
2 312415 A 2019-01-09
3 432513 A 2019-01-12
4 357912 A 2019-01-19
Let's start from the most detailed issue, how works cntPrevTr.
Retrieve one of dates from grp:
dd = grp.iloc[2,2]
It contains Timestamp('2019-01-09 00:00:00').
To test example invocation of cntPrevTr for this date, run:
cntPrevTr(dd, grp.Transaction_Date, pd.Timedelta(10, 'D'))
i.e. you want to check how many prior transaction performed this customer
before this date, but not earlier than 10 days back.
The result is 2.
To see how the whole first column is computed, run:
td = grp.Transaction_Date
td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D')))
The result is:
0 0
1 1
2 2
3 2
4 2
Name: Transaction_Date, dtype: int64
The left column is the index and the right - values returned
from cntPrevTr call for each date.
And the last thing is to show, how the result for the whole group
is generated. Run:
priorTx(grp.Transaction_Date)
The result (a DataFrame) is:
tx10 tx20
0 0 0
1 1 1
2 2 2
3 2 3
4 2 4
The same procedure takes place for all other groups, then
all partial results are concatenated (vertically) and the last
step is to save both columns of the whole DataFrame in
respective columns of df.

put equation to calculate value for new column

I want to create a new column in my table by implementing equation, but there might be 2 possible equations for the new table.
(1) frequency = (total x 100) / hour
(2) frequency = (total x 1000000) / km_length
the table is similar to this:
type hour km_length total
A 1 - 1
B - 2 1
the calculation for "frequency" table would depend on which columns between hour and km_length that has value.
then, I expect the table will be like this:
type hour km_length total frequency
A 1 - 1 100
B - 2 1 500000
I have tried using np.nan_to_num before but it did not show the expected table I want.
is there anyway I can make it using python? Looking forward to any help
thankyou.
We can use np.where for assigning values based on a condition:
df[["hour", "km_length"]] = df[["hour", "km_length"]].apply(pd.to_numeric, errors="coerce")
df["frequency"] = np.where(
df["km_length"].isna(),
df["total"] * 100 / df["hour"],
df["total"] * 1_000_000 / df["km_length"]
)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
Make your values numeric then multiply. Because a missing value indicates with method to use and because division with NaN results in a NaN do both multiplications and use .fillna to determine the correct resulting value.
df[['hour', 'km_length']] = df[['hour', 'km_length']].apply(pd.to_numeric, errors='coerce')
s1 = df['total'].divide(df['hour']).multiply(100)
s2 = df['total'].divide(df['km_length']).multiply(10**6)
df['frequency'] = s1.fillna(s2)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
You can store the data in numpy array.
import numpy as np
table = np.array([['hour' , 'km_lenght' , 'total' , 'frequrncy']] #set the value of frequency as 0
for i in table:
try:
i[3] = (i[2]*100)/i[0]
except:
i[3] = (i[2]*1000000)/i[1]
print(table)
This should print the desired table.

Divide columns in a DataFrame by a Series (result is only NaNs?)

I'm trying to do a similar thing to what is posted in this question: Python Pandas - n X m DataFrame multiplied by 1 X m Dataframe
I have an n x m DataFrame, with all non-zero float values, and a 1 x m column, with all non-zero float values, and I'm trying to divide each column in the n x m dataframe by the values in the column.
So I've got:
a b c
1 2 3
4 5 6
7 8 9
and
x
11
12
13
and I'm looking to return:
a b c
1/11 2/11 3/11
4/12 5/12 6/12
7/13 8/13 9/13
I've tried a multiplication operation first, to see if I can make it work, so I tried applying the two solutions given in the answer to the question above.
df_prod = pd.DataFrame({c:df[c]* df_1[c].ix[0] for c in df.columns})
This produces a "Key Error 0"
And using the other solution to :
df.mul(df_1.iloc[0])
This just gives me all NaN, although in the right shape.
The cause of NaNs are due to misalignment of your indexes. To get over this, you will either need to divide by numpy arrays,
# <=0.23
df.values / df2[['x']].values # or df2.values assuming there's only 1 column
# 0.24+
df.to_numpy() / df[['x']].to_numpy()
array([[0.09090909, 0.18181818, 0.27272727],
[0.33333333, 0.41666667, 0.5 ],
[0.53846154, 0.61538462, 0.69230769]])
Or perform an axis aligned division using .div:
df.div(df2['x'], axis=0)
a b c
0 0.090909 0.181818 0.272727
1 0.333333 0.416667 0.500000
2 0.538462 0.615385 0.692308

Multiply many columns pandas

I have a data frame like this, but with many more columns and I would like to multiply each two adjacent columns and state the product of the two in a new column beside it and call it Sub_pro and at the end have the total sum of all Sub_pro in a column called F_Pro and reduce the precision to 3 decimal places. I don't know how to get the Sub_pro columns. Below is my code.
import pandas as pd
df = pd.read_excel("C:dummy")
df['F_Pro'] = ("Result" * "Attribute").sum(axis=1)
df.round(decimals=3)
print (df)
Input
Id Result Attribute Result1 Attribute1
1 0.5621 0.56 536 0.005642
2 0.5221 0.5677 2.15 93
3 0.024564 5.23 6.489 8
4 11.564256 4.005 0.45556 5.25
5 0.6123 0.4798 0.6667 5.10
Desire Output
id Result Attribute Sub_Pro Result1 Attribute1 Sub_pro1 F_Pro
1 0.5621 0.56 0.314776 536 0.005642 3.024112 3.338888
2 0.5221 0.5677 0.29639617 2.15 93 199.95 200.2463962
3 0.024564 5.23 0.12846972 6.489 8 51.912 52.04046972
4 11.564256 4.005 46.31484528 0.45556 5.25 2.39169 48.70653528
5 0.6123 0.4798 0.29378154 0.6667 5.1 3.40017 3.69395154
Because you have several columns named kind of the same, here is one way using filter. To see how it works, on your df, you do df.filter(like='Result') and you get the columns where the name has Result in it:
Result Result1
0 0.562100 536.00000
1 0.522100 2.15000
2 0.024564 6.48900
3 11.564256 0.45556
4 0.612300 0.66670
You can create an array containing the columns 'Sub_Pro':
import numpy as np
arr_sub_pro = np.round(df.filter(like='Result').values* df.filter(like='Attribute').values,3)
and you get the values of the columns sub_pro such as arr_sub_pro:
array([[3.1500e-01, 3.0240e+00],
[2.9600e-01, 1.9995e+02],
[1.2800e-01, 5.1912e+01],
[4.6315e+01, 2.3920e+00],
[2.9400e-01, 3.4000e+00]])
Now you need to add them at the right position in the dataframe, I think a loop for is necessary
for nb, col in zip( range(arr_sub_pro.shape[1]), df.filter(like='Attribute').columns):
df.insert(df.columns.get_loc(col)+1, 'Sub_pro{}'.format(nb), arr_sub_pro[:,nb])
here I get the location of the column Attibut(nb) and insert the value from column nb of arr_sub_pro at the next position
To add the column 'F_Pro', you can do:
df.insert(len(df.columns), 'F_Pro', arr_sub_pro.sum(axis=1))
the final df looks like:
Id Result Attribute Sub_pro0 Result1 Attribute1 Sub_pro1 \
0 1 0.562100 0.5600 0.315 536.00000 0.005642 3.024
1 2 0.522100 0.5677 0.296 2.15000 93.000000 199.950
2 3 0.024564 5.2300 0.128 6.48900 8.000000 51.912
3 4 11.564256 4.0050 46.315 0.45556 5.250000 2.392
4 5 0.612300 0.4798 0.294 0.66670 5.100000 3.400
F_Pro
0 3.339
1 200.246
2 52.040
3 48.707
4 3.694
import pandas as pd
src = "/opt/repos/pareto/test/stack/data.csv"
df = pd.read_csv(src)
count = 0
def multiply(x):
res = x.copy()
keys_len = len(x)
idx = 1
while idx + 1 < keys_len:
left = x[idx]
right = x[idx + 1]
new_key = "sub_prod_{}".format(idx)
# Multiply and round to three decimal places.
res[new_key] = round(left * right,3)
idx = idx + 1
return res
res_df = df.apply(lambda x: multiply(x),axis=1)
It solve the problem but you need now order de columns you can iterate over the keys instead of make a deep copy of the full row. I hope that the code help you.
Here's one way using NumPy and a dictionary comprehension:
# extract NumPy array for relevant columns
A = df.iloc[:, 1:].values
n = int(A.shape[1] / 2)
# calculate products and feed to pd.DataFrame
prods = pd.DataFrame({'Sub_Pro_'+str(i): np.prod(A[:, 2*i: 2*(i+1)], axis=1) \
for i in range(n)})
# calculate sum of product rows
prods['F_Pro'] = prods.sum(axis=1)
# join to original dataframe
df = df.join(prods)
print(df)
Id Result Attribute Result1 Attribute1 Sub_Pro_0 Sub_Pro_1 \
0 1 0.562100 0.5600 536.00000 0.005642 0.314776 3.024112
1 2 0.522100 0.5677 2.15000 93.000000 0.296396 199.950000
2 3 0.024564 5.2300 6.48900 8.000000 0.128470 51.912000
3 4 11.564256 4.0050 0.45556 5.250000 46.314845 2.391690
4 5 0.612300 0.4798 0.66670 5.100000 0.293782 3.400170
F_Pro
0 3.338888
1 200.246396
2 52.040470
3 48.706535
4 3.693952

Use Column and Row Multi-Index values in Pandas Groupby without unstacking

I have a multi-index hierarchy set up as follows:
import numpy as np
sectors = ['A','B','C','D']
ports = ['pf','bm']
dates = range(1,11)*2
wts, pchg = zip(*np.random.randn(20,2))
df = pd.DataFrame(dict(dates=dates,port=sorted(ports*10),
sector=np.random.choice(sectors,20), wts=wts,
pchg=pchg))
df = df.set_index(['port','sector','dates'])
df = df.unstack('port')
df = df.fillna(0)
I'd like group by dates and port , and sum pchg * wts
I've been through the docs but I'm struggling to figure this out.
Any help greatly appreciated. Thanks
You indeed do not need to unstack to get what you want, using the product method to do the multiplication you want. Step by step:
Starting from this dataframe:
In [50]: df.head()
Out[50]:
pchg wts
port bm pf bm pf
sector dates
A 1 0.138996 0.451688 0.763287 -1.863401
3 1.081863 0.000000 0.956807 0.000000
4 0.207065 0.000000 -0.663175 0.000000
5 0.258293 -0.868822 0.109336 -0.784900
6 -1.016700 0.900241 -0.054077 -1.253191
We can first do the pchg * wts part with the product method, multiplying over axis 1, but only for the second level:
In [51]: df.product(axis=1, level=1).head()
Out[51]:
port bm pf
sector dates
A 1 0.106094 -0.841675
3 1.035134 0.000000
4 -0.137320 0.000000
5 0.028241 0.681938
6 0.054980 -1.128174
And then we can just group by dates (and no grouping by port needed anymore) and take the sum:
In [52]: df.product(axis=1, level=1).groupby(level='dates').sum()
Out[52]:
port bm pf
dates
1 0.106094 -0.841675
2 0.024968 1.357746
3 1.035134 1.776464
4 -0.137320 0.392312
5 0.028241 0.681938
6 0.054980 -1.128174
7 0.140183 -0.338828
8 1.296028 -1.526065
9 -0.213989 0.469104
10 0.058369 -0.006564
This gives the same output as
df.stack('port').groupby(level=[1,2]).apply(lambda x: (x['wts']*x["pchg"]).sum()).unstack('port')

Categories

Resources