I want to calculate excess amount remaining in ATM from the given dataset of transactions and replenishment.
I can do it by looping over the data to subtract the transactions from current amount. But I need to do this without using loop.
# R: Replenishment amount
# T: Transaction Amount
'''
R T
100 50
0 30
0 10
200 110
0 30
60 20
'''
data = {'Date':pd.date_range('2011-05-03','2011-05-8' ).tolist(),'R':[100,0,0,200,0,60],'T':[50,30,10,110,30,20]}
df = pd.DataFrame(data)
# calculated temporary amount and shift it to subtract future
# transactions from it
df['temp'] = ((df['R']-df['T']).shift(1).bfill())
# Boolean indicating whether ATM was replenished or not
# 1: Replenished, 0: Not Replenished
df['replenished'] = (df['R'] >0).astype(int)
# If replenished subtract transaction amount from the replenishment amount
# otherwise subtract it from temp amount
df['replenished']*df['R']+(np.logical_not(df['replenished']).astype(int))*df['temp']-df['T']
Expected Results:
0 50.0
1 20.0
2 10.0
3 90.0
4 60.0
5 40.0
dtype: float64
Actual Results:
0 50.0
1 20.0
2 -40.0
3 90.0
4 60.0
5 40.0
dtype: float64
First of all, we compute a boolean column to know if it was replenished, as you do.
df['replenished'] = df['R'] > 0
We also compute the increment in money, which will be useful to perform the rest of the operations.
df['increment'] = df['R'] - df['T']
We also create the column which will have the desired values in due time, I called it reserve. To begin, we do the cumulated sum of the increments, which is the desired value from the first replenishment day until the next one.
df['reserve'] = df['increment'].cumsum()
Now, we are going to create an auxiliary alias of our dataframe, which will be useful to do the operations without losing the original data. Remember that this variable is not a copy, it points to the same data as the original: A change in df_aux will change the original variable df.
df_aux = df
Then we can proceed to the loop that will take care of the problem.
while not df_aux.empty:
df_aux = df_aux.loc[df_aux.loc[df_aux['replenished']].index[0]:]
k = df_aux.at[df_aux.index[0], 'reserve']
l = df_aux.at[df_aux.index[0], 'increment']
df_aux['reserve'] = df_aux['reserve'] - k + l
if len(df_aux) > 1:
df_aux = df_aux.loc[df_aux.index[1]:]
else:
break
First, we take all the dataframe starting from the next replenishment day. From this day to the next replenishment day the cumulated sum will give us the desired outcome if the initial value is iqual to the increment, so we modify the cumsum so that the first value complies with this condition.
Then, if this was the last row of the dataframe our work is done and we get out of the loop. If it wasn't, then we drop the replenishment day we just calculated and go on to the next days.
After all these operations, the result (df) is this:
Date R T increment replenished reserve
0 2011-05-03 100 50 50 True 50
1 2011-05-04 0 30 -30 False 20
2 2011-05-05 0 10 -10 False 10
3 2011-05-06 200 110 90 True 90
4 2011-05-07 0 30 -30 False 60
5 2011-05-08 60 20 40 True 40
I'm not experienced with efficiencies in calculus time, so I'm not sure if this solution is faster than looping through all rows.
Related
I'm looking to have two-level index, of which one is of type datetime and the other one is int. The time column I'd like to resample for 1min, and the int column I'd like to have it as intervals of 5.
Currently I've only done the first part, but I've left the second level untouched:
x = w.groupby([pd.Grouper(level='time', freq='1min'), pd.Grouper(level=1)]).sum()
The problem is that it's not good to use bins generated from the entire range of data for pd.cut(), because most of them will be zero. I want to limit the bins only to the context of each 5-second interval.
In other words, I want to replace the second argument (pd.Grouper(level=1)) with pd.cut(rows_from_level0, my_bins) where mybins is an array from the respective 5 second group that's in intervals of 5. (e.g. for [34,54,29,31] -> [30, 35, 40, 45, 50, 55]).
How my_bins computed can be seen below:
def roundTo(num, base=5):
return base * round(num/base)
arr_min = roundTo(min(arr))
arr_max = roundTo(max(arr))
dif = arr_max - arr_min
my_bins = np.linspace(arr_min, arr_max, dif//5 +1)
Basically I'm not sure how to make the second level pd.cut aware of the rows from the first level index in order to produce the bins.
One way to go is to extract the level values, do some math, then groupby on that:
N = 5
df.groupby([pd.Grouper(level='datetime', freq='1min'),
df.index.get_level_values(level=1)//N * N]
).sum()
You would get something similar to this:
data
datetime lvl1
2021-01-01 00:00:00 5 9
15 1
25 4
60 9
2021-01-01 00:01:00 5 8
25 7
85 2
90 6
2021-01-01 00:02:00 0 9
70 8
For customer segmentation purpose, I want to analyse, How many transactions did the customer do in prior 10 days & 20 days based on given table of transaction records with date.
In this table, the last 2 columns are joined by using the following code.
But I'm not satisfied with this code, please suggest me improvement.
import pandas as pd
df4 = pd.read_excel(path)
# Since A and B two customers are there, two separate dataframe created
df4A = df4[df4['Customer_ID'] == 'A']
df4B = df4[df4['Customer_ID'] == 'B']
from datetime import date
from dateutil.relativedelta import relativedelta
txn_prior_10days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_10days_date = current_date - relativedelta(days=10)
if df4.iloc[i,1] == 'A':
No_of_txn = ((df4A['Transaction_Date'] >= prior_10days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
elif df4.iloc[i,1] == 'B':
No_of_txn = ((df4B['Transaction_Date'] >= prior_10days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
txn_prior_20days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_20days_date = current_date - relativedelta(days=20)
if df4.iloc[i,1] == 'A':
no_of_txn = ((df4A['Transaction_Date'] >= prior_20days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
elif df4.iloc[i,1] == 'B':
no_of_txn = ((df4B['Transaction_Date'] >= prior_20days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
df4['txn_prior_10days'] = txn_prior_10days
df4['txn_prior_20days'] = txn_prior_20days
df4
Your code would be very difficult to write if you had
e.g. 10 different Customer_IDs.
Fortunately, there is much shorter solution:
When you read your file, convert Transaction_Date to datetime,
e.g. passing parse_dates=['Transaction_Date'] to read_excel.
Define a fuction counting how many dates in group (gr) are
within the range between tDlt (Timedelta) and 1 day before the
current date (dd):
def cntPrevTr(dd, gr, tDtl):
return gr.between(dd - tDtl, dd - pd.Timedelta(1, 'D')).sum()
It will be applied twice to each member of the current group
by Customer_ID (actually to Transaction_Date column only),
once with tDtl == 10 days and second time with tDlt == 20 days.
Define a function counting both columns containing the number of previous
transactions, for the current group of transaction dates:
def priorTx(td):
return pd.DataFrame({
'tx10' : td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D'))),
'tx20' : td.apply(cntPrevTr, args=(td, pd.Timedelta(20, 'D')))})
Generate the result:
df[['txn_prior_10days', 'txn_prior_20days']] = df.groupby('Customer_ID')\
.Transaction_Date.apply(priorTx)
The code above:
groups df by Customer_ID,
takes from the current group only Transaction_Date column,
applies priorTx function to it,
saves the result in 2 target columns.
The result, for a bit shortened Transaction_ID, is:
Transaction_ID Customer_ID Transaction_Date txn_prior_10days txn_prior_20days
0 912410 A 2019-01-01 0 0
1 912341 A 2019-01-03 1 1
2 312415 A 2019-01-09 2 2
3 432513 A 2019-01-12 2 3
4 357912 A 2019-01-19 2 4
5 912411 B 2019-01-06 0 0
6 912342 B 2019-01-11 1 1
7 312416 B 2019-01-13 2 2
8 432514 B 2019-01-20 2 3
9 357913 B 2019-01-21 3 4
You cannot use rolling computation, because:
the rolling window extends forward from the current row, but you
want to count previous transactions,
rolling calculations include the current row, whereas
you want to exclude it.
This is why I came up with the above solution (just 8 lines of code).
Details how my solution works
To see all details, create the test DataFrame the following way:
import io
txt = '''
Transaction_ID Customer_ID Transaction_Date
912410 A 2019-01-01
912341 A 2019-01-03
312415 A 2019-01-09
432513 A 2019-01-12
357912 A 2019-01-19
912411 B 2019-01-06
912342 B 2019-01-11
312416 B 2019-01-13
432514 B 2019-01-20
357913 B 2019-01-21'''
df = pd.read_fwf(io.StringIO(txt), skiprows=1,
widths=[15, 12, 16], parse_dates=[2])
Perform groupby, but for now retrieve only group with key 'A':
gr = df.groupby('Customer_ID')
grp = gr.get_group('A')
It contains:
Transaction_ID Customer_ID Transaction_Date
0 912410 A 2019-01-01
1 912341 A 2019-01-03
2 312415 A 2019-01-09
3 432513 A 2019-01-12
4 357912 A 2019-01-19
Let's start from the most detailed issue, how works cntPrevTr.
Retrieve one of dates from grp:
dd = grp.iloc[2,2]
It contains Timestamp('2019-01-09 00:00:00').
To test example invocation of cntPrevTr for this date, run:
cntPrevTr(dd, grp.Transaction_Date, pd.Timedelta(10, 'D'))
i.e. you want to check how many prior transaction performed this customer
before this date, but not earlier than 10 days back.
The result is 2.
To see how the whole first column is computed, run:
td = grp.Transaction_Date
td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D')))
The result is:
0 0
1 1
2 2
3 2
4 2
Name: Transaction_Date, dtype: int64
The left column is the index and the right - values returned
from cntPrevTr call for each date.
And the last thing is to show, how the result for the whole group
is generated. Run:
priorTx(grp.Transaction_Date)
The result (a DataFrame) is:
tx10 tx20
0 0 0
1 1 1
2 2 2
3 2 3
4 2 4
The same procedure takes place for all other groups, then
all partial results are concatenated (vertically) and the last
step is to save both columns of the whole DataFrame in
respective columns of df.
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
Here is a part of df:
NUMBER MONEY
12345 20
12345 -20
12345 20
12345 20
123456 10
678910 7.6
123457 3
678910 -7.6
I want to drop rows which have the same NUMBER but opposite money.
The ideal outcome would like below:
NUMBER MONEY
12345 20
12345 20
123456 10
123457 3
note: these entries are not one-to-one correspondence (I mean the total amount is an odd number).
For example, there are four entries are [Number] 12345.
three of them [Money] are 20, one [Money] is -20.
I just want to delete two [Money] is the opposite, and keep the other two whose money is 20.
Here a solution using groupby and apply and a custom function to match and delete pairs.
def remove_pairs(x):
positive = x.loc[x['MONEY'] > 0].index.values
negative = x.loc[x['MONEY'] < 0].index.values
for i, j in zip(positive, negative):
x = x.drop([i, j])
return x
df['absvalues'] = df['MONEY'].abs()
dd = df.groupby(['NUMBER', 'absvalues']).apply(remove_pairs)
dd.reset_index(drop=True, inplace=True)
dd.drop('absvalues', axis=1, inplace=True)
'absvalue' column with the absolute values of 'MONEY' is added to perform a double index selection with groupby, and then the custom function drops rows in pairs selecting positive and negative numbers.
The two last lines just do some cleaning. Using your sample dataframe, the final result dd is:
NUMBER MONEY
0 12345 20.0
1 12345 20.0
2 123456 10.0
3 123457 3.0
Recently I asked how one could count the number of registers by the interval as answered in Count number of registers in interval.
The solution works great, but I had to adapt it to also take into account some localization key.
I did it through the following code:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
Output:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.