I have a list of transactions for a business.
Example dataframe:
userid date amt start_of_day_balance
123 2017-01-04 10 100.0
123 2017-01-05 20 NaN
123 2017-01-02 30 NaN
123 2017-01-04 40 100.0
The start of day balance is not always retrieved (in that case we receive a NaN). But from the moment that we know the start of day balance for any day, we can accurately estimate the balance after each transaction afterwards.
In this example the new column should look as follows:
userid date amt start_of_day_balance calculated_balance
123 2017-01-04 10 100.0 110
123 2017-01-05 20 NaN 170
123 2017-01-02 30 NaN NaN
123 2017-01-04 40 100.0 150
Note that there is no way to tell the exact order of the transactions that occurred on the same day - I'm happy to overlook that in this case.
My question is how to create this new column. Something like:
df['calculated_balance'] = df.sort_values(['date']).groupby(['userid'])\
['amt'].cumsum() + df['start_of_day_balance'].min()
wouldn't work because of the NaNs.
I also don't want to filter out any transactions that happened before the first recorded start of day balance.
I came up with a solution that seems to work. I'm not sure how elegant it is.
def calc_estimated_balance(g):
# find the first date which has a start of day balance
first_date_with_bal = g.loc[g['start_of_day_balance'].first_valid_index(), 'date']
# only calculate the balance if date is greater than or equal to the date of the first balance
g['calculated_balance'] = g[g['date'] >= first_date_with_bal]['amt'].cumsum().add(g['start_of_day_balance'].min())
return g
df = df.sort_values(['date']).groupby(['userid']).apply(calc_estimated_balance)
Related
I have a dataframe called data that looks like this:
org_id
commit_date
commit_amt
123
2020-06-01
50000
123
2020-06-01
50000
123
2021-06-01
60000
234
2019-07-01
30000
234
2020-07-01
40000
234
2021-07-01
50000
I want the dataframe to look like this:
org_id
date_1
date_2
date_3
amt_1
amt_2
amt_3
123
2020-06-01
2021-06-01
2022-06-01
50000
50000
60000
234
2019-07-01
2020-07-01
2021-07-01
30000
40000
50000
I've gotten the date columns and org_id column by:
dates = data.groupby('org_id').apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
c_dates = pd.DataFrame() #create empty dataframe
c_dates['org_id'] = dates.index #I had to specify each col bc the
dates df was too hard to work with.
c_dates['date_1'] = dates[0].values.tolist()
c_dates['date_2'] = dates[1].values.tolist()
c_dates['date_3'] = dates[2].values.tolist()
I cannot figure out how to get amt_1, amt_2, and amt_3 columns. I can't just repeat date columns code bc it will miss the repeat 50000 for org_id_123. Bc the c_dates dataframe does not match length of the original data dataframe, I can't just compare c_dates to data.
EXCITING UPDATE!
I haven't totally solved my problem yet, but I have made a bit of progress:
dates = data.groupby(['org_id','commit_amt']).apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
gives me the data I want, however, it is not formatted how I want. It gives results that look like:
org_id
commit_amt
123
50000
2020-06-01
2021-06-01
123
60000
2022-06-01
234
30000
2019-07-01
234
40000
2020-07-01
234
50000
2021-07-01
I would appreciate any help in getting me to the format I want. I ultimately want to be able to take the difference between amt_1 and amt_2, etc.
Hope this makes sense.
P.S. Thanks to the hero who edited this thereby teaching me how to make tables!
EXCITINGER NEWS!! I HAVE SOLVED MY PROBLEM!!!
Long story short, the function I needed was unstack. I am tired now but tomorrow, I will edit this with the solution! w00t!
i think you can use pandas.pivot() , for reshaping your date. but there is problem in using pivot() is you must not have duplicated value.
first i think you drop duplicated rows then use pivot.
data = data.drop_duplicates()
data.pivot(index='org_id', columns=['commit_amt'], values=['commit_date'])
What the original df looks like:
df.sample(n=3)
customer_code subscribe_date trx_date trx_amount
122706 ERESMLVQ 2020-09-16 2020-02-05 500
116048 DYD9NYC2 2020-12-07 2020-06-12 430
228329 H53A46HC 2020-05-11 2020-04-17 630
Aggregation level: 1 row equals 1 transaction date.
I would like to calculate how many transactions (as frequency) and how much money each customer makes in the last 7 days before his subscribe date.
df contains data as far back as 2017.
The code below:
df['subscribe_date'] = pd.to_datetime(df['subscribe_date']).dt.date
df['trx_date']=df['trx_date'].dt.date
d = {'count':'frequency','sum':'monetary'}
diff_ = df['subscribe_date'].sub(df['trx_date']).dt.days
out = (df.assign(Before=np.select([diff_>0],["Before"],"paid_date"))
.groupby(['customer_code','Before'])['trx_amount'].agg(['count','sum'])
.rename(columns=d)).unstack().swaplevel(axis=1)
final_dict = {i: out.loc[:,i] for i in out.columns.levels[0]}
print(final_dict['Before'])
returns a silent error, because it doesn't limit the number of days to 7 days for group by to do the sum and count.
As each customer's transaction value and frequency of purchases are different, the code should return a different result for each row.
Below is the expected output
frequency monetary
cust_code
ERESMLVQ 30.0 1500
DYD9NYC2 50.0 4200
H53A46HC 45.0 5148.03
Zoom into one customer:
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a CSV file that contains daily data in the following format for the last 30 days. However, if a particular ID has been added recently then it is expected that it has less rows (see ID=2 with data only for 2 days):
my.csv
date ID Name Value1 Value2 Value3
07-09-2020 1 ACME 111 3000 123
08-09-2020 1 ACME 222 2500 345
09-09-2020 1 ACME 333 4500 456
10-09-2020 1 ACME 444 1000 567
11-09-2020 1 ACME 555 9000 678
12-09-2020 1 ACME 666 400 789
13-09-2020 1 ACME 666 450 789
14-09-2020 1 ACME 666 444 789
12-09-2020 2 EMCA 111 999 123
13-09-2020 2 EMCA 222 888 345
#...
I'm looking for a solution that will:
take the data for the latest full calendar week for each ID (for now I should disregard 14-09-2020 and any dates before 07-09-2020, but each time I should check for latest fully available calendar week as dates are changing constantly in the file)
create new data frame with calculated % differences between each day of this full calendar week for the values in column Value2
calculate average of % differences for the whole week
save dataframe to new CSV file
Desired output for each ID:
ID Name 07-09-2020 % Difference 08-09-2020 % Difference 09-09-2020 % Difference 10-09-2020 % Difference 11-09-2020 % Difference 12-09-2020 % Difference 13-09-2020 Weekly % Difference Average
1 ACME 3000 -0.166667 2500 0.8 4500 -0.777778 1000 8.0 9000 -0.955556 400 0.125000 450 1.170833
2 EMCA N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 999 -0.111111 888 -0.111111
my code so far:
import pandas as pd
from datetime import timedelta
import datetime
data = pd.read_csv("path/to/my,csv", quotechar='"')
#generate latest full calendar week dates
today = datetime.date.today()
weekday = today.weekday()
start_delta = datetime.timedelta(days=weekday, weeks=1)
week_dates = []
for day in range(7):
week_dates.append(start_of_week + datetime.timedelta(days=day))
#check if latest full calendar week dates are available in my.csv
# if any of the days of the week for latest calendar week is not present, then select dates for the week before this week
last_week_dates = []
for i in week_dates:
last_week_dates.append(i.strftime("%d-%m-%Y"))
for i in last_week_dates:
checkDates = data['date'].isin(last_week_dates)
if any(x == False for x in checkDates):
for i in range(7,14):
print (today - timedelta(days=i)
#get values from the column 'Value2' for the previous week (if last week dates are not in the file)
#save values as columns in new dataframe
#calculate %difference and weekly avg
else:
#get values from the column 'Value2' for the last week
#save values as columns in new dataframe
#calculate %difference and weekly avg
finalData.to_csv("path/to/output.csv", index=False)
Would someone be able to help with this? thank you in advance!
Comments inline
# ensure 'date' is of <type datetime>
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
# select last full calendar week
end = pd.Timestamp.today().normalize()
if end.weekday() != 6:
end -= pd.Timedelta(days=end.weekday() + 1)
out = data.loc[
data['date'].between(end - pd.Timedelta(days=6), end)
]
# cast back to string, to control the way it is printed
out['date'] = out['date'].dt.strftime('%d-%m-%Y')
# calculate and reshape
out = out.set_index(['date', 'ID', 'Name'])['Value2'].to_frame()
out['Difference'] = (
out.groupby('ID').transform('pct_change')
)
out = out.unstack('date')
out.sort_index(axis=1, level='date', kind='mergesort', inplace=True)
out.dropna(axis=1, how='all', inplace=True)
out = out.swaplevel(0, 1, axis=1)
out['Weekly Difference Average'] = (
out.loc[:, (slice(None), 'Difference')]
.mean(axis=1)
)
Output
date 07-09-2020 08-09-2020 09-09-2020 10-09-2020 \
Value2 Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 3000.0 -0.166667 2500.0 0.8 4500.0 -0.777778 1000.0
2 EMCA NaN NaN NaN NaN NaN NaN NaN
date 11-09-2020 12-09-2020 13-09-2020 \
Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 8.0 9000.0 -0.955556 400.0 0.125000 450.0
2 EMCA NaN NaN NaN 999.0 -0.111111 888.0
date Weekly Difference Average
ID Name
1 ACME 1.170833
2 EMCA -0.111111
Then you can use df.to_csv().
I have a data frame with this column name
timestamp,stockname,total volume traded
There are multiple stock names at each time frame
11:00,A,100
11:00,B,500
11:01,A,150
11:01,B,600
11:02,A,200
11:02,B,650
I want to create a ChangeInVol column such that each stock carries its own difference like
timestamp, stock,total volume, change in volume
11:00,A,100,NaN
11:00,B,500,NAN
11:01,A,150,50
11:01,B,600,100
11:02,A,200,50
11:03,B,650,50
If it were a single stock, I could have done
df['ChangeVol'] = df['TotalVol'] - df['TotalVol'].shift(1)
but there are multiple stocks
Need sort_values + DataFrameGroupBy.diff:
#if columns not sorted
df = df.sort_values(['timestamp','stockname'])
df['change in volume'] = df.groupby('stockname')['total volume traded'].diff()
print (df)
timestamp stockname total volume traded change in volume
0 11:00 A 100 NaN
1 11:00 B 500 NaN
2 11:01 A 150 50.0
3 11:01 B 600 100.0
4 11:02 A 200 50.0
5 11:02 B 650 50.0
I have the following problem with imputing the missing or zero values in a table. It seems like it's more of an algorithm problem. I wanted to know if someone could help me out figure this out in python or R.
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 0 01/24/2017 00:00:00
A 0 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 0 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 0 01/12/2017 00:00:00
B 0 01/11/2017 00:00:00
B 0 01/10/2017 00:00:00
B 0 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
for each asset(A,B,etc..) traverse through the records chronologically(date) replace all the zeros with the average of mileage between the points =
(earlier mileage that is not zero - later mileage that is not zero) /
( number of records from the earlier mileage to the later mileage) +
the earlier mileage.
for instance for the above table the data will look like this after it's fixed
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 40,974 01/24/2017 00:00:00
A 40,919 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 39,800 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 4,000 01/12/2017 00:00:00
B 3,500 01/11/2017 00:00:00
B 3,000 01/10/2017 00:00:00
B 2,500 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
in the above case for instance the calculation for one of the records is as below:
(41084-40864)/4(# of records from 40,864 to 41,084) = 110 + previous
value(40,864) = 40919
It seems like you want to be using an analysis method that uses some sort of by to iterate over your data frame and find averages. You could consider something using by() and apply(). The specific iterative changes make it harder without adding in an ordered variable (i.e., right now your rows are implied to be numbered, but should be numbered by date within asset).
Steps to solving this yourself:
Create an ordered variable that provides a number from mileage (0) to mileage (X).
Use either by() or dplyr::group_by() to create averages within each asset. You might want to merge() or dplyr::inner_join() that to the original dataset, or use a lookup.
Use ifelse() to add that average to rows where mileage is 0, multiplying it by the ordered variable.