New pandas columns based on date condition - python

I have a dataframe with a series of floats called balance and a series of timestamps called due_date. I'd like to create a new column called current that displays the balance if the due_date is >= today (all else ""), a column called 1-30 Days that displays the balance if the due_date is 1 to 30 days ago (all else ""), and a column called >30 Days that displays the balance if the due_date is more than 30 days ago (all else "").
Here are some example rows:
balance due_date
0 250.00 2017-10-22
1 400.00 2017-10-04
2 3000.00 2017-09-08
3 3000.00 2017-09-08
4 250.00 2017-08-05
Any help would be greatly appreciated.

Using pd.cut and pd.crosstab
df['diff']=(pd.to_datetime('today')-df.due_date).dt.days
df['New']=pd.cut(df['diff'],bins = [0,1,30,99999],labels=["current","1-30","more than 30"])
pd.concat([df,pd.crosstab(df.index.get_level_values(0),df.New).apply(lambda x: x.mul(df.balance))],axis=1)
Out[928]:
balance due_date diff New more than 30
row_0
0 250.0 2017-01-22 261 more than 30 250.0
1 400.0 2017-02-04 248 more than 30 400.0
2 3000.0 2017-02-08 244 more than 30 3000.0
3 3000.0 2017-02-08 244 more than 30 3000.0
4 250.0 2017-02-05 247 more than 30 250.0

Related

How to add a new date and value for only one of the columns of an existing dataframe

I have a data-frame with date as the index and columns for a few companies' stock prices. I want to append a few new dates at the bottom but with only one of the cos' stock price filled in. How can I do that? I have tried append and concat but am running into issues because I am updating only one co's stock price.
Lets say this is the dataframe:
Co1 Co2 Co3 Co4
Date 1 100 200 300 400
Date 2 105 210 290 350
Date 3 102 205 325 380
I want to add Co3's prices for two new dates from a data-series, lets say:
Date
Date 4 300
Date 5 310
Date 6 305
Name:Co3, Length:3, dtype:float64
Would appreciate if anyone can guide!
Thanks
Let's try pandas.concat
print(Co3)
Date
Date 4 300
Date 5 310
Date 6 305
Name: Co3, dtype: int64
pd.concat([df1, pd.DataFrame(Co3)]).fillna(0.0)
Co1 Co2 Co3 Co4
Date
Date 1 100.0 200.0 300 400.0
Date 2 105.0 210.0 290 350.0
Date 3 102.0 205.0 325 380.0
Date 4 0.0 0.0 300 0.0
Date 5 0.0 0.0 310 0.0
Date 6 0.0 0.0 305 0.0

Group by month from a particular date python

I have a DataFrame of account statement that contains date, debit and credit.
Lets just say salary gets deposited every 20th of the month.
I want to groupby date column from every 20th of each month to find sum of debits and credits. For e.g., 20th Jan to 20th Feb and so on.
date_parsed Debit Credit
0 2020-05-02 775.0 0.0
1 2020-04-30 209.0 0.0
2 2020-04-24 5000.0 0.0
3 2020-04-24 25000.0 0.0
... ... ... ...
79 2020-04-20 750.0 0.0
80 2020-04-15 5000.0 0.0
81 2020-04-13 0.0 2283.0
82 2020-04-09 0.0 6468.0
83 2020-04-03 0.0 1000.0
I am not sure but pd.offsett can be used with groupby.
You could add an extra month column which truncates up or down based on the day of month. Then its just groupby and sum. E.g. month 2020-06 would include dates between 2020-05-20 and 2020-06-19.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_parsed': ['2020-05-02', '2020-05-03', '2020-05-20', '2020-05-22'], 'Credit': [1,2,3,4], 'Debit': [5,6,7,8]})
df['date'] = pd.to_datetime(df.date_parsed)
df['month'] = np.where(df.date.dt.day < 20, df.date.dt.to_period('M'), (df.date + pd.DateOffset(months=1)).dt.to_period('M'))
print(df[['month', 'Credit', 'Debit']].groupby('month').sum().reset_index())
Input:
date_parsed Credit Debit
0 2020-05-02 1 5
1 2020-05-03 2 6
2 2020-05-20 3 7
3 2020-05-22 4 8
Result:
month Credit Debit
0 2020-05 3 11
1 2020-06 7 15

Create Bi-weekly Variable Based On A Custom Start-date

I've done some light searching using "get biweekly variable in python" but havent been able to find many useful posts so I thought to post my question here.
I have a dataframe with tens of thousands of records. The dataframe contains records for the entire fiscal year. Each records has a datetime variable CHECKIN_DATE_TIME. I would like to create a biweekly variable beginning with the date June 30 2019.
ID CHECKIN_DATE_TIME
1 2019-06-30 13:36:00
2 2019-06-30 14:26:00
3 2019-06-30 20:10:00
4 2019-06-30 21:27:00
....
51 2019-07-10 13:36:00
52 2019-07-10 10:26:00
53 2019-07-10 10:10:00
54 2019-07-10 23:27:00
....
I would like a new dataframe to look like this where 6/30/2019 - 7/13/2019 would be week 1, 7/14/2019 to 7/27/2019 would be week 2, and so on until the end date of 6/28/2020. Thus there will be 26 weeks within the Week variable and each week represent a 2 week time frame.
EDIT and to have the last day in the week range assigned to the week number.
ID CHECKIN_DATE_TIME Week Date
1 2019-06-30 13:36:00 1 7/13/2019
2 2019-06-30 14:26:00 1 7/13/2019
3 2019-06-30 20:10:00 1 7/13/2019
4 2019-06-30 21:27:00 1 7/13/2019
....
51 2019-07-20 13:36:00 2 7/27/2019
52 2019-07-20 10:26:00 2 7/27/2019
53 2019-07-20 10:10:00 2 7/27/2019
54 2019-07-20 23:27:00 2
....
You can do so by determining the number of days between the check-in date and 2019-06-30 and then doing a floor division by 14.
df['CHECKIN_DATE_TIME'] = pd.to_datetime(df.CHECKIN_DATE_TIME)
df['week'] = (df.CHECKIN_DATE_TIME - pd.datetime(2019, 6, 30)).dt.days // 14 + 1
df['last_week_day'] = (pd.to_timedelta(-((df.CHECKIN_DATE_TIME - pd.datetime(2019,6,30)).dt.days % 14) + 13 ,'d') + df.CHECKIN_DATE_TIME).dt.date
# note I've created my own test set.
ID CHECKIN_DATE_TIME week last_week_day
0 1 2019-06-30 13:36:00 1 2019-07-13
1 2 2019-07-10 10:36:00 1 2019-07-13
2 3 2019-07-12 02:36:00 1 2019-07-13
3 4 2019-07-18 18:36:00 2 2019-07-27
4 5 2019-07-30 11:36:00 3 2019-08-10
5 6 2019-08-01 20:36:00 3 2019-08-10
Edit: added last_week_day as per request in comments. This is done by calculating the required number of days to the CHECKIN_DATE_TIME columns using modulo operator %.
Using Pandas date_range function it is very easy and efficient way to generate date list for weekly, bi-weekly or monthly
import pandas as pd
from datetime import date,datetime,timedelta
date_rng=pd.date_range(start=date.today()-timedelta(weeks=53),end=date.today(), freq="2W-SAT")
for i in date_rng:
print(i)

Calculate mean based on time elapsed in Pandas

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64

Add different missing dates for groups of rows

Let's suppose that I have a dataset which consists of the following columns:
Stock_id: the id of a stock
Date: a date of 2018 e.g. 25/03/2018
Stock_value: the value of the stock at this specific date
I have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.
By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value at the rows is NA etc.
A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.
By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.
What is the most efficient way to do this?
I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.
Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.
Let's suppose that I initially have the following:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 06/01/2018 150
2 07/01/2018 147
Thus for Stock_id = 1 the date 04/01/2018 is missing.
For Stock_id = 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).
Hence, I would like to have the following as output:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 04/01/2018 NA
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 05/01/2018 NA
2 06/01/2018 150
2 07/01/2018 147
Use asfreq per groups, but if large data performance should be problematic:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.asfreq('D'))
.reset_index()
)
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 124.0
1 1 2018-01-02 130.0
2 1 2018-01-03 136.0
3 1 2018-01-04 NaN
4 1 2018-01-05 129.0
5 1 2018-01-06 131.0
6 1 2018-01-07 133.0
7 2 2018-01-03 144.0
8 2 2018-01-04 148.0
9 2 2018-01-05 NaN
10 2 2018-01-06 150.0
11 2 2018-01-07 147.0
EDIT:
If want change values by minimal datetime per group with some scalar for maximum datetime, use reindex with date_range:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
.reset_index()
)
df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()

Categories

Resources