I wish to add a new row in the first line within each group, my raw dataframe is:
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person ('ID'), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wrote some loop conditions but quite slow, If you have any good ideas, please help. Thanks a lot
Let's try groupby.apply here. We'll append a row to each group at the start, like this:
def augment_group(group):
first_row = group.iloc[[0]]
first_row['Date'] += pd.Timedelta(days=1)
return first_row.append(group)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
(df.groupby('ID', as_index=False, group_keys=False)
.apply(augment_group)
.reset_index(drop=True))
ID From_num To_num Date
0 James 78 96 2020-05-13
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02
17 Wong Started 39 2019-02-01
Although I agree with #Joran Beasley in the comments that this feels like somewhat of an XY problem. Perhaps try clarifying the problem you're trying to solve, instead of asking how to implement what you think is the solution to your issue?
Related
My dataframe (df) contains 2 inputs UnitShrtDescr and SchShrtDescr
.
So for particular UnitShrtDescr and SchShrtDescr it must predict next value. But my data contains lots of missing values (output for in-between dates are 0).
During prediction prophet continuously predict value for each and every day without considering in between dates output as empty. How can i resolve this?
>df #(main dataframe)
>
UnitShrtDescr SchShrtDescr y ds id
8110 50 93 1 2011-12-01 243
3437 29 87 1 2011-12-21 133
6867 43 75 1 2011-12-23 204
1102 8 23 1 2011-12-28 36
5271 36 14 1 2011-12-28 166
... ... ... ... ... ...
13138 83 0 1 2018-05-18 390
14424 92 3 1 2018-05-18 432
11556 69 0 1 2018-05-18 334
11767 69 5 1 2018-05-18 338
4458 30 102 1 2018-05-18 141
15950 rows × 5 columns
code:
model = Prophet(daily_seasonality=True)
model.add_regressor("UnitShrtDescr")
model.add_regressor("SchShrtDescr")
model.fit(df)
input regressor that i want to predict is
UnitShrtDescr=40 and SchShrtDescr=93. So i made make_future_dataframe:
future = model.make_future_dataframe(periods=100, include_history=False)
future["UnitShrtDescr"]=40
future["SchShrtDescr"]=93
Previous value for UnitShrtDescr=40 and SchShrtDescr=93 was:
>dfx[(dfx['UnitShrtDescr']==40) & (dfx['SchShrtDescr']==93)].tail(10)
>
UnitShrtDescr SchShrtDescr y ds id
6293 40 93 1 2018-02-27 189
6294 40 93 3 2018-02-28 189
6295 40 93 1 2018-03-17 189
6296 40 93 1 2018-03-29 189
6297 40 93 1 2018-03-30 189
6298 40 93 4 2018-03-31 189
6299 40 93 1 2018-04-26 189
6300 40 93 1 2018-04-27 189
6301 40 93 4 2018-04-30 189
6302 40 93 1 2018-05-16 189
Please note Gap between dates is much bigger which means y is 0 for between dates.
So when i make prediction it must predict in-between dates as 0 also.
But in this case it continuously predict y without considering in between y as 0
output = model.predict(future)
>output[['ds','yhat']].head(10)
>
ds yhat
0 2018-05-19 2.959505
1 2018-05-20 2.631181
2 2018-05-21 2.418850
3 2018-05-22 2.411914
4 2018-05-23 2.386383
5 2018-05-24 2.444841
6 2018-05-25 2.409294
7 2018-05-26 2.937428
8 2018-05-27 2.588136
9 2018-05-28 2.358953
Please Suggest Changes or better alternative for my case
Hi have the following Dataframe that contains sends and open totals df_send_open:
date user_id name send open
0 2022-03-31 35 sally 50 20
1 2022-03-31 47 bob 100 55
2 2022-03-31 01 john 500 102
3 2022-03-31 45 greg 47 20
4 2022-03-30 232 william 60 57
5 2022-03-30 147 mary 555 401
6 2022-03-30 35 sally 20 5
7 2022-03-29 41 keith 65 55
8 2022-03-29 147 mary 100 92
My other Dataframe contains calls and cancelled totals df_call_cancel:
date user_id name call cancel
0 2022-03-31 21 percy 54 21
1 2022-03-31 47 bob 150 21
2 2022-03-31 01 john 100 97
3 2022-03-31 45 greg 101 13
4 2022-03-30 232 william 61 55
5 2022-03-30 147 mary 5 3
6 2022-03-30 35 sally 13 5
7 2022-03-29 41 keith 14 7
8 2022-03-29 147 mary 102 90
Like a VLOOKUP in excel, i want to add the additional columns from df_call_cancel to df_send_open, however I need to do it on the unique combination of BOTH date and user_id and this is where i'm tripping up.
I have two desired Dataframes outcomes (not sure which to go forward with so thought i'd ask for both solutions):
Desired Dataframe 1:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-30 232 william 60 57 61 55
5 2022-03-30 147 mary 555 401 5 3
6 2022-03-30 35 sally 20 5 13 5
7 2022-03-29 41 keith 65 55 14 7
8 2022-03-29 147 mary 100 92 102 90
Dataframe 1 only joins the call and cancel columns if the combination of date and user_id exists in df_send_open as this is the primary dataframe.
Desired Dataframe 2:
date user_id name send open call cancel
0 2022-03-31 35 sally 50 20 0 0
1 2022-03-31 47 bob 100 55 150 21
2 2022-03-31 01 john 500 102 100 97
3 2022-03-31 45 greg 47 20 101 13
4 2022-03-31 21 percy 0 0 54 21
5 2022-03-30 232 william 60 57 61 55
6 2022-03-30 147 mary 555 401 5 3
7 2022-03-30 35 sally 20 5 13 5
8 2022-03-29 41 keith 65 55 14 7
9 2022-03-29 147 mary 100 92 102 90
Dataframe 2 will do the same as df1 but will also add any new date and user combinations in df_call_cancel that isn't in df_send_open (see percy).
Many thanks.
merged_df1 = df_send_open.merge(df_call_cancel, how='left', on=['date', 'user_id'])
merged_df2 = df_send_open.merge(df_call_cancel, how='outer', on=['date', 'user_id']).fillna(0)
This should work for your 2 cases, one left and one outer join.
I'm not sure this is a general coding question or not but I hope this is the correct forum. Consider the following reduced example data frame df:
Department CustomerID Date Price MenswearDemand HomeDemand
0 Menswear 418089 2019-04-18 199 199 0
1 Menswear 613573 2019-04-24 199 199 0
2 Menswear 161840 2019-04-25 199 199 0
3 Menswear 2134926 2019-04-29 199 199 0
4 Menswear 984801 2019-04-30 19 19 0
5 Home 398555 2019-01-27 52 0 52
6 Menswear 682906 2019-02-03 97 97 0
7 Menswear 682906 2019-02-03 97 97 0
8 Menswear 923491 2019-02-09 80 80 0
9 Menswear 1098782 2019-02-25 258 258 0
10 Menswear 721696 2019-03-25 12 12 0
11 Menswear 695706 2019-04-10 129 129 0
12 Underwear 637026 2019-01-18 349 0 0
13 Underwear 205997 2019-01-25 279 0 0
14 Underwear 787984 2019-02-01 27 0 0
15 Underwear 318256 2019-02-01 279 0 0
16 Underwear 570454 2019-02-14 262 0 0
17 Underwear 1239118 2019-02-28 279 0 0
18 Home 1680791 2019-04-04 1398 0 1398
I want to group this data based on 'CustomerID' and then:
Turn the purchase date 'Date' into number of days until a cutoff - date, which is '2021-01-01'. This is just the time from the customers most recent purchase till '2021-01-01'.
Sum over all the remaining Demand-columns, in this example only 'MenswearDemand' and 'HomeDemand'.
The result I should get is this:
Date MenswearDemand HomeDemand
CustomerID
161840 6 199 0
205997 96 0 0
318256 89 0 0
398555 94 0 52
418089 13 199 0
570454 76 0 0
613573 7 199 0
637026 103 0 0
682906 87 194 0
695706 21 129 0
721696 37 12 0
787984 89 0 0
923491 81 80 0
984801 1 19 0
1098782 65 258 0
1239118 62 0 0
1680791 27 0 1398
2134926 2 199 0
This is how I managed to sovle this:
df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
'MenswearDemand': lambda x: x.sum(),
'HomeDemand': lambda x: x.sum()})
However, in reality I got about 15 million rows and 30 demand columns. I really don't want to write all those 'DemandColumn': lambda x: x.sum() in my aggregate function every time, since they all should be summed. Is there a better way of doing this? Like passing in an array of the subset of columns that one wants to do a particular operation on?
I am trying to use get_loc to get the current date and then return the 10 rows above the current date from the data, but I keep getting a Key Error.
Here is my datable => posting_df5:
Posting_date rooms Origin Rooms booked ADR Revenue
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
10 2019-04-10 31 31 31 184.138208 5708.284456
I tried doing the following:
idx = posting_df7.index.get_loc('2019-04-05')
posting_df7 = posting_df5.iloc[idx - 5 : idx + 5]
But I received the following error:
indexer = self._get_level_indexer(key, level=level)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2939, in _get_level_indexer
code = level_index.get_loc(key)
File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2019-04-05'
So, I then tried to first index Posting_date before using get_loc but it didn't work as well:
rooms Origin Rooms booked ADR Revenue
Posting_date
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
Then I used the same get_loc function but the same error appeared. How can I select the row based no the date required.
Thanks
Here is a a different approach ...
Because iloc and get_loc can be tricky, this solution uses boolean masking to return the rows relative to a given date, then use the head() function to return the number of rows you require.
import pandas as pd
PATH = '/home/user/Desktop/so/room_rev.csv'
# Read in data from a CSV.
df = pd.read_csv(PATH)
# Convert the date column to a `datetime` format.
df['Posting_date'] = pd.to_datetime(df['Posting_date'],
format='%Y-%m-%d')
# Sort based on date.
df.sort_values('Posting_date')
Original Dataset:
Posting_date rooms Origin Rooms booked ADR Revenue
0 2019-03-31 1 1 1 156.000000 156.000000
1 2019-04-01 13 13 13 160.720577 2089.367500
2 2019-04-02 15 15 15 167.409167 2511.137500
3 2019-04-03 21 21 21 166.967405 3506.315500
4 2019-04-04 37 37 37 162.384909 6008.241643
5 2019-04-05 52 52 52 202.150721 10511.837476
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
9 2019-04-09 37 37 37 177.654422 6573.213623
10 2019-04-10 31 31 31 184.138208 5708.284456
Solution:
Replace the value in the head() function with the number of rows you want to return. Note: There is also a tail() function for the inverse.
df[df['Posting_date'] > '2019-04-05'].head(3)
Output:
Posting_date rooms Origin Rooms booked ADR Revenue
6 2019-04-06 49 49 49 199.611887 9780.982476
7 2019-04-07 44 44 44 182.233171 8018.259527
8 2019-04-08 50 50 50 187.228192 9361.409623
I have this sample table:
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 111 2016-06-01 30 20
6 111 2016-07-01 31 20
7 111 2016-08-01 31 15
8 111 2016-09-01 29 15
9 111 2016-10-01 31 10
10 111 2016-11-01 29 5
11 111 2016-12-01 27 0
0 112 2016-01-01 31 55
1 112 2016-02-01 26 45
2 112 2016-03-01 31 40
3 112 2016-04-01 30 35
4 112 2016-04-01 31 30
5 112 2016-05-01 30 25
6 112 2016-06-01 31 25
7 112 2016-07-01 31 20
8 112 2016-08-01 30 20
9 112 2016-09-01 31 15
10 112 2016-11-01 29 10
11 112 2016-12-01 31 0
I'm trying to make my table final table look like this below after grouping by ID and Date.
ID Date CumDays Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 45 40
2 111 2016-03-01 76 35
3 111 2016-04-01 106 30
4 111 2016-05-01 137 25
5 111 2016-06-01 167 20
6 111 2016-07-01 198 20
7 111 2016-08-01 229 15
8 111 2016-09-01 258 15
9 111 2016-10-01 289 10
10 111 2016-11-01 318 5
11 111 2016-12-01 345 0
0 112 2016-01-01 31 55
1 112 2016-02-01 57 45
2 112 2016-03-01 88 40
3 112 2016-04-01 118 35
4 112 2016-05-01 149 30
5 112 2016-06-01 179 25
6 112 2016-07-01 210 25
7 112 2016-08-01 241 20
8 112 2016-09-01 271 20
9 112 2016-10-01 302 15
10 112 2016-11-01 331 10
11 112 2016-12-01 362 0
Next, I want to be able to extract the first value of Volume/Day per ID, all the CumDays values and all the Volume/Day values per ID and Date. So I can use them for further computation and plotting Volume/Day vs CumDays. Example for ID:111, the first value of Volume/Day will be only 50 and ID:112, it will be only 55. All CumDays values for ID:111 will be 20,45... and ID:112, it will be 31,57...For all Volume/Day --- ID:111, will be 50, 40... and ID:112 will be 55,45...
My solution:
def get_time_rate(grp_df):
t = grp_df['Days'].cumsum()
r = grp_df['Volume/Day']
return t,r
vals = df.groupby(['ID','Date']).apply(get_time_rate)
vals
Doing this, the cumulative calculation doesn't take effect at all. It returns the original Days value. This didn't allow me move further in extracting the first value of Volume/Day, all the CumDays values and all the Volume/Day values I need. Any advice or help on how to go about it will be appreciated. Thanks
Get a groupby object.
g = df.groupby('ID')
Compute columns with transform:
df['CumDays'] = g.Days.transform('cumsum')
df['First Volume/Day'] = g['Volume/Day'].transform('first')
df
ID Date Days Volume/Day CumDays First Volume/Day
0 111 2016-01-01 20 50 20 50
1 111 2016-02-01 25 40 45 50
2 111 2016-03-01 31 35 76 50
3 111 2016-04-01 30 30 106 50
4 111 2016-05-01 31 25 137 50
5 111 2016-06-01 30 20 167 50
6 111 2016-07-01 31 20 198 50
7 111 2016-08-01 31 15 229 50
8 111 2016-09-01 29 15 258 50
9 111 2016-10-01 31 10 289 50
10 111 2016-11-01 29 5 318 50
11 111 2016-12-01 27 0 345 50
0 112 2016-01-01 31 55 31 55
1 112 2016-01-02 26 45 57 55
2 112 2016-01-03 31 40 88 55
3 112 2016-01-04 30 35 118 55
4 112 2016-01-05 31 30 149 55
5 112 2016-01-06 30 25 179 55
6 112 2016-01-07 31 25 210 55
7 112 2016-01-08 31 20 241 55
8 112 2016-01-09 30 20 271 55
9 112 2016-01-10 31 15 302 55
10 112 2016-01-11 29 10 331 55
11 112 2016-01-12 31 0 362 55
If you want grouped plots, you can iterate over each groups after grouping by ID. To plot, first set index and call plot.
fig, ax = plt.subplots(figsize=(8,6))
for i, g in df2.groupby('ID'):
g.plot(x='CumDays', y='Volume/Day', ax=ax, label=str(i))
plt.show()