Split rows based on different rows and columns - python

I could really appreciate your help on this.
I have a table with products, dates, and amounts. This is what the initial table looks like.
Product ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
The purpose is to split each row into the number of months remaining until the end of the year
If it's the first year then split starting from the month of the first year until the end of the year
If the year is the end year then split until the month in the end year. the revenue needs to be split on the number of rows of the month as the revenue in the first table refers to the whole period.
all years in between will be divided into 12 rows along with the revenue one for each month.
Product ID goliveyear endyear Year Month Revenue
1 2020-10 2022-02 2020 10 90/3=30
1 2020-10 2022-02 2020 11 30
1 2020-10 2022-02 2020 12 30
1 2020-10 2022-02 2021 01 140/12 =11.67
1 2020-10 2022-02 2021 02 11.67
1 2020-10 2022-02 2021 03 11.67
1 2020-10 2022-02 2021 04 11.67
... ... ... ... ... ...
1 2020-10 2022-02 2022 01 60/2 = 30
1 2020-10 2022-02 2022 02 30
Thank you so much, everyone.

Quite a few steps.
Start by setting up the df
from io import StringIO
import pandas as pd
from datetime import datetime,timedelta
df = pd.read_csv(StringIO(
"""
Product_ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
"""), delim_whitespace=True)
df['goliveyear'] = pd.to_datetime(df['goliveyear'])
df['endyear'] = pd.to_datetime(df['endyear'])
df
Then add year_start, year_end, period_start, period_end columns
df['ys'] = df['goliveyear'].dt.year + df.groupby('Product_ID').cumcount()
df['ye'] = df['ys'] + 1
df['ys'] = pd.to_datetime(df['ys'], format = '%Y')
df['ye'] = pd.to_datetime(df['ye'], format = '%Y')+ timedelta(days=-1)
df['ps'] = df[['goliveyear','ys']].max(axis=1)
df['pe'] = df[['endyear','ye']].min(axis=1)
produces
Product_ID goliveyear endyear Revenue ys ye ps pe
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- -------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00
Then add months as lists at first
df['months'] = df.apply(lambda r: [d.month for d in pd.date_range(r['ps'], r['pe'], freq='MS', closed = None).to_pydatetime()], axis=1)
output:
Product_ID goliveyear endyear Revenue ys ye ps pe months
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- ------------------- ---------------------------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00 [10, 11, 12]
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00 [1, 2]
Then we explode months do the required calc for revenue and drop unneeded columns
df = df.explode('months')
df['Revenue'] = df['Revenue'] / df.groupby(['Product_ID','ys'])['months'].transform('count')
df = df.drop(columns = ['goliveyear','endyear','ye','ps','pe'])
df['ys'] = df['ys'].dt.year
to get
Product_ID Revenue ys months
-- ------------ --------- ---- --------
0 1 30 2020 10
0 1 30 2020 11
0 1 30 2020 12
1 1 11.6667 2021 1
1 1 11.6667 2021 2
1 1 11.6667 2021 3
1 1 11.6667 2021 4
1 1 11.6667 2021 5
1 1 11.6667 2021 6
1 1 11.6667 2021 7
1 1 11.6667 2021 8
1 1 11.6667 2021 9
1 1 11.6667 2021 10
1 1 11.6667 2021 11
1 1 11.6667 2021 12
2 1 30 2022 1
2 1 30 2022 2

Try this:
import pandas as pd
from io import StringIO
s = """
Product ID,goliveyear,endyear,Revenue
1,2020-10,2022-02,90
1,2020-10,2022-02,140
1,2020-10,2022-02,60"""
df = pd.read_csv(StringIO(s))
# generate the months list between thest two months
df['rng'] = df.apply(lambda x: pd.date_range(x['goliveyear'], x['endyear'],
freq='MS'), axis=1)
# explode the dataframe by months list
df_exploded = df.explode('rng')
df_exploded['Year'] = df_exploded['rng'].dt.year
df_exploded['Month'] = df_exploded['rng'].dt.month
# the (index,year) pair to fliter rows
filter_year = list(zip(df.index, df_exploded.Year.unique()))
# used columns
use_cols = ['Product ID', 'goliveyear', 'endyear', 'Revenue', 'Month']
# filter rows
df_filter = df_exploded.set_index([df_exploded.index,
df_exploded.Year]).loc[filter_year,
use_cols].reset_index().drop(columns='level_0')
# calculate the average Revenue
result = df_filter.set_index(['Year', "Month"]).assign(
Revenue=(df_filter.groupby(['Year', 'Month'])['Revenue'].sum() /
df_filter.groupby('Year')['Month'].count())
).reset_index()
result
Output
Year Month Product ID goliveyear endyear Revenue
0 2020 10 1 2020-10 2022-02 30.000000
1 2020 11 1 2020-10 2022-02 30.000000
2 2020 12 1 2020-10 2022-02 30.000000
3 2021 1 1 2020-10 2022-02 11.666667
4 2021 2 1 2020-10 2022-02 11.666667
5 2021 3 1 2020-10 2022-02 11.666667
6 2021 4 1 2020-10 2022-02 11.666667
7 2021 5 1 2020-10 2022-02 11.666667
8 2021 6 1 2020-10 2022-02 11.666667
9 2021 7 1 2020-10 2022-02 11.666667
10 2021 8 1 2020-10 2022-02 11.666667
11 2021 9 1 2020-10 2022-02 11.666667
12 2021 10 1 2020-10 2022-02 11.666667
13 2021 11 1 2020-10 2022-02 11.666667
14 2021 12 1 2020-10 2022-02 11.666667
15 2022 1 1 2020-10 2022-02 30.000000
16 2022 2 1 2020-10 2022-02 30.000000

Related

Group by month from a particular date python

I have a DataFrame of account statement that contains date, debit and credit.
Lets just say salary gets deposited every 20th of the month.
I want to groupby date column from every 20th of each month to find sum of debits and credits. For e.g., 20th Jan to 20th Feb and so on.
date_parsed Debit Credit
0 2020-05-02 775.0 0.0
1 2020-04-30 209.0 0.0
2 2020-04-24 5000.0 0.0
3 2020-04-24 25000.0 0.0
... ... ... ...
79 2020-04-20 750.0 0.0
80 2020-04-15 5000.0 0.0
81 2020-04-13 0.0 2283.0
82 2020-04-09 0.0 6468.0
83 2020-04-03 0.0 1000.0
I am not sure but pd.offsett can be used with groupby.
You could add an extra month column which truncates up or down based on the day of month. Then its just groupby and sum. E.g. month 2020-06 would include dates between 2020-05-20 and 2020-06-19.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date_parsed': ['2020-05-02', '2020-05-03', '2020-05-20', '2020-05-22'], 'Credit': [1,2,3,4], 'Debit': [5,6,7,8]})
df['date'] = pd.to_datetime(df.date_parsed)
df['month'] = np.where(df.date.dt.day < 20, df.date.dt.to_period('M'), (df.date + pd.DateOffset(months=1)).dt.to_period('M'))
print(df[['month', 'Credit', 'Debit']].groupby('month').sum().reset_index())
Input:
date_parsed Credit Debit
0 2020-05-02 1 5
1 2020-05-03 2 6
2 2020-05-20 3 7
3 2020-05-22 4 8
Result:
month Credit Debit
0 2020-05 3 11
1 2020-06 7 15

How to add 0 infront of week column dataframe pandas if len of column == 1

Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
How can I add 0 infront of week if the len is 1. I need to merge Year and Week together as 201702
Try this
df["Week"] = df.Week.astype('str').str.zfill(2)

Pandas count monthle rainy vs not rainy days starting from hourly data

I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00) and the mm of rain. Measurements are taken every 5 minutes for many years:
mm
tstamp
2010-01-01 00:00:00 0.0
2010-01-01 00:05:00 0.0
2010-01-01 00:10:00 0.0
2010-01-01 00:15:00 0.0
2010-01-01 00:20:00 0.0
........
What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following
tstamp rainy not rainy
2010-01 11 20
2010-02 20 8
......
2012-10 15 16
2012-11 30 0
What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}, made with this small code snippet:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(dict))
for year in df.index.year.unique():
try:
for month in df.index.month.unique():
a = df['{}-{}'.format(year, month)].resample('D').sum()
d[year][month]['rainy'] = a[a['mm'] != 0].count()
d[year][month]['not_rainy'] = a[a['mm'] == 0].count()
except:
pass
But I think I'm missing an easier and more straightforward solution. Any suggestion?
One way is to do two groupby:
daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
monthly = (daily.groupby(daily.index.to_period('M'))
.value_counts()
.unstack()
)
You can do this, I don't see any non-rainy months:
df = pd.read_csv('rain.csv')
df['tstamp'] = pd.to_datetime(df['tstamp'])
df['month'] = df['tstamp'].dt.month
df['year'] = df['tstamp'].dt.year
df = df.groupby(by=['year', 'month'], as_index=False).sum()
print(df)
Output:
year month mm
0 2010 1 1.0
1 2010 2 15.4
2 2010 3 21.8
3 2010 4 9.6
4 2010 5 118.4
5 2010 6 82.8
6 2010 7 96.0
7 2010 8 161.6
8 2010 9 109.2
9 2010 10 51.2
10 2010 11 52.4
11 2010 12 39.6
12 2011 1 5.6
13 2011 2 0.8
14 2011 3 13.4
15 2011 4 1.8
16 2011 5 97.6
17 2011 6 167.8
18 2011 7 128.8
19 2011 8 67.6
20 2011 9 155.8
21 2011 10 71.6
22 2011 11 0.4
23 2011 12 29.4
24 2012 1 17.6
25 2012 2 2.2
26 2012 3 13.0
27 2012 4 55.8
28 2012 5 36.8
29 2012 6 108.4
30 2012 7 182.4
31 2012 8 191.8
32 2012 9 89.0
33 2012 10 93.6
34 2012 11 161.2
35 2012 12 26.4

Calculate mean based on time elapsed in Pandas

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Categories

Resources