How do I calculate week over week changes in Pandas? - python

I have the following df of values for various slices across time:
date A B C
0 2016-01-01 5 7 2
1 2016-01-02 6 12 15
...
2 2016-01-08 9 5 16
...
3 2016-12-24 5 11 13
4 2016-12-31 3 52 22
I would like to create a new dataframe that calculates the w-w change in each slice, by date. For example, I want the new table to be blank for all slices from jan 1 - jan 7. I want the value of jan 8 to be the jan 8 value for the given slice minus the value of the jan 1 value of that slice. I then want the value of jan 9 to be the jan 9 value for the given slice minus the value of the jan 2 slice. So and so forth, all the way down.
The example table would look like this:
date A B C
0 2016-01-01 0 0 0
1 2016-01-02 0 0 0
...
2 2016-01-08 4 -2 14
...
3 2016-12-24 4 12 2
4 2016-12-31 -2 41 9
You may assume the offset is ALWAYS 7. In other words, there are no missing dates.

#Unatiel's answer is correct in this case, where there are no missing dates, and should be accepted.
But I wanted to post a modification here for cases with missing dates, for anyone interested. From the docs:
The shift method accepts a freq argument which can accept a
DateOffset class or other timedelta-like object or also a offset
alias
from pandas.tseries.offsets import Week
res = ((df - df.shift(1, freq=Week()).reindex(df.index))
.fillna(value=0)
.astype(int))
print(res)
A B
date
2016-01-01 0 0
2016-01-02 0 0
2016-01-03 0 0
2016-01-04 0 0
2016-01-05 0 0
2016-01-06 0 0
2016-01-07 0 0
2016-01-08 31 46
2016-01-09 4 20
2016-01-10 -51 -65
2016-01-11 56 5
2016-01-12 -51 24
.. ..
2016-01-20 34 -30
2016-01-21 -28 19
2016-01-22 24 8
2016-01-23 -28 -46
2016-01-24 -11 -60
2016-01-25 -34 -7
2016-01-26 -12 -28
2016-01-27 -41 42
2016-01-28 -2 48
2016-01-29 35 -51
2016-01-30 -8 62
2016-01-31 -6 -9

If we know offset is always 7 then use shift(), here is a quick example showing how it works :
df = pandas.DataFrame({'x': range(30)})
df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 0.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
...
So with this you can do :
df - df.shift(7)
x
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 7.0
8 7.0
...
In your case, don't forget to set_index('date') before.

Related

Add column in dataframe from another dataframe matching the id and based on condition in date columns pandas

My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks
merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]
merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7

Pandas Dataframe days difference with zeros and null values: TypeError: unsupported operand type(s) for +: 'Timedelta' and 'int'

I have a Dataframe with dates. I simply want to fill a new column with a difference of maximum end and minimum start dates and find the length of days.
My calculation is working but when if either column contains zero or Nan values it's going to give me this error.
Does anyone can look at the code and give a suggestion.
Thanks in advance.
# here is the Dataframe
end_d start_d
0 2021-09-11 00:00:00 2021-08-01 00:00:00
1 2021-08-29 00:00:00 2021-05-23 00:00:00
2 2021-09-04 00:00:00 2021-06-13 00:00:00
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 NaN NaN
9 NaN NaN
10 2021-09-04 00:00:00 2021-06-13 00:00:00
11
12
13
#When I use the below code if there aren't any zeros or Nan values, the code is working fine.
dsx['length'] = (dsx['end_d'] - dsx['start_d'] + pd.Timedelta(1, unit=freq)).max()
# I want something like, below Dataframe. Any suggestion?
end_d start_d length
0 2021-09-11 00:00:00 2021-08-01 00:00:00 99 days
1 2021-08-29 00:00:00 2021-05-23 00:00:00 99 days
2 2021-09-04 00:00:00 2021-06-13 00:00:00 99 days
3 0 0 99 days
4 0 0 99 days
5 0 0 99 days
6 0 0 99 days
7 0 0 99 days
8 NaN NaN 99 days
9 NaN NaN 99 days
10 2021-09-04 00:00:00 2021-06-13 00:00:00 99 days
Thanks in advance.
You can filter the dataframe for non-N/A values with skipna.
pandas.Dataframe.dropna
filtered_df = df.dropna()
df['length'] = (filtered_df['end_d'] - filtered_df['start_d'] + pd.Timedelta(1, unit=freq)).max()
That takes care of your N/A problem, but you still have an issue where your columns are filled with different data types (int and datetime). Not sure what's up there but you need to fix that.

How to join a table with each group of a dataframe in pandas

I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122

Upsample each pandas dateindex row including previous dates within group

My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Categories

Resources