Pandas resample timeseries in to 24hours - python

I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.

I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...

Related

How can I fill in missing dates in a dataframe?

I have a dataframe df_counts that contains the number of events that happen on a given day.
My goal is to fill in all the missing dates, and assign them a count of 0.
date count
0 2012-03-14 8
1 2012-03-19 1
2 2012-04-07 3
3 2012-04-10 1
4 2012-04-19 5
Desired output:
date count
0 2012-03-14 8
1 2012-03-15 0
2 2012-03-16 0
3 2012-03-17 0
4 2012-03-18 0
5 2012-03-19 0
6 2012-03-20 0
7 2012-03-21 0
8 2012-03-22 0
9 2012-03-23 0
...
Links I've read through:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
https://datatofish.com/pandas-dataframe-to-series/
Add missing dates to pandas dataframe
What I've tried:
idx = pd.date_range('2012-03-06', '2022-12-05')
s = df_counts.sequeeze
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value = 0)
s.head()
Output of what I've tried:
AttributeError: 'DataFrame' object has no attribute 'sequeeze'
You can achieve this using the asfreq() function.
# convert date column to datetime
df.date = pd.to_datetime(df.date)
# set date columns as index, drop the original index
df = df.set_index("date", drop=True)
# daily freq
df = df.asfreq("D", fill_value=0)
Some notes on your code:
squeeze() is a function, not an attribute. Therefore you get an error there.
You can directly use the date column as an index using the set_index() function.
You can use resample:
new_df = df.set_index('date').resample('1D').agg('first').fillna(0)
Output:
'''
date count
2012-03-14 8.0
2012-03-15 0.0
2012-03-16 0.0
2012-03-17 0.0
2012-03-18 0.0
2012-03-19 1.0
2012-03-20 0.0
2012-03-21 0.0
2012-03-22 0.0
2012-03-23 0.0
2012-03-24 0.0
2012-03-25 0.0
2012-03-26 0.0
2012-03-27 0.0
2012-03-28 0.0
2012-03-29 0.0
2012-03-30 0.0
2012-03-31 0.0
2012-04-01 0.0
2012-04-02 0.0
2012-04-03 0.0
2012-04-04 0.0
2012-04-05 0.0
2012-04-06 0.0
2012-04-07 3.0
2012-04-08 0.0
2012-04-09 0.0
2012-04-10 1.0
2012-04-11 0.0
2012-04-12 0.0
2012-04-13 0.0
2012-04-14 0.0
2012-04-15 0.0
2012-04-16 0.0
2012-04-17 0.0
2012-04-18 0.0
2012-04-19 5.0
'''
My favorite is fillna, you can use it for whole df or single column:
df["column"].fillna(0)

Сonvert the data from summary to daily time series data (pandas)

I have a dataset which is a time series. It has several regions at once, here is a small example:
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR
71541 2022-07-19 19170.0 555.0 18484.0 YEV
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB
I have three columns for which I want to display information about how many new cases have been added in separate three columns:
date confirmed deaths recovered region_code daily_confirmed daily_deaths daily_recovered
0 2020-03-27 3.0 0.0 0.0 ARK 3.0 0.0 0.0
1 2020-03-27 4.0 0.0 0.0 BA 4.0 0.0 0.0
2 2020-03-27 1.0 0.0 0.0 BEL 1.0 0.0 0.0
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR 32.0 16.0 8.0
71541 2022-07-19 19170.0 555.0 18484.0 YEV 6.0 1.0 1.0
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB 1.0 8.0 9.0
That is, for each region, you need to get the difference between the current date and the last day in order to understand how many new cases have occurred.
The problem is that I don't know how to do this process correctly. Since there are no missing dates in the data, you can use something like this: df['daily_cases'] = df['confirmed'] - df['confirmed'].shift(fill_value=0). But there are many different regions here, that is, first you need to filter everything correctly somehow ... Any ideas how to do this?
Use DataFrameGroupBy.diff with replace first missing values by original columns add prefix to columns and cast to inetegers if necessary:
print (df)
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
3 2020-03-28 4.0 0.0 4.0 ARK
4 2020-03-28 6.0 0.0 0.0 BA
5 2020-03-28 1.0 0.0 0.0 BEL
6 2020-03-29 6.0 0.0 10.0 ARK
7 2020-03-29 8.0 0.0 0.0 BA
8 2020-03-29 5.0 0.0 0.0 BEL
cols = ['confirmed','deaths','recovered']
df1 = (df.groupby(['region_code'])[cols]
.diff()
.fillna(df[cols])
.add_prefix('daily_')
.astype(int))
print (df1)
daily_confirmed daily_deaths daily_recovered
0 3 0 0
1 4 0 0
2 1 0 0
3 1 0 4
4 2 0 0
5 0 0 0
6 2 0 6
7 2 0 0
8 4 0 0
Last append to original:
df = df.join(df1)
print (df)

resampling a pandas dataframe from almost-weekly to daily

What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0
I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]

using date range to convert month end data to weekly data on Pandas

I have a dataframe that looks like the following. It is month end data.
date , value , expectation
31/01/2020, 34, 40
28/02/2020, 35, 38
31/03/2020, 40, 44
What I need:
date , value , expectation
07/01/2020, 0, 0
14/01/2020, 0, 0
21/01/2020, 0, 0
28/01/2020, 0, 0
04/02/2020, 34, 40
11/02/2020, 0, 0
18/02/2020, 0, 0
25/02/2020, 0, 0
04/03/2020, 35, 38
Basically, I am trying to convert the month end data to weekly data. But, the twist is that the exact month-end-date may not match the weekly date range, so it will fall into the end of week date (e.g., 04/02/2020 for 31/01/2020). The other week-end date are filled with 0. It sounds messy. But this what I have tried.
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W')
empty = pd.DataFrame(index=dtr)
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
The code works but I do not get the exact expected output. Any help is appreciated.
Use merge_asof:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W')
empty = pd.DataFrame(index=dtr)
df = pd.merge_asof(empty,
df,
left_index=True,
right_index=True,
tolerance=pd.Timedelta(7, 'd')).fillna(0)
print (df)
value expectation
2020-01-05 0.0 0.0
2020-01-12 0.0 0.0
2020-01-19 0.0 0.0
2020-01-26 0.0 0.0
2020-02-02 34.0 40.0
2020-02-09 0.0 0.0
2020-02-16 0.0 0.0
2020-02-23 0.0 0.0
2020-03-01 35.0 38.0
2020-03-08 0.0 0.0
2020-03-15 0.0 0.0
2020-03-22 0.0 0.0
2020-03-29 0.0 0.0
If need also change start of weeks, e.g from Tuesdays change freq in date_range:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
dtr = pd.date_range('01.01.2020', '31.03.2020', freq='W-Tue')
empty = pd.DataFrame(index=dtr)
df = pd.merge_asof(empty,
df,
left_index=True,
right_index=True,
tolerance=pd.Timedelta(7, 'd')).fillna(0)
print (df)
value expectation
2020-01-07 0.0 0.0
2020-01-14 0.0 0.0
2020-01-21 0.0 0.0
2020-01-28 0.0 0.0
2020-02-04 34.0 40.0
2020-02-11 0.0 0.0
2020-02-18 0.0 0.0
2020-02-25 0.0 0.0
2020-03-03 35.0 38.0
2020-03-10 0.0 0.0
2020-03-17 0.0 0.0
2020-03-24 0.0 0.0
2020-03-31 40.0 44.0
Below given piece of code will give you the desired result:
for end_date in df["date"]:
days_diff = (end_date - pd.date_range(end=end_date , freq='W', periods=5)[-1])
pd.date_range(end='2020-03-31', freq='W', periods=5) + days_diff

Add missing datetime columns to grouped dataframe

Is it possible to add missing date columns from created date_range to grouped dataframe df without for loop and fill zeros as missing values?
date_range has 7 date elements. df has 4 date columns. So how to add 3 missing columns to df?
import pandas as pd
from datetime import datetime
start = datetime(2018,6,4, )
end = datetime(2018,6,10,)
date_range = pd.date_range(start=start, end=end, freq='D')
DatetimeIndex(['2018-06-04', '2018-06-05', '2018-06-06', '2018-06-07',
'2018-06-08', '2018-06-09', '2018-06-10'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({
'date':
['2018-06-07', '2018-06-10', '2018-06-09','2018-06-09',
'2018-06-08','2018-06-09','2018-06-08','2018-06-10',
'2018-06-10','2018-06-10',],
'name':
['sogan', 'lyam','alex','alex',
'kovar','kovar','kovar','yamo','yamo','yamo',]
})
df['date'] = pd.to_datetime(df['date'])
df = (df
.groupby(['name', 'date',])['date',]
.count()
.unstack(fill_value=0)
)
df
date date date date
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0 0 2 0
kovar 0 2 1 0
lyam 0 0 0 1
sogan 1 0 0 0
yamo 0 0 0 3
I would pivot the table for making the date columns as rows then use the .asfreq function of pandas as below:
DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html
Thanks Sina Shabani for clue to making date columns as rows. And in this situation more suitable setting date as index and using .reindex appeared
df = (df.groupby(['date', 'name'])['name']
.size()
.reset_index(name='count')
.pivot(index='date', columns='name', values='count')
.fillna(0))
df
name alex kovar lyam sogan yamo
date
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.index = pd.DatetimeIndex(df.index)
df = (df.reindex(pd.date_range(start, freq='D', periods=7), fill_value=0)
.sort_index())
df
name alex kovar lyam sogan yamo
2018-06-04 0.0 0.0 0.0 0.0 0.0
2018-06-05 0.0 0.0 0.0 0.0 0.0
2018-06-06 0.0 0.0 0.0 0.0 0.0
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.T
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0.0 0.0 2.0 0.0
kovar 0.0 2.0 1.0 0.0
lyam 0.0 0.0 0.0 1.0
sogan 1.0 0.0 0.0 0.0
yamo 0.0 0.0 0.0 3.0

Categories

Resources