resampling a pandas dataframe from almost-weekly to daily - python

What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0

I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]

Related

Changing year of DataTimeIndex in Pandas

I have a timeseries with data related to the irradiance of the sun. I have data for every hour during a year, but every month has data from a diferent year. For example, the data taken in March can be from 2012 and the data taken in January can be from 2014.
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2012-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2012-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2012-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2012-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2012-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2011-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2011-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2011-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2011-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2011-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
My question is: there is a way I can set all the data to a certain year?
For example, set all data to 2014
T2m RH G(h) Gb(n) Gd(h) IR(h) WS10m WD10m SP Hour Month
time(UTC)
2014-01-01 00:00:00 16.00 81.66 0.0 -0.0 0.0 310.15 2.56 284.0 102252.0 0 1
2014-01-01 01:00:00 15.97 82.42 0.0 -0.0 0.0 310.61 2.49 281.0 102228.0 1 1
2014-01-01 02:00:00 15.93 83.18 0.0 -0.0 0.0 311.06 2.41 278.0 102205.0 2 1
2014-01-01 03:00:00 15.89 83.94 0.0 -0.0 0.0 311.52 2.34 281.0 102218.0 3 1
2014-01-01 04:00:00 15.85 84.70 0.0 -0.0 0.0 311.97 2.26 284.0 102232.0 4 1
... ... ... ... ... ... ... ... ... ... ... ...
2014-12-31 19:00:00 16.19 77.86 0.0 -0.0 0.0 307.88 2.94 301.0 102278.0 19 12
2014-12-31 20:00:00 16.15 78.62 0.0 -0.0 0.0 308.33 2.86 302.0 102295.0 20 12
2014-12-31 21:00:00 16.11 79.38 0.0 -0.0 0.0 308.79 2.79 297.0 102288.0 21 12
2014-12-31 22:00:00 16.08 80.14 0.0 -0.0 0.0 309.24 2.71 292.0 102282.0 22 12
2014-12-31 23:00:00 16.04 80.90 0.0 -0.0 0.0 309.70 2.64 287.0 102275.0 23 12
Thanks in advance.
Use offsets.DateOffset with year (without s) for set same year in all DatetimeIndex:
rng = pd.date_range('2009-04-03', periods=10, freq='350D')
df = pd.DataFrame({ 'a': range(10)}, rng)
print (df)
a
2009-04-03 0
2010-03-19 1
2011-03-04 2
2012-02-17 3
2013-02-01 4
2014-01-17 5
2015-01-02 6
2015-12-18 7
2016-12-02 8
2017-11-17 9
df.index += pd.offsets.DateOffset(year=2014)
print (df)
a
2014-04-03 0
2014-03-19 1
2014-03-04 2
2014-02-17 3
2014-02-01 4
2014-01-17 5
2014-01-02 6
2014-12-18 7
2014-12-02 8
2014-11-17 9
Another idea with Index.map and replace:
df.index = df.index.map(lambda x: x.replace(year=2014))

pandas broadcast function to ensure monotonically decreasing series

I'm wondering if there's a succinct, broadcast way to take a series/frame such as this:
>>> print(pd.DataFrame({'a': [5, 10, 9, 11, 13, 14, 12]}, index=pd.date_range('2020-12-01', periods=7)))
a
2020-12-01 5
2020-12-02 10
2020-12-03 9
2020-12-04 11
2020-12-05 13
2020-12-06 14
2020-12-07 12
...and turn it into:
>>> print(pd.DataFrame({'a': [5, 9, 9, 11, 12, 12, 12]}, index=pd.date_range('2020-12-01', periods=7)))
a
2020-12-01 5
2020-12-02 9
2020-12-03 9
2020-12-04 11
2020-12-05 12
2020-12-06 12
2020-12-07 12
NB: The important part is that the most recent value is kept, and and previous values that exceed it are replaced with the current value goes backwards in time, resulting in a monotonically increasing series of numbers such that any modifications do not increase a modified value.
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
any full
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-09 E92000001 11115.2 0.0
N92000002 724.6 0.0
S92000003 3801.8 0.0
W92000004 1651.4 0.0
...
2021-01-24 E92000001 5727693.0 441684.0
N92000002 159642.0 22713.0
S92000003 415402.0 5538.0
W92000004 270833.0 543.0
2021-01-25 E92000001 5962544.0 443010.0
Here's another way:
df.sort_index(ascending=False).cummin().sort_index()
a
2020-12-01 5
2020-12-02 9
2020-12-03 9
2020-12-04 11
2020-12-05 12
2020-12-06 12
2020-12-07 12
For the MultiIndex, this becomes:
df.sort_index(ascending=False).groupby('areaCode').cummin().sort_index()
Flipping the time axis and doing a rolling min does the trick:
df.sort_index(ascending = False).rolling(window=1000, min_periods=1).min().sort_index()
produces
a
2020-12-01 5.0
2020-12-02 9.0
2020-12-03 9.0
2020-12-04 11.0
2020-12-05 12.0
2020-12-06 12.0
2020-12-07 12.0

replacing np.float64 nan in pandas

I have a pandas dataframe as below:
>>> df.head()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.0 0.0 0
fillna does no replace the NaN
>>> df.fillna(0)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.000000 0.000 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.000000 0.000 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.000000 1511.000 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.000000 0.000 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.000000 0.000 0
But if we access just one row, fillna on the resulting series works as expected:
>>> df.iloc[0]
timestamp 2020-09-18 09:00:00
count_200 4932
count_201 51
count_503 NaN
count_504 NaN
mean_200 59.501
mean_201 73.9412
mean_503 0
mean_504 0
count_500 0
Name: 0, dtype: object
>>> df.iloc[0].fillna(0)
timestamp 2020-09-18 09:00:00
count_200 4932
count_201 51
count_503 0
count_504 0
mean_200 59.501
mean_201 73.9412
mean_503 0
mean_504 0
count_500 0
Name: 0, dtype: object
What is going on here?
>>> df.iloc[0,3]
nan
>>> type(df.iloc[0,3])
<class 'numpy.float64'>
Pandas recognises as a na:
>>> df.isna()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 False False False True True False False False False False
1 False False False True True False False False False False
2 False False False False False False False False False False
3 False False False True True False False False False False
4 False False False True True False False False False False
But with numpys inbuild function this can be fixed in pandas:
>>> df.head().apply(np.nan_to_num)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 0.0 0.0 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 0.0 0.0 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 0.0 0.0 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 0.0 0.0 70.539800 102.302326 0.0 0.0 0
Is this expected, I can't find this documented. What am I missing? Is this a bug?
df.head()
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 NaN NaN 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 NaN NaN 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 NaN NaN 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 NaN NaN 70.539800 102.302326 0.0 0.0 0
Replacing NaN with 0
df.fillna(0)
timestamp count_200 count_201 count_503 count_504 mean_200 mean_201 mean_503 mean_504 count_500
0 2020-09-18 09:00:00 4932.0 51.0 0.0 0.0 59.501014 73.941176 0.0 0.0 0
1 2020-09-18 10:00:00 1697.0 9.0 0.0 0.0 57.807896 69.111111 0.0 0.0 0
2 2020-09-18 11:00:00 6895.0 6.0 2.0 1.0 54.037273 98.333333 33.0 1511.0 0
3 2020-09-18 12:00:00 2943.0 97.0 0.0 0.0 74.334353 74.268041 0.0 0.0 0
4 2020-09-18 13:00:00 2299.0 43.0 0.0 0.0 70.539800 102.302326 0.0 0.0 0
It is working fine for me.
Use inplace=True to apply the changes to dataframe
df.fillna(0, inplace=True)
Pandas version I'm using is
print(pd.__version__)
0.23.0
Please restart IDE/python kernal
Check and update pandas version (if required)
df[df.isna().any()] = 0
you can use this,pandas lib can be confusing,as for one functionality you have many type of things you can do,i generally try all and dont stuck in one,tell me if this is working or atleast what is it doing.
I can't seem to recreate the error, if I copy your provided df and use pd.read_clipboard() to turn it into a df, then df.fillna(0) gives the expected results for me.
When you provide the return from df.fillna(0), is that the actual return? Or are you printing the df. If so, remember to use the inplace=True parameter.

How to create rolling window variables without skipping months when using a multiIndex?

Currently I have a df with a location_key & year_month multiIndex. I want to create a sum using a rolling window for 3 months.
(pd.DataFrame(df.groupby(['LOCATION_KEY','YEAR_MONTH'])['SALES'].count()).sort_index()).groupby(level=(0)).apply(lambda x: x.rolling(window=3).sum())
The window is working properly the issue is that in months where there were no sales instead of counting an empty month instead it counts the another month.
e.g. in the data below, 2016-03 Sales is the sum of 2016-03, 2016-01, 2015-12 as opposed to what I would like: 2016-03, 2016-02, 2016-01.
LOCATION_KE YEAR_MONTH SALES
A 2015-10 NaN
2015-11 NaN
2015-12 200
2016-01 220
2016-03 180
B 2015-04 NaN
2015-05 NaN
2015-06 119
2015-07 120
Basically you have to get your index set up how you want so the rolling window has zeros to process.
df
LOCATION_KE YEAR_MONTH SALES
0 A 2015-10-01 NaN
1 A 2015-11-01 NaN
2 A 2015-12-01 200.0
3 A 2016-01-01 220.0
4 A 2016-03-01 180.0
5 B 2015-04-01 NaN
6 B 2015-05-01 NaN
7 B 2015-06-01 119.0
8 B 2015-07-01 120.0
df['SALES'] = df['SALES'].fillna(0)
df.index = [df["LOCATION_KE"], df["YEAR_MONTH"]]
df
LOCATION_KE YEAR_MONTH SALES
LOCATION_KE YEAR_MONTH
A 2015-10-01 A 2015-10-01 0.0
2015-11-01 A 2015-11-01 0.0
2015-12-01 A 2015-12-01 200.0
2016-01-01 A 2016-01-01 220.0
2016-03-01 A 2016-03-01 180.0
B 2015-04-01 B 2015-04-01 0.0
2015-05-01 B 2015-05-01 0.0
2015-06-01 B 2015-06-01 119.0
2015-07-01 B 2015-07-01 120.0
df = df.reindex(pd.MultiIndex.from_product([df['LOCATION_KE'],
pd.date_range("20150101", periods=24, freq='MS')],
names=['location', 'month']))
df['SALES'].fillna(0).reset_index(level=0).groupby('location').rolling(3).sum().fillna(0)
location SALES
location month
A 2015-01-01 A 0.0
2015-02-01 A 0.0
2015-03-01 A 0.0
2015-04-01 A 0.0
2015-05-01 A 0.0
2015-06-01 A 0.0
2015-07-01 A 0.0
2015-08-01 A 0.0
2015-09-01 A 0.0
2015-10-01 A 0.0
2015-11-01 A 0.0
2015-12-01 A 200.0
2016-01-01 A 420.0
2016-02-01 A 420.0
2016-03-01 A 400.0
2016-04-01 A 180.0
2016-05-01 A 180.0
2016-06-01 A 0.0
2016-07-01 A 0.0
2016-08-01 A 0.0
2016-09-01 A 0.0
2016-10-01 A 0.0
2016-11-01 A 0.0
2016-12-01 A 0.0
2015-01-01 A 0.0
2015-02-01 A 0.0
2015-03-01 A 0.0
2015-04-01 A 0.0
2015-05-01 A 0.0
2015-06-01 A 0.0
... ... ...
B 2016-07-01 B 0.0
2016-08-01 B 0.0
2016-09-01 B 0.0
2016-10-01 B 0.0
2016-11-01 B 0.0
2016-12-01 B 0.0
2015-01-01 B 0.0
2015-02-01 B 0.0
2015-03-01 B 0.0
2015-04-01 B 0.0
2015-05-01 B 0.0
2015-06-01 B 119.0
2015-07-01 B 239.0
2015-08-01 B 239.0
2015-09-01 B 120.0
2015-10-01 B 0.0
2015-11-01 B 0.0
2015-12-01 B 0.0
2016-01-01 B 0.0
2016-02-01 B 0.0
2016-03-01 B 0.0
2016-04-01 B 0.0
2016-05-01 B 0.0
2016-06-01 B 0.0
2016-07-01 B 0.0
2016-08-01 B 0.0
2016-09-01 B 0.0
2016-10-01 B 0.0
2016-11-01 B 0.0
2016-12-01 B 0.0
I think if you have a up to date pandas you can leave out the reset_index.

Pandas resample timeseries in to 24hours

I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.
I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...

Categories

Resources