Pandas Multiindex reindex on levels - python

I have multiple different series data saved as Multiindex(2-level) pandas dataframe. I want to know how to reindex a Multiindex dataframe so that I get indexes for all(hourly) data between two existing indexes.
So this is an example of my dataframe:
A B C D
tick act
2019-01-10 2019-01-09 20:00:00 5.0 5.0 5.0 5.0
2019-01-10 00:00:00 52.0 34.0 1.0 9.0
2019-01-10 01:00:00 75.0 52.0 61.0 1.0
2019-01-10 02:00:00 28.0 29.0 46.0 61.0
2019-01-16 2019-01-09 22:00:00 91.0 42.0 3.0 34.0
2019-01-10 02:00:00 2.0 22.0 41.0 59.0
2019-01-10 03:00:00 16.0 9.0 92.0 53.0
And this is what I want to get:
tick act
2019-01-10 2019-01-09 20:00:00 5.0 5.0 5.0 5.0
2019-01-09 21:00:00 NaT NaN NaN NaN NaN
2019-01-09 22:00:00 NaT NaN NaN NaN NaN
2019-01-09 23:00:00 NaT NaN NaN NaN NaN
2019-01-10 00:00:00 52.0 34.0 1.0 9.0
2019-01-10 01:00:00 75.0 52.0 61.0 1.0
2019-01-10 02:00:00 28.0 29.0 46.0 61.0
2019-01-16 2019-01-09 22:00:00 91.0 42.0 3.0 34.0
2019-01-09 23:00:00 NaT NaN NaN NaN NaN
2019-01-10 00:00:00 NaT NaN NaN NaN NaN
2019-01-10 01:00:00 NaT NaN NaN NaN NaN
2019-01-10 02:00:00 2.0 22.0 41.0 59.0
2019-01-10 03:00:00 16.0 9.0 92.0 53.0
The important thing to remember is that the 'act' index level doesn't have same date range(for example in 2019-01-10 it starts with 2019-01-09 20:00:00 and ends with 2019-01-10 02:00:00 while for 2019-01-16 it starts with 2019-01-09 22:00:00 and ends with 2019-01-10 03:00:00).
I am mainly interested if there exists a solution using pandas methods without unnecessary external loops.

At first reset_index of your data.
d = df.reset_index()
d
tick act A B C D
0 2019-01-10 2019-01-09 20:00:00 5.0 5.0 5.0 5.0
1 2019-01-10 2019-01-10 00:00:00 52.0 34.0 1.0 9.0
2 2019-01-10 2019-01-10 01:00:00 75.0 52.0 61.0 1.0
3 2019-01-10 2019-01-10 02:00:00 28.0 29.0 46.0 61.0
4 2019-01-16 2019-01-09 22:00:00 91.0 42.0 3.0 34.0
5 2019-01-16 2019-01-10 02:00:00 2.0 22.0 41.0 59.0
6 2019-01-16 2019-01-10 03:00:00 16.0 9.0 92.0 53.0
Group your data by tick and apply the interpolate function to each group.
def interpolate(df):
# generate new index
new_index = pd.date_range(df.act.min(),df.act.max(),freq="h")
# set `act` as index and unsampleing it to hours
return df.set_index("act").reindex(new_index)
d.groupby("tick").apply(interpolate)
It gives:
tick A B C D
tick
2019-01-10 2019-01-09 20:00:00 2019-01-10 5.0 5.0 5.0 5.0
2019-01-09 21:00:00 NaN NaN NaN NaN NaN
2019-01-09 22:00:00 NaN NaN NaN NaN NaN
2019-01-09 23:00:00 NaN NaN NaN NaN NaN
2019-01-10 00:00:00 2019-01-10 52.0 34.0 1.0 9.0
2019-01-10 01:00:00 2019-01-10 75.0 52.0 61.0 1.0
2019-01-10 02:00:00 2019-01-10 28.0 29.0 46.0 61.0
2019-01-16 2019-01-09 22:00:00 2019-01-16 91.0 42.0 3.0 34.0
2019-01-09 23:00:00 NaN NaN NaN NaN NaN
2019-01-10 00:00:00 NaN NaN NaN NaN NaN
2019-01-10 01:00:00 NaN NaN NaN NaN NaN
2019-01-10 02:00:00 2019-01-16 2.0 22.0 41.0 59.0
2019-01-10 03:00:00 2019-01-16 16.0 9.0 92.0 53.0

Related

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

Using the last valid data index in one Dataframe to select data in another Dataframe

I want to find the last valid index of the first Dataframe, and use it to index the second Dataframe.
So, suppose I have the following Dataframe (df1):
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Now I can use "first_valid_index()" to find the last valid index of each column:
lvi = df.apply(lambda series: series.last_valid_index())
Which yields:
Site 1 2017-01-01
Site 2 2013-01-01
Site 3 2019-01-01
Site 4 2020-01-01
Site 5 2006-01-01
Site 6 2017-01-01
How do I apply this to another Dataframe where I use this index to slice the timeseries of another Dataframe. Another example of a Dataframe could be created with:
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
How do I use that "lvi" variable to index into df2?
To do this manually I could just use:
df_s1 = df['Site 1'].loc['2000-01-01':'2017-01-01']
To get something like:
2000-01-01 13.0
2001-01-01 77.0
2002-01-01 50.0
2003-01-01 7.0
2004-01-01 11.0
2005-01-01 50.0
2006-01-01 30.0
2007-01-01 50.0
2008-01-01 24.0
2009-01-01 56.0
2010-01-01 87.0
2011-01-01 22.0
2012-01-01 12.0
2013-01-01 1.0
2014-01-01 94.0
2015-01-01 2.0
2016-01-01 81.0
2017-01-01 59.0
Is there a better way to approach this? Also, will each column have to essentially be its own dataframe to work? Any help is greatly appreciated!
This might be a bit more idiomatic:
df2[df.notna()]
or even
df2.where(df.notna())
Note that in these cases (and df1*0 + df2), the operations are done for matching index values of df and df2. For example, df2[df.reset_index(drop=True).notna()] will return all nan because there are no common index values.
This seems to work just fine:
In [34]: d
Out[34]:
x y
Date
2020-01-01 1.0 2.0
2020-01-02 1.0 2.0
2020-01-03 1.0 2.0
2020-01-04 1.0 2.0
2020-01-05 1.0 2.0
2020-01-06 1.0 NaN
2020-01-07 1.0 NaN
2020-01-08 1.0 NaN
2020-01-09 1.0 NaN
2020-01-10 1.0 NaN
2020-01-11 NaN NaN
2020-01-12 NaN NaN
2020-01-13 NaN NaN
2020-01-14 NaN NaN
2020-01-15 NaN NaN
2020-01-16 NaN NaN
2020-01-17 NaN NaN
2020-01-18 NaN NaN
2020-01-19 NaN NaN
2020-01-20 NaN NaN
In [35]: d.apply(lambda col: col.last_valid_index())
Out[35]:
x 2020-01-10
y 2020-01-05
dtype: datetime64[ns]
And then:
In [15]: d.apply(lambda col: col.last_valid_index()).apply(lambda date: df2.loc[date]) Out[15]: z x 0.940396 y 0.564007
Alright, so after thinking about this for a while and trying to come up with a detailed procedure that involved a for loop etc., I came to the conclusions that this simple math operation will do the trick. Basically I am taking advantage of how math is done between Dataframes in pandas.
output = df1*0 + df2
This gives the output on df2 that will take on the NaN values from df1 and look like this:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 0.690597 0.443933 0.787931 0.659639 0.363606 0.922373
2001-01-01 0.388669 0.577734 0.450225 0.021592 0.554249 0.305546
2002-01-01 0.578212 0.927848 0.361426 0.840541 0.626881 0.545491
2003-01-01 0.431668 0.128282 0.893351 0.783488 0.122182 0.666194
2004-01-01 0.151491 0.928584 0.834474 0.945401 0.590830 0.802648
2005-01-01 0.113477 0.398326 0.649955 0.202538 0.485927 0.127925
2006-01-01 0.521906 0.458672 0.923632 0.948696 0.638754 0.552753
2007-01-01 0.266599 0.839047 0.099069 0.000928 NaN 0.018146
2008-01-01 0.819810 0.809779 0.706223 0.247780 NaN 0.759691
2009-01-01 0.441574 0.020291 0.702551 0.468862 NaN 0.341191
2010-01-01 0.277030 0.130573 0.906697 0.589474 NaN 0.819986
2011-01-01 0.795344 0.103121 0.846405 0.589916 NaN 0.564411
2012-01-01 0.697255 0.599767 0.206482 0.718980 NaN 0.731366
2013-01-01 0.891771 0.001944 0.703132 0.751986 NaN 0.845933
2014-01-01 0.672579 NaN 0.466981 0.466770 NaN 0.618069
2015-01-01 0.767219 NaN 0.702156 0.370905 NaN 0.481971
2016-01-01 0.315264 NaN 0.793531 0.754920 NaN 0.091432
2017-01-01 0.431651 NaN 0.974520 0.708074 NaN 0.870077
2018-01-01 NaN NaN 0.408743 0.430576 NaN NaN
2019-01-01 NaN NaN 0.751509 0.755521 NaN NaN
2020-01-01 NaN NaN NaN 0.518533 NaN NaN
I was basically wanting to imprint the NaN values from one Dataframe onto another. I cannot believe how difficult I was making this. As long as my Dataframes are the same size this should work fine for my needs.
Now I should be able to take it from here to calculate the percent change from each last valid datapoint. Thank you everyone for the input!
EDIT:
Just to show everyone what I was ultimately trying to accomplish, here is the final code I produced with everyone's help and suggestions!
The original df originally looked like:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Then I came up with a second full dataframe (df2) with:
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
Now I replace the nan values in df2 with the nan values from df:
dfr = df2[df.notna()]
Then I invert the dataframe:
dfr = dfr[::-1]
valid_first = dfr.apply(lambda col: col.first_valid_index())
valid_last = dfr.apply(lambda col: col.last_valid_index())
Now I want the to calculate the percent change from my last valid data point, which is fixed for each column. This gives me the % change from the present to the past, with respect to the most recent (or last valid) data point.
new = []
for j in dfr:
m = dfr[j].loc[valid_first[j]:valid_last[j]]
pc = m / m.iloc[0]-1
new.append(pc)
final = pd.concat(new,axis=1)
print(final)
Which gave me:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
2000-01-01 0.270209 -0.728445 -0.636105 0.380330 41.339081 -0.462147
2001-01-01 0.854952 -0.827804 -0.703568 -0.787391 40.588791 -0.884806
2002-01-01 -0.677757 -0.120482 -0.208255 -0.982097 54.348094 -0.483415
2003-01-01 -0.322010 -0.061277 -0.382602 1.025088 5.440808 -0.602661
2004-01-01 1.574451 -0.768251 -0.543260 1.210434 50.494788 -0.859331
2005-01-01 -0.412226 -0.866441 -0.055027 -0.168267 1.346869 -0.385080
2006-01-01 1.280867 -0.640899 0.354513 1.086703 0.000000 0.108504
2007-01-01 1.121585 -0.741675 -0.735990 -0.768578 NaN -0.119436
2008-01-01 -0.210467 -0.376884 -0.575106 -0.779147 NaN 0.055949
2009-01-01 1.864107 -0.966827 0.566590 1.003121 NaN -0.214482
2010-01-01 0.571762 -0.311459 -0.518113 1.036950 NaN -0.513911
2011-01-01 -0.122525 -0.178137 -0.641642 0.197481 NaN 0.033141
2012-01-01 0.403578 -0.829402 0.161753 -0.438578 NaN -0.996595
2013-01-01 0.383481 0.000000 -0.305824 0.602079 NaN -0.057711
2014-01-01 -0.699708 NaN -0.515074 -0.277157 NaN -0.840873
2015-01-01 0.422364 NaN -0.759708 1.230037 NaN -0.663253
2016-01-01 -0.418945 NaN 0.197396 -0.445260 NaN -0.299741
2017-01-01 0.000000 NaN -0.897428 0.669791 NaN 0.000000
2018-01-01 NaN NaN 0.138997 0.486961 NaN NaN
2019-01-01 NaN NaN 0.000000 0.200771 NaN NaN
2020-01-01 NaN NaN NaN 0.000000 NaN NaN
I know often times these questions don't have context, so here is the final output achieved thanks to your input. Again, thank you to everyone for the help!

Using Fixed interval(3hour) data to generates continuous time(1hour) data

This is part of my data:
Day_Data Hour_Data WIN_D WIN_S TEM RHU PRE_1h
1 0 58 1 22 78 0
1 3 32 1.9 24.6 65 0
1 6 41 3.2 25.6 59 0
1 9 20 0.8 24.8 64 0
1 12 44 1.7 22.7 76 0
1 15 118 0.7 20.2 92 0
1 18 70 2.6 20.2 94 0
1 21 76 3.4 19.9 66 0
2 0 76 3.8 19.4 58 0
2 3 75 5.8 19.4 47 0
2 6 81 5.1 19.5 42 0
2 9 61 3.6 17.4 48 0
2 12 50 0.9 15.8 46 0
2 15 348 1.1 14.5 52 0
2 18 357 1.9 13.5 60 0
2 21 333 1.2 12.4 74 0
and, I want to generate extra data like this:
the fill values are the mean of the last value and the next value.
How can I do that?
Thank you!
And, #jdy thanks for reminder, this is what I have done:
data['time']='2017'+'-'+'10'+'-'+data['Day_Data'].map(int).map(str)+'
'+data['Hour_Data'].map(int).map(str)+':'+'00'+':'+'00'
from datetime import datetime
data.loc[:,'Date']=pd.to_datetime(data['time'])
data=data.drop(['Day_Data','Hour_Data','time'],axis=1)
index = data.set_index(data['Date'])
data=index.resample('1h').mean()
Output:
2017-10-01 00:00:00 58.0 1.0 22.0 78.0 0.0
2017-10-01 01:00:00 NaN NaN NaN NaN NaN
2017-10-01 02:00:00 NaN NaN NaN NaN NaN
2017-10-01 03:00:00 32.0 1.9 24.6 65.0 0.0
2017-10-01 04:00:00 NaN NaN NaN NaN NaN
2017-10-01 05:00:00 NaN NaN NaN NaN NaN
2017-10-01 06:00:00 41.0 3.2 25.6 59.0 0.0
2017-10-01 07:00:00 NaN NaN NaN NaN NaN
2017-10-01 08:00:00 NaN NaN NaN NaN NaN
2017-10-01 09:00:00 20.0 0.8 24.8 64.0 0.0
2017-10-01 10:00:00 NaN NaN NaN NaN NaN
2017-10-01 11:00:00 NaN NaN NaN NaN NaN
2017-10-01 12:00:00 44.0 1.7 22.7 76.0 0.0
2017-10-01 13:00:00 NaN NaN NaN NaN NaN
2017-10-01 14:00:00 NaN NaN NaN NaN NaN
2017-10-01 15:00:00 118.0 0.7 20.2 92.0 0.0
2017-10-01 16:00:00 NaN NaN NaN NaN NaN
2017-10-01 17:00:00 NaN NaN NaN NaN NaN
2017-10-01 18:00:00 70.0 2.6 20.2 94.0 0.0
2017-10-01 19:00:00 NaN NaN NaN NaN NaN
2017-10-01 20:00:00 NaN NaN NaN NaN NaN
2017-10-01 21:00:00 76.0 3.4 19.9 66.0 0.0
2017-10-01 22:00:00 NaN NaN NaN NaN NaN
2017-10-01 23:00:00 NaN NaN NaN NaN NaN
2017-10-02 00:00:00 76.0 3.8 19.4 58.0 0.0
2017-10-02 01:00:00 NaN NaN NaN NaN NaN
2017-10-02 02:00:00 NaN NaN NaN NaN NaN
2017-10-02 03:00:00 75.0 5.8 19.4 47.0 0.0
2017-10-02 04:00:00 NaN NaN NaN NaN NaN
2017-10-02 05:00:00 NaN NaN NaN NaN NaN
2017-10-02 06:00:00 81.0 5.1 19.5 42.0 0.0
but, I have no idea about how to fill the NaN by the mean of the last value and the next value.

how can i replace time-series dataframe specific values in pandas?

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?
You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources