I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Related
Let's say I have a panel dataframe with lots of NaNs inside as follow:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = pd.np.nan
df
Out:
A B C
2013-02-26 NaN NaN NaN
2013-02-27 NaN NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 NaN 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 NaN NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
I need to apply df.fillna(method='bfill') or df.fillna(method='ffill') to the dataframe only if they are in same year and month:
For example, if I apply df.fillna(method='bfill'), the expected result will like this:
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
How could I do that in Pandas? Thanks.
You could resample by M (month) and transform bfill:
>>> df.resample("M").transform('bfill')
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>
For specific columns:
>>> df[['A', 'B']] = df.resample("M")[['A', 'B']].transform('bfill')
>>> df
A B C
2013-02-26 62.0 NaN NaN
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 10.0 NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>
I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))
Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0
I want to find the last valid index of the first Dataframe, and use it to index the second Dataframe.
So, suppose I have the following Dataframe (df1):
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Now I can use "first_valid_index()" to find the last valid index of each column:
lvi = df.apply(lambda series: series.last_valid_index())
Which yields:
Site 1 2017-01-01
Site 2 2013-01-01
Site 3 2019-01-01
Site 4 2020-01-01
Site 5 2006-01-01
Site 6 2017-01-01
How do I apply this to another Dataframe where I use this index to slice the timeseries of another Dataframe. Another example of a Dataframe could be created with:
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
How do I use that "lvi" variable to index into df2?
To do this manually I could just use:
df_s1 = df['Site 1'].loc['2000-01-01':'2017-01-01']
To get something like:
2000-01-01 13.0
2001-01-01 77.0
2002-01-01 50.0
2003-01-01 7.0
2004-01-01 11.0
2005-01-01 50.0
2006-01-01 30.0
2007-01-01 50.0
2008-01-01 24.0
2009-01-01 56.0
2010-01-01 87.0
2011-01-01 22.0
2012-01-01 12.0
2013-01-01 1.0
2014-01-01 94.0
2015-01-01 2.0
2016-01-01 81.0
2017-01-01 59.0
Is there a better way to approach this? Also, will each column have to essentially be its own dataframe to work? Any help is greatly appreciated!
This might be a bit more idiomatic:
df2[df.notna()]
or even
df2.where(df.notna())
Note that in these cases (and df1*0 + df2), the operations are done for matching index values of df and df2. For example, df2[df.reset_index(drop=True).notna()] will return all nan because there are no common index values.
This seems to work just fine:
In [34]: d
Out[34]:
x y
Date
2020-01-01 1.0 2.0
2020-01-02 1.0 2.0
2020-01-03 1.0 2.0
2020-01-04 1.0 2.0
2020-01-05 1.0 2.0
2020-01-06 1.0 NaN
2020-01-07 1.0 NaN
2020-01-08 1.0 NaN
2020-01-09 1.0 NaN
2020-01-10 1.0 NaN
2020-01-11 NaN NaN
2020-01-12 NaN NaN
2020-01-13 NaN NaN
2020-01-14 NaN NaN
2020-01-15 NaN NaN
2020-01-16 NaN NaN
2020-01-17 NaN NaN
2020-01-18 NaN NaN
2020-01-19 NaN NaN
2020-01-20 NaN NaN
In [35]: d.apply(lambda col: col.last_valid_index())
Out[35]:
x 2020-01-10
y 2020-01-05
dtype: datetime64[ns]
And then:
In [15]: d.apply(lambda col: col.last_valid_index()).apply(lambda date: df2.loc[date]) Out[15]: z x 0.940396 y 0.564007
Alright, so after thinking about this for a while and trying to come up with a detailed procedure that involved a for loop etc., I came to the conclusions that this simple math operation will do the trick. Basically I am taking advantage of how math is done between Dataframes in pandas.
output = df1*0 + df2
This gives the output on df2 that will take on the NaN values from df1 and look like this:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 0.690597 0.443933 0.787931 0.659639 0.363606 0.922373
2001-01-01 0.388669 0.577734 0.450225 0.021592 0.554249 0.305546
2002-01-01 0.578212 0.927848 0.361426 0.840541 0.626881 0.545491
2003-01-01 0.431668 0.128282 0.893351 0.783488 0.122182 0.666194
2004-01-01 0.151491 0.928584 0.834474 0.945401 0.590830 0.802648
2005-01-01 0.113477 0.398326 0.649955 0.202538 0.485927 0.127925
2006-01-01 0.521906 0.458672 0.923632 0.948696 0.638754 0.552753
2007-01-01 0.266599 0.839047 0.099069 0.000928 NaN 0.018146
2008-01-01 0.819810 0.809779 0.706223 0.247780 NaN 0.759691
2009-01-01 0.441574 0.020291 0.702551 0.468862 NaN 0.341191
2010-01-01 0.277030 0.130573 0.906697 0.589474 NaN 0.819986
2011-01-01 0.795344 0.103121 0.846405 0.589916 NaN 0.564411
2012-01-01 0.697255 0.599767 0.206482 0.718980 NaN 0.731366
2013-01-01 0.891771 0.001944 0.703132 0.751986 NaN 0.845933
2014-01-01 0.672579 NaN 0.466981 0.466770 NaN 0.618069
2015-01-01 0.767219 NaN 0.702156 0.370905 NaN 0.481971
2016-01-01 0.315264 NaN 0.793531 0.754920 NaN 0.091432
2017-01-01 0.431651 NaN 0.974520 0.708074 NaN 0.870077
2018-01-01 NaN NaN 0.408743 0.430576 NaN NaN
2019-01-01 NaN NaN 0.751509 0.755521 NaN NaN
2020-01-01 NaN NaN NaN 0.518533 NaN NaN
I was basically wanting to imprint the NaN values from one Dataframe onto another. I cannot believe how difficult I was making this. As long as my Dataframes are the same size this should work fine for my needs.
Now I should be able to take it from here to calculate the percent change from each last valid datapoint. Thank you everyone for the input!
EDIT:
Just to show everyone what I was ultimately trying to accomplish, here is the final code I produced with everyone's help and suggestions!
The original df originally looked like:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Then I came up with a second full dataframe (df2) with:
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
Now I replace the nan values in df2 with the nan values from df:
dfr = df2[df.notna()]
Then I invert the dataframe:
dfr = dfr[::-1]
valid_first = dfr.apply(lambda col: col.first_valid_index())
valid_last = dfr.apply(lambda col: col.last_valid_index())
Now I want the to calculate the percent change from my last valid data point, which is fixed for each column. This gives me the % change from the present to the past, with respect to the most recent (or last valid) data point.
new = []
for j in dfr:
m = dfr[j].loc[valid_first[j]:valid_last[j]]
pc = m / m.iloc[0]-1
new.append(pc)
final = pd.concat(new,axis=1)
print(final)
Which gave me:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
2000-01-01 0.270209 -0.728445 -0.636105 0.380330 41.339081 -0.462147
2001-01-01 0.854952 -0.827804 -0.703568 -0.787391 40.588791 -0.884806
2002-01-01 -0.677757 -0.120482 -0.208255 -0.982097 54.348094 -0.483415
2003-01-01 -0.322010 -0.061277 -0.382602 1.025088 5.440808 -0.602661
2004-01-01 1.574451 -0.768251 -0.543260 1.210434 50.494788 -0.859331
2005-01-01 -0.412226 -0.866441 -0.055027 -0.168267 1.346869 -0.385080
2006-01-01 1.280867 -0.640899 0.354513 1.086703 0.000000 0.108504
2007-01-01 1.121585 -0.741675 -0.735990 -0.768578 NaN -0.119436
2008-01-01 -0.210467 -0.376884 -0.575106 -0.779147 NaN 0.055949
2009-01-01 1.864107 -0.966827 0.566590 1.003121 NaN -0.214482
2010-01-01 0.571762 -0.311459 -0.518113 1.036950 NaN -0.513911
2011-01-01 -0.122525 -0.178137 -0.641642 0.197481 NaN 0.033141
2012-01-01 0.403578 -0.829402 0.161753 -0.438578 NaN -0.996595
2013-01-01 0.383481 0.000000 -0.305824 0.602079 NaN -0.057711
2014-01-01 -0.699708 NaN -0.515074 -0.277157 NaN -0.840873
2015-01-01 0.422364 NaN -0.759708 1.230037 NaN -0.663253
2016-01-01 -0.418945 NaN 0.197396 -0.445260 NaN -0.299741
2017-01-01 0.000000 NaN -0.897428 0.669791 NaN 0.000000
2018-01-01 NaN NaN 0.138997 0.486961 NaN NaN
2019-01-01 NaN NaN 0.000000 0.200771 NaN NaN
2020-01-01 NaN NaN NaN 0.000000 NaN NaN
I know often times these questions don't have context, so here is the final output achieved thanks to your input. Again, thank you to everyone for the help!
This is my dataframe. How to I add max_value, min_value, mean_value, median_value names to rows so that my index values will be like
0
1
2
3
4
max_value
min_value
mean_value
median_value
Could anyone help me in solving this
If want add rows use add DataFrame.agg:
df1 = df.append(df.agg(['max','min','mean','median']))
If want add columns use assign with min, max, mean and median:
df2 = df.assign(max_value=df.max(axis=1),
min_value=df.min(axis=1),
mean_value=df.mean(axis=1),
median_value=df.median(axis=1))
one Way is,
Thanks to #jezrael for the help.
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
df1=df.copy()
#column wise calc
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
#row wise calc
df['max']=df1.max(axis=1)
df['min']=df1.min(axis=1)
df['mean']=df1.mean(axis=1)
df['median']=df1.median(axis=1)
O/P:
A B C D max min mean median
0 49.0 91.0 16.0 17.0 91.0 16.0 43.25 33.0
1 20.0 42.0 86.0 60.0 86.0 20.0 52.00 51.0
2 32.0 25.0 94.0 13.0 94.0 13.0 41.00 28.5
3 40.0 1.0 66.0 31.0 66.0 1.0 34.50 35.5
4 18.0 30.0 67.0 31.0 67.0 18.0 36.50 30.5
max 49.0 91.0 94.0 60.0 NaN NaN NaN NaN
min 18.0 1.0 16.0 13.0 NaN NaN NaN NaN
mean 31.8 37.8 65.8 30.4 NaN NaN NaN NaN
median 32.0 30.0 67.0 31.0 NaN NaN NaN NaN
This worked well and fine:
df1 = df.copy()
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
I have two Pandas series (d1 and d2) indexed by datetime and each containing one column of data with both float and NaN. Both indices are at one-day intervals, although the time entries are inconsistent with many periods of missing days. d1 ranges from 1974-12-16 to 2002-01-30. d2 ranges from 1997-12-19 to 2017-07-06. The period from 1997-12-19 to 2002-01-30 contains many duplicate indices between the two series. The data for duplicated indices is sometimes the same value, different values, or one value and NaN.
I would like to combine these two series into one, prioritizing the data from d2 anytime there are duplicate indices (that is, replace all d1 data with d2 data anytime there is a duplicated index). What is the most efficient way to do this among the many Pandas tools available (merge, join, concatenate etc.)?
Here is an example of my data:
In [7]: print d1
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2002-01-01 18.0
2002-01-02 16.0
2002-01-03 NaN
2002-01-04 24.0
2002-01-05 23.0
2002-01-06 15.0
2002-01-07 22.0
2002-01-08 34.0
2002-01-09 35.0
2002-01-10 29.0
2002-01-11 21.0
2002-01-12 24.0
2002-01-13 NaN
2002-01-14 18.0
2002-01-15 14.0
2002-01-16 10.0
2002-01-17 5.0
2002-01-18 7.0
2002-01-19 7.0
2002-01-20 7.0
2002-01-21 11.0
2002-01-22 NaN
2002-01-23 9.0
2002-01-24 8.0
2002-01-25 15.0
2002-01-26 NaN
2002-01-27 NaN
2002-01-28 18.0
2002-01-29 13.0
2002-01-30 13.0
Name: MaxTempMid, dtype: float64
In [8]: print d2
fldDate
1997-12-19 22.0
1997-12-20 14.0
1997-12-21 18.0
1997-12-22 16.0
1997-12-23 16.0
1997-12-24 10.0
1997-12-25 12.0
1997-12-26 12.0
1997-12-27 9.0
1997-12-28 12.0
1997-12-29 18.0
1997-12-30 23.0
1997-12-31 28.0
1998-01-01 26.0
1998-01-02 29.0
1998-01-03 27.0
1998-01-04 22.0
1998-01-05 19.0
1998-01-06 17.0
1998-01-07 14.0
1998-01-08 14.0
1998-01-09 14.0
1998-01-10 16.0
1998-01-11 20.0
1998-01-12 21.0
1998-01-13 19.0
1998-01-14 20.0
1998-01-15 16.0
1998-01-16 17.0
1998-01-17 20.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0
Name: MaxTempMid, dtype: float64
Let's use, combine_first:
df2.combine_first(df1)
Output:
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0