how can i replace time-series dataframe specific values in pandas? - python

I have the dataframes below (date/time is multi index) and I want to replace column values in (00:00:00~07:00:00) as a numpy array:
[[ 21.63920663 21.62012822 20.9900515 21.23217008 21.19482458
21.10839656 20.89631935 20.79977166 20.99176729 20.91567565
20.87258765 20.76210464 20.50357827 20.55897631 20.38005033
20.38227309 20.54460993 20.37707293 20.08279925 20.09955877
20.02559575 20.12390737 20.2917257 20.20056711 20.1589065
20.41302289 20.48000767 20.55604102 20.70255192]]
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
03:15:00 NaN
03:30:00 NaN
03:45:00 NaN
04:00:00 NaN
04:15:00 NaN
04:30:00 NaN
04:45:00 NaN
05:00:00 NaN
05:15:00 NaN
05:30:00 NaN
05:45:00 NaN
06:00:00 NaN
06:15:00 NaN
06:30:00 NaN
06:45:00 NaN
07:00:00 NaN
07:15:00 NaN
07:30:00 NaN
07:45:00 NaN
08:00:00 NaN
08:15:00 NaN
08:30:00 NaN
08:45:00 NaN
09:00:00 NaN
09:15:00 NaN
09:30:00 NaN
09:45:00 NaN
10:00:00 NaN
10:15:00 NaN
10:30:00 NaN
10:45:00 NaN
11:00:00 NaN
Name: temp, dtype: float64
<class 'datetime.time'>
How can I do this?

You can use slicers:
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
Or if second levels are times:
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = 1
Sample:
print (df1)
aaa
date time
2018-01-26 00:00:00 21.65
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 2.00
00:15:00 NaN
00:30:00 NaN
00:45:00 NaN
01:00:00 NaN
01:15:00 NaN
01:30:00 NaN
01:45:00 NaN
02:00:00 NaN
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
idx = pd.IndexSlice
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = 1
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 1.0
00:30:00 1.0
00:45:00 1.0
01:00:00 1.0
01:15:00 1.0
01:30:00 1.0
01:45:00 1.0
02:00:00 1.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
For assign array is necessary use numpy.tile for repeat by length of first level unique values:
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, 10),len(df1.index.levels[0]))
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
More general solution with generated array by length of slice:
idx = pd.IndexSlice
len0 = df1.loc[idx[df1.index.levels[0][0], '00:00:00':'02:00:00'],:].shape[0]
len1 = len(df1.index.levels[0])
df1.loc[idx[:, '00:00:00':'02:00:00'],:] = np.tile(np.arange(1, len0 + 1), len1)
Tested with times:
import datetime
idx = pd.IndexSlice
arr =np.tile(np.arange(1, 10),len(df1.index.levels[0]))
df1.loc[idx[:, datetime.time(0, 0, 0):datetime.time(2, 0, 0)],:] = arr
print (df1)
aaa
date time
2018-01-26 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
2018-01-27 00:00:00 1.0
00:15:00 2.0
00:30:00 3.0
00:45:00 4.0
01:00:00 5.0
01:15:00 6.0
01:30:00 7.0
01:45:00 8.0
02:00:00 9.0
02:15:00 NaN
02:30:00 NaN
02:45:00 NaN
03:00:00 NaN
EDIT:
Last was problem found - my solution wokrs with one column DataFrame, but if working with Series need remove one ::
arr = np.array([[ 21.63920663, 21.62012822, 20.9900515, 21.23217008, 21.19482458, 21.10839656,
20.89631935, 20.79977166, 20.99176729, 20.91567565, 20.87258765, 20.76210464,
20.50357827, 20.55897631, 20.38005033, 20.38227309, 20.54460993, 20.37707293,
20.08279925, 20.09955877, 20.02559575, 20.12390737, 20.2917257, 20.20056711,
20.1589065, 20.41302289, 20.48000767, 20.55604102, 20.70255192]])
import datetime
idx = pd.IndexSlice
df1.loc[idx[:, datetime.time(0, 0, 0): datetime.time(7, 0, 0)]] = arr[0]
---^^^

Related

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0
First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

Pandas: Find first occurrence - on daily basis in a timeseries

I'm struggling with this so any input appreciated. I want to iterate over the values in a dataframe column and return the first instance when a value is seen every day. Groupby looked to be a good option for this but when using df.groupby(grouper).first() with grouper set at daily the following output is seen.
In [95]:
df.groupby(grouper).first()
Out[95]:
test_1
2014-03-04 1.0
2014-03-05 1.0
This is only giving the day the value was seen in test _1 and not reseting the first() on a daily basis which is what I need (see desired output below).
I want to preserve the time this value was seen in the following format:
This is the input dataframe:
test_1
2014-03-04 09:00:00 NaN
2014-03-04 10:00:00 NaN
2014-03-04 11:00:00 NaN
2014-03-04 12:00:00 NaN
2014-03-04 13:00:00 NaN
2014-03-04 14:00:00 1.0
2014-03-04 15:00:00 NaN
2014-03-04 16:00:00 1.0
2014-03-05 09:00:00 1.0
This is the desired output:
test_1 test_output
2014-03-04 09:00:00 NaN NaN
2014-03-04 10:00:00 NaN NaN
2014-03-04 11:00:00 NaN NaN
2014-03-04 12:00:00 NaN NaN
2014-03-04 13:00:00 NaN NaN
2014-03-04 14:00:00 1.0 1.0
2014-03-04 15:00:00 NaN NaN
2014-03-04 16:00:00 1.0 NaN
2014-03-05 09:00:00 1.0 NaN
I just want to mark the time when an event first occurs in a new column named test_output.
Admins. Please note this question is different from the other marked as a duplicate as this requires a rolling one day first occurrence.
Try this, using this data:
rng = pd.DataFrame( {'test_1': [None, None,None, None, 1,1, 1 , None, None, None,1 , None, None, None,]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
rng
test_1
2014-04-02 09:00:00 NaN
2014-04-02 10:00:00 NaN
2014-04-02 11:00:00 NaN
2014-04-02 12:00:00 NaN
2014-04-02 13:00:00 1.0
2014-04-02 14:00:00 1.0
2014-04-02 15:00:00 1.0
2014-04-02 16:00:00 NaN
2014-04-03 09:00:00 NaN
2014-04-03 10:00:00 NaN
2014-04-03 11:00:00 1.0
2014-04-03 12:00:00 NaN
2014-04-03 13:00:00 NaN
2014-04-03 14:00:00 NaN
The output is this:
rng['test_output'] = rng['test_1'].loc[rng.groupby(pd.TimeGrouper(freq='D'))['test_1'].idxmin()]
test_1 test_output
2014-04-02 09:00:00 NaN NaN
2014-04-02 10:00:00 NaN NaN
2014-04-02 11:00:00 NaN NaN
2014-04-02 12:00:00 NaN NaN
2014-04-02 13:00:00 1.0 1.0
2014-04-02 14:00:00 1.0 NaN
2014-04-02 15:00:00 1.0 NaN
2014-04-02 16:00:00 NaN NaN
2014-04-03 09:00:00 NaN NaN
2014-04-03 10:00:00 NaN NaN
2014-04-03 11:00:00 1.0 1.0
2014-04-03 12:00:00 NaN NaN
2014-04-03 13:00:00 NaN NaN
2014-04-03 14:00:00 NaN NaN

Comparing dates of two pandas Dataframes and adding values if dates are similar?

I have two dataframes with different number of rows, one has daily values and the second one has hourly values. I wanted to compare them and if the dates match then I would like to add daily values infront of hourly values of the same day. The dataframes are;
import pandas as pd
df1 = pd.read_csv('C:\Users\ABC.csv')
df2 = pd.read_csv('C:\Users\DEF.csv')
df1 = pd.to_datetime(df1['Datetime'])
df2 = pd.to_datetime(df2['Datetime'])
df1.head()
Out [3]
Datetime Value
0 2016-02-02 21:00:00 0.6
1 2016-02-02 22:00:00 0.4
2 2016-02-02 23:00:00 0.4
3 2016-03-02 00:00:00 0.3
4 2016-03-02 01:00:00 0.2
df2.head()
Out [4] Datetime No of people
0 2016-02-02 56
1 2016-03-02 60
2 2016-04-02 91
3 2016-05-02 87
4 2016-06-02 90
What I would like to have is something like this;
Datetime Value No of People
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
Any idea, how to do this in Python using Pandas? please note, there may be some dates missing.
you can set index to df1.Datetime.dt.date for the df1 DF and then you can join it with df2:
In [46]: df1.set_index(df1.Datetime.dt.date).join(df2.set_index('Datetime')).reset_index(drop=True)
Out[46]:
Datetime Value No_of_people
0 2016-02-02 21:00:00 0.6 56
1 2016-02-02 22:00:00 0.4 56
2 2016-02-02 23:00:00 0.4 56
3 2016-03-02 00:00:00 0.3 60
4 2016-03-02 01:00:00 0.2 60
optionally you may want to use how='left' parameter when calling the join() function
You can just use pd.concat and fillna(method='ffill') because the date values match the first value on any day:
df1 = pd.DataFrame(data={'day': np.random.randint(low=50, high=100, size=10), 'date':pd.date_range(date(2016,1,1), freq='D', periods=10)})
date day
0 2016-01-01 55
1 2016-01-02 51
2 2016-01-03 92
3 2016-01-04 78
4 2016-01-05 72
df2 = pd.DataFrame(data={'hour': np.random.randint(low=1, high=10, size=100), 'datetime': pd.date_range(date(2016,1,1), freq='H', periods=100)})
datetime hour
0 2016-01-01 00:00:00 5
1 2016-01-01 01:00:00 1
2 2016-01-01 02:00:00 4
3 2016-01-01 03:00:00 5
4 2016-01-01 04:00:00 2
like so:
pd.concat([df2.set_index('datetime'), df1.set_index('date')], axis=1).fillna(method='ffill')
to get:
hour day
2016-01-01 00:00:00 5.0 55.0
2016-01-01 01:00:00 1.0 55.0
2016-01-01 02:00:00 4.0 55.0
2016-01-01 03:00:00 5.0 55.0
2016-01-01 04:00:00 2.0 55.0
2016-01-01 05:00:00 3.0 55.0
2016-01-01 06:00:00 5.0 55.0
2016-01-01 07:00:00 6.0 55.0
2016-01-01 08:00:00 6.0 55.0
2016-01-01 09:00:00 8.0 55.0
2016-01-01 10:00:00 3.0 55.0
2016-01-01 11:00:00 5.0 55.0
2016-01-01 12:00:00 7.0 55.0
2016-01-01 13:00:00 7.0 55.0
2016-01-01 14:00:00 4.0 55.0
2016-01-01 15:00:00 5.0 55.0
2016-01-01 16:00:00 7.0 55.0
2016-01-01 17:00:00 4.0 55.0
2016-01-01 18:00:00 6.0 55.0
2016-01-01 19:00:00 1.0 55.0
2016-01-01 20:00:00 8.0 55.0
2016-01-01 21:00:00 8.0 55.0
2016-01-01 22:00:00 2.0 55.0
2016-01-01 23:00:00 3.0 55.0
2016-01-02 00:00:00 7.0 51.0
2016-01-02 01:00:00 6.0 51.0
2016-01-02 02:00:00 8.0 51.0
2016-01-02 03:00:00 6.0 51.0
2016-01-02 04:00:00 1.0 51.0
2016-01-02 05:00:00 5.0 51.0
... ... ...
2016-01-04 03:00:00 6.0 78.0
2016-01-04 04:00:00 9.0 78.0
2016-01-04 05:00:00 1.0 78.0
2016-01-04 06:00:00 6.0 78.0
2016-01-04 07:00:00 3.0 78.0
2016-01-04 08:00:00 9.0 78.0
2016-01-04 09:00:00 5.0 78.0
2016-01-04 10:00:00 3.0 78.0
2016-01-04 11:00:00 6.0 78.0
2016-01-04 12:00:00 4.0 78.0
2016-01-04 13:00:00 2.0 78.0
2016-01-04 14:00:00 4.0 78.0
2016-01-04 15:00:00 3.0 78.0
2016-01-04 16:00:00 4.0 78.0
2016-01-04 17:00:00 9.0 78.0
2016-01-04 18:00:00 8.0 78.0
2016-01-04 19:00:00 4.0 78.0
2016-01-04 20:00:00 7.0 78.0
2016-01-04 21:00:00 1.0 78.0
2016-01-04 22:00:00 6.0 78.0
2016-01-04 23:00:00 1.0 78.0
2016-01-05 00:00:00 5.0 72.0
2016-01-05 01:00:00 8.0 72.0
2016-01-05 02:00:00 6.0 72.0
2016-01-05 03:00:00 3.0 72.0
2016-01-06 00:00:00 3.0 87.0
2016-01-07 00:00:00 3.0 50.0
2016-01-08 00:00:00 3.0 65.0
2016-01-09 00:00:00 3.0 81.0
2016-01-10 00:00:00 3.0 65.0

Categories

Resources