I'm trying to upsample my data from daily to hourly frequency and forward fill missing data.
I start with the following code:
df1 = pd.read_csv("DATA.csv")
df1.head(5)
I then used the following to convert to a datetime string and set the date/time as an index:
df1['DT'] = pd.to_datetime(df1['DT']).dt.strftime('%Y-%m-%d %H:%M:%S')
df1.set_index('DT')
I try to resample hourly as follows:
df1['DT'] = df1.resample('H').ffill()
But I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'RangeIndex'
I thought my dtype was already date time as instructed by the pd.to_datetime code above. Nothing I try seems to be working. Can anyone please help me?
My expected output is as follows:
DT VALUE
2016-08-01 00:00:00 0.000000
2016-08-01 01:00:00 0.000000
2016-08-01 02:00:00 0.000000
etc.
The file itself has approximately 1000 rows. The first 50 rows or so are zero so to clarify where there's actual data:
DT VALUE
2018-12-13 00:00:00 24000.000000
2018-12-13 01:00:00 24000.000000
2018-12-13 02:00:00 24000.000000
...
2018-12-13 23:00:00 24000.000000
2018-12-14 00:00:00 26000.000000
2018-12-14 01:00:00 26000.000000
etc.
Try assign it back
df1=df1.set_index('DT')
Or
df1.set_index('DT',inplace=True)
I am assuming some initial rows of your dataset as you mentioned,
DT VALUE
0 2016-08-01 0
1 2016-08-02 0
2 2016-08-03 0
3 2016-08-04 0
4 2016-08-05 0
5 2016-08-06 0
6 2016-08-07 0
7 2016-08-08 0
8 2016-08-09 0
Then, make index on DT like this,
df = df.set_index('DT')
df
Output:
VALUE
DT
2016-08-01 0
2016-08-02 0
2016-08-03 0
2016-08-04 0
2016-08-05 0
2016-08-06 0
2016-08-07 0
2016-08-08 0
2016-08-09 0
Now, resample your dataframe,
df = df.resample('H').ffill()
df
Output: showing some initial values of output,
VALUE
DT
2016-08-01 00:00:00 0
2016-08-01 01:00:00 0
2016-08-01 02:00:00 0
2016-08-01 03:00:00 0
2016-08-01 04:00:00 0
2016-08-01 05:00:00 0
2016-08-01 06:00:00 0
2016-08-01 07:00:00 0
2016-08-01 08:00:00 0
2016-08-01 09:00:00 0
2016-08-01 10:00:00 0
You could convert the index to a pd.DatetimeIndex and then resample that. I also don't think you need (or want) the strftime() call:
df1 = pd.read_csv("DATA.csv")
df1['DT'] = pd.to_datetime(df1['DT'])
df1.set_index('DT')
df1.index = pd.DatetimeIndex(df1.index)
df1['DT'] = df1.resample('H').ffill()
NOTE: You could probably combine a bunch of this and it would still be quite clear, like:
df1 = pd.read_csv("DATA.csv")
df1.index = pd.DatetimeIndex(pd.to_datetime(df1['DT']))
df1['DT'] = df1.resample('H').ffill()
Related
So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]
I have a dataframe including a datetime column for date and a column for hour.
like this:
min hour date
0 0 2020-12-01
1 5 2020-12-02
2 6 2020-12-01
I need a datetime column including both date and hour.
like this :
min hour date datetime
0 0 2020-12-01 2020-12-01 00:00:00
0 5 2020-12-02 2020-12-02 05:00:00
0 6 2020-12-01 2020-12-01 06:00:00
How can I do it?
Use pd.to_datetime and pd.to_timedelta:
In [393]: df['date'] = pd.to_datetime(df['date'])
In [396]: df['datetime'] = df['date'] + pd.to_timedelta(df['hour'], unit='h')
In [405]: df
Out[405]:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
You could also try using apply and np.timedelta64:
df['datetime'] = df['date'] + df['hour'].apply(lambda x: np.timedelta64(x, 'h'))
print(df)
Output:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
In the first question it is not clear the data type of columns, so i thought they are
in date (not pandas) and he want the datetime version.
If this is the case so, solution is similar to the previous, but using a different constructor.
from datetime import datetime
df['datetime'] = df.apply(lambda x: datetime(x.date.year, x.date.month, x.date.day, int(x['hour']), int(x['min'])), axis=1)
I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
If df is a Dataframe indexed by DateTime objects, the following code splits it into the list groups_list where each index containts all the data in df that belongs to a given day:
groupby_clause = [df.index.year,df.index.month,df.index.day]
groups_list = [group[1] for group in df.groupby(groupby_clause)]
I am having trouble, though, to understand how the grouping is actually made, since I don't need to label the elements of groupby_clause as year, month, and day for the grouping to be made on DateTime objects.
As an example, I have the following components for groups_list:
Maybe I'm missing something obvious, but I don't get it: how does pandas know that it should associate groupby_clause[0] to year, groupby_clause[1] to month, and groupby_clause[2] to day in order to group the dataframe indexes that have DateTime type?
Suppose you have a DataFrame like this:
0
2011-01-01 00:00:00 -0.324398
2011-01-01 01:00:00 -0.761585
2011-01-01 02:00:00 0.057204
2011-01-01 03:00:00 -1.162510
2011-01-01 04:00:00 -0.680896
2011-01-01 05:00:00 -0.701835
2011-01-01 06:00:00 -0.431338
2011-01-01 07:00:00 0.306935
2011-01-01 08:00:00 -0.503177
2011-01-01 09:00:00 -0.507444
2011-01-01 10:00:00 0.230590
2011-01-01 11:00:00 -2.326702
2011-01-01 12:00:00 -0.034664
2011-01-01 13:00:00 0.224373
2011-01-01 14:00:00 -0.242884
If you want the index to be by year month and date then just set_index it:
df.set_index([ts.index.year, ts.index.month, ts.index.day])
Output
0
2011 1 1 -0.324398
1 -0.761585
1 0.057204
1 -1.162510
1 -0.680896
1 -0.701835
1 -0.431338
1 0.306935
1 -0.503177
1 -0.507444
1 0.230590
1 -2.326702
1 -0.034664
1 0.224373
1 -0.242884
1 -0.134757
1 -1.177362
1 0.931335
1 0.904084
1 -0.757860
1 0.406597
1 -0.664150