So I have a dataset with a specific date along with every data. I want to fill these values according to their specific date in Excel which contains the date range of the whole year. It's like the date starts from 01-01-2020 00:00:00 and end at 31-12-2020 23:45:00 with the frequency of 15 mins. So there will be a total of 35040 date-time values in Excel.
my data is like:
load date
12 01-02-2020 06:30:00
21 29-04-2020 03:45:00
23 02-07-2020 12:15:00
54 07-08-2020 16:00:00
23 22-09-2020 16:30:00
As you can see these values are not continuous but they have specific dates with them, so I these date values as the index and put it at that particular date in the Excel which has the date column, and also put zero in the missing values. Can someone please help?
Use DataFrame.reindex with date_range - so added 0 values for all not exist datetimes:
rng = pd.date_range('2020-01-01','2020-12-31 23:45:00', freq='15Min')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').reindex(rng, fill_value=0)
print (df)
load
2020-01-01 00:00:00 0
2020-01-01 00:15:00 0
2020-01-01 00:30:00 0
2020-01-01 00:45:00 0
2020-01-01 01:00:00 0
...
2020-12-31 22:45:00 0
2020-12-31 23:00:00 0
2020-12-31 23:15:00 0
2020-12-31 23:30:00 0
2020-12-31 23:45:00 0
[35136 rows x 1 columns]
Related
I have a dataframe including a datetime column for date and a column for hour.
like this:
min hour date
0 0 2020-12-01
1 5 2020-12-02
2 6 2020-12-01
I need a datetime column including both date and hour.
like this :
min hour date datetime
0 0 2020-12-01 2020-12-01 00:00:00
0 5 2020-12-02 2020-12-02 05:00:00
0 6 2020-12-01 2020-12-01 06:00:00
How can I do it?
Use pd.to_datetime and pd.to_timedelta:
In [393]: df['date'] = pd.to_datetime(df['date'])
In [396]: df['datetime'] = df['date'] + pd.to_timedelta(df['hour'], unit='h')
In [405]: df
Out[405]:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
You could also try using apply and np.timedelta64:
df['datetime'] = df['date'] + df['hour'].apply(lambda x: np.timedelta64(x, 'h'))
print(df)
Output:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
In the first question it is not clear the data type of columns, so i thought they are
in date (not pandas) and he want the datetime version.
If this is the case so, solution is similar to the previous, but using a different constructor.
from datetime import datetime
df['datetime'] = df.apply(lambda x: datetime(x.date.year, x.date.month, x.date.day, int(x['hour']), int(x['min'])), axis=1)
I have two dataframes with timestamps. I want to select the timestamps from df1 that equal the timestamps 'start_show' of df2 but also keep all the timestamps of df1 2 hours before and 2 hours after (of df1) where the timestamps are equal.
df1:
van_timestamp weekdag
2880 2016-11-19 00:00:00 6
2881 2016-11-19 00:15:00 6
2882 2016-11-19 00:30:00 6
... ... ...
822349 2019-11-06 22:45:00 3
822350 2019-11-06 23:00:00 3
822351 2019-11-06 23:15:00 3
df2:
einde_show start_show
255 2016-01-16 22:00:00 2016-01-16 20:00:00
256 2016-01-23 21:30:00 2016-01-23 19:45:00
257 2016-01-26 21:30:00 2016-01-26 19:45:00
... ... ...
1111 2019-12-29 18:30:00 2019-12-29 17:00:00
1112 2019-12-30 15:00:00 2019-12-30 13:30:00
1113 2019-12-30 18:30:00 2019-12-30 17:00:00
df1 contains a timestamp every 15 minutes of every day whereas df2['start_show'] contains just a single timestamp per day.
So ultimately what I want to achieve is that for every timestamp of df2 I have the corresponding timestamp of df1 +- 2 hours.
So far I've tried:
df1['van_timestamp'][df1['van_timestamp'].isin(df2['start_show'])]
This selects the right timestamps. Now I want to select everything from df1 in the range of
+ pd.Timedelta(2, unit='h')
- pd.Timedelta(2, unit='h')
But I'm not sure how to go about this. Help would be much appreciated!
Thanks!
I got it working (ugly fix). I created a datetime range
dates = [pd.date_range(start = df2['start_show'].iloc[i] - pd.Timedelta(2, unit='h'), end = df2['start_show'].iloc[i], freq = '15T') for i in range(len(evs_data))]
Which I then unlisted:
dates = [i for sublist in dates for i in sublist]
Afterwards I compared the dataframe with this list.
relevant_timestamps = df1[df1['van_timestamp'].isin(dates)]
If anyone else has a better solution, please let me know!
I have the following dataframe:
;h0;h1;h2;h3;h4;h5;h6;h7;h8;h9;h10;h11;h12;h13;h14;h15;h16;h17;h18;h19;h20;h21;h22;h23
2017-01-01;52.72248155184351;49.2949899678983;46.57492391198069;44.087373768731766;44.14801243124734;42.17606224526609;43.18529986793594;39.58391124876044;41.63499969987035;41.40594457169249;47.58107920806581;46.56963630932529;47.377935483897694;37.99479190229543;38.53347417483357;40.62674178535282;45.81503347748674;49.0590694393733;52.73183568074295;54.37213882189341;54.737087166843295;50.224872755157314;47.874441844531056;47.8848916244788
2017-01-02;49.08874087825248;44.998912615866075;45.92457207636786;42.38001388673675;41.66922093408655;43.02027406525752;49.82151473221541;53.23401784350719;58.33805556091773;56.197239473200206;55.7686948361035;57.03099874898539;55.445563603040405;54.929102019056195;55.85170734639889;57.98929007227575;56.65821961018764;61.01309728212006;63.63384537162659;61.730431501017684;54.40180394585544;50.27375006416599;51.229656340500156;51.22066846069472
2017-01-03;50.07885876956572;47.00180020415448;44.47243045246001;42.62192562660052;40.15465704760352;43.48422695796396;50.01631022884173;54.8674584250141;60.434849010428685;61.47694796693493;60.766557330286844;59.12019178422993;53.97447369962696;51.85242030255539;53.604945764469065;56.48188852869667;59.12301823257856;72.05688032286155;74.61342126987793;70.76845988290785;64.13311592022278;58.7237387203283;55.2422389373486;52.63648285910918
As you can notice, there are the days, in the column and the hours.
I would like to create a new dataframe with only two columns:
the first the days (with also the hour data) and a column with the value. Something like the following:
2017-01-01 00:00:00 ; 52.72248
2017-01-01 01:00:00 ; 49.2949899678983
...
I could create a new dataframe and use a cycle to fullfill it. This is I do now
icount = 0
for idd in range(0,365):
for ih in range(0,24):
df.loc[df.index.values[icount]] = ecodf.iloc[idd,ih]
icount = icount + 1
What do you think?
Thanks
Turn columns names into a new column, turn to hours and use pd.to_datetime
s = df.stack()
pd.concat([
pd.to_datetime(s.reset_index() \
.replace({'level_1': r'h(\d+)'}, {'level_1': '\\1:00'}, regex=True) \
[['level_0','level_1']].apply(' '.join, axis=1)), \
s.reset_index(drop=True)], \
axis=1, sort=False)
0 1
0 2017-01-01 00:00:00 52.722482
1 2017-01-01 01:00:00 49.294990
2 2017-01-01 02:00:00 46.574924
3 2017-01-01 03:00:00 44.087374
4 2017-01-01 04:00:00 44.148012
.. ... ...
67 2017-01-03 19:00:00 70.768460
68 2017-01-03 20:00:00 64.133116
69 2017-01-03 21:00:00 58.723739
70 2017-01-03 22:00:00 55.242239
71 2017-01-03 23:00:00 52.636483
[72 rows x 2 columns]
>>>
I'm trying to upsample my data from daily to hourly frequency and forward fill missing data.
I start with the following code:
df1 = pd.read_csv("DATA.csv")
df1.head(5)
I then used the following to convert to a datetime string and set the date/time as an index:
df1['DT'] = pd.to_datetime(df1['DT']).dt.strftime('%Y-%m-%d %H:%M:%S')
df1.set_index('DT')
I try to resample hourly as follows:
df1['DT'] = df1.resample('H').ffill()
But I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'RangeIndex'
I thought my dtype was already date time as instructed by the pd.to_datetime code above. Nothing I try seems to be working. Can anyone please help me?
My expected output is as follows:
DT VALUE
2016-08-01 00:00:00 0.000000
2016-08-01 01:00:00 0.000000
2016-08-01 02:00:00 0.000000
etc.
The file itself has approximately 1000 rows. The first 50 rows or so are zero so to clarify where there's actual data:
DT VALUE
2018-12-13 00:00:00 24000.000000
2018-12-13 01:00:00 24000.000000
2018-12-13 02:00:00 24000.000000
...
2018-12-13 23:00:00 24000.000000
2018-12-14 00:00:00 26000.000000
2018-12-14 01:00:00 26000.000000
etc.
Try assign it back
df1=df1.set_index('DT')
Or
df1.set_index('DT',inplace=True)
I am assuming some initial rows of your dataset as you mentioned,
DT VALUE
0 2016-08-01 0
1 2016-08-02 0
2 2016-08-03 0
3 2016-08-04 0
4 2016-08-05 0
5 2016-08-06 0
6 2016-08-07 0
7 2016-08-08 0
8 2016-08-09 0
Then, make index on DT like this,
df = df.set_index('DT')
df
Output:
VALUE
DT
2016-08-01 0
2016-08-02 0
2016-08-03 0
2016-08-04 0
2016-08-05 0
2016-08-06 0
2016-08-07 0
2016-08-08 0
2016-08-09 0
Now, resample your dataframe,
df = df.resample('H').ffill()
df
Output: showing some initial values of output,
VALUE
DT
2016-08-01 00:00:00 0
2016-08-01 01:00:00 0
2016-08-01 02:00:00 0
2016-08-01 03:00:00 0
2016-08-01 04:00:00 0
2016-08-01 05:00:00 0
2016-08-01 06:00:00 0
2016-08-01 07:00:00 0
2016-08-01 08:00:00 0
2016-08-01 09:00:00 0
2016-08-01 10:00:00 0
You could convert the index to a pd.DatetimeIndex and then resample that. I also don't think you need (or want) the strftime() call:
df1 = pd.read_csv("DATA.csv")
df1['DT'] = pd.to_datetime(df1['DT'])
df1.set_index('DT')
df1.index = pd.DatetimeIndex(df1.index)
df1['DT'] = df1.resample('H').ffill()
NOTE: You could probably combine a bunch of this and it would still be quite clear, like:
df1 = pd.read_csv("DATA.csv")
df1.index = pd.DatetimeIndex(pd.to_datetime(df1['DT']))
df1['DT'] = df1.resample('H').ffill()
I have a dataframe of Ids and dates.
id date
1 2010-03-09 00:00:00
1 2010-05-28 00:00:00
1 2010-10-12 00:00:00
1 2010-12-10 00:00:00
1 2011-07-11 00:00:00
I'd like to reshape the dataframe so that I have one date in one column, and the next date adjacent in another column. See below
id date date2
1 2010-03-09 00:00:00 2010-05-28 00:00:00
1 2010-05-28 00:00:00 2010-10-12 00:00:00
1 2010-10-12 00:00:00 2010-12-10 00:00:00
1 2010-12-10 00:00:00 2011-07-11 00:00:00
How can I achieve this?
df['date2'] = df.date.shift(-1) # use shift function to shift index of the date
# column and assign it back to df as a new column
df.dropna() # the last row will be nan for date2, drop it if you
# don't need it
# id date date2
#0 1 2010-03-09 00:00:00 2010-05-28 00:00:00
#1 1 2010-05-28 00:00:00 2010-10-12 00:00:00
#2 1 2010-10-12 00:00:00 2010-12-10 00:00:00
#3 1 2010-12-10 00:00:00 2011-07-11 00:00:00
Looks like Psidom has a swaggy answer already ... but since I was already at it:
df_new = df.iloc[:-1]
df_new['date2'] = df.date.values[1:]