I have data in a table as presented below:
YEAR DOY Hour
2015 1 0
2015 1 1
2015 1 2
2015 1 3
2015 1 4
2015 1 5
This is how I'm reading the file:
df = pd.read_table('data2015.lst', sep='\s+')
lines = len(df)
To convert it to a datetime object I do:
dates = []
for l in range(0,lines):
date = str(df.ix[l,0])[:-2] +' '+ str(df.ix[l,1])[:-2] +' '+ str(df.ix[l,2])[:-2]
d = pd.to_datetime(date, format='%Y %j %H')
dates.append(d)
But this is taking a lot of time.
Is there some way to do it (more directly) without the loop?
You can do it in one line when reading it:
df = pd.read_csv('file.txt', sep='\s+', index_col='Timestamp',
parse_dates={'Timestamp': [0,1,2]},
date_parser=lambda x: pd.datetime.strptime(x, '%Y %j %H'))
Timestamp
2015-01-01 00:00:00
2015-01-01 01:00:00
2015-01-01 02:00:00
2015-01-01 03:00:00
2015-01-01 04:00:00
2015-01-01 05:00:00
Related
I have a dataframe including a datetime column for date and a column for hour.
like this:
min hour date
0 0 2020-12-01
1 5 2020-12-02
2 6 2020-12-01
I need a datetime column including both date and hour.
like this :
min hour date datetime
0 0 2020-12-01 2020-12-01 00:00:00
0 5 2020-12-02 2020-12-02 05:00:00
0 6 2020-12-01 2020-12-01 06:00:00
How can I do it?
Use pd.to_datetime and pd.to_timedelta:
In [393]: df['date'] = pd.to_datetime(df['date'])
In [396]: df['datetime'] = df['date'] + pd.to_timedelta(df['hour'], unit='h')
In [405]: df
Out[405]:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
You could also try using apply and np.timedelta64:
df['datetime'] = df['date'] + df['hour'].apply(lambda x: np.timedelta64(x, 'h'))
print(df)
Output:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
In the first question it is not clear the data type of columns, so i thought they are
in date (not pandas) and he want the datetime version.
If this is the case so, solution is similar to the previous, but using a different constructor.
from datetime import datetime
df['datetime'] = df.apply(lambda x: datetime(x.date.year, x.date.month, x.date.day, int(x['hour']), int(x['min'])), axis=1)
I have columnar data of dates of the form mm-dd as shown. I need to add the correct year (dates October to December are 2017 and dates after 1-1 are 2018) and make a datetime object. The code below works, but it's ugly. Is there a more Pythonic way to accomplish this?
import pandas as pd
from datetime import datetime
import io
data = '''Date
1-3
1-2
1-1
12-21
12-20
12-19
12-18'''
df = pd.read_csv(io.StringIO(data))
for i,s in enumerate(df.Date):
s = s.split('-')
if int(s[0]) >= 10:
s = s[0]+'-'+s[1]+'-17'
else:
s = s[0]+'-'+s[1]+'-18'
df.Date[i] = pd.to_datetime(s)
print(df.Date[i])
Prints:
2018-01-03 00:00:00
2018-01-02 00:00:00
2018-01-01 00:00:00
2017-12-21 00:00:00
2017-12-20 00:00:00
2017-12-19 00:00:00
2017-12-18 00:00:00
You can conver the date to pandas datetimeobjects. Then modify their year with datetime.replace. See docs for more information.
You can use the below code:
df['Date'] = pd.to_datetime(df['Date'], format="%m-%d")
df['Date'] = df['Date'].apply(lambda x: x.replace(year=2017) if x.month in(range(10,13)) else x.replace(year=2018))
Output:
Date
0 2018-01-03
1 2018-01-02
2 2018-01-01
3 2017-12-21
4 2017-12-20
5 2017-12-19
6 2017-12-18
This is one way using pandas vectorised functionality:
df['Date'] = pd.to_datetime(df['Date'] + \
np.where(df['Date'].str.split('-').str[0].astype(int).between(10, 12),
'-2017', '-2018'))
print(df)
Date
0 2018-01-03
1 2018-01-02
2 2018-01-01
3 2017-12-21
4 2017-12-20
5 2017-12-19
6 2017-12-18
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
I have a dataframe that has a date column and an hour column.
DATE HOUR
2015-1-1 1
2015-1-1 2
. .
. .
. .
2015-1-1 24
I want to convert these columns into a datetime format something like:
2015-12-26 01:00:00
You could first convert df.DATE to datetime column and add df.HOUR delta via timedelta64[h]
In [10]: df
Out[10]:
DATE HOUR
0 2015-1-1 1
1 2015-1-1 2
2 2015-1-1 24
In [11]: pd.to_datetime(df.DATE) + df.HOUR.astype('timedelta64[h]')
Out[11]:
0 2015-01-01 01:00:00
1 2015-01-01 02:00:00
2 2015-01-02 00:00:00
dtype: datetime64[ns]
Or, use pd.to_timedelta
In [12]: pd.to_datetime(df.DATE) + pd.to_timedelta(df.HOUR, unit='h')
Out[12]:
0 2015-01-01 01:00:00
1 2015-01-01 02:00:00
2 2015-01-02 00:00:00
dtype: datetime64[ns]
I would like to revert index after groupby function.
Question is how to create a DateTime index having year, month, day in separate columns in Multindex.
Given a DataFrame as an example:
import pandas as pd
import numpy as np
index=pd.date_range('2011-1-1 00:00:00', '2011-1-31 23:50:00', freq='10min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
Then, get the sum over each hour using grupby:
day_h = df.groupby([lambda x: x.year, lambda x: x.month, lambda x: x.day,lambda x: x.hour]).mean()
This creates an Index, where year, month, day and hour are in separate columns.
A B
2011 1 1 0 0.209908 1.196164
2011 1 1 1 0.692531 0.518185
2011 1 1 2 1.674748 0.013136
2011 1 1 3 1.674748 0.013136
2011 1 1 4 1.674748 0.013136
2011 1 1 5 1.674748 0.013136
The desired output would be to have DateTime index:
A B
2011-1-1 00:00 0.209908 1.196164
2011-1-1 01:00 0.692531 0.518185
2011-1-1 03:00 1.674748 0.013136
2011-1-1 04:00 1.674748 0.013136
2011-1-1 05:00 1.674748 0.013136
In my files there are some missing rows, so I can't create a new index with 1h timestep.
My data after groupby Example data
Someone else on SO had a similar question, but their solution was to use resample. You can avoid resampling by mapping the tuples in the multi-index to create a new index. This will handle missing rows just fine.
day_h['new_index'] = day_h.index.map(lambda x: datetime.datetime(x[0], x[1], x[2], x[3]))
day_h.set_index('new_index')
Output:
A B
new_index
2011-01-01 00:00:00 -1.095114 1.995776
2011-01-01 01:00:00 -2.411459 4.508794
2011-01-01 02:00:00 -1.261747 4.953709
2011-01-01 03:00:00 -0.311934 5.454112
2011-01-01 04:00:00 2.095718 6.854375
2011-01-01 05:00:00 1.696756 3.518919
2011-01-01 06:00:00 0.623589 1.740478
2011-01-01 07:00:00 0.544426 0.916016
2011-01-01 08:00:00 2.331326 0.891177