I am trying to assign a Column to an existing df. Specifically, certain timestamps get sorted but the current export is a separate series. I'd like to append this to the df.
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
Output:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I've altered
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
But this produces
time code
0 08:00:00 A
1 12:00:00 B
2 13:00:00 C
3 16:00:00 A
4 20:00:00 B
5 26:00:00 C
6 27:00:00 A
As you can see. The timestamps aren't aligned with their respective values after sorting.
The intended output is:
time code
0 08:00:00 A
1 12:00:00 B
2 13:00:00 C
3 16:00:00 C
4 20:00:00 A
5 26:00:00 B
6 27:00:00 A
Remove reset_index(drop=True) from your code and sort later may work for you.
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
x = df.time.apply(lambda x: x if x > cutoff else x + day).dt.components
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
df = df.sort_values('time')
print(df)
Pandas do alignment via index. reset_index(drop=True) destructed the original index and caused the sorted time column assigned back sequentially. This is probably why you didn't get what you what.
original time column.
0 08:00:00
1 12:00:00
2 16:00:00
3 20:00:00
4 02:00:00
5 13:00:00
6 03:00:00
after sort_values().
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
after reset_index(drop=True)
0 02:00:00
1 03:00:00
2 08:00:00
3 12:00:00
4 13:00:00
5 16:00:00
6 20:00:00
I hope this is what you want:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
'code' : ['A','B','C','A','B','C','A'],
})
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
print(df)
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
df['time'] = x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
print(df)
Related
this is my dataframe:
its got 455 rows with a secuence of a period of days in range of 4 hours each row.
i need to replace each 'demand' value with 0 if the timestamp hours are "23"
so i write this:
datadf['value']=datadf['timestamp'].apply(lambda x, y=datadf['value']: 0 if x.hour==23 else y)
i know the Y value is wrong, but i couldnt find the way to refer to the same row "demand" value inside the lambda.
how can i refer to that demand value? is any alternative that my else do nothing?
import pandas as pd
import numpy as np
#data preparation
df = pd.DataFrame()
df['date'] = pd.date_range(start='2022-06-01',periods=7,freq='4h') + pd.Timedelta('3H')
df['val'] = np.random.rand(7)
print(df)
>>
date val
0 2022-06-01 03:00:00 0.601889
1 2022-06-01 07:00:00 0.017787
2 2022-06-01 11:00:00 0.290662
3 2022-06-01 15:00:00 0.179150
4 2022-06-01 19:00:00 0.763534
5 2022-06-01 23:00:00 0.680892
6 2022-06-02 03:00:00 0.585380
#if your dates not datetime format, you must convert it
df['date'] = pd.to_datetime(df['date'])
df.loc[df['date'].dt.hour == 23, 'val'] = 0
#if you don't want to change data in "demand" column you can copy it
#df['val_2'] = df['val']
#df.loc[df['date'].dt.hour == 23, 'val_2'] = 0
print(df)
>>
date val
0 2022-06-01 03:00:00 0.601889
1 2022-06-01 07:00:00 0.017787
2 2022-06-01 11:00:00 0.290662
3 2022-06-01 15:00:00 0.179150
4 2022-06-01 19:00:00 0.763534
5 2022-06-01 23:00:00 0.000000
6 2022-06-02 03:00:00 0.585380
I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
I am trying to resample a datetime index into hourly data. I also want the resampling until the end of the month.
So given the following df:
data = np.arange(6).reshape(3,2)
rng = ['Jan-2016', 'Feb-2016', 'Mar-2016']
df = pd.DataFrame(data, index=rng)
df.index = pd.to_datetime(df.index)
0 1
2016-01-01 0 1
2016-02-01 2 3
2016-03-01 4 5
I know I can resample this into an hourly index by: df = df.resample('H').ffill() However, when I call the df it gets cut at 2016-03-01. I am essentially making the index run from 1/1/2016 to 3/31/2016 with an hourly granularity.
How can I extend this to the end of the month 2015-03-31 given that the last index is the beginning of the month.
UPDATE:
In [37]: (df.set_index(df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)]))
....: .resample('H')
....: .ffill()
....: .head()
....: )
Out[37]:
0 1
2016-01-01 00:00:00 0 1
2016-01-01 01:00:00 0 1
2016-01-01 02:00:00 0 1
2016-01-01 03:00:00 0 1
2016-01-01 04:00:00 0 1
In [38]: (df.set_index(df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)]))
....: .resample('H')
....: .ffill()
....: .tail()
....: )
Out[38]:
0 1
2016-03-30 20:00:00 2 3
2016-03-30 21:00:00 2 3
2016-03-30 22:00:00 2 3
2016-03-30 23:00:00 2 3
2016-03-31 00:00:00 4 5
Explanation:
In [40]: df.index[-1] + pd.offsets.MonthEnd(0)
Out[40]: Timestamp('2016-03-31 00:00:00')
In [41]: df.index[:-1].union([df.index[-1] + pd.offsets.MonthEnd(0)])
Out[41]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-31'], dtype='datetime64[ns]', freq=None)
Old incorrect answer:
In [77]: df.resample('M').ffill().resample('H').ffill().tail()
Out[77]:
0 1
2016-03-30 20:00:00 2 3
2016-03-30 21:00:00 2 3
2016-03-30 22:00:00 2 3
2016-03-30 23:00:00 2 3
2016-03-31 00:00:00 4 5
Maybe it's late for this, but I think this way is easier:
import pandas as pd
import numpy as np
data = np.arange(6).reshape(3,2)
rng = ['Jan-2016', 'Feb-2016', 'Mar-2016']
df = pd.DataFrame(data, index=rng)
df.index = pd.to_datetime(df.index)
# Create the desired time range
t_index = pd.DatetimeIndex(pd.date_range(start='2016-01-01', end='2016-12-31', freq='h'))
# Resample
df_rsmpld = df.reindex(t_index, method='ffill')
I have a sample_data.txt with structure.
Precision= Waterdrops
2009-11-17 14:00:00,4.9,
2009-11-17 14:30:00,6.1,
2009-11-17 15:00:00,5.3,
2009-11-17 15:30:00,3.3,
2009-11-17 16:00:00,4.9,
I need to separate my data with the values bigger than zero and identify change (event) with timespam bigger than 2 h. So far i have wrote:
file_path = 'sample_data.txt'
df = pd.read_csv(file_path, skiprows = [num for (num,line) in enumerate(open(file_path),2) if 'Precision=' in line][0],
parse_dates = True,index_col = 0,header= None, sep =',',
names = ['meteo', 'empty'])
df['date'] = df.index
df = df.drop(['empty'], axis=1)
df = df[df.meteo>20]
df['diff'] = df.date-df.date.shift(1)
df['sections'] = (diff > np.timedelta64(2, "h")).astype(int).cumsum()
From the above code i get:
meteo date diff sections
2009-12-15 12:00:00 23.8 2009-12-15 12:00:00 NaT 0
2009-12-15 13:00:00 23.0 2009-12-15 13:00:00 01:00:00 0
If i use:
df.date.iloc[[0, -1]].reset_index(drop=True)
I get:
0 2009-12-15 12:00:00
1 2012-12-05 16:00:00
Name: date, dtype: datetime64[ns]
Which is the start date and finish date of my example_data.txt.
How i can get .iloc[[0, -1]].reset_index(drop=True) for each df['section'] category ?
I tried with .apply:
def f(s):
return s.iloc[[0, -1]].reset_index(drop=True)
df.groupby(df['sections']).apply(f)
and i get: IndexError: positional indexers are out-of-bounds
I don't know why you use the drop_index() shenanigans. My somewhat more straightforward process would be, starting with
df
sections meteo date diff
0 0 2009-12-15 12:00:00 NaT
1 0 2009-12-15 13:00:00 01:00:00
0 1 2009-12-15 12:00:00 NaT
1 1 2009-12-15 13:00:00 01:00:00
to do (after you ensure with sort('sections', 'date') that iloc[0,-1] actually is start and end, otherwise just use min() and max() )
def f(s):
return s.iloc[[0, -1]]['date']
df.groupby('sections').apply(f)
date 0 1
sections
0 12:00:00 13:00:00
1 12:00:00 13:00:00
Or, as a more streamlined approach
df.groupby('sections')['date'].agg([np.max, np.min])
amax amin
sections
0 13:00:00 12:00:00
1 13:00:00 12:00:00