I have a sample_data.txt with structure.
Precision= Waterdrops
2009-11-17 14:00:00,4.9,
2009-11-17 14:30:00,6.1,
2009-11-17 15:00:00,5.3,
2009-11-17 15:30:00,3.3,
2009-11-17 16:00:00,4.9,
I need to separate my data with the values bigger than zero and identify change (event) with timespam bigger than 2 h. So far i have wrote:
file_path = 'sample_data.txt'
df = pd.read_csv(file_path, skiprows = [num for (num,line) in enumerate(open(file_path),2) if 'Precision=' in line][0],
parse_dates = True,index_col = 0,header= None, sep =',',
names = ['meteo', 'empty'])
df['date'] = df.index
df = df.drop(['empty'], axis=1)
df = df[df.meteo>20]
df['diff'] = df.date-df.date.shift(1)
df['sections'] = (diff > np.timedelta64(2, "h")).astype(int).cumsum()
From the above code i get:
meteo date diff sections
2009-12-15 12:00:00 23.8 2009-12-15 12:00:00 NaT 0
2009-12-15 13:00:00 23.0 2009-12-15 13:00:00 01:00:00 0
If i use:
df.date.iloc[[0, -1]].reset_index(drop=True)
I get:
0 2009-12-15 12:00:00
1 2012-12-05 16:00:00
Name: date, dtype: datetime64[ns]
Which is the start date and finish date of my example_data.txt.
How i can get .iloc[[0, -1]].reset_index(drop=True) for each df['section'] category ?
I tried with .apply:
def f(s):
return s.iloc[[0, -1]].reset_index(drop=True)
df.groupby(df['sections']).apply(f)
and i get: IndexError: positional indexers are out-of-bounds
I don't know why you use the drop_index() shenanigans. My somewhat more straightforward process would be, starting with
df
sections meteo date diff
0 0 2009-12-15 12:00:00 NaT
1 0 2009-12-15 13:00:00 01:00:00
0 1 2009-12-15 12:00:00 NaT
1 1 2009-12-15 13:00:00 01:00:00
to do (after you ensure with sort('sections', 'date') that iloc[0,-1] actually is start and end, otherwise just use min() and max() )
def f(s):
return s.iloc[[0, -1]]['date']
df.groupby('sections').apply(f)
date 0 1
sections
0 12:00:00 13:00:00
1 12:00:00 13:00:00
Or, as a more streamlined approach
df.groupby('sections')['date'].agg([np.max, np.min])
amax amin
sections
0 13:00:00 12:00:00
1 13:00:00 12:00:00
Related
I have the following pandas DataFrame df:
date time val1
2018-12-31 09:00:00 15
2018-12-31 10:00:00 22
2018-12-31 11:00:00 19
2018-12-31 11:30:00 10
2018-12-31 11:45:00 5
2018-12-31 12:00:00 1
2018-12-31 12:05:00 6
I want to find how many minutes are between the val1 value that is greater than 20 and the val1 value that is lower than or equal to 5?
In this example, the answer is 1 hour and 45 minutes = 95 minutes.
I know how to check the difference between two datetime values:
(df.from_datetime-df.to_datetime).astype('timedelta64[m]')
But how to slice it over the DataFrame, detecting the proper rows?
UPDATE: Taking into consideration that date might be different
Convert the date column to a datetime object and time column to a timedelta object and combine them to get another datetime object
df.time = pd.to_timedelta(df.time)
df.date = pd.to_datetime(df.date)
df['date_time'] = df['date'] + df['time']
df
date time val1 date_time
0 2018-12-31 09:00:00 15 2018-12-31 09:00:00
1 2018-12-31 10:00:00 22 2018-12-31 10:00:00
2 2018-12-31 11:00:00 19 2018-12-31 11:00:00
3 2018-12-31 11:30:00 10 2018-12-31 11:30:00
4 2018-12-31 11:45:00 5 2018-12-31 11:45:00
5 2018-12-31 12:00:00 1 2018-12-31 12:00:00
6 2018-12-31 12:05:00 6 2018-12-31 12:05:00
Now could use one of these two methods
1) Love lambdas and this works with Series objects.
subtr = lambda d1, d2: abs(d1 - d2)/np.timedelta64(1, 'm')
d20 = df[df.val1 > 20].date_time.iloc[0]
d5 = df[df.val1 <= 5].date_time.iloc[0]
subtr(d20, d5)
105.0
2) Needs DataFrame object instead of Series object. Hinders with my aesthetics
d20 = df[df.val1 <= 5][['date_time']].iloc[0]
d5 = df[df.val1 > 20][['date_time']].iloc[0]
abs(d5 - d20).astype('timedelta64[m]')[0]
105.0
So this is my approach:
1) Filter out any val1 that is not >= 20 or <= 5
df = pd.DataFrame({'date':['2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31','2018-12-31'],
'time':['09:00:00', '10:00:00', '11:00:00', '11:30:00', '11:45:00', '12:00:00', '12:05:00'],
'val1': [15,22,19,10,5,1,6]})
df2 = df[(df['val1'] >= 20)|(df['val1'] <= 5)].copy()
Then we will do the following code:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(-1) >= 15,
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]'),
np.NaN)
Let me go through this.
np.where is a if statement, where if the first statment is true it will do the second, if not true then the third.
df2['val1'] - df2['val1'].shift(-1) >= 15 Since we filtered the df the minimum difference between two rows must be great than or equal to 15.
If it is true:
df2['time'].astype('datetime64[ns]').shift(-1) - df2['time'].astype('datetime64[ns]') We take the later time and subtract it from the beginning time.
If not true, we just return np.NaN
We get a df that looks like the following:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 01:45:00
4 2018-12-31 11:45:00 5 NaT
5 2018-12-31 12:00:00 1 NaT
If you want to put the TimeDiff on the end time you can do the following:
df2['TimeDiff'] = np.where(df2['val1'] - df2['val1'].shift(1) <= -15,
df2['time'].astype('datetime64[ns]') - df2['time'].astype('datetime64[ns]').shift(),
np.NaN)
and you will get:
date time val1 TimeDiff
1 2018-12-31 10:00:00 22 NaT
4 2018-12-31 11:45:00 5 01:45:00
5 2018-12-31 12:00:00 1 NaT
I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
I have a dataframe indexed by datetime. I want to filter out rows based on the difference between their index and the index of the previous row.
So, if my criteria is "remove all rows that are over one hour late than the previous row", the second row in the example below should be removed:
2005-07-15 17:00:00
2005-07-17 18:00:00
While in the following case, both rows stay:
2005-07-17 23:00:00
2005-07-18 00:00:00
It seems you need boolean indexing with diff for difference and compare with 1 hour Timedelta:
dates=['2005-07-15 17:00:00','2005-07-17 18:00:00', '2005-07-17 19:00:00',
'2005-07-17 23:00:00', '2005-07-18 00:00:00']
df = pd.DataFrame({'a':range(5)}, index=pd.to_datetime(dates))
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 18:00:00 1
2005-07-17 19:00:00 2
2005-07-17 23:00:00 3
2005-07-18 00:00:00 4
diff = df.index.to_series().diff().fillna(0)
print (diff)
2005-07-15 17:00:00 0 days 00:00:00
2005-07-17 18:00:00 2 days 01:00:00
2005-07-17 19:00:00 0 days 01:00:00
2005-07-17 23:00:00 0 days 04:00:00
2005-07-18 00:00:00 0 days 01:00:00
dtype: timedelta64[ns]
mask = diff <= pd.Timedelta(1, unit='h')
print (mask)
2005-07-15 17:00:00 True
2005-07-17 18:00:00 False
2005-07-17 19:00:00 True
2005-07-17 23:00:00 False
2005-07-18 00:00:00 True
dtype: bool
df = df[mask]
print (df)
a
2005-07-15 17:00:00 0
2005-07-17 19:00:00 2
2005-07-18 00:00:00 4
I have a DataFrame df like the following (excerpt, 'Timestamp' are the index):
Timestamp Value
2012-06-01 00:00:00 100
2012-06-01 00:15:00 150
2012-06-01 00:30:00 120
2012-06-01 01:00:00 220
2012-06-01 01:15:00 80
...and so on.
I need a new column df['weekday'] with the respective weekday/day-of-week of the timestamps.
How can I get this?
Use the new dt.dayofweek property:
In [2]:
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[2]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
In the situation where the Timestamp is your index you need to reset the index and then call the dt.dayofweek property:
In [14]:
df = df.reset_index()
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[14]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
Strangely if you try to create a series from the index in order to not reset the index you get NaN values as does using the result of reset_index to call the dt.dayofweek property without assigning the result of reset_index back to the original df:
In [16]:
df['weekday'] = pd.Series(df.index).dt.dayofweek
df
Out[16]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
In [17]:
df['weekday'] = df.reset_index()['Timestamp'].dt.dayofweek
df
Out[17]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
EDIT
As pointed out to me by user #joris you can just access the weekday attribute of the index so the following will work and is more compact:
df['Weekday'] = df.index.weekday
If the Timestamp column is a datetime value, then you can just use:
df['weekday'] = df['Timestamp'].apply(lambda x: x.weekday())
or
df['weekday'] = pd.to_datetime(df['Timestamp']).apply(lambda x: x.weekday())
You can get with this way:
import datetime
df['weekday'] = pd.Series(df.index).dt.day_name()
In case somebody else has the same issue with a multiindexed dataframe, here is what solved it for me, based on #joris solution:
df['Weekday'] = df.index.get_level_values(1).weekday
for me date was the get_level_values(1) instead of get_level_values(0), which would work for the outer index.
As of pandas 1.1.0 dt.dayofweek is deprecated, so instead of:
df['weekday'] = df['Timestamp'].dt.dayofweek
from #EdChum and #Artyom Krivolapov
you can now use:
df['weekday'] = df['Timestamp'].dt.isocalendar().day