I have a dataframe which contains data that were measured at two hours interval each day, some time intervals are however missing. My dataset looks like below:
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I'm trying to insert the missing time intervals and fill their value with Nan.
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 Nan
2020-12-01 14:00:00 Nan
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I will appreciate any help on how to achieve this in python as i'm a newbie starting out with python
Create DatetimeIndex and use DataFrame.asfreq:
print (df)
date val
0 2020-12-01 08:00:00 145.9
1 2020-12-01 10:00:00 100.0
2 2020-12-01 16:00:00 99.3
3 2020-12-01 18:00:00 91.0
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('2H')
print (df)
val
date
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
assuming your df looks like
datetime value
0 2020-12-01T08:00:00 145.9
1 2020-12-01T10:00:00 100.0
2 2020-12-01T16:00:00 99.3
3 2020-12-01T18:00:00 91.0
make sure datetime column is dtype datetime;
df['datetime'] = pd.to_datetime(df['datetime'])
so that you can now resample to 2-hourly frequency:
df.resample('2H', on='datetime').mean()
value
datetime
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
Note that you don't need to set the on= keyword if your df already has a datetime index. The df resulting from resampling will have a datetime index.
Also note that I'm using .mean() as aggfunc, meaning that if you have multiple values within the two hour intervals, you'll get the mean of that.
You can try the following:
I have used datetime and timedelta for this,
from datetime import datetime, timedelta
# Asuming that the data is given like below.
data = ['2020-12-01 08:00:00 145.9',
'2020-12-01 10:00:00 100.0',
'2020-12-01 16:00:00 99.3',
'2020-12-01 18:00:00 91.0']
# initialize the start time using data[0]
date = data[0].split()[0].split('-')
time = data[0].split()[1].split(':')
start = datetime(int(date[0]), int(date[1]), int(date[2]), int(time[0]), int(time[1]), int(time[2]))
newdata = []
newdata.append(data[0])
i = 1
while i < len(data):
cur = start
nxt = start + timedelta(hours=2)
if (str(nxt) != (data[i].split()[0] + ' ' + data[i].split()[1])):
newdata.append(str(nxt) + ' NaN')
else:
newdata.append(data[i])
i+=1
start = nxt
newdata
NOTE : temedelta(hours=2) will add 2 hours to the existing time.
Related
I have a dataframe where the index is a timestamp.
DATE VALOR
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
... ...
2021-04-30 19:00:00 0.77059
2021-04-30 20:00:00 0.49285
2021-04-30 21:00:00 0.49057
2021-04-30 22:00:00 0.50339
2021-04-30 23:00:00 0.48792
I´m searching for a specific date
drop.loc['2020-12-01 04:00:00']
VALOR 0.0108
Name: 2020-12-01 04:00:00, dtype: float64
I want the return for the index of search above.
In this case is line 5. After I want to use this value to do a slice in the dataframe
drop[:5]
Thanks!
It looks like you want to subset drop up to index '2020-12-01 04:00:00'.
Then simply do this: drop.loc[:'2020-12-01 04:00:00']
No need to manually get the line number.
output:
VALOR
DATE
2020-12-01 00:00:00 0.00635
2020-12-01 01:00:00 0.00941
2020-12-01 02:00:00 0.01151
2020-12-01 03:00:00 0.00281
2020-12-01 04:00:00 0.01080
If you really want to get the position:
pos = drop.index.get_loc(key='2020-12-01 04:00:00') ## returns: 4
drop[:pos+1]
I have a time series with breaks (times w/o recordings) in between. A simplified example would be:
df = pd.DataFrame(
np.random.rand(13), columns=["values"],
index=pd.date_range(start='1/1/2020 11:00:00',end='1/1/2020 23:00:00',freq='H'))
df.iloc[4:7] = np.nan
df.dropna(inplace=True)
df
values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166
Now I would like to split it in intervals which are divided by a certain time span (e.g. 2h). In the example above this would be:
( values
2020-01-01 11:00:00 0.100339
2020-01-01 12:00:00 0.054668
2020-01-01 13:00:00 0.209965
2020-01-01 14:00:00 0.551023,
values
2020-01-01 18:00:00 0.495879
2020-01-01 19:00:00 0.479905
2020-01-01 20:00:00 0.250568
2020-01-01 21:00:00 0.904743
2020-01-01 22:00:00 0.686085
2020-01-01 23:00:00 0.188166)
I was a bit surprised that I didn't find anything on that since I thought this is a common problem. My current solution to get start and end index of each interval is :
def intervals(data: pd.DataFrame, delta_t: timedelta = timedelta(hours=2)):
data = data.sort_values(by=['event_timestamp'], ignore_index=True)
breaks = (data['event_timestamp'].diff() > delta_t).astype(bool).values
ranges = []
start = 0
end = start
for i, e in enumerate(breaks):
if not e:
end = i
if i == len(breaks) - 1:
ranges.append((start, end))
start = i
end = start
elif i != 0:
ranges.append((start, end))
start = i
end = start
return ranges
Any suggestions how I could do this in a smarter way? I suspect this should be somehow possible using groupby.
Yes, you can use the very convenient np.split:
dt = pd.Timedelta('2H')
parts = np.split(df, np.where(np.diff(df.index) > dt)[0] + 1)
Which gives, for your example:
>>> parts
[ values
2020-01-01 11:00:00 0.557374
2020-01-01 12:00:00 0.942296
2020-01-01 13:00:00 0.181189
2020-01-01 14:00:00 0.758822,
values
2020-01-01 18:00:00 0.682125
2020-01-01 19:00:00 0.818187
2020-01-01 20:00:00 0.053515
2020-01-01 21:00:00 0.572342
2020-01-01 22:00:00 0.423129
2020-01-01 23:00:00 0.882215]
#Pierre thanks for your input. I now got to a solution which is convenient for me:
df['diff'] = df.index.to_series().diff()
max_gap = timedelta(hours=2)
df['gapId'] = 0
df.loc[df['diff'] >= max_gap, ['gapId']] = 1
df['gapId'] = df['gapId'].cumsum()
list(df.groupby('gapId'))
gives:
[(0,
values date diff gapId
0 1.0 2020-01-01 11:00:00 NaT 0
1 1.0 2020-01-01 12:00:00 0 days 01:00:00 0
2 1.0 2020-01-01 13:00:00 0 days 01:00:00 0
3 1.0 2020-01-01 14:00:00 0 days 01:00:00 0),
(1,
values date diff gapId
7 1.0 2020-01-01 18:00:00 0 days 04:00:00 1
8 1.0 2020-01-01 19:00:00 0 days 01:00:00 1
9 1.0 2020-01-01 20:00:00 0 days 01:00:00 1
10 1.0 2020-01-01 21:00:00 0 days 01:00:00 1
11 1.0 2020-01-01 22:00:00 0 days 01:00:00 1
12 1.0 2020-01-01 23:00:00 0 days 01:00:00 1)]
I have a dataframe including a datetime column for date and a column for hour.
like this:
min hour date
0 0 2020-12-01
1 5 2020-12-02
2 6 2020-12-01
I need a datetime column including both date and hour.
like this :
min hour date datetime
0 0 2020-12-01 2020-12-01 00:00:00
0 5 2020-12-02 2020-12-02 05:00:00
0 6 2020-12-01 2020-12-01 06:00:00
How can I do it?
Use pd.to_datetime and pd.to_timedelta:
In [393]: df['date'] = pd.to_datetime(df['date'])
In [396]: df['datetime'] = df['date'] + pd.to_timedelta(df['hour'], unit='h')
In [405]: df
Out[405]:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
You could also try using apply and np.timedelta64:
df['datetime'] = df['date'] + df['hour'].apply(lambda x: np.timedelta64(x, 'h'))
print(df)
Output:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
In the first question it is not clear the data type of columns, so i thought they are
in date (not pandas) and he want the datetime version.
If this is the case so, solution is similar to the previous, but using a different constructor.
from datetime import datetime
df['datetime'] = df.apply(lambda x: datetime(x.date.year, x.date.month, x.date.day, int(x['hour']), int(x['min'])), axis=1)
I've looked around (eg.
Python - Locating the closest timestamp) but can't find anything on this.
I have a list of datetimes, and a dataframe containing 10k + rows, of start and end times (formatted as datetimes).
The dataframe is effectively listing parameters for runs of an instrument.
The list describes times from an alarm event.
The datetime list items are all within a row (i.e. between a start and end time) in the dataframe. Is there an easy way to locate the rows which would contain the timeframe within which the alarm time would be? (sorry for poor wording there!)
eg.
for i in alarms:
df.loc[(df.start_time < i) & (df.end_time > i), 'Flag'] = 'Alarm'
(this didn't work but shows my approach)
Example datasets
# making list of datetimes for the alarms
df = pd.DataFrame({'Alarms':["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n=33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({ 'start_date': rng1, 'end_Date': rng2})
Herein a flag would go against line (well, index) 4, 13 and 21.
You can use pandas.IntervalIndex here:
# Create and set IntervalIndex
intervals = pd.IntervalIndex.from_arrays(df.start_date, df.end_Date)
df = df.set_index(intervals)
# Update using loc
df.loc[alarms, 'flag'] = 'alarm'
# Finally, reset_index
df = df.reset_index(drop=True)
[out]
start_date end_Date flag
0 2019-07-18 00:00:00 2019-07-18 03:00:00 NaN
1 2019-07-18 03:00:00 2019-07-18 06:00:00 NaN
2 2019-07-18 06:00:00 2019-07-18 09:00:00 NaN
3 2019-07-18 09:00:00 2019-07-18 12:00:00 NaN
4 2019-07-18 12:00:00 2019-07-18 15:00:00 alarm
5 2019-07-18 15:00:00 2019-07-18 18:00:00 NaN
6 2019-07-18 18:00:00 2019-07-18 21:00:00 NaN
7 2019-07-18 21:00:00 2019-07-19 00:00:00 NaN
8 2019-07-19 00:00:00 2019-07-19 03:00:00 NaN
9 2019-07-19 03:00:00 2019-07-19 06:00:00 NaN
10 2019-07-19 06:00:00 2019-07-19 09:00:00 NaN
11 2019-07-19 09:00:00 2019-07-19 12:00:00 NaN
12 2019-07-19 12:00:00 2019-07-19 15:00:00 NaN
13 2019-07-19 15:00:00 2019-07-19 18:00:00 alarm
14 2019-07-19 18:00:00 2019-07-19 21:00:00 NaN
15 2019-07-19 21:00:00 2019-07-20 00:00:00 NaN
16 2019-07-20 00:00:00 2019-07-20 03:00:00 NaN
17 2019-07-20 03:00:00 2019-07-20 06:00:00 NaN
18 2019-07-20 06:00:00 2019-07-20 09:00:00 NaN
19 2019-07-20 09:00:00 2019-07-20 12:00:00 NaN
20 2019-07-20 12:00:00 2019-07-20 15:00:00 NaN
21 2019-07-20 15:00:00 2019-07-20 18:00:00 alarm
22 2019-07-20 18:00:00 2019-07-20 21:00:00 NaN
23 2019-07-20 21:00:00 2019-07-21 00:00:00 NaN
24 2019-07-21 00:00:00 2019-07-21 03:00:00 NaN
25 2019-07-21 03:00:00 2019-07-21 06:00:00 NaN
26 2019-07-21 06:00:00 2019-07-21 09:00:00 NaN
27 2019-07-21 09:00:00 2019-07-21 12:00:00 NaN
28 2019-07-21 12:00:00 2019-07-21 15:00:00 NaN
29 2019-07-21 15:00:00 2019-07-21 18:00:00 NaN
30 2019-07-21 18:00:00 2019-07-21 21:00:00 NaN
31 2019-07-21 21:00:00 2019-07-22 00:00:00 NaN
32 2019-07-22 00:00:00 2019-07-22 03:00:00 NaN
you were calling your columns start_date and end_Date, but in your for you use start_time and end_time.
try this:
import pandas as pd
df = pd.DataFrame({'Alarms': ["18/07/19 14:56:21", "19/07/19 15:05:15", "20/07/19 15:46:00"]})
df['Alarms'] = pd.to_datetime(df['Alarms'])
alarms = list(df.Alarms.unique())
# dataframe of runs containing start and end times
n = 33
rng1 = pd.date_range('2019-07-18', '2019-07-22', periods=n)
rng2 = pd.date_range('2019-07-18 03:00:00', '2019-07-22 03:00:00', periods=n)
df = pd.DataFrame({'start_date': rng1, 'end_Date': rng2})
for i in alarms:
df.loc[(df.start_date < i) & (df.end_Date > i), 'Flag'] = 'Alarm'
print(df[df['Flag']=='Alarm']['Flag'])
Output:
4 Alarm
13 Alarm
21 Alarm
Name: Flag, dtype: object
I have a dataframe as show below:
index value
2003-01-01 00:00:00 14.5
2003-01-01 01:00:00 15.8
2003-01-01 02:00:00 0
2003-01-01 03:00:00 0
2003-01-01 04:00:00 13.6
2003-01-01 05:00:00 4.3
2003-01-01 06:00:00 13.7
2003-01-01 07:00:00 14.4
2003-01-01 08:00:00 0
2003-01-01 09:00:00 0
2003-01-01 10:00:00 0
2003-01-01 11:00:00 17.2
2003-01-01 12:00:00 0
2003-01-01 13:00:00 5.3
2003-01-01 14:00:00 0
2003-01-01 15:00:00 2.0
2003-01-01 16:00:00 4.0
2003-01-01 17:00:00 0
2003-01-01 18:00:00 0
2003-01-01 19:00:00 3.9
2003-01-01 20:00:00 7.2
2003-01-01 21:00:00 1.0
2003-01-01 22:00:00 1.0
2003-01-01 23:00:00 10.0
The index is datetime and have column record the rainfall value(unit:mm) in each hour,I would like to calculate the "Average wet spell duration", which means the
average of continuous hours that exist values (not zero) in a day, so the calculation is
2 + 4 + 1 + 1 + 2 + 5 / 6 (events) = 2.5 (hr)
and the "average wet spell amount", which means the average of sum of the values in continuous hours in a day.
{ (14.5 + 15.8) + ( 13.6 + 4.3 + 13.7 + 14.4 ) + (17.2) + (5.3) + (2 + 4)+ (3.9 + 7.2 + 1 + 1 + 10) } / 6 (events) = 21.32 (mm)
The datafame above is just a example, the dataframe which I have have more longer time series (more than one year for example), how can I write a function so it could calculate the two value mentioned above in a better way? thanks in advance!
P.S. the values may be NaN, and I would like to just ignore it.
I believe this is what you are looking for. I have added explanations to the code for each step.
# create helper columns defining contiguous blocks and day
df['block'] = (df['value'].astype(bool).shift() != df['value'].astype(bool)).cumsum()
df['day'] = df['index'].dt.normalize()
# group by day to get unique block count and value count
session_map = df[df['value'].astype(bool)].groupby('day')['block'].nunique()
hour_map = df[df['value'].astype(bool)].groupby('day')['value'].count()
# map to original dataframe
df['sessions'] = df['day'].map(session_map)
df['hours'] = df['day'].map(hour_map)
# calculate result
res = df.groupby(['day', 'hours', 'sessions'], as_index=False)['value'].sum()
res['duration'] = res['hours'] / res['sessions']
res['amount'] = res['value'] / res['sessions']
Result
day sessions duration value amount
0 2003-01-01 6 2.5 127.9 21.316667
I am not exactly sure what you are asking for. But, I think what you are asking for is resample(). If I misunderstood your question, correct me, please.
From Creating pandas dataframe with datetime index and random values in column, I have created a random time series dataframe.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(1), freq='H')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'Day': days, 'Value': data})
df = df.set_index('Day')
View the dataframe
Day Value
2018-03-18 20:18:08.205546 29
2018-03-18 21:18:08.205546 56
2018-03-18 22:18:08.205546 82
2018-03-18 23:18:08.205546 13
2018-03-19 00:18:08.205546 35
2018-03-19 01:18:08.205546 53
2018-03-19 02:18:08.205546 25
2018-03-19 03:18:08.205546 23
2018-03-19 04:18:08.205546 21
2018-03-19 05:18:08.205546 12
2018-03-19 06:18:08.205546 15
2018-03-19 07:18:08.205546 9
2018-03-19 08:18:08.205546 13
2018-03-19 09:18:08.205546 87
2018-03-19 10:18:08.205546 9
2018-03-19 11:18:08.205546 63
2018-03-19 12:18:08.205546 62
2018-03-19 13:18:08.205546 52
2018-03-19 14:18:08.205546 43
2018-03-19 15:18:08.205546 77
2018-03-19 16:18:08.205546 95
2018-03-19 17:18:08.205546 79
2018-03-19 18:18:08.205546 77
2018-03-19 19:18:08.205546 5
2018-03-19 20:18:08.205546 78
Now, re-sampling your dataframe
# resample into 2 hours and drop the NaNs
df.resample('2H').mean().dropna()
It gives you,
Day Value
2018-03-18 20:00:00 42.5
2018-03-18 22:00:00 47.5
2018-03-19 00:00:00 44.0
2018-03-19 02:00:00 24.0
2018-03-19 04:00:00 16.5
2018-03-19 06:00:00 12.0
2018-03-19 08:00:00 50.0
2018-03-19 10:00:00 36.0
2018-03-19 12:00:00 57.0
2018-03-19 14:00:00 60.0
2018-03-19 16:00:00 87.0
2018-03-19 18:00:00 41.0
2018-03-19 20:00:00 78.0
Similarly, you can resample into days, hours, minutes etc which I leave upto you. You might need to take a look at
Where is the documentation on Pandas 'Freq' tags?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html