I have a dataframe as show below:
index value
2003-01-01 00:00:00 14.5
2003-01-01 01:00:00 15.8
2003-01-01 02:00:00 0
2003-01-01 03:00:00 0
2003-01-01 04:00:00 13.6
2003-01-01 05:00:00 4.3
2003-01-01 06:00:00 13.7
2003-01-01 07:00:00 14.4
2003-01-01 08:00:00 0
2003-01-01 09:00:00 0
2003-01-01 10:00:00 0
2003-01-01 11:00:00 17.2
2003-01-01 12:00:00 0
2003-01-01 13:00:00 5.3
2003-01-01 14:00:00 0
2003-01-01 15:00:00 2.0
2003-01-01 16:00:00 4.0
2003-01-01 17:00:00 0
2003-01-01 18:00:00 0
2003-01-01 19:00:00 3.9
2003-01-01 20:00:00 7.2
2003-01-01 21:00:00 1.0
2003-01-01 22:00:00 1.0
2003-01-01 23:00:00 10.0
The index is datetime and have column record the rainfall value(unit:mm) in each hour,I would like to calculate the "Average wet spell duration", which means the
average of continuous hours that exist values (not zero) in a day, so the calculation is
2 + 4 + 1 + 1 + 2 + 5 / 6 (events) = 2.5 (hr)
and the "average wet spell amount", which means the average of sum of the values in continuous hours in a day.
{ (14.5 + 15.8) + ( 13.6 + 4.3 + 13.7 + 14.4 ) + (17.2) + (5.3) + (2 + 4)+ (3.9 + 7.2 + 1 + 1 + 10) } / 6 (events) = 21.32 (mm)
The datafame above is just a example, the dataframe which I have have more longer time series (more than one year for example), how can I write a function so it could calculate the two value mentioned above in a better way? thanks in advance!
P.S. the values may be NaN, and I would like to just ignore it.
I believe this is what you are looking for. I have added explanations to the code for each step.
# create helper columns defining contiguous blocks and day
df['block'] = (df['value'].astype(bool).shift() != df['value'].astype(bool)).cumsum()
df['day'] = df['index'].dt.normalize()
# group by day to get unique block count and value count
session_map = df[df['value'].astype(bool)].groupby('day')['block'].nunique()
hour_map = df[df['value'].astype(bool)].groupby('day')['value'].count()
# map to original dataframe
df['sessions'] = df['day'].map(session_map)
df['hours'] = df['day'].map(hour_map)
# calculate result
res = df.groupby(['day', 'hours', 'sessions'], as_index=False)['value'].sum()
res['duration'] = res['hours'] / res['sessions']
res['amount'] = res['value'] / res['sessions']
Result
day sessions duration value amount
0 2003-01-01 6 2.5 127.9 21.316667
I am not exactly sure what you are asking for. But, I think what you are asking for is resample(). If I misunderstood your question, correct me, please.
From Creating pandas dataframe with datetime index and random values in column, I have created a random time series dataframe.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(1), freq='H')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'Day': days, 'Value': data})
df = df.set_index('Day')
View the dataframe
Day Value
2018-03-18 20:18:08.205546 29
2018-03-18 21:18:08.205546 56
2018-03-18 22:18:08.205546 82
2018-03-18 23:18:08.205546 13
2018-03-19 00:18:08.205546 35
2018-03-19 01:18:08.205546 53
2018-03-19 02:18:08.205546 25
2018-03-19 03:18:08.205546 23
2018-03-19 04:18:08.205546 21
2018-03-19 05:18:08.205546 12
2018-03-19 06:18:08.205546 15
2018-03-19 07:18:08.205546 9
2018-03-19 08:18:08.205546 13
2018-03-19 09:18:08.205546 87
2018-03-19 10:18:08.205546 9
2018-03-19 11:18:08.205546 63
2018-03-19 12:18:08.205546 62
2018-03-19 13:18:08.205546 52
2018-03-19 14:18:08.205546 43
2018-03-19 15:18:08.205546 77
2018-03-19 16:18:08.205546 95
2018-03-19 17:18:08.205546 79
2018-03-19 18:18:08.205546 77
2018-03-19 19:18:08.205546 5
2018-03-19 20:18:08.205546 78
Now, re-sampling your dataframe
# resample into 2 hours and drop the NaNs
df.resample('2H').mean().dropna()
It gives you,
Day Value
2018-03-18 20:00:00 42.5
2018-03-18 22:00:00 47.5
2018-03-19 00:00:00 44.0
2018-03-19 02:00:00 24.0
2018-03-19 04:00:00 16.5
2018-03-19 06:00:00 12.0
2018-03-19 08:00:00 50.0
2018-03-19 10:00:00 36.0
2018-03-19 12:00:00 57.0
2018-03-19 14:00:00 60.0
2018-03-19 16:00:00 87.0
2018-03-19 18:00:00 41.0
2018-03-19 20:00:00 78.0
Similarly, you can resample into days, hours, minutes etc which I leave upto you. You might need to take a look at
Where is the documentation on Pandas 'Freq' tags?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
Related
I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?
The following code generates a sample DataFrame with a multilevel index. The first level is a string, the second level is a datetime.
Script
import pandas as pd
from datetime import datetime
import random
df = pd.DataFrame(columns=['network','time','active_clients','throughput','speed'])
networks = ['ALPHA','BETA','GAMMA']
times = pd.date_range(datetime.strptime('2021-01-01 00:00:00','%Y-%m-%d %H:%M:%S'),datetime.strptime('2021-01-01 12:00:00','%Y-%m-%d %H:%M:%S'),7).tolist()
for n in networks:
for t in times:
df = df.append({'network':n,'time':t,'active_clients':random.randint(10,30),'throughput':random.randint(1500,5000),'speed':random.randint(10000,12000)},ignore_index=True)
df.set_index(['network','time'],inplace=True)
print(df.to_string())
Output
active_clients throughput speed
network time
ALPHA 2021-01-01 00:00:00 16 4044 11023
2021-01-01 02:00:00 17 2966 10933
2021-01-01 04:00:00 10 4649 11981
2021-01-01 06:00:00 23 3629 10113
2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 00:00:00 17 3073 11798
2021-01-01 02:00:00 20 1941 10640
2021-01-01 04:00:00 17 1980 11869
2021-01-01 06:00:00 23 3346 10002
2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 00:00:00 21 4366 11587
2021-01-01 02:00:00 22 3404 11669
2021-01-01 04:00:00 20 1608 10344
2021-01-01 06:00:00 28 1849 10278
2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
For each item in the first level, I want to select the last three records in the second level. The catch is that I don't know the datetime values, so I need to select by integer-based index location instead. What's the most efficient way of slicing the DataFrame to achieve the following.
Desired output
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 30 2520 11159
2021-01-01 10:00:00 10 4200 11309
2021-01-01 12:00:00 16 3878 11366
BETA 2021-01-01 08:00:00 10 1952 10063
2021-01-01 10:00:00 28 3788 11047
2021-01-01 12:00:00 24 4993 10487
GAMMA 2021-01-01 08:00:00 14 3229 11925
2021-01-01 10:00:00 21 3408 10411
2021-01-01 12:00:00 12 1799 10492
My attempts
Returns the full dataframe:
df_sel = df.iloc[:,-3:]
Raises an error because loc doesn't support using integer values on datetime objects:
df_sel = df.loc[:,-3:]
Returns the last three entries in the second level, but only for the last entry in the first level:
df_sel = df.loc[:].iloc[-3:]
I have 2 methods to solve this problem:
Method 1:
As it mentions from the first comment from Quang Hoang, you can use groupby to do this, which I believe has the shortest code:
df.groupby(level=0).tail(3)
Method 2:
You can also slice each one in networks then concat them:
pd.concat([df.loc[[i]][-3:] for i in networks])
Both of these 2 methods will output the result you want:
Another method is to do some reshaping:
df.unstack(0).iloc[-3:].stack().swaplevel(0,1).sort_index()
Output:
active_clients throughput speed
network time
ALPHA 2021-01-01 08:00:00 26 4081 11325
2021-01-01 10:00:00 13 3370 10716
2021-01-01 12:00:00 13 3691 10737
BETA 2021-01-01 08:00:00 28 2105 10465
2021-01-01 10:00:00 21 2444 10158
2021-01-01 12:00:00 24 1947 11226
GAMMA 2021-01-01 08:00:00 13 1850 10288
2021-01-01 10:00:00 23 2241 11521
2021-01-01 12:00:00 30 3515 11138
Details:
unstack the outer most index level, level=0
Use, iloc to select the last three records in the dataframe
stack that level back to the index swaplevel and sort_index
I have contiguous periods of NaN values by code. I want to count NaN values from periods of contiguous NaN values by code, and also i want the start and end date of the contiguos period of NaN values.
df :
CODE TMIN
1998-01-01 00:00:00 12 2.5
1999-01-01 00:00:00 12 NaN
2000-01-01 00:00:00 12 NaN
2001-01-01 00:00:00 12 2.2
2002-01-01 00:00:00 12 NaN
1998-01-01 00:00:00 41 NaN
1999-01-01 00:00:00 41 NaN
2000-01-01 00:00:00 41 5.0
2001-01-01 00:00:00 41 9.0
2002-01-01 00:00:00 41 8.0
1998-01-01 00:00:00 52 2.0
1999-01-01 00:00:00 52 NaN
2000-01-01 00:00:00 52 NaN
2001-01-01 00:00:00 52 NaN
2002-01-01 00:00:00 52 1.0
1998-01-01 00:00:00 91 NaN
Expected results :
Start_Date End date CODE number of contiguous missing values
1999-01-01 00:00:00 2000-01-01 00:00:00 12 2
2002-01-01 00:00:00 2002-01-01 00:00:00 12 1
1998-01-01 00:00:00 1999-01-01 00:00:00 41 2
1999-01-01 00:00:00 2001-01-01 00:00:00 52 3
1998-01-01 00:00:00 1998-01-01 00:00:00 91 1
How can i solve this? Thanks!
You can try groupby the cumsum of non-null:
df['group'] = df.TMIN.notna().cumsum()
(df[df.TMIN.isna()]
.groupby(['group','CODE'])
.agg(Start_Date=('group', lambda x: x.index.min()),
End_Date=('group', lambda x: x.index.max()),
cont_missing=('TMIN', 'size')
)
)
Output:
Start_Date End_Date cont_missing
group CODE
1 12 1999-01-01 00:00:00 2000-01-01 00:00:00 2
2 12 2002-01-01 00:00:00 2002-01-01 00:00:00 1
41 1998-01-01 00:00:00 1999-01-01 00:00:00 2
6 52 1999-01-01 00:00:00 2001-01-01 00:00:00 3
7 91 1998-01-01 00:00:00 1998-01-01 00:00:00 1
I have a dataframe which contains data that were measured at two hours interval each day, some time intervals are however missing. My dataset looks like below:
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I'm trying to insert the missing time intervals and fill their value with Nan.
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 Nan
2020-12-01 14:00:00 Nan
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
I will appreciate any help on how to achieve this in python as i'm a newbie starting out with python
Create DatetimeIndex and use DataFrame.asfreq:
print (df)
date val
0 2020-12-01 08:00:00 145.9
1 2020-12-01 10:00:00 100.0
2 2020-12-01 16:00:00 99.3
3 2020-12-01 18:00:00 91.0
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('2H')
print (df)
val
date
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
assuming your df looks like
datetime value
0 2020-12-01T08:00:00 145.9
1 2020-12-01T10:00:00 100.0
2 2020-12-01T16:00:00 99.3
3 2020-12-01T18:00:00 91.0
make sure datetime column is dtype datetime;
df['datetime'] = pd.to_datetime(df['datetime'])
so that you can now resample to 2-hourly frequency:
df.resample('2H', on='datetime').mean()
value
datetime
2020-12-01 08:00:00 145.9
2020-12-01 10:00:00 100.0
2020-12-01 12:00:00 NaN
2020-12-01 14:00:00 NaN
2020-12-01 16:00:00 99.3
2020-12-01 18:00:00 91.0
Note that you don't need to set the on= keyword if your df already has a datetime index. The df resulting from resampling will have a datetime index.
Also note that I'm using .mean() as aggfunc, meaning that if you have multiple values within the two hour intervals, you'll get the mean of that.
You can try the following:
I have used datetime and timedelta for this,
from datetime import datetime, timedelta
# Asuming that the data is given like below.
data = ['2020-12-01 08:00:00 145.9',
'2020-12-01 10:00:00 100.0',
'2020-12-01 16:00:00 99.3',
'2020-12-01 18:00:00 91.0']
# initialize the start time using data[0]
date = data[0].split()[0].split('-')
time = data[0].split()[1].split(':')
start = datetime(int(date[0]), int(date[1]), int(date[2]), int(time[0]), int(time[1]), int(time[2]))
newdata = []
newdata.append(data[0])
i = 1
while i < len(data):
cur = start
nxt = start + timedelta(hours=2)
if (str(nxt) != (data[i].split()[0] + ' ' + data[i].split()[1])):
newdata.append(str(nxt) + ' NaN')
else:
newdata.append(data[i])
i+=1
start = nxt
newdata
NOTE : temedelta(hours=2) will add 2 hours to the existing time.
I am new to Pandas but have been working with python for a few years now.
I have a large data set of hourly data with multiple columns. I need to group the data by day then count how many times the value is above 85 for each day for each column.
example data:
date KMRY KSNS PCEC1 KFAT
2014-06-06 13:00:00 56.000000 63.0 17 11
2014-06-06 14:00:00 58.000000 61.0 17 11
2014-06-06 15:00:00 63.000000 63.0 16 10
2014-06-06 16:00:00 67.000000 65.0 12 11
2014-06-06 17:00:00 67.000000 67.0 10 13
2014-06-06 18:00:00 72.000000 75.0 9 14
2014-06-06 19:00:00 77.000000 79.0 9 15
2014-06-06 20:00:00 84.000000 81.0 9 23
2014-06-06 21:00:00 81.000000 86.0 12 31
2014-06-06 22:00:00 84.000000 84.0 13 28
2014-06-06 23:00:00 83.000000 86.0 15 34
2014-06-07 00:00:00 84.000000 86.0 16 36
2014-06-07 01:00:00 86.000000 89.0 17 43
2014-06-07 02:00:00 86.000000 89.0 20 44
2014-06-07 03:00:00 89.000000 89.0 22 49
2014-06-07 04:00:00 86.000000 86.0 22 51
2014-06-07 05:00:00 86.000000 89.0 21 53
From the sample above my results should look like the following:
date KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0
Any help you be greatly appreciated.
(D_RH>85).sum()
The above code gets me close but I need a daily break down also not just the column counts.
One way would be to make date a DatetimeIndex and then groupby the result of the comparison to 85. For example:
>>> df["date"] = pd.to_datetime(df["date"]) # only if it isn't already
>>> df = df.set_index("date")
>>> (df > 85).groupby(df.index.date).sum()
KMRY KSNS PCEC1 KFAT
2014-06-06 0 2 0 0
2014-06-07 5 6 0 0