Python Datetime resample results suddenly in NaN Values - python

I have tried to resample my values to hour. However, since I have changed the format of the date in csv file because of automatic swapping of months and days with low numbers (2003-04-01 is suddenly 2003-01-04). Now the date format is fine (when showing the csv file in Python) but while using resample, the values appear in NaN values.
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
`hour_avg = df_2.resample('H').mean()`
Sample of my data:
Raw data with time as index
Afterwards: even when time is datetime it shows 99% of the data as NaN values (one value per day is shown)
Data with NaN values after resample per hours
When I used resample for day values, all values are back. So it seems there is a problem with the Time.
When I use the format at the beginning, the error "The format doesn't fit" comes up.
I tried a different way before (not sure what was different) but resample worked per hour.
What do I need to change to be able to use resample for hour again?

Can you share a sample of your data? Assuming that your data consists of a DateTime feature (i.e. yyyy-mm-dd hh-mm-ss) and some other features that you are trying to resample by hour, NaN values can occur due to two reasons: incorrect formatting by Pandas or missing hour values in data.
(1) It is possible that pandas is not reading your dates correctly. Once you read the file, make sure the date column is in the right format (i.e. yyyy-mm-dd).
df = pd.read_csv(r'C:\Users\water_level.csv',parse_dates=[0],index_col=0,decimal=",", delimiter=';')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
(2) If there are any gaps in your data, NaN values will pop up. For instance, assume the data is of this form:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:06:00 1
If you try hour_avg = df_2.resample('H').mean(), your output will look like:
2000-01-01 00:00:00 1
2000-01-01 00:01:00 1
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 1
2000-01-01 00:04:00 1
2000-01-01 00:05:00 NaN
2000-01-01 00:06:00 1
I suspect the problem is the latter. If it is the latter, you can simply remove the NaN values using df_2.dropna(). Otherwise, if you do need the hourly bins regardless of missing data, you can avoid the NaN values by padding the missing values first and then attempting to get the mean:
hour_pad = df_2.resample('H').pad()
hour_avg = hour_pad.resample('H').mean()

Related

Create Multiple DataFrames using Rolling Window from DataFrame Timestamps

I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)

Pandas: How to group by a datetime column, using only the time and discarding the date

I have a dataframe with a datetime column. I want to group by the time component only and aggregate, e.g. by taking the mean.
I know that I can use pd.Grouper to group by date AND time, but it doesn't work on time only.
Say we have the following dataframe:
import numpy as np
import pandas as pd
drange = pd.date_range('2019-08-01 00:00', '2019-08-12 12:00', freq='1T')
time = drange.time
c0 = np.random.rand(len(drange))
c1 = np.random.rand(len(drange))
df = pd.DataFrame(dict(drange=drange, time=time, c0=c0, c1=c1))
print(df.head())
drange time c0 c1
0 2019-08-01 00:00:00 00:00:00 0.031946 0.159739
1 2019-08-01 00:01:00 00:01:00 0.809171 0.681942
2 2019-08-01 00:02:00 00:02:00 0.036720 0.133443
3 2019-08-01 00:03:00 00:03:00 0.650522 0.409797
4 2019-08-01 00:04:00 00:04:00 0.239262 0.814565
In this case, the following throws a TypeError:
grouper = pd.Grouper(key='time', freq='5T')
grouped = df.groupby(grouper).mean()
I could set key=drange to group by date and time and then:
Reset the index
Transform the new column to float
Bin with pd.cut
Cast back to time
Finally group-by and then aggregate
... But I wonder whether there is a cleaner way to achieve the same results.
Series.dt.time/DatetimeIndex.time returns the time as datetime.time. This isn't great because pandas works best withtimedelta64 and so your 'time' column is cast to object, losing all datetime functionality.
You can subtract off the normalized date to obtain the time as a timedelta so you can continue to use the datetime tools of pandas. You can floor this to group.
s = (df.drange - df.drange.dt.normalize()).dt.floor('5T')
df.groupby(s).mean()
c0 c1
drange
00:00:00 0.436971 0.530201
00:05:00 0.441387 0.518831
00:10:00 0.465008 0.478130
... ... ...
23:45:00 0.523233 0.515991
23:50:00 0.468695 0.434240
23:55:00 0.569989 0.510291
Alternatively if you feel unsure of floor, this gets the identical output up to the index name
df['time'] = (df.drange - df.drange.dt.normalize()) # timedelta64[ns]
df.groupby(pd.Grouper(key='time', freq='5T')).mean()
When you use DataFrame.groupby you can a Series an argument. Moreover, if your series is a datetime, you can use the series.dt to access the properties of date. In your case df['drange'].dt.hour or df['drange'].dt.time should do it.
# df['drange']=pd.to_datetime(df['drange'])
df.groupby(df['drange'].dt.hour).agg(...)

Resample DataFrame with DatetimeIndex and keep date range

My problem might sound trivial but I haven't found any solution for it:
I want the resampled data to remain in the same date range as the original data when I resample a DataFrame with a DatetimeIndex e.g. into three-monthly values.
Minimal example:
import numpy as np
import pandas as pd
# data from 2014 to 2016
dim = 8760 * 3 + 24
idx = pd.date_range('1/1/2014 00:00:00', freq='h', periods=dim)
df = pd.DataFrame(np.random.randn(dim, 2), index=idx)
# resample two three months
df = df.resample('3M').sum()
print(df)
yielding
0 1
2014-01-31 24.546928 -16.082389
2014-04-30 -52.966507 -40.255773
2014-07-31 -32.580114 47.096810
2014-10-31 -9.501333 12.872683
2015-01-31 -106.504047 45.082733
2015-04-30 -34.230358 70.508420
2015-07-31 -35.916497 104.930101
2015-10-31 -16.780425 17.411410
2016-01-31 68.512994 -43.772082
2016-04-30 -0.349917 27.794895
2016-07-31 -30.408862 -18.182486
2016-10-31 -97.355730 -105.961101
2017-01-31 -7.221361 40.037358
Why does the resampling exceed the date range e.g. create an entry for 2017-01-31 and how can I prevent this and instead remain within the original range e.g. between 2014-01-01 and 2016-12-31? And shouldn't this be the expected standard behaviour going from January-March, April-June, ... October-December?
Thanks in advance!
There are 36 months in your DataFrame.
When you resample every 3 months, the first row will contain everything up to the end of your first month, the second row will contain everything between your second month and 3 months after that, and so on. Your last row will contain everything from 2016-10-31 until 3 months after that, which is 2017-01-31.
If you want, you could change it to
df.resample('3M', closed='left', label='left').sum()
, giving you
2013-10-31 3.705955 25.394287
2014-01-31 38.778872 -12.655323
2014-04-30 10.382832 -64.649173
2014-07-31 66.939190 31.966008
2014-10-31 -39.453572 27.431183
2015-01-31 66.436348 29.585436
2015-04-30 78.731608 -25.150526
2015-07-31 14.493226 -5.842421
2015-10-31 -2.394419 58.017105
2016-01-31 -36.295499 -14.542251
2016-04-30 69.794101 62.572736
2016-07-31 76.600558 -17.706111
2016-10-31 -68.842328 -32.723581
, but then the first row would be 'outside your range'.
If you resample every 3 months, then either your first row is going to be outside your range, or your last one is.
EDIT
If you want the bins to be 'first three months', 'next three months', and so on, you could write
df.resample('3MS').sum()
, as this will take the beginning of each month rather than its end (see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)

shifting timezone for reshaped pandas dataframe

I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()

Time differentiation in Pandas

Say I have a dataframe with several timestamps and values. I would like to measure Δ values / Δt every 2.5 seconds. Does Pandas provide any utilities for time differentiation?
time_stamp values
19492 2014-10-06 17:59:40.016000-04:00 1832128
167106 2014-10-06 17:59:41.771000-04:00 2671048
202511 2014-10-06 17:59:43.001000-04:00 2019434
161457 2014-10-06 17:59:44.792000-04:00 1294051
203944 2014-10-06 17:59:48.741000-04:00 867856
It most certainly does. First, you'll need to convert your indices into pandas date_rangeformat and then use the custom offset functions available to series/dataframes indexed with that class. Helpful documentation here. Read more here about offset aliases.
This code should resample your data to 2.5s intervals
#df is your dataframe
index = pd.date_range(df['time_stamp'])
values = pd.Series(df.values, index=index)
#Read above link about the different Offset Aliases, S=Seconds
resampled_values = values.resample('2.5S')
resampled_values.diff() #compute the difference between each point!
That should do it.
If you really want the time derivative, then you also need to divide by the time difference (delta time, dt) since last sample
An example:
dti = pd.DatetimeIndex([
'2018-01-01 00:00:00',
'2018-01-01 00:00:02',
'2018-01-01 00:00:03'])
X = pd.DataFrame({'data': [1,3,4]}, index=dti)
X.head()
data
2018-01-01 00:00:00 1
2018-01-01 00:00:02 3
2018-01-01 00:00:03 4
You can find the time delta by using the diff() on the DatetimeIndex. This gives you a series of type Time Deltas. You only need the values in seconds, though
dt = pd.Series(df.index).diff().dt.seconds.values
dXdt = df.diff().div(dt, axis=0, )
dXdt.head()
data
2018-01-01 00:00:00 NaN
2018-01-01 00:00:02 1.0
2018-01-01 00:00:03 1.0
As you can see, this approach takes into account that there are two seconds between the first two values, and only one between the two last values. :)

Categories

Resources