How to fill missing observations in time series data

How to fill missing observations in time series data - python

I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object

Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

Related

Pandas resampling data with bigger interval than a whole index range

Situation
I have the folowwing pandas timeseries data:
date
predicted1
2001-03-13
0.994756
2005-08-22
0.551661
2000-05-07
0.001396
I need to take into account a case of resampling into bigger interval than a 5 years, for e.g. 10 years:
sample = data.set_index(pd.DatetimeIndex(data['date'])).drop('date', axis=1)['predicted1']
sample.resample('10Y').sum()
I get the following:
date
2000-12-31
0.001396
2010-12-31
1.546418
So resampling function groups data for the first year and separetely for other years.
Question
How to group all data to the 10 year interval? I want to get smth like this:
date
2000-12-31
1.5478132011506138

You can change the reference, closing and label in resample:
sample.resample('10Y', origin=sample.index.min(), closed='left', label='left').sum()
Output:
date
1999-12-31 1.547813
Freq: 10A-DEC, Name: predicted1, dtype: float64

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!

You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

How can the Pandas datetime object be used dynamically?

Given a dataframe df with only one column (consisting of datetime values that can be repeated). e.g:
date
2017-09-17
2017-09-17
2017-09-22
2017-11-04
2017-11-15
and df.dtypes is date datetime64[ns].
How can I create a new dataframe exporting information from the existing one so that for every month of a particular year there will be a second column with the number of observations for that month of the year.
The result for the above example would be something like:
date
observations
2017-09
3
2017-11
2

You can do:
(df['date'].dt.to_period('M') # change date to Month
.value_counts() # count the Month
.reset_index(name='observations') # make dataframe
)

Count a number of rows in a dataframe per hour

I have a dataframe with two columns: timeStamp and eventMessage (string).
timeStamp: eventMessage:
2020-10-19T10:07:56.7450775+02:00 transaction successful
2020-10-19T10:08:13.025169+02:00 transaction successful
I want to end up with a dataframe that has two columns : hour and numberOfEvents per that hour.
hour: numberOfEvents:
1 41
2 0
... ...
24 32
I've tried the df.resample('H', on='timeStamp', how='count'), but I think the how='count' is deprecated now?
Is there a new quick pandas way to do it?
UPDATE: thanks to Ami Tavory's tip the df now looks like this:
timeStamp
10 792
11 792
14 594
15 198
16 198
I'm not actually sure if it's a dataframe with one column or some other type completely. And how do I fill in the hours that had zero events?
Miniupdate: It's pandas.core.series.Series
Converted it to df with:
series = df.message.groupby(pd.to_datetime(df.timeStamp).dt.hour).count()
df2 = pd.DataFrame({'hour': series.index, 'counted': series.values})
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.

Regarding your new question (after the edit).
Converted it to df with:
You can more easily convert it with
df = series.to_frame().
Now I just need to figure out how to add and fill in other hours from 1 to 24 that had no events with a zero.
new_index = Index(arange(0,23,1), name="hour")
df.set_index("hour").reindex(new_index).fillna(0)

Group by the hour, and count:
df.eventMessage.groupby(pd.to_datetime(df.timeStamp).dt.hour)).count()

Pandas group values and get mean by date range

I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks

One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be

Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fill missing observations in time series data - python

Related

Pandas resampling data with bigger interval than a whole index range

How can I filter for rows one hour before and after a set timestamp in Python?

How can the Pandas datetime object be used dynamically?

Count a number of rows in a dataframe per hour

Pandas group values and get mean by date range

Categories

Resources