Pandas resample up to certain date - filling missing timeseries - python

I am trying to resample my time series to gain a consistent dataframe shape across many iterations.
Sometime when I pull my data, there is no result, so I am trying to resample my dataframe to include a fill for every time this has happened, however I want to force the resample to run up to a certain date.
My current efforts include
df.set_index(df.date, inplace = True)
resampled = df.resample('D').sum()
But I am unsure on how to force the resampler to continue to the latest date
I have also tried :
df.index = pd.period_range(min(older_df.date), max(older_df.date))
but then there is a length mismatch.

Chain the resample with reindex:
min_date = df.index.min()
max_date = '2020-01-01' # change your date here
daterange = pd.date_range(min_date, max_date)
df.resample('D').sum().reindex(daterange)

Related

Python pandas.datetimeindex piecewise dataframe slicing

I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks
The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings

.fillna breaking .dt.normalize()

I am trying to clean up some data, by formatting my floats to show no decimal points and my date/time to only show date. After this, I want to fill in my NaNs with an empty string, but when I do that, my date goes back to showing both date/time. Any idea why? Or how to fix it.
This is before I run the fillna() method with a picture of what my data looks like:
#Creating DataFrame from path variable
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
#daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
code with NaNs
This is when I run the fillna() method:
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
date_time
Using normalize() does not change the dtype of the column, pandas just stop displaying the time portion when print because they share the same time portion.
I would recommend the correct solution which is convert the column to actual datetime.date instead of using normalize():
df['date'] = pd.to_datetime(df['date']).dt.date

Convert string time into DatetimeIndex and then resample

Two of the columns in my dataset are hour and mins as integers. Here's a snippet of the dataset.
I'm creating a timestamp through the following code:
TIME = pd.to_timedelta(df["hour"], unit='h') + pd.to_timedelta(df["mins"], unit='m')
#df['TIME'] = TIME
df['TIME'] = TIME.astype(str)
I convert TIME to string format because I'm exporting the dataframe to MS Excel which doesn't support timedelta format.
Now I want timestamps for every minute.
For that, I want to fill the missing minutes and add zero to the TOTAL_TRADE_RATE against them, for which I first have to set the TIME column as index. I'm applying this:
df = df.set_index('TIME')
df.index = pd.DatetimeIndex(df.index)
df.resample('60s').sum().reset_index()
but it's giving the following error:
Unknown string format: 0 days 09:33:00.000000000

Check Time Series Data for Missing Values

I would like to analyse time series data, where I have some millions of entries.
The data has a granularity of one data entry per minute.
During the weekend, per definition no data exists. As well as for one hour during a weekday.
I want to check for missing data during the week (so: if one or more minutes are missing).
How would I do this with high performance in Python (e.g. with a Pandas DataFrame)
Probably the easiest would be to compare your DatetimeIndex with missing values to a reference DatetimeIndex covering the same range with all values.
Here's an example where I create an arbitrary DatetimeIndex and include some dummy values in a DataFrame.
import pandas as pd
import numpy as np
#dummy data
date_range = pd.date_range('2017-01-01 00:00', '2017-01-01 00:59', freq='1Min')
df = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 1)))
df.index = date_range # set index
df_missing = df.drop(df.between_time('00:12', '00:14').index)
#check for missing datetimeindex values based on reference index (with all values)
missing_dates = df.index[~df.index.isin(df_missing.index)]
print(missing_dates)
Which will return:
DatetimeIndex(['2017-01-01 00:12:00', '2017-01-01 00:13:00',
'2017-01-01 00:14:00'],
dtype='datetime64[ns]', freq='T')

How to resample time series with pandas and pad dates within a range

I have a pandas Series indexed by timestamp. I group by the value in the series, so I end up having a number of groups each either their own timestamp indexed Series. I then want to resample() the group series each to weekly, but want to align the first date across all groups.
My code looks something like this:
grp = df.groupby(df)
for userid, user_df in grp:
resample = user_df.resample('1W', __some_fun)
The only way I have found to make sure alignment happens on the left hand side of the date is to fake pad each group with one value, e.g.:
grp = df.groupby(df)
for userid, user_df in grp:
user_df = user_df.append(pandas.Series([0], index=[pandas.to_datetime('2013-09-02')]))
resample = user_df.resample('1W', __some_fun)
It seems to me that this must be a common workflow, any pandas insight?

Categories

Resources