I would like to analyse time series data, where I have some millions of entries.
The data has a granularity of one data entry per minute.
During the weekend, per definition no data exists. As well as for one hour during a weekday.
I want to check for missing data during the week (so: if one or more minutes are missing).
How would I do this with high performance in Python (e.g. with a Pandas DataFrame)
Probably the easiest would be to compare your DatetimeIndex with missing values to a reference DatetimeIndex covering the same range with all values.
Here's an example where I create an arbitrary DatetimeIndex and include some dummy values in a DataFrame.
import pandas as pd
import numpy as np
#dummy data
date_range = pd.date_range('2017-01-01 00:00', '2017-01-01 00:59', freq='1Min')
df = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 1)))
df.index = date_range # set index
df_missing = df.drop(df.between_time('00:12', '00:14').index)
#check for missing datetimeindex values based on reference index (with all values)
missing_dates = df.index[~df.index.isin(df_missing.index)]
print(missing_dates)
Which will return:
DatetimeIndex(['2017-01-01 00:12:00', '2017-01-01 00:13:00',
'2017-01-01 00:14:00'],
dtype='datetime64[ns]', freq='T')
Related
I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks
The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings
I am trying to resample my time series to gain a consistent dataframe shape across many iterations.
Sometime when I pull my data, there is no result, so I am trying to resample my dataframe to include a fill for every time this has happened, however I want to force the resample to run up to a certain date.
My current efforts include
df.set_index(df.date, inplace = True)
resampled = df.resample('D').sum()
But I am unsure on how to force the resampler to continue to the latest date
I have also tried :
df.index = pd.period_range(min(older_df.date), max(older_df.date))
but then there is a length mismatch.
Chain the resample with reindex:
min_date = df.index.min()
max_date = '2020-01-01' # change your date here
daterange = pd.date_range(min_date, max_date)
df.resample('D').sum().reindex(daterange)
A DataFrame has Date as Index. I need to add a column, value of the column should be days_since_epoch. This value can be calculated with
(date_value - datetime.datetime(1970,1,1)).days
How can this value be calculated for all rows in dataframe ?
Following code demonstrate the operation with a sample DataFrame, is there a better way of doing this ?
import pandas as pd
date_range = pd.date_range(start='1/1/1970', end='12/31/2018', freq='D')
df = pd.DataFrame(date_range, columns=['date'])
df['days_since_epoch']=range(0,len(df))
df = df.set_index('date')
Note : this is an example, dates in DataFrame need not start from 1st Jan 1970.
Subtract from Datetimeindex scalar and then call TimedeltaIndex.days:
df['days_since_epoch1']= (df.index - pd.Timestamp('1970-01-01')).days
I have a CSV file that has time represented in a format I'm not familiar with:
I am trying to compute the average time in all of those rows (efforts shown below).
Any sort of feedback will be appreciated.
import pandas as pd
import pandas as np
from datetime import datetime
flyer = pd.read_csv("./myfile.csv",parse_dates = ['timestamp'])
flyer.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
pd.set_option('display.max_rows', 20)
flyer['timestamp'] = pd.to_datetime(flyer['timestamp'],
infer_datetime_format=True)
p = flyer.loc[:,'timestamp'].mean()
print(flyer['timestamp'].mean())
The above is correct, but if you're new it might not be as clear what 0x is feeding you.
import pandas as pd
# turn your csv into a pandas dataframe
df = pd.read_csv('your/file/location.csv')
The timestamp column might be interpreted as a bunch of strings, you won't be able to do the math you want on strings.
# this forces the column's data into timestamp variables
df['timestamp'] = pd.to_datetime(df['timestamp'], infer_datetime_format=True)
# now for your answer, get the average of the timestamp column
print(df['timestamp'].mean())
When you read the csv with pandas, add parse_dates = ['timestamp'] to the pd.read_csv() function call and it will read in that column correctly. The T in the timestamp field is a common way to separate the date and the time.
The -4:00 indicates time zone information, which in this case means -4:00 hours in comparison to UTC time.
As for calculating the mean time, that can get a bit tricky, but here's one solution for after you've imported the csv.
from datetime import datetime
pd.to_datetime(datetime.fromtimestamp(pd.to_timedelta(df['timestamp'].mean().total_seconds())))
This is converting the field to a datetime object in order to calculate the mean, then getting the total seconds (EPOCH time) and using that to convert back into a pandas datetime series.
I have a pandas Series indexed by timestamp. I group by the value in the series, so I end up having a number of groups each either their own timestamp indexed Series. I then want to resample() the group series each to weekly, but want to align the first date across all groups.
My code looks something like this:
grp = df.groupby(df)
for userid, user_df in grp:
resample = user_df.resample('1W', __some_fun)
The only way I have found to make sure alignment happens on the left hand side of the date is to fake pad each group with one value, e.g.:
grp = df.groupby(df)
for userid, user_df in grp:
user_df = user_df.append(pandas.Series([0], index=[pandas.to_datetime('2013-09-02')]))
resample = user_df.resample('1W', __some_fun)
It seems to me that this must be a common workflow, any pandas insight?