Calculating daily average from irregular time series using pandas

Calculating daily average from irregular time series using pandas - python

I am trying to obtain daily averages from an irregular time series from a csv-file.
The data in the csv-file start at 13:00 on 20 September 2013 and run till 10:57 on 14 January 2014:
Time Values
20/09/2013 13:00 5.133540
20/09/2013 13:01 5.144993
20/09/2013 13:02 5.158208
20/09/2013 13:03 5.170542
20/09/2013 13:04 5.167899
20/09/2013 13:25 5.168780
20/09/2013 13:26 5.179351
...
I import them with:
import pandas as pd
data = pd.read_csv('<file name>', parse_dates={'Timestamp':'Time']},index_col='Timestamp')
This results in
Values
Timestamp
2013-09-20 13:00:00 5.133540
2013-09-20 13:01:00 5.144993
2013-09-20 13:02:00 5.158208
2013-09-20 13:03:00 5.170542
2013-09-20 13:04:00 5.167899
2013-09-20 13:25:00 5.168780
2013-09-20 13:26:00 5.179351
...
And then I do
dataDailyAv = data.resample('D', how = 'mean')
This results in
Values
Timestamp
2013-01-10 8.623744
2013-01-11 NaN
2013-01-12 NaN
2013-01-13 NaN
2013-01-14 NaN
...
In other words, the result contains dates that do not appear in the original data, and for some of these dates (e.g. 10 January 2013), there even appears a value.
Any ideas about what is going wrong?
Thanks.
Edit: apparently something goes wrong with the parsing of the date: 01/10/2013 is interpreted as 10 January 2013 instead of 1 October 2013. This can be solved by editing the date format in the csv-file, but is there a way to specify the date format in read_csv?

You want dayfirst=True, one of the many tweaks listed in the read_csv docs.

Related

How to drop multiple rows with datetime index?

I have the pandas data frame as below with a datetime index. The dataframe shows the data for the month of April and May. (The original dataframe has many more columns).
I want to remove all the rows for the month of May i.e. starting from index 2022-05-01 00:00:00 and ending at 2022-05-31 23:45:00. Currently, I am doing it by explicitly mentioning the index labels but I am sure that should be a more sophisticated way to do it without having to mention the index labels so that if the data changes and I want to remove the next month, I don't have to hard code it. I would appreciate help with this.
Current Code:
start_remove = pd.to_datetime('2022-05-01 00:00:00')
end_remove = pd.to_datetime('2022-05-01 23:45:00')
df = df.loc[(df.index < start_remove) | (df.index > end_remove)]
Sample Dataset:
date Open Close High Low
...
2022-04-30 23:30:00 10 11.4 10.2 10.7
2022-04-30 23:45:00 18 17.2 17.2 15.8
2022-05-01 00:00:00 24 24 24.8 24.8
2022-05-01 00:15:00 59 58 60 60.3
2022-05-01 00:30:00 43.7 43.9 48 48
...
...
2022-05-31 23:45:00 41.7 53.9 51 50

you may want to include the year when choosing month, to avoid deleting same month from other year
# assumption: date field is an index
# and is already converted to datetime using pd.to_datetime
df.drop(df.loc[df.index.strftime('%Y%m') == '202205'].index)
converting index to datetime
df.index=pd.to_datetime(df.index)
df

How to correctly extract various Date formats from Text in Python

I have to extract all the available dates from a PDF and then check among the dates which is Contract Date.
For that first I want to extract all the Dates in the Text that i have extracted from PDF. Now the Dates can be in various formats. I have tried adding all flavours of dates in the below example.
I tried using Datefinder Python module to extract all the dates. Although it comes close but throws few garbage dates initially and also doesn't match the first Date correctly.
import datefinder
dateContent = """ Test
I want to apply for leaves August, 11, 2017 I want to apply for leaves Aug, 23, 2017 I want to apply for leaves Aug, 21, 17
I want to apply for leaves August 20 2017
I want to apply for leaves August 30th, 2017 I want to apply for leaves August 31st 17
I want to apply for leaves 8/26/2017 I want to apply for leaves 8/27/17
I want to apply for leaves 28/8/2017 I want to apply for leaves 29/8/17 I want to apply for leaves 30/08/17
I want to apply for leaves 15 Jan 17 I want to apply for leaves 14 January 17
I want to apply for leaves 13 Jan 2017
I want to apply for leaves Jan 10 17 I want to apply for leaves Jan 11 2017 I want to apply for leaves January 12 2017
"""
matches = datefinder.find_dates(dateContent)
for match in matches:
print(match)
Response :
2019-08-05 00:00:00
2019-06-11 00:00:00
2017-06-05 00:00:00
2017-08-23 00:00:00
2017-08-21 00:00:00
2017-08-20 00:00:00
2017-08-30 00:00:00
2017-08-31 00:00:00
2017-08-26 00:00:00
2017-08-27 00:00:00
2017-08-28 00:00:00
2017-08-29 00:00:00
2017-08-30 00:00:00
2017-01-15 00:00:00
2017-01-14 00:00:00
2017-01-13 00:00:00
2017-01-10 00:00:00
2017-01-11 00:00:00
2017-01-12 00:00:00
As you can see, I have 17 such Date objects, but i am getting 19. Checking from bottom, last 16 match correctly. Then there is those initial Garbage.
Once i get these Dates correctly, i can move forward with some kind of N-Gram model to check which Dates Context is to Contractual Information.
Any help in resolving the issue would be great.

I resolved the issue.
Actually there were some encoding issue in my text content.
dateContent = dateContent.replace(u'\u200b', '')
Replacing \u200b with empty character fixed the issue.
Datefinder Module does rest of the work of finding all the different Date Formats.

This is corpus research. You have to check your data for alternations in date time strings and try to figure out your own customized regular expression for it. If it is natural language resource that you use, and not some system-generated text with distinct patterns of realising the date, you will never get 100 percent recall and precision. It is always a tradeoff.

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

to_datetime in pandas changes the date of my datetime data

I use the following code to extract the datetime of a .csv file:
house_data = 'test_1house_EV.csv'
house1 = pandas.read_csv(house_data)
time = pandas.to_datetime(house1["localminute"])
The datetime data to be extracted are the 1440 minutes of September 1, 2017.
However, after using to_datetime the minutes between 00:00 and 05:00 are placed on September 2.
e.g. the original data looks like this:
28 2017-09-01 00:28:00-05
29 2017-09-01 00:29:00-05
...
1411 2017-09-01 23:31:00-05
1412 2017-09-01 23:32:00-05
but the datetime data looks like this:
28 2017-09-01 05:28:00
29 2017-09-01 05:29:00
...
1410 2017-09-02 04:30:00
1411 2017-09-02 04:31:00
Does anyone know how to fix this?

Use this, as per #James' suggestion:
pd.to_datetime(house1["localminute"], format='%Y-%m-%d %H:%M:%S-%f')

You can slice off the last three characters of the date string before converting.
pd.to_datetime(house1.localminute.str[:-3])

How to properly set start/end params of statsmodels.tsa.ar_model.AR.predict function

I have a dataframe of project costs from an irregularly spaced time series that I would like to try to apply the statsmodel AR model against.
This is a sample of the data in it's dataframe:
cost
date
2015-07-16 35.98
2015-08-11 25.00
2015-08-11 43.94
2015-08-13 26.25
2015-08-18 15.38
2015-08-24 77.72
2015-09-09 40.00
2015-09-09 20.00
2015-09-09 65.00
2015-09-23 70.50
2015-09-29 59.00
2015-11-03 19.25
2015-11-04 19.97
2015-11-10 26.25
2015-11-12 19.97
2015-11-12 23.97
2015-11-12 21.88
2015-11-23 23.50
2015-11-23 33.75
2015-11-23 22.70
2015-11-23 33.75
2015-11-24 27.95
2015-11-24 27.95
2015-11-24 27.95
...
2017-03-31 21.93
2017-04-06 22.45
2017-04-06 26.85
2017-04-12 60.40
2017-04-12 37.00
2017-04-12 20.00
2017-04-12 66.00
2017-04-12 60.00
2017-04-13 41.95
2017-04-13 25.97
2017-04-13 29.48
2017-04-19 41.00
2017-04-19 58.00
2017-04-19 78.00
2017-04-19 12.00
2017-04-24 51.05
2017-04-26 21.88
2017-04-26 50.05
2017-04-28 21.00
2017-04-28 30.00
I am having a hard time understanding how to use start and end in the predict function.
According to the docs:
start : int, str, or datetime
Zero-indexed observation number at which to start forecasting, ie., the first > forecast is start. Can also be a date string to parse or a datetime type.
end : int, str, or datetime Zero-indexed observation number at which
to end forecasting, ie., the first forecast is start. Can also be a
date string to parse or a datetime type.
I create a dataframe that has an empty daily time series, add my irregularly spaced time series data to it, and then try to apply the model.
data = pd.read_csv('data.csv', index_col=1, parse_dates=True)
df = pd.DataFrame(index=pd.date_range(start=datetime(2015, 1, 1), end=datetime(2017, 12, 31), freq='d'))
df = df.join(data)
df.cost.interpolate(inplace=True)
ar_model = sm.tsa.AR(df, missing='drop', freq='D')
ar_res = ar_model.fit(maxlag=9, method='mle', disp=-1)
pred = ar_res.predict(start='2016', end='2016')
The predict function results in an error of pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
If I try to use a more specific date, I get the same type of error:
pred = ar_res.predict(start='2016-01-01', end='2016-06-01')
If I try to use integers, I get a different error:
pred = ar_res.predict(start=0, end=len(data))
Wrong number of items passed 202, placement implies 197
If I actually use a datetime, I get an error that reads no rule for interpreting end.
I am hitting a wall so hard here I am thinking there must be something I am missing.
Ultimately, I would like to use the model to get out-of-sample predictions (such as a prediction for next quarter).

This works if you pass a datetime (rather than a date):
from datetime import datetime
...
pred = ar_res.predict(start=datetime(2015, 1, 1), end=datetime(2017,12,31))
In [21]: pred.head(2) # my dummy numbers from data
Out[21]:
2015-01-01 35
2015-01-02 23
Freq: D, dtype: float64
In [22]: pred.tail(2)
Out[22]:
2017-12-30 44
2017-12-31 44
Freq: D, dtype: float64

So I was creating a daily index to account for the equally spaced time series requirement, but it still remained non-unique (comment by #user333700).
I added a groupby function to sum duplicate dates together, and could then run the predict function using datetime objects (answer by #andy-hayden).
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
...
ar_res.predict(start=min(df.index), end=datetime(2018,12,31))
With the predict function providing a result, I am now able to analyze the results and tweak the params to get something useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.