Changing timeseries column into a date - python

I have a timeseries with 2 columns, the first being hours after 1 Jan 1970. In this column, a year is only 360 days, with 12 months of 30 days. I need to convert this column into a usable date so that I can analyse the other column based on month, year etc (e.g 1997-Jan-1-1 being year-month-day-hour).
I need to make an array with modulo, to convert the each row of the hours column into hour_of_day, day_of_month, year etc, so that the column is instead a year, month, day and hour. But I don't know how to do this. Appreciate it might be confusing. Any help on doing this would be very helpful.
Input: 233280.5 (in hours)
Output: 1997-01-01-01 (year-day-month-hour)

you can calculate the number of years and add it to the reference date like e.g.
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
refdate = pd.Timestamp('1970-01-01')
df = pd.DataFrame({'360d_year_hours': [233280.5]})
# we calculate the number of years and fractional years as helper Series
y_frac, y = np.modf(df['360d_year_hours'] / (24*360))
# now we can calculate the new date's year:
df['datetime'] = pd.Series(refdate + DateOffset(years=i) for i in y)
# we need the days in the given year to be able to use y_frac
daysinyear = np.where(df['datetime'].dt.is_leap_year, 366, 365)
# ...so we can update the datetime and round to the hour:
df['datetime'] = (df['datetime'] + pd.to_timedelta(y_frac*daysinyear, unit='d')).dt.round('h')
# df['datetime']
# 0 1997-01-01 01:00:00
# Name: datetime, dtype: datetime64[ns]

Related

Combining Year and DayOfYear, H:M:S columns into date time object

I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

Pandas date_range - subtracting numpy timedelta gives odd result, time becomes not 0:00:00

I am trying to generate a set of dates with pandas date_range functionality. Then I want to iterate over this range and subtract several months from each of the dates (exact number of month is determined in loop) to get a new date.
I get some very odd results when I do this.
MVP:
#get date range
dates = pd.date_range(start = '1/1/2013', end='1/1/2018', freq=str(test_size)+'MS', closed='left', normalize=True)
#take first date as example
date = dates[0]
date
Timestamp('2013-01-01 00:00:00', freq='3MS')
So far so good.
Now let's say I want to go just one month back from this date. I define numpy timedelta (it supports months for definition, while pandas' timedelta doesn't):
#get timedelta of 1 month
deltaGap = np.timedelta64(1,'M')
#subtract one month from date
date - deltaGap
Timestamp('2012-12-01 13:30:54', freq='3MS')
Why so? Why I get 13:30:54 in time component instead of midnight.
Moreover, if I subtract more than 1 month it the shift becomes so large that I lose a whole day:
#let's say I want to subtract both 2 years and then 1 month
deltaTrain = np.timedelta64(2,'Y')
#subtract 2 years and then subtract 1 month
date - deltaTrain - deltaGap
Timestamp('2010-12-02 01:52:30', freq='3MS')
I've had similar issues with timedelta, and the solution I've ended up using was using relativedelta from dateutil, which is specifically built for this kind of application (taking into account all the calendar weirdness like leap years, weekdays, etc...). For example given:
from dateutil.relativedelta import relativedelta
date = dates[0]
>>> date
Timestamp('2013-01-01 00:00:00', freq='10MS')
deltaGap = relativedelta(months=1)
>>> date-deltaGap
Timestamp('2012-12-01 00:00:00', freq='10MS')
deltaGap = relativedelta(years=2, months=1)
>>> date-deltaGap
Timestamp('2010-12-01 00:00:00', freq='10MS')
Check out the documentation for more info on relativedelta
The issues with numpy.timedelta64
I think that the problem with np.timedelta is revealed in these 2 parts of the docs:
There are two Timedelta units (‘Y’, years and ‘M’, months) which are treated specially, because how much time they represent changes depending on when they are used. While a timedelta day unit is equivalent to 24 hours, there is no way to convert a month unit into days, because different months have different numbers of days.
and
The length of the span is the range of a 64-bit integer times the length of the date or unit. For example, the time span for ‘W’ (week) is exactly 7 times longer than the time span for ‘D’ (day), and the time span for ‘D’ (day) is exactly 24 times longer than the time span for ‘h’ (hour).
So the timedeltas are fine for hours, weeks, months, days, because these are non-variable timespans. However, months and years are variable in length (think leap years), and so to take this into account, numpy takes some sort of "average" (I guess). One numpy "year" seems to be one year, 5 hours, 49 minutes and 12 seconds, while one numpy "month" seems to be 30 days, 10 hours, 29 minutes and 6 seconds.
# Adding one numpy month adds 30 days + 10:29:06:
deltaGap = np.timedelta64(1,'M')
date+deltaGap
# Timestamp('2013-01-31 10:29:06', freq='10MS')
# Adding one numpy year adds 1 year + 05:49:12:
deltaGap = np.timedelta64(1,'Y')
date+deltaGap
# Timestamp('2014-01-01 05:49:12', freq='10MS')
This is not so easy to work with, which is why I would just go to relativedelta, which is much more intuitive (to me).
You can try using pd.DateOffset which is mainly used for applying offset logic (month, year, hour) on dates format.
# get random dates
dates = pd.date_range(start = '1/1/2013', freq='H',periods=100,closed='left', normalize=True)
#take first date as example
date = dates[0]
# subtract a month
dates[0] - pd.DateOffset(months=1)
Timestamp('2012-12-01 00:00:00')
# to apply this on all dates
new_dates = list(map(lambda x: x - pd.DateOffset(months=1), dates))

Group DataFrame by Business Day of Month

I am trying to group a Pandas DataFrame that is indexed by date by the business day of month, approx 22/month.
I would like to return a result that contains 22 rows with mean of some value in `DataFrame.
I can by day of month but cant seem to figure out how to by business day.
Is there a function that will return the business day of month of a date?
if someone could provide a simple example that would be most appreciated.
Assuming your dates are in the index (if not use 'set_index):
df.groupby(pd.TimeGrouper('B'))
See time series functionality.
I think what the question is asking is to groupby business day of month - the other answer just seems to resample the data to the nearest business day (at least for me).
This code returns a groupby object with 22 rows
from datetime import date
import pandas as pd
import numpy as np
d = pd.Series(np.random.randn(1000), index=pd.bdate_range(start='01 Jan 2018', periods=1000))
def to_bday_of_month(dt):
month_start = date(dt.year, dt.month, 1)
return np.busday_count(month_start, dt)
day_of_month = [to_bday_of_month(dt) for dt in d.index.date]
d.groupby(day_of_month).mean()

How to find the number of the day in a year based on the actual dates using Pandas?

My data frame data has a date variable dateOpen with the following format date_format = "%Y-%m-%d %H:%M:%S.%f" and I would like to have a new column called openDay which is the day number based on 365 days a year. I tried applying the following
data['dateOpen'] = [datetime.strptime(dt, date_format) for dt in data['dateOpen']]
data['openDay'] = [dt.day for dt in data['dateOpen']]
however, I get the day in the month. For example if the date was 2013-02-21 10:12:14.3 then the above formula would return 21. However, I want it to return 52 which is 31 days from January plus the 21 days from February.
Is there a simple way to do this in Pandas?
On latest pandas you can use date-time properties:
>>> ts = pd.Series(pd.to_datetime(['2013-02-21 10:12:14.3']))
>>> ts
0 2013-02-21 10:12:14.300000
dtype: datetime64[ns]
>>> ts.dt.dayofyear
0 52
dtype: int64
On older versions, you may be able to convert to a DatetimeIndex and then use .dayofyear property:
>>> pd.Index(ts).dayofyear # may work
array([52], dtype=int32)
Not sure if there's a pandas builtin, but in Python, you can get the "Julian" day, eg:
data['openDay'] = [int(format(dt, '%j')) for dt in data['dateOpen']]
Example:
>>> from datetime import datetime
>>> int(format(datetime(2013,2,21), '%j'))
52
#To find number of days in this year sofar
from datetime import datetime
from datetime import date
today = date.today()
print("Today's date:", today)
print(int(format(today, '%j')))
Today's date: 2020-03-26
86

Categories

Resources