using PeriodIndex vs DateTimeIndex in pandas? - python

I am working with some financial data that is organized as a df with a MultiIndex that contains the ticker and the date and a column that contains the return. I am wondering whether one should convert the index to a PeriodIndex instead of a DateTimeIndex since returns are really over a period rather than an instant in time. Beside the philosophical argument, what practical functionality does PeriodIndex provide that may be useful in this particular use case vs DateTimeIndex?

There are some functions available in DateTimeIndex (such as is_month_start, is_quarter_end) which are not available in PeriodIndex. I use PeriodIndex when is not possible to have the format I need with DateTimeIndex. For example if I need a monthly frequency in the format yyyy-mm, I use the PeriodIndex.
Example:
Assume that df has an index as
df.index
'2020-02-26 13:50:00', '2020-02-27 14:20:00',
'2020-02-28 11:10:00', '2020-02-29 13:50:00'],
dtype='datetime64[ns]', name='peak_time', length=1025, freq=None)
The minimum monthly data can be obtained via the following code
dfg = df.groupby([df.index.year, df.index.month]).min()
whose index is a MultiIndex
dfg.index
MultiIndex([(2017, 1),
...
(2020, 1),
(2020, 2)],
names=['peak_time', 'peak_time'])
No I convert it to a PeriodIndex:
dfg["date"] = pd.PeriodIndex (dfg.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")

For me, the PeriodIndex can be automatically displayed as the corresponding month, quarter and year in the downsampling.
import pandas as pd
# https://github.com/jiahe224/bug_report/blob/main/resample_test.csv
temp = pd.read_csv('resample_test.csv',dtype={'stockcode':str, 'A股代码':str})
temp['date'] = pd.to_datetime(temp['date'])
temp = temp.set_index(['date'])
result = temp['北向占自由流通比'].resample('Q',closed='left').first()
result
result = temp['北向占自由流通比'].resample('Q',closed='left').first().to_period()
result
Off topic, there is a problem with resample that has not been fixed as of yet, the bug report at https://github.com/pandas-dev/pandas/issues/45869

Behavior on partial periods.
date_range returns empty index. period_range returns index with len 1 when specifying start and end that do not cover a whole period.
(also, the timezone information is lost for periods of months).
date_range:
dates = pd.core.indexes.datetimes.date_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", inclusive="both", freq="1M")
dates
DatetimeIndex([], dtype='datetime64[ns, UTC]', freq='M')
period_range:
periods = pd.core.indexes.period.period_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", freq="1M")
periods
PeriodIndex(['2022-12'], dtype='period[M]')

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

convert a pandas column of string dates to compare with datetime.date

I have a column of string values in pandas as follows:
2022-07-01 00:00:00+00:00
I want to compare it to a couple of dates as follows:
month_start_date = datetime.date(start_year, start_month, 1)
month_end_date = datetime.date(start_year, start_month, calendar.monthrange(start_year, start_month)[1])
df = df[(df[date] >= month_start_date) and (df[date] <= month_end_date)]
How do i convert the string value to datetime.date?
I have tried to use pd.to_datetime(df['date']), says cant compare datetime to date
Tried to use pd.to_datetime(df['date']).dt.date says dt can only be used for datetime l like variables, did you mean at
Also tired to normalize it, but that bring more errors with timezone, and active and naive timezone
Also tried .astype('datetime64[ns]')
None of it is working
UPDATE
Turns out none of the above are working because half the data is in this format: 2022-07-01 00:00:00+00:00
And the rest is in this format: 2022-07-01
Here is how i am getting around this issue:
for index, row in df_uscis.iterrows():
df_uscis.loc[index, 'date'] = datetime.datetime.strptime(row['date'].split(' ')[0], "%Y-%m-%d").date()
Is there a simpler and faster way of doing this? I tried to make a new column with the date values only, but not sure how to do that
From your update, if you only need to turn the values from string to date objects, you can try:
df['date'] = pd.to_datetime(df['date'].str.split(' ').str[0])
df['date'] = df['date'].dt.date
Also, try to avoid using iterrows, as it is really slow and usually there's a better way to achieve what you're trying to acomplish, but if you really need to iterate through a DataFrame, try using the df.itertuples() method.

What is the equivalent function of ts from R language in python? [duplicate]

I have a dataframe with various attributes, including one datetime column. I want to extract one of the attribute columns as a time series indexed by the datetime column. This seemed pretty straightforward, and I can construct time series with random values, as all the pandas docs show.. but when I do so from a dataframe, my attribute values all convert to NaN.
Here's an analogous example.
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = pd.Series(df.a, index=df.date)
In this case, the series will have correct time series index, but all the values will be NaN.
I can do the series in two steps, as below, but I don't understand why this should be required.
s = pd.Series(df.a)
s.index = df.date
What am I missing? I assume it has to do with series references, but don't understand at all why the values would go to NaN.
I am also able to get it to work by copying the index column.
s = pd.Series(df.a, df.date.copy())
The problem is that pd.Series() is trying to use the values specified in index to select values from the dataframe, but the date values in the dataframe are not present in the index.
You can set the index to the date column and then select the one data column you want. This will return a series with the dates as the index
import pandas as pd
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = df.set_index('date')['a']
Examining s gives:
In [1]: s
Out[1]:
date
2017-04-01 0
2017-04-02 1
Name: a, dtype: int64
And you can confirm that s is a Series:
In [2]: isinstance(s, pd.Series)
Out[2]: True

Slicing data frame with datetime columns (Python - Pandas)

Through the loc and iloc methods, Pandas allows us to slice dataframes. Still, I am having trouble to do this when the columns are datetime objects.
For instance, suppose the data frame generated by the following code:
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
Let us try to slice the first two columns of the dataframe through dfloc:
df.loc[0,'01-01-2001':'02-02-2002']
We get the following TypeError:'<' not supported between instances of 'datetime.date' and 'str'
How could this be solved?
df.iloc[0,[0,1]]
Use iloc or loc , but give column name in second parameter as index of that columns and you are passing strings, just give index
To piggyback off of #Ch3steR comment from above that line should work.
dates = pd.to_datetime(dates)
At that point the date conversion should allow you to index the columns that fall in that range based on the date as listed below. Just make sure the end date is a little beyond the end date that you're trying to capture.
# Return all rows in columns between date range 1/1/2001 and 2/3/2002
df.loc[:, '1/1/2001':'2/3/2002']
2001-01-01 2002-02-02
0 1 2
You can call the dates from the list you created earlier and it doesn't give an error.
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
df.loc[0,dates[0]:dates[1]]
The two different formats are here. It's just important that you stick to the one format. Calling from the list works because it guarantees that the format is the same. But as you said, you need to be able to use any dates so the second one is better for you.
>>>dates = pd.to_datetime(dates).date
>>>print("With .date")
With .date
>>>print(dates)
[datetime.date(2001, 1, 1) datetime.date(2002, 2, 2)
datetime.date(2003, 3, 3)]
>>>dates = pd.to_datetime(dates)
>>>print("Without .date")
Without .date
>>>print(dates)
DatetimeIndex(['2001-01-01', '2002-02-02', '2003-03-03'], dtype='datetime64[ns]', freq=None)

Python PANDAS: New Column, Apply Unique Value To All Rows

Just looking for a best approach as someone who spends more time in data analysis land than programming proper (hat tip to you all). Pretty straightforward, large ETL project but hand coding it in Python which is a first. Fixed-width file is being read successfully into initial PANDAS df.
I am trying to add a new column with a static, end-of-month date value (2014-01-31, for example) indicating the "Data Month" for further EDW processing. Ultimately, I am going to use datetime/timedelta functionality to pass this value as an automatically generated when I CRON it on the utility server.
My confusion seems to be about which function to utilize (apply, mapapply, etc.), if I need to reference an index value in the original df to apply a completely unrelated value to the initial df, and the most optimized, pythonic way to accomplish this.
Currently referencing: "Python for Data Analysis", PANDAS Docs. Thanks!
EDIT
Here is a small example of some fixed-width data:
5151022314
5113 22204
111 20018
Here is some code for reading it into a PANDAS df:
import pandas as pd
import numpy as np
path = 'C:\Users\Office\Desktop\example data.txt'
widths = [2, 3, 5]
names = (['STATE_CD', 'CNTY_CD', 'ZIP_CD',])
df = pd.read_fwf(path, names=names, widths=widths, header=0)
This should return something like this as a df for the example date above:
STATE_CD,CNTY_CD,ZIP_CD
51,510,22314
51,1 ,22204
11,3 ,20018
What I am trying to do is add a column "DATA_MM" like this for all rows:
STATE_CD,CNTY_CD,ZIP_CD, DATA_MM
51,510,22314,2014-01-31
51,1 ,22204,2014-01-31
11,3 ,20018,2014-01-31
Ultimately, I am hoping to utilize something like this to generate the value that is applied automatically when this monthly job initiates:
import datetime
today = datetime.date.today()
first = datetime.date(day=1, month=today.month, year=today.year)
lastMonth = first - datetime.timedelta(days=1)
print lastMonth.strftime("%Y-%m-%d")
If you want to fill a column with a new value that doesn't depend on your original DataFrame, you don't need to make reference to the original indices. You can fill the new column by simply assigning the new value to it:
df["DATA_MM"] = date
You can get the last day of the month by using datetime and calendar:
import datetime
import calendar
today = datetime.date.today()
y = today.year
m = today.month
eom = datetime.date(y, m, calendar.monthrange(y, m)[1])
df["DATA_MM"] = eom
monthrange returns a tuple with the first and last days of the month, so [1] references the last day of the month. You can also use #Alexander's method for finding the date of the last day, and assign it directly to the column instead of applying it.
Lets say your DataFrame is named df and it has a date column of Timestamps for which you would like to get end-of-month (EOM) values:
df['EOM date'] = df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
You are coercing the objects to Pandas Period objects and then back to end of month timestamps, so it may not be the most efficient method.
Here is an alternative implementation with some performance stats:
dates = pd.date_range('2000-1-1', '2015-1-1')
df = pd.DataFrame(dates, columns=['date'])
%%timeit
df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
10 loops, best of 3: 161 ms per loop
%%timeit
df.date.apply(lambda x: x + pd.datetools.MonthEnd())
1 loops, best of 3: 177 ms per loop
just getting a DATETIME.DATE (per request below) for the end-of-month date from the current date can be achieve as follows:
pd.Timestamp(dt.datetime.now()).to_period('M').to_timestamp('M').date()

Categories

Resources