I have a column in pandas dataframe in timestamp format and want to extract unique dates (no time) into a list. I tried following ways doesn't really work,
1. dates = datetime.datetime(df['EventTime'].tolist()).date()
2. dates = pd.to_datetime(df['EventTime']).date().tolist()
3. dates = pd.to_datetime(df['EventTime']).tolist().date()
can anyone help?
You can use dt to access the date time object in a Series, try this:
pd.to_datetime(df['EventTime']).dt.date.unique().tolist()
# [datetime.date(2014, 1, 1), datetime.date(2014, 1, 2)]
df = pd.DataFrame({"EventTime": ["2014-01-01", "2014-01-01", "2014-01-02 10:12:00", "2014-01-02 09:12:00"]})
Related
My dataframe index has datetime type. All date will be feature date already sorted starting nearest date from today. I need to filter the dataframe based on the nearest date.
I tried with below method but getting a KeyError: datetime.date(2023, 1, 4) error for today
fno_symbolList['expiry'] = pd.to_datetime(fno_symbolList['expiry'])
fno_symbolList = fno_symbolList.set_index(fno_symbolList['expiry'])
today = datetime.date.today()
fno_symbolList = fno_symbolList[fno_symbolList.index.get_loc(today, method='nearest')]
I am new to Python so I'm sorry if this sounds silly. I have a date column in a DataFrame. I need to check if the values in the date column is the end of the month, if yes then add one day and display the result in the new date column and if not we will just replace the day of with the first of that month.
For example. If the date 2000/3/31 then the output date column will be 2000/4/01
and if the date is 2000/3/30 then the output value in the date column would be 2000/3/1
Now I can do a row wise iteration of the column but I was wondering if there is a pythonic way to do it.
Let's say my Date column is called "Date" and new column which I want to create is "Date_new" and my dataframe is df, I am trying to code it like this but it is giving me an error:
if(df['Date'].dt.is_month_end == 'True'):
df['Date_new'] = df['Date'] + timedelta(days = 1)
else:
df['Date_new'] =df['Date'].replace(day=1)
I made your if statement into a function and modified it a bit so it works for columns. I used dataframe .apply method with axis=1 so it operates on columns instead of rows
import pandas as pd
import datetime
df = pd.DataFrame({'Date': [datetime.datetime(2022, 1, 31), datetime.datetime(2022, 1, 20)]})
print(df)
def my_func(column):
if column['Date'].is_month_end:
return column['Date'] + datetime.timedelta(days = 1)
else:
return column['Date'].replace(day=1)
df['Date_new'] = df.apply(my_func, axis=1)
print(df)
Through the loc and iloc methods, Pandas allows us to slice dataframes. Still, I am having trouble to do this when the columns are datetime objects.
For instance, suppose the data frame generated by the following code:
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
Let us try to slice the first two columns of the dataframe through dfloc:
df.loc[0,'01-01-2001':'02-02-2002']
We get the following TypeError:'<' not supported between instances of 'datetime.date' and 'str'
How could this be solved?
df.iloc[0,[0,1]]
Use iloc or loc , but give column name in second parameter as index of that columns and you are passing strings, just give index
To piggyback off of #Ch3steR comment from above that line should work.
dates = pd.to_datetime(dates)
At that point the date conversion should allow you to index the columns that fall in that range based on the date as listed below. Just make sure the end date is a little beyond the end date that you're trying to capture.
# Return all rows in columns between date range 1/1/2001 and 2/3/2002
df.loc[:, '1/1/2001':'2/3/2002']
2001-01-01 2002-02-02
0 1 2
You can call the dates from the list you created earlier and it doesn't give an error.
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
df.loc[0,dates[0]:dates[1]]
The two different formats are here. It's just important that you stick to the one format. Calling from the list works because it guarantees that the format is the same. But as you said, you need to be able to use any dates so the second one is better for you.
>>>dates = pd.to_datetime(dates).date
>>>print("With .date")
With .date
>>>print(dates)
[datetime.date(2001, 1, 1) datetime.date(2002, 2, 2)
datetime.date(2003, 3, 3)]
>>>dates = pd.to_datetime(dates)
>>>print("Without .date")
Without .date
>>>print(dates)
DatetimeIndex(['2001-01-01', '2002-02-02', '2003-03-03'], dtype='datetime64[ns]', freq=None)
I have a Pandas DataFrame. How do I create a new column that is like a count of the Pandas DataFrame because I already made my index a Datatime.
For example, the following code is reproducible on your local PC:
import datetime
import numpy
dates = [
datetime.date(2019, 1, 13),
datetime.date(2020, 5, 11),
datetime.date(2018, 7, 24),
datetime.date(2019, 3, 23),
datetime.date(2020, 2, 16)
]
data = {
"a": [13.3,12.3,np.nan,10.3,np.nan],
"b": [1,0,0,1,1],
"c": ["no","yes","no","","yes"]
}
pd.DataFrame(index=dates,data=data)
Right now, I would like to add a new column as a count. Something like 1,2,3,4,5 until the end of the data
df['count'] = range(1, len(df) + 1)
len(df) returns the number of rows in the DataFrame, so you can call the builtin range function to create a range from 1 to the number of rows in the DataFrame, and then assign it to a new column. When assigning a range to a column, it is automatically converted to a pandas Series.
You can build a Series using df.index and apply some processing to it before assigning it to a column of the dataframe.
Here, we could use:
df['count'] = pd.Series(1, index=df.index()).cumsum()
Here it would be far less efficient (more than 1 magnitude order) than df['count'] = np.arange(1, 1 + len(df)) that directly builds a numpy array with the expected values, but it can be useful in more complex uses cases.
I am working with some financial data that is organized as a df with a MultiIndex that contains the ticker and the date and a column that contains the return. I am wondering whether one should convert the index to a PeriodIndex instead of a DateTimeIndex since returns are really over a period rather than an instant in time. Beside the philosophical argument, what practical functionality does PeriodIndex provide that may be useful in this particular use case vs DateTimeIndex?
There are some functions available in DateTimeIndex (such as is_month_start, is_quarter_end) which are not available in PeriodIndex. I use PeriodIndex when is not possible to have the format I need with DateTimeIndex. For example if I need a monthly frequency in the format yyyy-mm, I use the PeriodIndex.
Example:
Assume that df has an index as
df.index
'2020-02-26 13:50:00', '2020-02-27 14:20:00',
'2020-02-28 11:10:00', '2020-02-29 13:50:00'],
dtype='datetime64[ns]', name='peak_time', length=1025, freq=None)
The minimum monthly data can be obtained via the following code
dfg = df.groupby([df.index.year, df.index.month]).min()
whose index is a MultiIndex
dfg.index
MultiIndex([(2017, 1),
...
(2020, 1),
(2020, 2)],
names=['peak_time', 'peak_time'])
No I convert it to a PeriodIndex:
dfg["date"] = pd.PeriodIndex (dfg.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
For me, the PeriodIndex can be automatically displayed as the corresponding month, quarter and year in the downsampling.
import pandas as pd
# https://github.com/jiahe224/bug_report/blob/main/resample_test.csv
temp = pd.read_csv('resample_test.csv',dtype={'stockcode':str, 'A股代码':str})
temp['date'] = pd.to_datetime(temp['date'])
temp = temp.set_index(['date'])
result = temp['北向占自由流通比'].resample('Q',closed='left').first()
result
result = temp['北向占自由流通比'].resample('Q',closed='left').first().to_period()
result
Off topic, there is a problem with resample that has not been fixed as of yet, the bug report at https://github.com/pandas-dev/pandas/issues/45869
Behavior on partial periods.
date_range returns empty index. period_range returns index with len 1 when specifying start and end that do not cover a whole period.
(also, the timezone information is lost for periods of months).
date_range:
dates = pd.core.indexes.datetimes.date_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", inclusive="both", freq="1M")
dates
DatetimeIndex([], dtype='datetime64[ns, UTC]', freq='M')
period_range:
periods = pd.core.indexes.period.period_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", freq="1M")
periods
PeriodIndex(['2022-12'], dtype='period[M]')