Through the loc and iloc methods, Pandas allows us to slice dataframes. Still, I am having trouble to do this when the columns are datetime objects.
For instance, suppose the data frame generated by the following code:
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
Let us try to slice the first two columns of the dataframe through dfloc:
df.loc[0,'01-01-2001':'02-02-2002']
We get the following TypeError:'<' not supported between instances of 'datetime.date' and 'str'
How could this be solved?
df.iloc[0,[0,1]]
Use iloc or loc , but give column name in second parameter as index of that columns and you are passing strings, just give index
To piggyback off of #Ch3steR comment from above that line should work.
dates = pd.to_datetime(dates)
At that point the date conversion should allow you to index the columns that fall in that range based on the date as listed below. Just make sure the end date is a little beyond the end date that you're trying to capture.
# Return all rows in columns between date range 1/1/2001 and 2/3/2002
df.loc[:, '1/1/2001':'2/3/2002']
2001-01-01 2002-02-02
0 1 2
You can call the dates from the list you created earlier and it doesn't give an error.
d = {'col1': [1], 'col2': [2],'col3': [3]}
df = pd.DataFrame(data=d)
dates = ['01-01-2001','02-02-2002','03-03-2003']
dates = pd.to_datetime(dates).date
df.columns= dates
df.loc[0,dates[0]:dates[1]]
The two different formats are here. It's just important that you stick to the one format. Calling from the list works because it guarantees that the format is the same. But as you said, you need to be able to use any dates so the second one is better for you.
>>>dates = pd.to_datetime(dates).date
>>>print("With .date")
With .date
>>>print(dates)
[datetime.date(2001, 1, 1) datetime.date(2002, 2, 2)
datetime.date(2003, 3, 3)]
>>>dates = pd.to_datetime(dates)
>>>print("Without .date")
Without .date
>>>print(dates)
DatetimeIndex(['2001-01-01', '2002-02-02', '2003-03-03'], dtype='datetime64[ns]', freq=None)
Related
Have the following dataset. This is a small sample while the actual dataset is much larger.
What is the fastest way to:
iterate through days = (1,2,3,4,5,6)
calculate [...rolling(day, min_periods=day).mean()]
add it as column name df[f'sma_{day}']
Method I have is casting it to dict of {ticker:price_df} and looping through shown below..
Have thought of methods like groupby, stack/unstack got stuck and need help with appending the columns because they are multi-index.
Am favouring the method with the fastest %%timeit.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-09-13").loc[:,['Close']].stack().swaplevel().sort_index()
df.index.set_names(['Ticker','Date'], inplace=True)
df
Here is a sample dictionary method I have..
df = df.reset_index()
df = dict(tuple(df.groupby(['Ticker'])))
## Iterate through days and keys
days = (1, 2, 3, 4, 5, 6)
for key in df.keys():
for day in days:
df[key][f'sma_{day}'] = df[key].Close.sort_index(ascending=True).rolling(day, min_periods=day).mean()
## Flatten dictionary
pd.concat(df.values()).set_index(['Ticker','Date']).sort_index()
I have a dataframe with various attributes, including one datetime column. I want to extract one of the attribute columns as a time series indexed by the datetime column. This seemed pretty straightforward, and I can construct time series with random values, as all the pandas docs show.. but when I do so from a dataframe, my attribute values all convert to NaN.
Here's an analogous example.
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = pd.Series(df.a, index=df.date)
In this case, the series will have correct time series index, but all the values will be NaN.
I can do the series in two steps, as below, but I don't understand why this should be required.
s = pd.Series(df.a)
s.index = df.date
What am I missing? I assume it has to do with series references, but don't understand at all why the values would go to NaN.
I am also able to get it to work by copying the index column.
s = pd.Series(df.a, df.date.copy())
The problem is that pd.Series() is trying to use the values specified in index to select values from the dataframe, but the date values in the dataframe are not present in the index.
You can set the index to the date column and then select the one data column you want. This will return a series with the dates as the index
import pandas as pd
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = df.set_index('date')['a']
Examining s gives:
In [1]: s
Out[1]:
date
2017-04-01 0
2017-04-02 1
Name: a, dtype: int64
And you can confirm that s is a Series:
In [2]: isinstance(s, pd.Series)
Out[2]: True
I'm trying to make a list of the index of my dataframe so I can use it as the X values in a plot.
I'm also trying to make a list of the rainfall so I can use it as the Y values in a plot. The dataframe is df and the index column is date.
df=pd.read_csv(data_source, sep=',', comment='#', header=None, names=['station', 'date', 'T_gem', 'T_min', 'T_max', 'rainfall'], parse_dates=[1])
df = df.set_index(['date'])
january = df.loc['2021-01-01':'2021-01-31']
I've tried using january = df.loc['2021-01-01':'2021-01-31', 'date'] but that raises a KeyError because I think it cannot find the column date as it is an index.
This should work:
january_df = df['2021-01-01':'2021-01-31']
After this, you can use the proposed solution.
january_df['rainfall'].plot()
You don't have to reset the index and create a list.
my df columns names are dates in this format: dd-mm-yy. when I use sort_index(axis = 1) it sort by the first two digits (which specify the days) so it doesn't make sense chronologically. How can I sort it automatically by taking into account also the months?
my df headers:
submitted_at 06-05-18 13-05-18 29-04-18
I expected the output of:
submitted_at 29-04-18 06-05-18 13-05-18
Convert the columns to datetime and use argsort to find the correct ordering. This will put all non-dates to the left in the order they occur, followed by the sorted dates.
import pandas as pd
df = pd.DataFrame(columns=['submitted_at', '06-05-18', '13-05-18', '29-04-18'])
idx = pd.to_datetime(df.columns, errors='coerce', format='%d-%m-%y').argsort()
df.iloc[:, idx]
Empty DataFrame
Columns: [submitted_at, 29-04-18, 06-05-18, 13-05-18]
Converting strings to datetime then sorting them with something like this :
from datetime import datetime
cols_as_date = [datetime.strptime(x,'%d-%m-%Y') for x in df.columns]
df = df[sorted(cols_as_data)]
just convert to DateTime your column
df['newdate']=pd.to_datetime(df.date,format='%d-%m-%y')
and then sort it using sort_values
df.sort_values(by='newdate')
I am working with some financial data that is organized as a df with a MultiIndex that contains the ticker and the date and a column that contains the return. I am wondering whether one should convert the index to a PeriodIndex instead of a DateTimeIndex since returns are really over a period rather than an instant in time. Beside the philosophical argument, what practical functionality does PeriodIndex provide that may be useful in this particular use case vs DateTimeIndex?
There are some functions available in DateTimeIndex (such as is_month_start, is_quarter_end) which are not available in PeriodIndex. I use PeriodIndex when is not possible to have the format I need with DateTimeIndex. For example if I need a monthly frequency in the format yyyy-mm, I use the PeriodIndex.
Example:
Assume that df has an index as
df.index
'2020-02-26 13:50:00', '2020-02-27 14:20:00',
'2020-02-28 11:10:00', '2020-02-29 13:50:00'],
dtype='datetime64[ns]', name='peak_time', length=1025, freq=None)
The minimum monthly data can be obtained via the following code
dfg = df.groupby([df.index.year, df.index.month]).min()
whose index is a MultiIndex
dfg.index
MultiIndex([(2017, 1),
...
(2020, 1),
(2020, 2)],
names=['peak_time', 'peak_time'])
No I convert it to a PeriodIndex:
dfg["date"] = pd.PeriodIndex (dfg.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
For me, the PeriodIndex can be automatically displayed as the corresponding month, quarter and year in the downsampling.
import pandas as pd
# https://github.com/jiahe224/bug_report/blob/main/resample_test.csv
temp = pd.read_csv('resample_test.csv',dtype={'stockcode':str, 'A股代码':str})
temp['date'] = pd.to_datetime(temp['date'])
temp = temp.set_index(['date'])
result = temp['北向占自由流通比'].resample('Q',closed='left').first()
result
result = temp['北向占自由流通比'].resample('Q',closed='left').first().to_period()
result
Off topic, there is a problem with resample that has not been fixed as of yet, the bug report at https://github.com/pandas-dev/pandas/issues/45869
Behavior on partial periods.
date_range returns empty index. period_range returns index with len 1 when specifying start and end that do not cover a whole period.
(also, the timezone information is lost for periods of months).
date_range:
dates = pd.core.indexes.datetimes.date_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", inclusive="both", freq="1M")
dates
DatetimeIndex([], dtype='datetime64[ns, UTC]', freq='M')
period_range:
periods = pd.core.indexes.period.period_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", freq="1M")
periods
PeriodIndex(['2022-12'], dtype='period[M]')