I have a dataframe (df) with the index as datetime index in "%Y-%m-%d %H:%M:%S.%f" format i.e 2012-06-16 15:53:42.457000.
I am trying to create groups of 1 second using groupby method i.e
x= df.groupby(pd.TimeGrouper('1S'))
time=x.first().index[1]
print time
The problem is using groupby method i am only getting the timestamp in seconds only i.e "2012-06-16 15:53:42" , the milliseconds are excluded. Is there a way to get the complete timestamp?
thank you
I think this is a problem of formatting.
> df.index[0].strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
'2016-01-01 00:00:00.000'
Documentation here
After spending few hours, i solved it. The functions such as groupby, rolling etc were giving the same problem. These functions uses a general datetime index for grouping which is mathematically the multiples of the defined frequency. There are possibilities to get keyerror if that index is used to get data of dataframe.
To get the complete datetime index (in milliseconds), access the daytime index of members of the individual group which were created by groupby method. i.e
time= x2.first()['timeindex'][0]
where timeindex is the datetime index in my dataframe. [0] is the datetime index of the first member of the group, it can be incremented to get the datetime index of the second, 3rd and so on members of each groups.
Related
I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.
i'm scratiching my head for a pandas slicing problem.
I have a dataframe with a date time column and a time date index (every 15 minutes), i want to create a new column with the same frequency (15min) that contains the max value of another column in the previous day (it will be the same value for each row within the same day).
My dataframe is call klines i know that i can get the date of each row with
klines['date']=klines['timedate'].dt.date
I know I can create a timedelta with timedelta function but I can't figure out how to use the timedate object
I was hoping that something like
klines['max_prev_day']=klines[klines['Close time'].dt.date+dt.timedelta(days = -1):klines['Close time'].dt.date]['value_to_look'].max()
But i'm getting a loc error
raise InvalidIndexError(key)
Any clever input is wellcome !
If you see in the image of my dataframe, I have time points where midnight is a day behind what it should be, which affects my time series graphs.
I tried df.replace() where I passed in lists a and b:
df.replace(to_replace=a,value=b,inplace=True)
This just replaced all values in a with just the same one value in b instead of all the values in the list.
I also tried passing in a dictionary but received:
Value Error: "Replacement not allowed with overlapping keys and values"
Is there any way I can change either the dates in either the date column or the date_time column to day+1 for instances where time is 00:00:00 ?
Maybe using pandas map() method with strftime format?
Maybe you can do something in this context
df.loc[df['time'] == datetime.time(0, 0), 'date'] += datetime.timedelta(days+1)
It selects the rows where the time is 00:00. Only on that rows, you increase the date-column by one day.
I am using Pandas to structure and process Data. This is my DataFrame:
I grouped many datetimes by minute and I did an aggregation in order to have the sum of 'bitrate' scores by minute.
This was my code to have this Dataframe:
def aggregate_data(data):
def delete_seconds(time):
return (datetime.datetime.strptime(time, '%Y-%m-%d %H:%M:%S')).replace(second=0)
data['new_time'] = data['beginning_time'].apply(delete_seconds)
df = (data[['new_time', 'bitrate']].groupby(['new_time'])).aggregate(np.sum)
return df
Now I want to do a similar thing with 5 minutes as buckets. I wand to do group my datetimes by 5 minutes and do a mean..
Something like this : (This dosent work of course!)
df.groupby([df.index.map(lambda t: t.5minute)]).aggregate(np.mean)
Ideas ? Thx !
use resample.
df.resample('5Min').sum()
This assumes your index is properly set as a DateTimeIndex.
you can also use the TimeGrouper, as resampling is a groupby operation on time buckets.
df.groupby(pd.TimeGrouper('5Min')).sum()
Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)