I am using Pandas to structure and process Data. This is my DataFrame:
I grouped many datetimes by minute and I did an aggregation in order to have the sum of 'bitrate' scores by minute.
This was my code to have this Dataframe:
def aggregate_data(data):
def delete_seconds(time):
return (datetime.datetime.strptime(time, '%Y-%m-%d %H:%M:%S')).replace(second=0)
data['new_time'] = data['beginning_time'].apply(delete_seconds)
df = (data[['new_time', 'bitrate']].groupby(['new_time'])).aggregate(np.sum)
return df
Now I want to do a similar thing with 5 minutes as buckets. I wand to do group my datetimes by 5 minutes and do a mean..
Something like this : (This dosent work of course!)
df.groupby([df.index.map(lambda t: t.5minute)]).aggregate(np.mean)
Ideas ? Thx !
use resample.
df.resample('5Min').sum()
This assumes your index is properly set as a DateTimeIndex.
you can also use the TimeGrouper, as resampling is a groupby operation on time buckets.
df.groupby(pd.TimeGrouper('5Min')).sum()
Related
I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.
I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.
Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()
I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range
If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.
Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!
I currently have a dataframe that looks like:
I am trying to figure out how to do the following, and just don't know how to start....
For each DAY, cumsum the volume..
After this, group the data by time
of day (ie, 10min intervals). If a day doesn't have that interval
(sometimes gaps), then it should just be treated as 0.
Any help would really be appreciated!
For number 1:
Let's use resample with D:
df.resample('D')['volume'].cumsum()
For Number 2:
Let's use resample('10T') with asfreq and replace:
df.resample('10T').asfreq().replace(np.nan,0)
I have a dataframe (df) with the index as datetime index in "%Y-%m-%d %H:%M:%S.%f" format i.e 2012-06-16 15:53:42.457000.
I am trying to create groups of 1 second using groupby method i.e
x= df.groupby(pd.TimeGrouper('1S'))
time=x.first().index[1]
print time
The problem is using groupby method i am only getting the timestamp in seconds only i.e "2012-06-16 15:53:42" , the milliseconds are excluded. Is there a way to get the complete timestamp?
thank you
I think this is a problem of formatting.
> df.index[0].strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
'2016-01-01 00:00:00.000'
Documentation here
After spending few hours, i solved it. The functions such as groupby, rolling etc were giving the same problem. These functions uses a general datetime index for grouping which is mathematically the multiples of the defined frequency. There are possibilities to get keyerror if that index is used to get data of dataframe.
To get the complete datetime index (in milliseconds), access the daytime index of members of the individual group which were created by groupby method. i.e
time= x2.first()['timeindex'][0]
where timeindex is the datetime index in my dataframe. [0] is the datetime index of the first member of the group, it can be incremented to get the datetime index of the second, 3rd and so on members of each groups.