I currently have a dataframe that looks like:
I am trying to figure out how to do the following, and just don't know how to start....
For each DAY, cumsum the volume..
After this, group the data by time
of day (ie, 10min intervals). If a day doesn't have that interval
(sometimes gaps), then it should just be treated as 0.
Any help would really be appreciated!
For number 1:
Let's use resample with D:
df.resample('D')['volume'].cumsum()
For Number 2:
Let's use resample('10T') with asfreq and replace:
df.resample('10T').asfreq().replace(np.nan,0)
Related
I have data of 7 months on an hourly and minutes basis. I want to drop night time data (7:30pm to 5:10am) from everyday.
If you are using a datetime as index, you don't need to use dt. Also, dt.hour returns an integer value. But you are using an integer value with a string expression. You can use like this:
df2=df.loc[(df.index.hour >= 5) & (df.index.hour <= 19)]
but there is simple way. Use between_time():
df=df.between_time('05:10:00', '19:30:00')
I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range
If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.
Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!
I am using Pandas to structure and process Data. This is my DataFrame:
I grouped many datetimes by minute and I did an aggregation in order to have the sum of 'bitrate' scores by minute.
This was my code to have this Dataframe:
def aggregate_data(data):
def delete_seconds(time):
return (datetime.datetime.strptime(time, '%Y-%m-%d %H:%M:%S')).replace(second=0)
data['new_time'] = data['beginning_time'].apply(delete_seconds)
df = (data[['new_time', 'bitrate']].groupby(['new_time'])).aggregate(np.sum)
return df
Now I want to do a similar thing with 5 minutes as buckets. I wand to do group my datetimes by 5 minutes and do a mean..
Something like this : (This dosent work of course!)
df.groupby([df.index.map(lambda t: t.5minute)]).aggregate(np.mean)
Ideas ? Thx !
use resample.
df.resample('5Min').sum()
This assumes your index is properly set as a DateTimeIndex.
you can also use the TimeGrouper, as resampling is a groupby operation on time buckets.
df.groupby(pd.TimeGrouper('5Min')).sum()
Currently I'm generating a DateTimeIndex using a certain function, zipline.utils.tradingcalendar.get_trading_days. The time series is roughly daily but with some gaps.
My goal is to get the last date in the DateTimeIndex for each month.
.to_period('M') & .to_timestamp('M') don't work since they give the last day of the month rather than the last value of the variable in each month.
As an example, if this is my time series I would want to select '2015-05-29' while the last day of the month is '2015-05-31'.
['2015-05-18', '2015-05-19', '2015-05-20', '2015-05-21',
'2015-05-22', '2015-05-26', '2015-05-27', '2015-05-28',
'2015-05-29', '2015-06-01']
Condla's answer came closest to what I needed except that since my time index stretched for more than a year I needed to groupby by both month and year and then select the maximum date. Below is the code I ended up with.
# tempTradeDays is the initial DatetimeIndex
dateRange = []
tempYear = None
dictYears = tempTradeDays.groupby(tempTradeDays.year)
for yr in dictYears.keys():
tempYear = pd.DatetimeIndex(dictYears[yr]).groupby(pd.DatetimeIndex(dictYears[yr]).month)
for m in tempYear.keys():
dateRange.append(max(tempYear[m]))
dateRange = pd.DatetimeIndex(dateRange).order()
Suppose your data frame looks like this
original dataframe
Then the following Code will give you the last day of each month.
df_monthly = df.reset_index().groupby([df.index.year,df.index.month],as_index=False).last().set_index('index')
transformed_dataframe
This one line code does its job :)
My strategy would be to group by month and then select the "maximum" of each group:
If "dt" is your DatetimeIndex object:
last_dates_of_the_month = []
dt_month_group_dict = dt.groupby(dt.month)
for month in dt_month_group_dict:
last_date = max(dt_month_group_dict[month])
last_dates_of_the_month.append(last_date)
The list "last_date_of_the_month" contains all occuring last dates of each month in your dataset. You can use this list to create a DatetimeIndex in pandas again (or whatever you want to do with it).
This is an old question, but all existing answers here aren't perfect. This is the solution I came up with (assuming that date is a sorted index), which can be even written in one line, but I split it for readability:
month1 = pd.Series(apple.index.month)
month2 = pd.Series(apple.index.month).shift(-1)
mask = (month1 != month2)
apple[mask.values].head(10)
Few notes here:
Shifting a datetime series requires another pd.Series instance (see here)
Boolean mask indexing requires .values (see here)
By the way, when the dates are the business days, it'd be easier to use resampling: apple.resample('BM')
Maybe the answer is not needed anymore, but while searching for an answer to the same question I found maybe a simpler solution:
import pandas as pd
sample_dates = pd.date_range(start='2010-01-01', periods=100, freq='B')
month_end_dates = sample_dates[sample_dates.is_month_end]
Try this, to create a new diff column where the value 1 points to the change from one month to the next.
df['diff'] = np.where(df['Date'].dt.month.diff() != 0,1,0)
Assume my data is daily counts and has as its index a DateTimeIndex column. Is there a way to get the average of the past n weekdays? For instance, if the date is Sunday August 15th, I'd like to get mean of counts on (sunday august 8th, sunday august 1st, ...).
I started using pandas yesterday, so here's what I've brute forced.
# df is a dataframe with an DateTimeIndex
# brute force for count last n weekdays, wherelnwd = last n weekdays
def lnwd(n=1):
lnwd, tmp = df.shift(7), df.shift(7) # count last weekday
for i in xrange(n-1):
tmp = tmp.shift(7)
lnwd += tmp
lnwd = lnwd/n # average
return lnwd
There has to be a one liner? Is there a way to use apply() (without passing a function that has a for loop? since n is variable) or some form of groupby? For instance, the way to find the mean of all data on each weekday is:
df.groupby(lambda x: x.dayofweek).mean() # mean of each MTWHFSS
I think you are looking for a rolling apply (rolling mean in this case)? See the docs: http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments. But then applied for each weekday seperately, this can be achieved by combining rolling_mean with grouping on the weekday with groupby.
This should give somethin like (with a series s):
s.groupby(s.index.weekday).transform(lambda x: pd.rolling_mean(x, window=n))
Using Pandas Version 1.4.1 the solution provided by joris seems outdated ("module 'pandas' has no attribute 'rolling_mean'"). The same could be achieved using
s.groupby(s.index.weekday).transform(lambda x: pd.Series.rolling(x, window=n).mean())