Python - Pandas, How to aggregate by months inside a date interval efficiently - python

I am trying to compute aggregation metrics with pandas of a dataset with a start and finish date of a month interval, i need to do this efficiently because my dataset can have millions of rows.
My dataset is like this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame([["2020-01-01", "2020-05-01", 200],
["2020-02-01", "2020-03-01", 100],
["2020-03-01", "2020-04-01", 350],
["2020-02-01", "2020-05-01", 500]], columns=["start", "end", "value"])
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
And i want to have something like this:
I've tried two approaches, making a month timerange with the start and end dates and exploding them, then grouping by month:
df["months"] = df.apply(lambda x: pd.date_range(x["start"], x["end"], freq="MS"), axis=1)
df_explode = df.explode("months")
df_explode.groupby("months")["value"].agg(["mean", "sum", "std"])
The other one is iterating month by month, checking what month rows contain this month, then aggregating them:
rows = []
for m in pd.date_range(df.start.min(), df.end.max(), freq="MS"):
rows.append(df[(df.start <= m) & (m <= df.end)]["value"].agg(["mean", "sum", "std"]))
pd.DataFrame(rows, index=pd.date_range(df.start.min(), df.end.max(), freq="MS"))
The first approach works faster with smaller datasets, the second one is best with bigger datasets, but I'd want to know if there is a better approach for doing this better and faster.
Thank you very much

This is similar to your second approach, but vectorized. It assumes your start and end dates are month starts.
month_starts = pd.date_range(df.start.min(), df.end.max(), freq="MS")[:-1].to_numpy()
contained = np.logical_and(
np.greater_equal.outer(month_starts, df["start"].to_numpy()),
np.less.outer(month_starts, df["end"].to_numpy()),
)
masked = np.where(contained, np.broadcast_to(df[["value"]].transpose(),contained.shape), np.nan)
pd.DataFrame(masked, index=month_starts).agg(["mean", "sum", "std"], axis=1)

Related

pandas groupby column with rolling mean, limited between datetimes, without iterating over each row

I have data in a dataframe as follows:
ROWS = 1000
df = pandas.DataFrame()
df['DaT'] = pandas.date_range('2000-1-1', periods=ROWS, freq='H')
df['cat'] = numpy.random.choice(['a','b','c'],size=ROWS)
df['val'] = numpy.random.randint(2,size=ROWS)
df['r10'] = df.groupby(['cat'])['val'].apply(lambda x: x.rolling(10).mean() )
I need to calculate a column that, is grouped by category 'cat', and is a rolling (10periods) mean of the value 'val' column, but the rolling mean for a given row cannot include values from the day it occurs on.
The desired result ('wanted') can be generated as follows:
df['wanted'] = numpy.nan
for idx, row in df.iterrows():
Rdate = row['DaT'].normalize()
Rcat = row['cat']
try: df.loc[idx,'wanted'] = df[(df['DaT'] < Rdate) & (df['cat'] == Rcat) ]['val'].rolling(10).mean().iloc[-1]
except: df.loc[idx,'wanted'] = numpy.nan
The above is an awful solution, but gets the result. It is very slow for 100000+rows that need to go through. Is there are more elegant solution?
I have tried using combinations of shift and even quantize to get a more efficient solution, but no success yet

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Aggregate dataframe columns by hourly index

So I have a pandas dataframe that is taking in / out interface traffic every 10 minutes. I want to aggregate the two time series into hourly buckets for analysis. What seems to be simple has actually ended up being quite challenging for me to figure out! Just need to bucket into hourly bins
times = list()
ins = list()
outs = list()
for row in results['results']:
times.append(row['DateTime'])
ins.append(row['Intraffic'])
outs.append(row['Outtraffic'])
df = pd.DataFrame()
df['datetime'] = times
df['datetime'] = pd.to_datetime(df['datetime'])
df.index = df['datetime']
df['ins'] = ins
df['outs'] = outs
I have tried using
df.resample('H').mean()
I have tried pandas
groupby
but was having trouble with the two columns and getting the means over the hourly bucket
I believe this should do what you want:
df = pd.DataFrame()
df['datetime'] = times
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime',inplace=True) # This won't try to remap your rows
new_df = df.groupby(pd.Grouper(freq='H')).mean()
That last line groups your data by timestamp into hourly chunks, based on the index, and then spits out a new DataFrame with the mean of each column.

Extract data between two dates each year

I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]

counting records on per hour, per day and create multindex DataFrame as output

Sample DataFrame :
process_id | app_path | start_time
the desired output data frame should be multi-Indexed based on the date and time value in start_time column with unique dates as first level of index and one hour range as second level of index the count of records in each time slot should be calculated
def activity(self):
# find unique dates from db file
columns = self.df['start_time'].map(lambda x: x.date()).unique()
result = pandas.DataFrame(np.zeros((1,len(columns))), columns = columns)
for i in range(len(self.df)):
col = self.df.iloc[i]['start_time'].date()
result[col][0] = result.get_value(0, col) + 1
return result
I have tried the above code which gives the output as :
15-07-2014 16-7-2014 17-07-2014 18-07-2014
3217 2114 1027 3016
I want to count records on per hour basis as well
It would be helpful to start your question with some sample data. Since you didn't, I assumed the following is representative of your data (looks like app_path was not being used):
rng = pd.date_range('1/1/2011', periods=10000, freq='1Min')
df = pd.DataFrame(randint(size=len(rng), low=100, high = 500), index=rng)
df.columns = ['process_id']
It looks like you could benefit from exploring the groupby method in Pandas data frames. Using groupby, your example above become a simple one-liner:
df.groupby( [df.index.year, df.index.month, df.index.day] ).count()
and grouping by hour means simply adding hour to the group:
df.groupby( [df.index.year, df.index.month, df.index.day, df.index.hour] ).count()
Don't recreate the wheel in Pandas, use the methods provided for much more readable, as well as faster, code.

Categories

Resources