Plot groupby data by month - python

I'm trying to plot grouped by month DataFrame, where index column is DateTime.
My goal is to plot all months on separate plots.
index=date_range('2011-9-1 00:00:03', '2012-09-01 00:00:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
df2.plot()
This gives me 14 plots (not 12), where 2 first are empty - on the x-axis are years from 2000 to 2010. Than two first plots are January.
Hoping for your good advice how to cope with this.

What are you trying to achieve? When grouping data you usually aggregate it in some way if you want to plot it. For example:
import pandas as pd
index=pd.date_range('2011-1-1 00:00:03', '2011-12-31 23:50:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
group.plot()
Update: Fix for groups that span more than a month. This is probably not the best solution but it's the one that first fell on my mind.
import pandas as pd
num_ticks = 15
index=pd.date_range('2011-9-1 00:00:03', '2012-09-01 00:00:03', freq='10min')
df=pd.DataFrame(np.random.rand(len(index),3),index=index)
df2 = df.groupby(lambda x: x.month)
for key, group in df2:
step = len(group) / num_ticks
reset = group.reset_index()
reset.plot()
plt.xticks(reset.index[::step],
reset['index'][::step].apply(
lambda x: x.strftime('%Y-%m-%d')).values,
rotation=70)

Related

Python - Pandas, How to aggregate by months inside a date interval efficiently

I am trying to compute aggregation metrics with pandas of a dataset with a start and finish date of a month interval, i need to do this efficiently because my dataset can have millions of rows.
My dataset is like this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame([["2020-01-01", "2020-05-01", 200],
["2020-02-01", "2020-03-01", 100],
["2020-03-01", "2020-04-01", 350],
["2020-02-01", "2020-05-01", 500]], columns=["start", "end", "value"])
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
And i want to have something like this:
I've tried two approaches, making a month timerange with the start and end dates and exploding them, then grouping by month:
df["months"] = df.apply(lambda x: pd.date_range(x["start"], x["end"], freq="MS"), axis=1)
df_explode = df.explode("months")
df_explode.groupby("months")["value"].agg(["mean", "sum", "std"])
The other one is iterating month by month, checking what month rows contain this month, then aggregating them:
rows = []
for m in pd.date_range(df.start.min(), df.end.max(), freq="MS"):
rows.append(df[(df.start <= m) & (m <= df.end)]["value"].agg(["mean", "sum", "std"]))
pd.DataFrame(rows, index=pd.date_range(df.start.min(), df.end.max(), freq="MS"))
The first approach works faster with smaller datasets, the second one is best with bigger datasets, but I'd want to know if there is a better approach for doing this better and faster.
Thank you very much
This is similar to your second approach, but vectorized. It assumes your start and end dates are month starts.
month_starts = pd.date_range(df.start.min(), df.end.max(), freq="MS")[:-1].to_numpy()
contained = np.logical_and(
np.greater_equal.outer(month_starts, df["start"].to_numpy()),
np.less.outer(month_starts, df["end"].to_numpy()),
)
masked = np.where(contained, np.broadcast_to(df[["value"]].transpose(),contained.shape), np.nan)
pd.DataFrame(masked, index=month_starts).agg(["mean", "sum", "std"], axis=1)

How to extract year, month, date from a date column?

I'm trying to extract date information from a date column, and append the new columns to the original dataframe. However, I kept getting this message saying I cannot use .dt with this column. Not sure what I did wrong here, any help will be appreciated.
Error message that I got in python:
First do df.datecolumn = pd.to_datetime(df.datecolumn), then live happily ever after.
This will give you year, month and day in that month. You can also easily get week of the year and day of the week.
import pandas as pd
df = pd.DataFrame(data=[['1920-01-01'], ['2008-12-06']], columns=['Date'])
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].apply(lambda x : x.year)
df['Month'] = df['Date'].apply(lambda x : x.month)
df['Day'] = df['Date'].apply(lambda x : x.day)
print(df)
In your Time list you have a typo Dayorweek should be dayofweek.

Selecting on multiple criteria

I made this dataframe, which contains dates as datetime64 values.
What I want to do is a bit of a stupid example, but it illustrates my point of selecting on multiple criteria.
I want to:
For the year 2018: plot a bar chart grouped per month, of the different values. So I want to create one graph for 2018, showing on the x-axis 12 times 3 bars.
I hope someone has some idea how this works.
Thank you in advance
import pandas as pd
import numpy as np
import random
date_expected = np.arange('2006-01', '2008-06', dtype= 'datetime64[D]')
cat = ['True','False', 'Maybe']
value = [random.choice(cat) for i in range(len(date_expected))]
data = {'Date_expected': date_expected, 'Value': value }
df = pd.DataFrame(data)
print(df)
First, create a column with the month. Then, group by month and value and get the count.
You need to unstack to get the one column of count per value so that you can plot the bar chart.
df['month'] = df['Date_expected'].apply(lambda x: x.month)
df.groupby(['month', 'Value']).count().unstack().plot(kind='bar')

Extract data between two dates each year

I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]

How can I bin by date ranges and categories in pandas?

I have a data frame with a date, a category and a value. I'd like to plot the sum-aggregated values per category. For example I want to sum values which happen in 3 day periods, but for each category individually.
An attempt which seems too complicating is
import random
import datetime as dt
import pandas as pd
random.seed(0)
df=pd.DataFrame([[dt.datetime(2000,1,random.randint(1,31)), random.choice("abc"), random.randint(1,3)] for _ in range(100)], columns=["date", "cat", "value"])
df.set_index("date", inplace=True)
result=df.groupby("cat").resample("3d", how="sum").unstack("cat").value.fillna(0)
result.plot()
This is basically the right logic, but the resampling doesn't have a fixed start, so the date ranges for the 3-day periods don't align between categories (and I get NaN/0 values).
What is a better way to achieve this plot?
I think you should group by cat and date:
df = pd.DataFrame([[dt.datetime(2000,1,random.randint(1,31)), random.choice("abc"), random.randint(1,3)] for _ in range(100)], columns=["date", "cat", "value"])
df.groupby(["cat", pd.Grouper(freq='3d',key='date')]).sum().unstack(0).fillna(0).plot()

Categories

Resources