How to aggregate a metric and plot groups separately - python

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.

Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Related

Seaborn barplot: isna is not defined for MultiIndex

I would like to use seaborn barplot() to create a bar chart from a multi-indexed Series. I have grouped my dataset by two variables:
module_7_a_df = module_7_df.groupby(by=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])['SENTENCE CAP "SENSPCAP"'].count()
Grouping the dataframe creates a Series. This is what the resulting Series looks like:
When I try to create a barplot, I keep getting an error stating 'isna is not defined for MultiIndex.' The code for the barplot is:
sns.barplot(x=module_7_a_df.values, y=module_7_a_df.index)
This code works for Series created where the data has only been grouped by one column.
Can someone understand how to deal with this error?
Remove all nan values from the columns you groupby before you group them.
module_7_a_df.dropna(subset=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])
When you have a multi-index, you need to reset_index and when use hue = to enable the grouping, using an example dataset:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("tips")
counts = df.groupby(['time','day']).size()
counts
time day
Lunch Thur 61
Fri 7
Sat 0
Sun 0
Dinner Thur 1
Fri 12
Sat 87
Sun 76
dtype: int64
Then with the following:
counts = counts.to_frame('counts').reset_index()
sns.barplot(data = counts, x = "time",y="counts",hue="day")

Pandas/NumPy -- Plotting Dates as X axis

My Goal is just to plot this simple data, as a graph, with x data being dates ( date showing in x-axis) and price as the y-axis. Understanding that the dtype of the NumPy record array for the field date is datetime64[D] which means it is a 64-bit np.datetime64 in 'day' units. While this format is more portable, Matplotlib cannot plot this format natively yet. We can plot this data by changing the dates to DateTime.date instances instead, which can be achieved by converting to an object array: which I did below view the astype('0'). But I am still getting
this error :
view limit minimum -36838.00750000001 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-DateTime value to an axis that has DateTime units
code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(r'avocado.csv')
df2 = df[['Date','AveragePrice','region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['Date'] = df2.Date.astype('O')
plt.style.use('ggplot')
ax = df2[['Date','AveragePrice']].plot(kind='line', title ="Price Change",figsize=(15,10),legend=True, fontsize=12)
ax.set_xlabel("Period",fontsize=12)
ax.set_ylabel("Price",fontsize=12)
plt.show()
df.head(3)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
df2 = df[['Date', 'AveragePrice', 'region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2[['Date', 'AveragePrice']]
df2 = df2.sort_values(['Date'])
df2 = df2.set_index('Date')
print(df2)
ax = df2.plot(kind='line', title="Price Change")
ax.set_xlabel("Period", fontsize=12)
ax.set_ylabel("Price", fontsize=12)
plt.show()
output:

Pandas dataframe groupby plot

I have a dataframe which is structured as:
Date ticker adj_close
0 2016-11-21 AAPL 111.730
1 2016-11-22 AAPL 111.800
2 2016-11-23 AAPL 111.230
3 2016-11-25 AAPL 111.790
4 2016-11-28 AAPL 111.570
...
8 2016-11-21 ACN 119.680
9 2016-11-22 ACN 119.480
10 2016-11-23 ACN 119.820
11 2016-11-25 ACN 120.740
...
How can I plot based on the ticker the adj_close versus Date?
Simple plot,
you can use:
df.plot(x='Date',y='adj_close')
Or you can set the index to be Date beforehand, then it's easy to plot the column you want:
df.set_index('Date', inplace=True)
df['adj_close'].plot()
If you want a chart with one series by ticker on it
You need to groupby before:
df.set_index('Date', inplace=True)
df.groupby('ticker')['adj_close'].plot(legend=True)
If you want a chart with individual subplots:
grouped = df.groupby('ticker')
ncols=2
nrows = int(np.ceil(grouped.ngroups/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12,4), sharey=True)
for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
grouped.get_group(key).plot(ax=ax)
ax.legend()
plt.show()
Similar to Julien's answer above, I had success with the following:
fig, ax = plt.subplots(figsize=(10,4))
for key, grp in df.groupby(['ticker']):
ax.plot(grp['Date'], grp['adj_close'], label=key)
ax.legend()
plt.show()
This solution might be more relevant if you want more control in matlab.
Solution inspired by: https://stackoverflow.com/a/52526454/10521959
The question is How can I plot based on the ticker the adj_close versus Date?
This can be accomplished by reshaping the dataframe to a wide format with .pivot or .groupby, or by plotting the existing long form dataframe directly with seaborn.
In the following sample data, the 'Date' column has a datetime64[ns] Dtype.
Convert the Dtype with pandas.to_datetime if needed.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
Imports and Sample Data
import pandas as pd
import pandas_datareader as web # for sample data; this can be installed with conda if using Anaconda, otherwise pip
import seaborn as sns
import matplotlib.pyplot as plt
# sample stock data, where .iloc[:, [5, 6]] selects only the 'Adj Close' and 'tkr' column
tickers = ['aapl', 'acn']
df = pd.concat((web.DataReader(ticker, data_source='yahoo', start='2020-01-01', end='2022-06-21')
.assign(ticker=ticker) for ticker in tickers)).iloc[:, [5, 6]]
# display(df.head())
Date Adj Close ticker
0 2020-01-02 73.785904 aapl
1 2020-01-03 73.068573 aapl
2 2020-01-06 73.650795 aapl
3 2020-01-07 73.304420 aapl
4 2020-01-08 74.483604 aapl
# display(df.tail())
Date Adj Close ticker
1239 2022-06-14 275.119995 acn
1240 2022-06-15 281.190002 acn
1241 2022-06-16 270.899994 acn
1242 2022-06-17 275.380005 acn
1243 2022-06-21 282.730011 acn
pandas.DataFrame.pivot & pandas.DataFrame.plot
pandas plots with matplotlib as the default backend.
Reshaping the dataframe with pandas.DataFrame.pivot converts from long to wide form, and puts the dataframe into the correct format to plot.
.pivot does not aggregate data, so if there is more than 1 observation per index, per ticker, then use .pivot_table
Adding subplots=True will produce a figure with two subplots.
# reshape the long form data into a wide form
dfp = df.pivot(index='Date', columns='ticker', values='Adj Close')
# display(dfp.head())
ticker aapl acn
Date
2020-01-02 73.785904 203.171112
2020-01-03 73.068573 202.832764
2020-01-06 73.650795 201.508224
2020-01-07 73.304420 197.157654
2020-01-08 74.483604 197.544434
# plot
ax = dfp.plot(figsize=(11, 6))
Use seaborn, which accepts long form data, so reshaping the dataframe to a wide form isn't necessary.
seaborn is a high-level api for matplotlib
sns.lineplot: axes-level plot
fig, ax = plt.subplots(figsize=(11, 6))
sns.lineplot(data=df, x='Date', y='Adj Close', hue='ticker', ax=ax)
sns.relplot: figure-level plot
Adding row='ticker', or col='ticker', will generate a figure with two subplots.
g = sns.relplot(kind='line', data=df, x='Date', y='Adj Close', hue='ticker', aspect=1.75)

Pandas: bar plot with multiIndex dataframe

I have a pandas DataFrame with a TIMESTAMP column (not the index), and the timestamp format is as follows:
2015-03-31 22:56:45.510
I also have columns called CLASS and AXLES. I would like to compute the count of records for each month separately for each unique value of AXLES (AXLES can take an integer value between 3-12).
I came up with a combination of resample and groupby:
resamp = dfWIM.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
This seems to give me a multiIndex dataframe object, as shown below.
In [72]: resamp
Out [72]:
AXLES TIMESTAMP
3 2014-07-31 5517
2014-08-31 31553
2014-09-30 42816
2014-10-31 49308
2014-11-30 44168
2014-12-31 45518
2015-01-31 54782
2015-02-28 52166
2015-03-31 47929
4 2014-07-31 3147
2014-08-31 24810
2014-09-30 39075
2014-10-31 46857
2014-11-30 42651
2014-12-31 48282
2015-01-31 42708
2015-02-28 43904
2015-03-31 50033
From here, how can I access different components of this multiIndex object to create a bar plot for the following conditions?
show data when AXLES = 3
show x ticks in the Month - Year format (no days, hours, minutes etc.)
Thanks!
EDIT: Following code gives me the plot, but I could not change the xtick formatting to MM-YY.
resamp[3].plot(kind='bar')
EDIT 2 below is a code snippet that generates a small sample of the data similar to what I have:
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
resamp[3].plot(kind='bar')
EDIT 3:
Here below is the solution:
A.Plot the whole resampled dataframe (based on #Ako 's suggestion):
df = resamp.unstack(0)
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
B.Plot an individual index from the resampled dataframe (based on #Alexander 's suggestion):
df = resamp[3]
df.index = [ts.strftime('%b 20%y') for ts in df.index]
df.plot(kind='bar', rot=0)
You could generate and set the labels explicitly using ax.xaxis.set_major_formatter with a ticker.FixedFormatter. This will allow you to keep your DataFrame's MultiIndex with timestamp values, while displaying the timestamps in the desired %m-%Y format:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.ticker as ticker
dftest = {'TIMESTAMP':['2014-08-31','2014-09-30','2014-10-31'], 'AXLES':[3, 3, 3], 'CLASS':[5,6,7]}
dfTest = pd.DataFrame(dftest)
dfTest.TIMESTAMP = pd.to_datetime(pd.Series(dfTest.TIMESTAMP))
resamp = dfTest.set_index('TIMESTAMP').groupby('AXLES').resample('M', how='count').CLASS
ax = resamp[3].plot(kind='bar')
ticklabels = [timestamp.strftime('%m-%Y') for axle, timestamp in resamp.index]
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: ticklabels[int(x)]))
plt.gcf().autofmt_xdate()
plt.show()
yields
The following should work, but it is difficult to test without some data.
Start by resetting your index to get access to the TIMESTAMP column. Then use strftime to format it to your desired text representation (e.g. mm-yy). Finally, reset the index back to AXLES and TIMESTAMP.
df = resamp.reset_index()
df['TIMESTAMP'] = [ts.strftime('%m-%y') for ts in df.TIMESTAMP]
df.set_index(['AXLES', 'TIMESTAMP'], inplace=True)
>>> df.xs(3, level=0).plot(kind='bar')

When plotting datetime index data, put markers in the plot on specific days (e.g. weekend)

I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.

Categories

Resources