I have retail beef ad counts time series data, and I intend to make stacked line chart aim to show On a three-week average basis, quantity of average ads that grocers posted per store last week. To do so, I managed to aggregate data for plotting and tried to make line chart that I want. The main motivation is based on context of the problem and desired plot. In my attempt, I couldn't get very nice line chart because it is not informative to understand. I am wondering how can I achieve this goal in matplotlib. Can anyone suggest me what should I do from my current attempt? Any thoughts?
reproducible data and current attempt
Here is minimal reproducible data that I used in my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import timedelta, datetime
url = 'https://gist.githubusercontent.com/adamFlyn/96e68902d8f71ad62a4d3cda135507ad/raw/4761264cbd55c81cf003a4219fea6a24740d7ce9/df.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_grp["percentage"] = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.reset_index(level=[0,1])
for item in df_grp['retail_item'].unique():
dd = df_grp[df_grp['retail_item'] == item].groupby(['date', 'percentage'])[['number_of_ads']].sum().reset_index(level=[0,1])
dd['weakly_change'] = dd[['percentage']].rolling(7).mean()
fig, ax = plt.subplots(figsize=(8, 6), dpi=144)
sns.lineplot(dd.index, 'weakly_change', data=dd, ax=ax)
ax.set_xlim(dd.index.min(), dd.index.max())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
plt.gcf().autofmt_xdate()
plt.style.use('ggplot')
plt.xticks(rotation=90)
plt.show()
Current Result
but I couldn't get correct line chart that I expected, I want to reproduce the plot from this site. Is that doable to achieve this? Any idea?
desired plot
here is the example desired plot that I want to make from this minimal reproducible data:
I don't know how should make changes for my current attempt to get my desired plot above. Can anyone know any possible way of doing this in matplotlib? what else should I do? Any possible help would be appreciated. Thanks
Also see How to create a min-max plot by month with fill_between?
See in-line comments for details
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
#################################################################
# setup from question
url = 'https://gist.githubusercontent.com/adamFlyn/96e68902d8f71ad62a4d3cda135507ad/raw/4761264cbd55c81cf003a4219fea6a24740d7ce9/df.csv'
df = pd.read_csv(url, parse_dates=['date'])
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_grp["percentage"] = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.reset_index(level=[0,1])
#################################################################
# create a month map from long to abbreviated calendar names
month_map = dict(zip(calendar.month_name[1:], calendar.month_abbr[1:]))
# update the month column name
df_grp['month'] = df_grp.date.dt.month_name().map(month_map)
# set month as categorical so they are plotted in the correct order
df_grp.month = pd.Categorical(df_grp.month, categories=month_map.values(), ordered=True)
# use groupby to aggregate min mean and max
dfmm = df_grp.groupby(['retail_item', 'month'])['percentage'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
# create a palette map for line colors
cmap = {'min': 'k', 'max': 'k', 'mean': 'b'}
# iterate through each retail item and plot the corresponding data
for g, d in dfmm.groupby('retail_item'):
plt.figure(figsize=(7, 4))
sns.lineplot(x='month', y='vals', hue='mm', data=d, palette=cmap)
# select only min or max data for fill_between
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.month, y1=y1.vals, y2=y2.vals, color='gainsboro')
# add lines for specific years
for year in [2016, 2018, 2020]:
data = df_grp[(df_grp.date.dt.year == year) & (df_grp.retail_item == g)]
sns.lineplot(x='month', y='percentage', ci=None, data=data, label=year)
plt.ylim(0, 100)
plt.margins(0, 0)
plt.legend(bbox_to_anchor=(1., 1), loc='upper left')
plt.ylabel('Percentage of Ads')
plt.title(g)
plt.show()
Related
I have a dataset containing various fields of users, like dates, like count etc. I am trying to plot a histogram which shows like count with respect to date, how should I do that?
The dataset:
Assuming you want to plot number of public likes by date, you could do something like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('analysis.csv')
# convert text column to date time and keep only the date part
df['created_at'] = pd.to_datetime(df['created_at'])
df['created_at'] = df['created_at'].dt.date
# group by date taking the sum of public_metrics.like_count
df1 = df.groupby(['created_at'])['public_metrics.like_count'].sum().reset_index()
df1 = df1.set_index('created_at')
# plot and show
df1.plot()
plt.show()
And this is the output you will get
Just to add something to the first answer: you could visualize only the likes count of a specific month by making a bar plot. In this way, maybe you have a plot that is "closer" to the idea of histogram that you want. For example, I did it for January month:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.dates as mdates
# Read and clean data
df = pd.read_csv('tweets_data.txt')
df['created_at'] = df['created_at'].str.replace(".000Z", "")
df.created_at
# Create a new dataframe with only two columns: data and number of likes
histogram_data = pd.concat([df[['created_at']],df[['public_metrics.like_count']]],axis=1)
January_values = histogram_data[histogram_data['created_at'].astype(str).str.contains('2018-01')] #histogram_data['created_at'].astype(str)
January_values
January_values.shape
dictionary = {}
for date, n_likes in January_values.itertuples(index=False):
dictionary[date] = n_likes
print(dictionary)
# Create figure and plot space
fig, ax = plt.subplots(figsize=(12, 12))
# Add x-axis and y-axis
ax.bar(dictionary.keys(),
dictionary.values(),
color='purple')
# Set title and labels for axes
ax.set_xlabel('Date', fontsize = 20)
ax.set_ylabel('Counts', fontsize = 20)
ax.set_title('Tweets likes counts in January 2018', fontsize = 15, weight = "bold")
# Ensure a major tick for each week using (interval=1)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
ax.tick_params(axis='x', which='major', labelsize=15, width=2)
plt.setp( ax.xaxis.get_majorticklabels(), rotation=-45, ha="left", weight="bold")
plt.show()
The output is:
Of course, if you use all your data (that are more than 3000 dates), you will obtain a plot with bars really sharp...
Data I'm working with: https://drive.google.com/file/d/1xb7icmocz-SD2Rkq4ykTZowxW0uFFhBl/view?usp=sharing
Hey everyone,
I am a bit stuck with editing a plot.
Basically, I would like my x value to display the months in the year, but it doesn't seem to work because of the data type (?). Do you have any idea how I could get my plot to have months in the x axis?
If you need more context about the data, please let me know!!!
Thank you!
Here's my code for the plot and the initial data modifications:
import matplotlib.pyplot as plt
import mplleaflet
import pandas as pd
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import numpy as np
df = pd.read_csv("data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv")
df['degrees']=df['Data_Value']/10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date']<'2015-01-01']
df3 = df[df['Date']>='2015-01-01']
max_temp = df2.groupby([(df2.Date.dt.month),(df2.Date.dt.day)])['degrees'].max()
min_temp = df2.groupby([(df2.Date.dt.month),(df2.Date.dt.day)])['degrees'].min()
max_temp2 = df3.groupby([(df3.Date.dt.month),(df3.Date.dt.day)])['degrees'].max()
min_temp2 = df3.groupby([(df3.Date.dt.month),(df3.Date.dt.day)])['degrees'].min()
max_temp.plot(x ='Date', y='degrees', kind = 'line')
min_temp.plot(x ='Date',y='degrees', kind= 'line')
plt.fill_between(range(len(min_temp)),min_temp, max_temp, color='C0', alpha=0.2)
ax = plt.gca()
ax.set(xlabel="Date",
ylabel="Temperature",
title="Extreme Weather in 2015")
plt.legend()
plt.tight_layout()
x = plt.gca().xaxis
for item in x.get_ticklabels():
item.set_rotation(45)
plt.show()
Plot I'm getting:
Option 1 (Most Similar Approach)
Change the index based on month abbreviations using Index.map and calendar
This is just for df2:
import calendar
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
max_temp = df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees'].max()
min_temp = df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees'].min()
# Update the index to be the desired display format for x-axis
max_temp.index = max_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
min_temp.index = min_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
max_temp.plot(x='Date', y='degrees', kind='line')
min_temp.plot(x='Date', y='degrees', kind='line')
plt.fill_between(range(len(min_temp)), min_temp, max_temp,
color='C0', alpha=0.2)
ax = plt.gca()
ax.set(xlabel="Date", ylabel="Temperature", title="Extreme Weather 2005-2014")
x = plt.gca().xaxis
for item in x.get_ticklabels():
item.set_rotation(45)
plt.margins(x=0)
plt.legend()
plt.tight_layout()
plt.show()
As an aside: the title "Extreme Weather in 2015" is incorrect because this data includes all years before 2015. This is "Extreme Weather 2005-2014"
The year range can be checked with min and max as well:
print(df2.Date.dt.year.min(), '-', df2.Date.dt.year.max())
# 2005 - 2014
The title could be programmatically generated with:
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
Option 2 (Simplifying groupby step)
Simplify the code using groupby aggregate to create a single DataFrame then convert the index in the same way as above:
import calendar
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
# Get Max and Min Degrees in Single Groupby
df2_temp = (
df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees']
.agg(['max', 'min'])
)
# Convert Index to whatever display format is desired:
df2_temp.index = df2_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
# Plot
ax = df2_temp.plot(
kind='line', rot=45,
xlabel="Date", ylabel="Temperature",
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
)
# Fill between
plt.fill_between(range(len(df2_temp)), df2_temp['min'], df2_temp['max'],
color='C0', alpha=0.2)
plt.margins(x=0)
plt.tight_layout()
plt.show()
Option 3 (Best overall functionality)
Convert the index to a datetime using pd.to_datetime. Choose any leap year to uniform the data (it must be a leap year so Feb-29 does not raise an error). Then set the set_major_formatter using the format string %b to use the month abbreviation:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
# Get Max and Min Degrees in Single Groupby
df2_temp = (
df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees']
.agg(['max', 'min'])
)
# Convert to DateTime of Same Year
# (Must be a leap year so Feb-29 doesn't raise an error)
df2_temp.index = pd.to_datetime(
'2000-' + df2_temp.index.map(lambda s: '-'.join(map(str, s)))
)
# Plot
ax = df2_temp.plot(
kind='line', rot=45,
xlabel="Date", ylabel="Temperature",
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
)
# Fill between
plt.fill_between(df2_temp.index, df2_temp['min'], df2_temp['max'],
color='C0', alpha=0.2)
# Set xaxis formatter to month abbr with the %b format string
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.tight_layout()
plt.show()
The benefit of this approach is that the index is a datetime and therefore will format better than the string representations of options 1 and 2.
I am trying to make a linear chart that visualizes the product's export and sales activity by using weekly base data. Basically, I want to use this data to see how the exporting number of different commodities is changing along with weekly time base data. I could able to aggregate data for making a line chart for the export trends of different commodities for top-5 counties, but the resulted plot in my attempt didn't make my expected output. Can anyone point me out how to make this right? Is there any better way to make a product export trend line chart using matplotlib or seaborn in python? Can anyone suggest a possible better way of doing this? Any thoughts
my current attempt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['weekEndingDate','country', 'commodity'])['weeklyExports'].sum().unstack().reset_index()
df_grp = df_grp .fillna(0)
for c in df_grp[['FCF_Beef', 'FCF_Pork']]:
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
df_grp_new = df_grp .groupby(['country', 'weekEndingDate'])[c].sum().unstack().fillna(0)
df_grp_new = df_grp_new .T
df_grp_new.drop([col for col, val in df_grp_new .sum().iteritems() if val < 1000], axis=1, inplace=True)
for col in df_grp_new.columns:
sns.lineplot(x='WeekEndingDate', y='weekly export', ci=None, data=df_grp_new, label=col)
ax.relim()
ax.autoscale_view()
ax.xaxis.label.set_visible(False)
plt.legend(bbox_to_anchor=(1., 1), loc='upper left')
plt.ylabel('weekly export')
plt.margins(x=0)
plt.title(c)
plt.tight_layout()
plt.grid(True)
plt.show()
plt.close()
but these attempts didn't make my expected output. Essentially, I want to see how weekly export of different commodities like beef and pork for different countries by weekly base time series. Can anyone suggest to me what went wrong in my code? How can I get a desirable line chart by using the above data? Any idea?
desired output
here is the example desired plots (just style) that I want to make in my attempt:
Plenty of ways to do it. If you make your time column into datetime seaborn will handle formatting the axis for you.
You could use a facetgrid to split by commodity, or if you want finer control over the individual charts plot them using lineplot, filtering the df by the commodity prior.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df['weekEndingDate'] = pd.to_datetime(df['weekEndingDate'])
# sns.set(rc={'figure.figsize':(11.7,8.27)})
g = sns.FacetGrid(df, col='commodity', height=8, sharex=False, sharey=False, legend_out=True)
g.map_dataframe(sns.lineplot, x='weekEndingDate',y='weeklyExports', hue='country', ci=None)
g.add_legend()
Thanks for reading.
I have a plot and would like to make the latest year in my dataset stand out. My data is just one long time series, so I want to plot YoY comparisons, so I pivot it, then plot it.
The first block of code runs and gives me roughly what I am after (without the latest year standing out), then in the second block of code I try to make my latest stand out (which technically works) but the colour is different, doesn't match the legend and can even be the same colour as another year.
I can see the old series in the background. I think I am creating another plot and putting this on top, but how can I select the original line for the latest year (in this case 2018) and just make that stand out?
Or is there a better way to do this whole process?
Any tips on code, formatting or anything would be much appreciated, I am very new to this!
Thanks so much!
13sen1
FIRST BLOCK
# import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create fake time series dataframe
index = pd.date_range(start='01-Jan-2012', end='01-01-2019', freq='M')
data = np.random.randn(len(index))
df = pd.DataFrame(data, index, columns=['Data'])
# pivot to get by month in rows, then year in columns
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='Data')
# plot
df_pivot.plot(title='Data by Year', figsize=(6,4))
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
firstblockresult
SECOND BLOCK
# import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create fake time series dataframe
index = pd.date_range(start='01-Jan-2012', end='01-01-2019', freq='M')
data = np.random.randn(len(index))
df = pd.DataFrame(data, index, columns=['Data'])
# pivot to get by month in rows, then year in columns
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='Data')
# plot
df_pivot.plot(title='Data by Year', figsize=(6,4))
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
# plot the thicker last line
# **************** ERROR HERE *************************
plt.plot(df_pivot.iloc[:, -1:], lw=4, ls='--')
# **************** ERROR HERE *************************
plt.show()
secondblockresult
You can make the line of the last year thicker. Because columns are sorted, it will be the last line in the axes (index -1).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create fake time series dataframe
index = pd.date_range(start='01-Jan-2012', end='01-01-2019', freq='M')
data = np.random.randn(len(index))
df = pd.DataFrame(data, index, columns=['Data'])
# pivot to get by month in rows, then year in columns
df_pivot = pd.pivot_table(df, index=df.index.month, columns=df.index.year, values='Data')
# plot
ax = df_pivot.plot(title='Data by Year', figsize=(6,4))
ax.get_lines()[-1].set_linewidth(5)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.figure.tight_layout()
plt.show()
I have a pandas dataframe with 5 years daily time series data. I want to make a monthly plot from whole datasets so that the plot should shows variation (std or something else) within monthly data. Simillar figure I tried to create but did not found a way to do that:
for example, I have a sudo daily precipitation data:
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'pre':ppt},index=dates)
Manually I can do it like:
one = df['pre']['1999-12-01':'2000-11-29'].values
two = df['pre']['2000-12-01':'2001-11-30'].values
three = df['pre']['2001-12-01':'2002-11-30'].values
four = df['pre']['2002-12-01':'2003-11-30'].values
five = df['pre']['2003-12-01':'2004-11-29'].values
df = pd.DataFrame({'2000':one,'2001':two,'2002':three,'2003':four,'2004':five})
std = df.std(axis=1)
lw = df.mean(axis=1)-std
up = df.mean(axis=1)+std
plt.fill_between(np.arange(365), up, lw, alpha=.4)
I am looking for the more pythonic way to do that instead of doing it manually!
Any helps will be highly appreciated
If I'm understanding you correctly you'd like to plot your daily observations against a monthly periodic mean +/- 1 standard deviation. And that's what you get in my screenshot below. Nevermind the lackluster design and color choice. We'll get to that if this is something you can use. And please notice that I've replaced your ppt = np.random.rand(1900) with ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum() just to make the data look a bit more like your screenshot.
Here I've aggregated the daily data by month, and retrieved mean and standard deviation for each month. Then I've merged that data with the original dataframe so that you're able to plot both the source and the grouped data like this:
# imports
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import numpy as np
# Data that matches your setup, but with a random
# seed to make it reproducible
np.random.seed(42)
date = pd.to_datetime("1st of Dec, 1999")
dates = date+pd.to_timedelta(np.arange(1900), 'D')
#ppt = np.random.rand(1900)
ppt = np.random.normal(loc=0.0, scale=1.0, size=1900).cumsum()
df = pd.DataFrame({'ppt':ppt},index=dates)
# A subset
df = df.tail(200)
# Add a yearmonth column
df['YearMonth'] = df.index.map(lambda x: 100*x.year + x.month)
# Create aggregated dataframe
df2 = df.groupby('YearMonth').agg(['mean', 'std']).reset_index()
df2.columns = ['YearMonth', 'mean', 'std']
# Merge original data and aggregated data
df3 = pd.merge(df,df2,how='left',on=['YearMonth'])
df3 = df3.set_index(df.index)
df3 = df3[['ppt', 'mean', 'std']]
# Function to make your plot
def monthplot():
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Define upper and lower bounds for shaded variation
lower_bound = df3['mean'] + df3['std']*-1
upper_bound = df3['mean'] + df3['std']
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Source data and mean
ax.plot(df3.index,df3['mean'], lw=0.5, color = 'red')
ax.plot(df3.index, df3['ppt'], lw=0.1, color = 'blue')
# Variation and shaded area
ax.fill_between(df3.index, lower_bound, upper_bound, facecolor='grey', alpha=0.5)
fig = ax.get_figure()
# Assign months to X axis
locator = mdates.MonthLocator() # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
fig.show()
monthplot()
Check out this post for more on axis formatting and this post on how to add a YearMonth column.
In your example, you have a few mistakes, but I think it isn't important.
Do you want all years to be on the same graphic (like in your example)? If you do, this may help you:
df['month'] = df.index.strftime("%m-%d")
df['year'] = df.index.year
df.set_index(['month']).drop(['year'],1).plot()