I'm having a difficult time trying to create a bar plot with and DataFrame grouped by year and month. With the following code I'm trying to plot the data in the created image, instead of that, is returning a second image. Also I tried to move the legend to the right and change its values to the corresponding month.
I started to get a feel for the DataFrames obtained with the groupby command, though not getting what I expected led me to ask you guys.
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.read_csv('fcc-forum-pageviews.csv', index_col='date')
line_plot = df.value[(df.value > df.value.quantile(0.025)) & (df.value < df.value.quantile(0.975))]
fig, ax = plt.subplots(figsize=(10,10))
bar_plot = line_plot.groupby([line_plot.index.year, line_plot.index.month]).mean().unstack()
bar_plot.plot(kind='bar')
ax.set_xlabel('Years')
ax.set_ylabel('Average Page Views')
plt.show()
This is the format of the data that I am analyzing.
date,value
2016-05-09,1201
2016-05-10,2329
2016-05-11,1716
2016-05-12,10539
2016-05-13,6933
Add a sorted categorical 'month' column with pd.Categorical
Transform the dataframe to a wide format with pd.pivot_table where aggfunc='mean' is the default.
Wide format is typically best for plotting grouped bars.
pandas.DataFrame.plot returns matplotlib.axes.Axes, so there's no need to use fig, ax = plt.subplots(figsize=(10,10)).
The pandas .dt accessor is used to extract various components of 'date', which must be a datetime dtype
If 'date' is not a datetime dtype, then transform it with df.date = pd.to_datetime(df.date).
Tested with python 3.8.11, pandas 1.3.1, and matplotlib 3.4.2
Imports and Test Data
import pandas as pd
from calendar import month_name # conveniently supplies a list of sorted month names or you can type them out manually
import numpy as np # for test data
# test data and dataframe
np.random.seed(365)
rows = 365 * 3
data = {'date': pd.bdate_range('2021-01-01', freq='D', periods=rows), 'value': np.random.randint(100, 1001, size=(rows))}
df = pd.DataFrame(data)
# select data within specified quantiles
df = df[df.value.gt(df.value.quantile(0.025)) & df.value.lt(df.value.quantile(0.975))]
# display(df.head())
date value
0 2021-01-01 694
1 2021-01-02 792
2 2021-01-03 901
3 2021-01-04 959
4 2021-01-05 528
Transform and Plot
If 'date' has been set to the index, as stated in the comments, use the following:
df['months'] = pd.Categorical(df.index.strftime('%B'), categories=months, ordered=True)
# create the month column
months = month_name[1:]
df['months'] = pd.Categorical(df.date.dt.strftime('%B'), categories=months, ordered=True)
# pivot the dataframe into the correct shape
dfp = pd.pivot_table(data=df, index=df.date.dt.year, columns='months', values='value')
# display(dfp.head())
months January February March April May June July August September October November December
date
2021 637.9 595.7 569.8 508.3 589.4 557.7 508.2 545.7 560.3 526.2 577.1 546.8
2022 567.9 521.5 625.5 469.8 582.6 627.3 630.4 474.0 544.1 609.6 526.6 572.1
2023 521.1 548.5 484.0 528.2 473.3 547.7 525.3 522.4 424.7 561.3 513.9 602.3
# plot
ax = dfp.plot(kind='bar', figsize=(12, 4), ylabel='Mean Page Views', xlabel='Year', rot=0)
_ = ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
Just pass the ax you defined to pandas:
bar_plot.plot(ax = ax, kind='bar')
If you also want to replace months numbers with names, you have to get those labels, replace numbers with names and re-define the legend by passing to it the new labels:
handles, labels = ax.get_legend_handles_labels()
new_labels = [datetime.date(1900, int(monthinteger), 1).strftime('%B') for monthinteger in labels]
ax.legend(handles = handles, labels = new_labels, loc = 'upper left', bbox_to_anchor = (1.02, 1))
Complete Code
import pandas as pd
from matplotlib import pyplot as plt
import datetime
df = pd.read_csv('fcc-forum-pageviews.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
line_plot = df.value[(df.value > df.value.quantile(0.025)) & (df.value < df.value.quantile(0.975))]
fig, ax = plt.subplots(figsize=(10,10))
bar_plot = line_plot.groupby([line_plot.index.year, line_plot.index.month]).mean().unstack()
bar_plot.plot(ax = ax, kind='bar')
ax.set_xlabel('Years')
ax.set_ylabel('Average Page Views')
handles, labels = ax.get_legend_handles_labels()
new_labels = [datetime.date(1900, int(monthinteger), 1).strftime('%B') for monthinteger in labels]
ax.legend(handles = handles, labels = new_labels, loc = 'upper left', bbox_to_anchor = (1.02, 1))
plt.show()
(plot generated with fake data)
Related
I'm attempting to plot a pandas stacked bar plot with the x axis showing Months on the major ticks, or years on Jan 1, ideally with small ticks identifying the weeks but with no label.
I have a dataset with a datetime index that was then grouped by week and then I plot that dataset. If I don't attempt to control the settings the dates show up but are vertical and don't fit. So I used the set formatter to fix that but then the axes changed to 1970 as if following an index number instead of date. If I replace the pandas plotting with a regular bar chart, the "ConciseDateFormatter" works as desired/expected. But I wanted to use stacked with pandas as creating a regular stacked bar chart is a pain. I don't understand why I can't control pandas axes like I can a regular plot.
One thing I notice is that the index is shown as an object. If I convert it to to_datetime() it then adds 00:00 for times that I don't want on the axes or my data.
My data is a simple set of weekly random data:
date A B C D
3/20/2022 1.540765154 0.504616419 1.543679189 2.952934623
3/27/2022 1.781135128 4.594966635 4.799026389 3.499803401
4/3/2022 0.254059207 0.69835265 0.323039575 1.628138491
4/10/2022 3.112760301 0.287056897 4.372938373 0.130817579
4/17/2022 0.497273044 0.913246096 1.296612207 1.250610278
4/24/2022 1.370087689 3.124985109 4.322253295 4.49571603
5/1/2022 3.952629538 3.976896924 1.679311114 1.265443147
5/8/2022 3.470328161 1.266161308 3.990502436 1.364929959
5/15/2022 2.296588269 4.639761391 0.04685036 1.438471692
5/22/2022 3.443458637 2.66592719 0.968656871 2.349325343
5/29/2022 1.820278464 4.794211675 2.435710815 2.156110694
6/5/2022 4.328825266 0.049132356 1.842839099 3.665701299
6/12/2022 0.184631564 0.412976815 4.787477069 4.80052839
6/19/2022 4.846734385 3.471474741 1.808871854 2.440013553
6/26/2022 1.612870444 0.70191857 3.55713114 1.438699834
7/3/2022 2.896859156 4.025996887 0.209608767 4.174881655
Code:
import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
maxval = 200
values = ['A','B','C','D']
cum = [v + '_CUM' for v in values]
df = pd.read_csv('test_data.csv', index_col='date', parse_dates=True,
infer_datetime_format=True)
#df.index = pd.to_datetime(df.index.date).strftime("%b %d")
df = df.join(df.cumsum(), rsuffix="_CUM")
df = df.join(df[cum]/maxval * 100, rsuffix="_LIFE")
fig, axs = plt.subplots(nrows=2, ncols=1, sharex=False, squeeze=False,
facecolor='white')
axs = axs.flatten()
ax = axs[0]
df[values].plot.bar(ax=ax, grid=True, stacked=True, legend=True)
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.ConciseDateFormatter
(ax.xaxis.get_major_locator()))
# ax.xaxis.set_tick_params(rotation = 0)
plt.show(block=False)
I would like to improve my bitcoin dataset but I found that the date is not sorted in the right way and want to show only the month and year. How can I do it?
data = Bitcoin_Historical['Price']
Date1 = Bitcoin_Historical['Date']
train1 = Bitcoin_Historical[['Date','Price']]
#Setting the Date as Index
train2 = train1.set_index('Date')
train2.sort_index(inplace=True)
cols = ['Price']
train2 = train2[cols].apply(lambda x: pd.to_numeric(x.astype(str)
.str.replace(',',''), errors='coerce'))
print (type(train2))
print (train2.head())
plt.figure(figsize=(15, 5))
plt.plot(train2)
plt.xlabel('Date', fontsize=12)
plt.xlim(0,20)
plt.ylabel('Price', fontsize=12)
plt.title("Closing price distribution of bitcoin", fontsize=15)
plt.gcf().autofmt_xdate()
plt.show()
The result shows picture below:
It's not ordered and shows all dates. I would like to order by month+year and show only the month name+year. How can that be done?
Example of Data:
Thank you
I've made the following edits to your code:
converted the column Date column as datetime type
cleaned up the Price column and converting to float
removed the line plt.xlim(0,20) which is causing the output to display 1970
used alternative way to plot, so that the x-axis can be formatted to get monthly tick marks, more info here
Please try the code below:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
pd.options.mode.chained_assignment = None
Bitcoin_Historical = pd.read_csv('data.csv')
train1 = Bitcoin_Historical[['Date','Price']]
train1['Date'] = pd.to_datetime(train1['Date'], infer_datetime_format=True, errors='coerce')
train1['Price'] = train1['Price'].str.replace(',','').str.replace(' ','').astype(float)
train2 = train1.set_index('Date') #Setting the Date as Index
train2.sort_index(inplace=True)
print (type(train2))
print (train2.head())
ax = train2.plot(figsize=(15, 5))
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
plt.xlabel('Date', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.title("Closing price distribution of bitcoin", fontsize=15)
plt.show()
Output
Try to cast your "Date" column into datetime, check if it does the trick:
train1.Date = pd.to_datetime(train1.Date)
train2 = train1.set_index('Date')
Data I'm working with: https://drive.google.com/file/d/1xb7icmocz-SD2Rkq4ykTZowxW0uFFhBl/view?usp=sharing
Hey everyone,
I am a bit stuck with editing a plot.
Basically, I would like my x value to display the months in the year, but it doesn't seem to work because of the data type (?). Do you have any idea how I could get my plot to have months in the x axis?
If you need more context about the data, please let me know!!!
Thank you!
Here's my code for the plot and the initial data modifications:
import matplotlib.pyplot as plt
import mplleaflet
import pandas as pd
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import numpy as np
df = pd.read_csv("data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv")
df['degrees']=df['Data_Value']/10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date']<'2015-01-01']
df3 = df[df['Date']>='2015-01-01']
max_temp = df2.groupby([(df2.Date.dt.month),(df2.Date.dt.day)])['degrees'].max()
min_temp = df2.groupby([(df2.Date.dt.month),(df2.Date.dt.day)])['degrees'].min()
max_temp2 = df3.groupby([(df3.Date.dt.month),(df3.Date.dt.day)])['degrees'].max()
min_temp2 = df3.groupby([(df3.Date.dt.month),(df3.Date.dt.day)])['degrees'].min()
max_temp.plot(x ='Date', y='degrees', kind = 'line')
min_temp.plot(x ='Date',y='degrees', kind= 'line')
plt.fill_between(range(len(min_temp)),min_temp, max_temp, color='C0', alpha=0.2)
ax = plt.gca()
ax.set(xlabel="Date",
ylabel="Temperature",
title="Extreme Weather in 2015")
plt.legend()
plt.tight_layout()
x = plt.gca().xaxis
for item in x.get_ticklabels():
item.set_rotation(45)
plt.show()
Plot I'm getting:
Option 1 (Most Similar Approach)
Change the index based on month abbreviations using Index.map and calendar
This is just for df2:
import calendar
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
max_temp = df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees'].max()
min_temp = df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees'].min()
# Update the index to be the desired display format for x-axis
max_temp.index = max_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
min_temp.index = min_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
max_temp.plot(x='Date', y='degrees', kind='line')
min_temp.plot(x='Date', y='degrees', kind='line')
plt.fill_between(range(len(min_temp)), min_temp, max_temp,
color='C0', alpha=0.2)
ax = plt.gca()
ax.set(xlabel="Date", ylabel="Temperature", title="Extreme Weather 2005-2014")
x = plt.gca().xaxis
for item in x.get_ticklabels():
item.set_rotation(45)
plt.margins(x=0)
plt.legend()
plt.tight_layout()
plt.show()
As an aside: the title "Extreme Weather in 2015" is incorrect because this data includes all years before 2015. This is "Extreme Weather 2005-2014"
The year range can be checked with min and max as well:
print(df2.Date.dt.year.min(), '-', df2.Date.dt.year.max())
# 2005 - 2014
The title could be programmatically generated with:
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
Option 2 (Simplifying groupby step)
Simplify the code using groupby aggregate to create a single DataFrame then convert the index in the same way as above:
import calendar
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
# Get Max and Min Degrees in Single Groupby
df2_temp = (
df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees']
.agg(['max', 'min'])
)
# Convert Index to whatever display format is desired:
df2_temp.index = df2_temp.index.map(lambda x: f'{calendar.month_abbr[x[0]]}')
# Plot
ax = df2_temp.plot(
kind='line', rot=45,
xlabel="Date", ylabel="Temperature",
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
)
# Fill between
plt.fill_between(range(len(df2_temp)), df2_temp['min'], df2_temp['max'],
color='C0', alpha=0.2)
plt.margins(x=0)
plt.tight_layout()
plt.show()
Option 3 (Best overall functionality)
Convert the index to a datetime using pd.to_datetime. Choose any leap year to uniform the data (it must be a leap year so Feb-29 does not raise an error). Then set the set_major_formatter using the format string %b to use the month abbreviation:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("...")
df['degrees'] = df['Data_Value'] / 10
df['Date'] = pd.to_datetime(df['Date'])
df2 = df[df['Date'] < '2015-01-01']
# Get Max and Min Degrees in Single Groupby
df2_temp = (
df2.groupby([df2.Date.dt.month, df2.Date.dt.day])['degrees']
.agg(['max', 'min'])
)
# Convert to DateTime of Same Year
# (Must be a leap year so Feb-29 doesn't raise an error)
df2_temp.index = pd.to_datetime(
'2000-' + df2_temp.index.map(lambda s: '-'.join(map(str, s)))
)
# Plot
ax = df2_temp.plot(
kind='line', rot=45,
xlabel="Date", ylabel="Temperature",
title=f"Extreme Weather {df2.Date.dt.year.min()}-{df2.Date.dt.year.max()}"
)
# Fill between
plt.fill_between(df2_temp.index, df2_temp['min'], df2_temp['max'],
color='C0', alpha=0.2)
# Set xaxis formatter to month abbr with the %b format string
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.tight_layout()
plt.show()
The benefit of this approach is that the index is a datetime and therefore will format better than the string representations of options 1 and 2.
This is the code for showing the 'Close' prices for Amazon, and visualizing the data using matplotlib.pyplot in Python.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('amzn_close.csv')
df = df.set_index(pd.DatetimeIndex(df['Date'].values))
plt.figure(figsize=(16,8))
plt.plot(df['Close'], label='Close')
plt.title('Close Price')
plt.xlabel('Date')
plt.ylabel('Price USD')
plt.show()
Unfortunately, it came out like this, with squiggly lines:
Can someone help me to present and visualize the data correctly?
amzn_close.csv
Date,Close
2021-03-05,3000.46
2021-04-01,3161.0
2021-03-17,3135.73
2021-02-23,3194.5
2021-03-10,3057.64
2021-03-16,3091.86
2021-03-18,3027.99
2021-02-25,3057.16
2021-04-15,3379.09
2021-03-22,3110.87
2021-04-14,3333.0
2021-03-25,3046.26
2021-03-24,3087.07
2021-02-26,3092.93
2021-04-20,3334.69
2021-04-19,3372.01
2021-04-16,3399.44
2021-04-08,3299.3
2021-03-08,2951.95
2021-03-30,3055.29
2021-03-02,3094.53
2021-03-09,3062.85
2021-02-24,3159.53
2021-02-22,3180.74
2021-04-22,3309.04
2021-03-01,3146.14
2021-03-15,3081.68
2021-03-26,3052.03
2021-04-05,3226.73
2021-03-31,3094.08
2021-03-03,3005.0
2021-04-23,3340.88
2021-04-26,3409.0
2021-03-19,3074.96
2021-03-23,3137.5
2021-04-21,3362.02
2021-03-29,3075.73
2021-04-12,3379.39
2021-04-07,3279.39
2021-04-13,3400.0
2021-04-27,3417.43
2021-04-06,3223.82
2021-03-12,3089.49
2021-03-11,3113.59
2021-03-04,2977.57
2021-04-09,3372.2
Two things to always verify are:
The 'Date' should be set as a datetime64[ns] or DatetimeIndex dtype.
pd.to_datetime()
pd.DatetimeIndex()
Parse dates when importing data
The value column, 'Close', should be a numeric dtype
Check the dtypes with df.info()
Use matplotlib directly
In this case, only the index needs to be sorted, because the y-axis is already numeric and the x-axis is already a datetime dtype.
The csv can also be read in with:
df = pd.read_csv('amzn_close.csv', parse_dates=['Date'], index_col=['Date'])
# load and format the data
df = pd.read_csv('amzn_close.csv')
df = df.set_index(pd.DatetimeIndex(df['Date'].values))
# sort the index
df.sort_index(inplace=True)
# plot
plt.figure(figsize=(16, 7))
plt.plot(df['Close'], label='Close')
plt.title('Close Price')
plt.xlabel('Date')
plt.ylabel('Price USD')
plt.show()
Use pandas.DataFrame.plot
Doesn't require sorting the index
# load the data
df = pd.read_csv('amzn_close.csv', parse_dates=['Date'], index_col=['Date'])
# plot the data
df.plot(figsize=(16, 7), title='Close Price', ylabel='Price USD', rot=0, legend=False)
Use seaborn.lineplot
Doesn't require sorting the index
# load the data
df = pd.read_csv('amzn_close.csv', parse_dates=['Date'], index_col=['Date'])
# plot
fig, ax = plt.subplots(figsize=(16, 7))
sns.lineplot(data=df, ax=ax, legend=False)
ax.set(title='Close Price', xlabel='Date', ylabel='Price USD')
Output for all implementations
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4
I have a data frame containing several columns for which I have continuous (annual) data since 1971 up to 2012. After that I have some say "predicted" values for 2020, 2025, 2030 and 2035. The index to the data frame is in integer format (each date), and I've tried converting it to a date time format using the appropriate module, but this still doesn't correctly space out the dates on the x-axis (to show the actual time-gaps) Here's the code I've been experimenting with:
fig, ax = plt.subplots()
# Set title
ttl = "India's fuel mix (1971-2012)"
# Set color transparency (0: transparent; 1: solid)
a = 0.7
# Convert the index integer dates into actual date objects
new_fmt.index = [datetime.datetime(year=date, month=1, day=1) for date in new_fmt.index]
new_fmt.ix[:,['Coal', 'Oil', 'Gas', 'Biofuels', 'Nuclear', 'Hydro','Wind']].plot(ax=ax,kind='bar', stacked=True, title = ttl)
ax.grid(False)
xlab = 'Date (Fiscal Year)'
ylab = 'Electricity Generation (GWh)'
ax.set_title(ax.get_title(), fontsize=20, alpha=a)
ax.set_xlabel(xlab, fontsize=16, alpha=a)
ax.set_ylabel(ylab, fontsize=16, alpha=a)
# Tell matplotlib to interpret the x-axis values as dates
ax.xaxis_date()
# Make space for and rotate the x-axis tick labels
fig.autofmt_xdate()
I tried to figure it out:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
# create data frame with random data (3 rows, 2 columns)
df = pd.DataFrame(np.random.randn(3,2))
# time index with missing years
t = [datetime.date(year=1971, month=12, day=31), datetime.date(year=1972, month=12, day=31), datetime.date(year=1980, month=12, day=31)]
df.index = t
# time index with all the years:
tnew = pd.date_range(datetime.date(year=1971, month=1, day=1),datetime.date(year=1981, month=1, day=1),freq="A")
# reindex data frame (missing years will be filled with NaN
df2 = df.reindex(tnew)
# replace NaN with 0
df2_zeros = df2.fillna(0)
# or interpolate
df2_interp = df2.interpolate()
# and plot
df2_interp.columns = ["coal","wind"]
df2_interp.plot(kind='bar', stacked=True)
plt.show()
Hope this helps.