Bar Chart with Line Chart - Using non numeric index - python

I'd like to show on the same graph a bar chart of a dataframe, and a line chart that represents the sum.
I can do that for a frame for which the index is numeric or text. But it doesn't work for a datetime index.
Here is the code I use:
import datetime as dt
np.random.seed(1234)
data = np.random.randn(10, 2)
date = dt.datetime.today()
index_nums = range(10)
index_text = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k']
index_date = pd.date_range(date + dt.timedelta(days=-9), date)
a_nums = pd.DataFrame(columns=['a', 'b'], index=index_nums, data=data)
a_text = pd.DataFrame(columns=['a', 'b'], index=index_text, data=data)
a_date = pd.DataFrame(columns=['a', 'b'], index=index_date, data=data)
fig, ax = plt.subplots(3, 1)
ax = ax.ravel()
for i, a in enumerate([a_nums, a_text, a_date]):
a.plot.bar(stacked=True, ax=ax[i])
(a.sum(axis=1)).plot(c='k', ax=ax[i])
As you can see the last chart comes only as the line with the bar chart legend. And the dates are missing.
Also if I replace the last line with
ax[i].plot(a.sum(axis=1), c='k')
Then:
The chart with index_nums is the same
The chart with index_text raises an error
the chart with index_date shows the bar chart but not the line chart.
fgo I'm using pytho 3.6.2 pandas 0.20.3 and matplotlib 2.0.2

Plotting a bar plot and a line plot to the same axes may often be problematic, because a bar plot puts the bars at integer positions (0,1,2,...N-1) while a line plot uses the numeric data to determine the ordinates.
In the case from the question, using range(10) as index for both bar and line plot works fine, since those are exactly the numbers a bar plot would use anyways. Using text also works fine, since this needs to be replaced by numbers in order to show it and of course the first N integers are used for that.
The bar plot for a datetime index also uses the first N integers, while the line plot will plot on the dates. Hence depending on which one comes first, you only see the line or bar plot (you would actually see the other by changing the xlimits accordingly).
An easy solution is to plot the bar plot first and reset the index to a numeric one on the dataframe for the line plot.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(1234)
import datetime as dt
data = np.random.randn(10, 2)
date = dt.datetime.today()
index_date = pd.date_range(date + dt.timedelta(days=-9), date)
df = pd.DataFrame(columns=['a', 'b'], index=index_date, data=data)
fig, ax = plt.subplots(1, 1)
df.plot.bar(stacked=True, ax=ax)
df.sum(axis=1).reset_index().plot(ax=ax)
fig.autofmt_xdate()
plt.show()
Alternatively you can plot the lineplot as usual and use a matplotlib bar plot, which accepts numeric positions. See this answer: Python making combined bar and line plot with secondary y-axis

Related

How to change the legend font size of pd.DataFrame.plot() when `secondary_y` is used?

Question
I have used the secondary_y argument in pd.DataFrame.plot().
While trying to change the fontsize of legends by .legend(fontsize=20), I ended up having only 1 column name in the legend when I actually have 2 columns to be printed on the legend.
This problem (having only 1 column name in the legend) does not take place when I did not use secondary_y argument.
I want all the column names in my dataframe to be printed in the legend, and change the fontsize of the legend even when I use secondary_y while plotting dataframe.
Example
The following example with secondary_y shows only 1 column name A, when I have actually 2 columns, which are A and B.
The fontsize of the legend is changed, but only for 1 column name.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(secondary_y = ["B"], figsize=(12,5)).legend(fontsize=20, loc="upper right")
When I do not use secondary_y, then legend shows both of the 2 columns A and B.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(figsize=(12,5)).legend(fontsize=20, loc="upper right")
To manage to customize it you have to create your graph with subplots function of Matplotlib:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
#define colors to use
col1 = 'steelblue'
col2 = 'red'
#define subplots
fig,ax = plt.subplots()
#add first line to plot
lns1=ax.plot(df.index,df['A'], color=col1)
#add x-axis label
ax.set_xlabel('dates', fontsize=14)
#add y-axis label
ax.set_ylabel('A', color=col1, fontsize=16)
#define second y-axis that shares x-axis with current plot
ax2 = ax.twinx()
#add second line to plot
lns2=ax2.plot(df.index,df['B'], color=col2)
#add second y-axis label
ax2.set_ylabel('B', color=col2, fontsize=16)
#legend
ax.legend(lns1+lns2,['A','B'],loc="upper right",fontsize=20)
#another solution is to create legend for fig,:
#fig.legend(['A','B'],loc="upper right")
plt.show()
result:
this is a somewhat late response, but something that worked for me was simply setting plt.legend(fontsize = wanted_fontsize) after the plot function.

Plotting a vertical line for df.plot.bar works, but it doesn't for a lineplot

I followed all step following my question here : Pandas Dataframe : How to add a vertical line with label to a bar plot when your data is time-series?
it was supposed to solve my problem but when I change the The kind of plot to line , the vertical line did not appear . I copy the same code and change plot type to line instead of bar :
as you can see with bar , the vertical line (in red ) appears .
# function to plot a bar
def dessine_line3(madataframe,debut_date , mes_colonnes):
madataframe.index = pd.to_datetime(madataframe.index,format='%m/%d/%y')
df = madataframe.loc[debut_date:,mes_colonnes].copy()
filt = (df[df.index == '4/20/20']).index
df.index.searchsorted(value=filt)
fig,ax = plt.subplots()
df.plot.bar(figsize=(17,8),grid=True,ax=ax)
ax.axvline(df.index.searchsorted(filt), color="red", linestyle="--", lw=2, label="lancement")
plt.tight_layout()
out :
but whan I just change code by changing the type of plot to line : there is no vertical line and also the x axis (date ) changed .
so I wrote another code juste to draw line with vertical line
ax = madagascar_maurice_case_df[["Madagascar Covid-19 Ratio","Maurice Covid-19 Ratio"]].loc['3/17/20':].plot.line(figsize=(17,7),grid=True)
filt = (df[df.index=='4/20/20']).index
ax.axvline(df.index.searchsorted(filt),color="red",linestyle="--",lw=2 ,label="lancement")
plt.show()
but the result is the same
following the comment below , here is my final code :
def dessine_line5(madataframe,debut_date , mes_colonnes):
plt.figure(figsize=(17,8))
plt.grid(b=True,which='major',axis='y')
df = madataframe.loc[debut_date:,mes_colonnes]
sns.lineplot(data=df)
lt = datetime.toordinal(pd.to_datetime('4/20/20'))
plt.axvline(lt,color="red",linestyle="--",lw=2,label="lancement")
plt.show()
and the result is :
Plot tick locs
The issue is the plot tick locations are a different style depending on plot kind and api
df.plot vs. plt.plot vs. sns.lineplot
Place ticks, labels = plt.xticks() after df.plot.bar(figsize=(17,8),grid=True,ax=ax) and printing ticks will give array([0, 1, 2,..., len(df.index)]), which is why df.index.searchsorted(filt) works, it produces an integer location.
df.plot() has tick locs like array([13136, 13152, 13174, 13175], dtype=int64), for my sample date range. I don't actually know how those numbers are derived, so I don't know how to convert the date to that format.
sns.lineplot and plt.plot have tick locs that are the ordinal representation of the datetime, array([737553., 737560., 737567., 737577., 737584., 737591., 737598.,
737607.]
For a lineplot with your example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
sns.lineplot(data=df)
lt = datetime.toordinal(pd.to_datetime('2020/04/20'))
plt.axvline(lt, color="red", linestyle="--", lw=2, label="lancement")
plt.show()
For my example data:
import numpy as np
data = {'a': [np.random.randint(10) for _ in range(40)],
'b': [np.random.randint(10) for _ in range(40)],
'date': pd.bdate_range(datetime.today(), periods=40).tolist()}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
sns.lineplot(data=df)
ticks, labels = plt.xticks()
lt = datetime.toordinal(pd.to_datetime('2020-05-19'))
plt.axvline(lt, color="red", linestyle="--", lw=2, label="lancement")
plt.show()

How to edit x-axis length but also maintain plot dates?

Below I have my code to plot my graph.
#can change the 'iloc[x:y]' component to plot sections of chart
#ax = df['Data'].iloc[300:].plot(color = 'black', title = 'Past vs. Expected Future Path')
ax = df.plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
df.loc[df.index >= idx, 'up2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'down2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'Data'].plot(color = 'b', ax = ax)
plt.show()
#resize the plot
plt.rcParams["figure.figsize"] = [10,6]
plt.show()
Lines 2 (commented out) and 3 both work to plot all of the lines together as seen, however I wish to have the dates on the x-axis and also be able to be able to plot sections of the graph (defined by x-axis, i.e. date1 to date2).
Using line 3 I can plot with dates on the x-axis, however using ".iloc[300:]" like in line 2 does not appear to work as the 3 coloured lines disconnect from the main line as seen below:
ax = df.iloc[300:].plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
Using line 2, I can edit the x-axis' length, however it doesn't have dates on the x-axis.
Does anyone have any advice on how to both have dates and be able to edit the x-axis periods?
For this to work as desired, you need to set the 'date' column as index of the dataframe. Otherwise, df.plot has no way to know what needs to be used as x-axis. With the date set as index, pandas accepts expressions such as df.loc[df.index >= '20180101', 'data2'] to select a time range and a specific column.
Here is some example code to demonstrate the concept.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
dates = pd.date_range('20160101', '20191231', freq='D')
data1 = np.random.normal(-0.5, 0.2, len(dates))
data2 = np.random.normal(-0.7, 0.2, len(dates))
df = pd.DataFrame({'date': dates, 'data1':data1, 'data2':data2})
df.set_index('date', inplace=True)
df['data1'].iloc[300:].plot(color='crimson')
df.loc[df.index >= '20180101', 'data2'].plot(color='dodgerblue')
plt.tight_layout()
plt.show()

How to annotate end of lines using python and matplotlib?

With a dataframe and basic plot such as this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123456)
rows = 75
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 3)), columns=['A', 'B', 'C'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
df.plot()
What is the best way of annotating the last points on the lines so that you get the result below?
In order to annotate a point use ax.annotate(). In this case it makes sense to specify the coordinates to annotate separately. I.e. the y coordinate is the data coordinate of the last point of the line (which you can get from line.get_ydata()[-1]) while the x coordinate is independent of the data and should be the right hand side of the axes (i.e. 1 in axes coordinates). You may then also want to offset the text a bit such that it does not overlap with the axes.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
rows = 75
df = pd.DataFrame(np.random.randint(-4,5,size=(rows, 3)), columns=['A', 'B', 'C'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df = df.cumsum()
ax = df.plot()
for line, name in zip(ax.lines, df.columns):
y = line.get_ydata()[-1]
ax.annotate(name, xy=(1,y), xytext=(6,0), color=line.get_color(),
xycoords = ax.get_yaxis_transform(), textcoords="offset points",
size=14, va="center")
plt.show()
Method 1
Here is one way, or at least a method, which you can adapt to aesthetically fit in whatever way you want, using the plt.annotate method:
[EDIT]: If you're going to use a method like this first one, the method outlined in ImportanceOfBeingErnest's answer is better than what I've proposed.
df.plot()
for col in df.columns:
plt.annotate(col,xy=(plt.xticks()[0][-1]+0.7, df[col].iloc[-1]))
plt.show()
For the xy argument, which is the x and y coordinates of the text, I chose the last x coordinate in plt.xticks(), and added 0.7 so that it is outside of your x axis, but you can coose to make it closer or further as you see fit.
METHOD 2:
You could also just use the right y axis, and label it with your 3 lines. For example:
fig, ax = plt.subplots()
df.plot(ax=ax)
ax2 = ax.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.set_yticks([df[col].iloc[-1] for col in df.columns])
ax2.set_yticklabels(df.columns)
plt.show()
This gives you the following plot:
I've got some tips from the other answers and believe this is the easiest solution.
Here is a generic function to improve the labels of a line chart. Its advantages are:
you don't need to mess with the original DataFrame since it works over a line chart,
it will use the already set legend label,
removes the frame,
just copy'n paste it to improve your chart :-)
You can just call it after creating any line char:
def improve_legend(ax=None):
if ax is None:
ax = plt.gca()
for spine in ax.spines:
ax.spines[spine].set_visible(False)
for line in ax.lines:
data_x, data_y = line.get_data()
right_most_x = data_x[-1]
right_most_y = data_y[-1]
ax.annotate(
line.get_label(),
xy=(right_most_x, right_most_y),
xytext=(5, 0),
textcoords="offset points",
va="center",
color=line.get_color(),
)
ax.legend().set_visible(False)
This is the original chart:
Now you just need to call the function to improve your plot:
ax = df.plot()
improve_legend(ax)
The new chart:
Beware, it will probably not work well if a line has null values at the end.

How do I use matplotlib autopct?

I'd like to create a matplotlib pie chart which has the value of each wedge written on top of the wedge.
The documentation suggests I should use autopct to do this.
autopct: [ None | format string |
format function ]
If not None, is a string or function used to label the wedges with
their numeric value. The label will be
placed inside the wedge. If it is a
format string, the label will be
fmt%pct. If it is a function, it will
be called.
Unfortunately, I'm unsure what this format string or format function is supposed to be.
Using this basic example below, how can I display each numerical value on top of its wedge?
plt.figure()
values = [3, 12, 5, 8]
labels = ['a', 'b', 'c', 'd']
plt.pie(values, labels=labels) #autopct??
plt.show()
autopct enables you to display the percent value using Python string formatting. For example, if autopct='%.2f', then for each pie wedge, the format string is '%.2f' and the numerical percent value for that wedge is pct, so the wedge label is set to the string '%.2f'%pct.
import matplotlib.pyplot as plt
plt.figure()
values = [3, 12, 5, 8]
labels = ['a', 'b', 'c', 'd']
plt.pie(values, labels=labels, autopct='%.2f')
plt.show()
yields
You can do fancier things by supplying a callable to autopct. To display both the percent value and the original value, you could do this:
import matplotlib.pyplot as plt
# make the pie circular by setting the aspect ratio to 1
plt.figure(figsize=plt.figaspect(1))
values = [3, 12, 5, 8]
labels = ['a', 'b', 'c', 'd']
def make_autopct(values):
def my_autopct(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return '{p:.2f}% ({v:d})'.format(p=pct,v=val)
return my_autopct
plt.pie(values, labels=labels, autopct=make_autopct(values))
plt.show()
Again, for each pie wedge, matplotlib supplies the percent value pct as the argument, though this time it is sent as the argument to the function my_autopct. The wedge label is set to my_autopct(pct).
You can do:
plt.pie(values, labels=labels, autopct=lambda p : '{:.2f}% ({:,.0f})'.format(p,p * sum(values)/100))
Using lambda and format may be better
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
path = r"C:\Users\byqpz\Desktop\DATA\raw\tips.csv"
df = pd.read_csv(path, engine='python', encoding='utf_8_sig')
days = df.groupby('day').size()
sns.set()
days.plot(kind='pie', title='Number of parties on different days', figsize=[8,8],
autopct=lambda p: '{:.2f}%({:.0f})'.format(p,(p/100)*days.sum()))
plt.show()
As autopct is a function used to label the wedges with their numeric value, you can write there any label or format items quantity with it as you need. The easiest approach for me to show percentage label is using lambda:
autopct = lambda p:f'{p:.2f}%'
or for some cases you can label data as
autopct = lambda p:'any text you want'
and for your code, to show percentage you can use:
plt.figure()
values = [3, 12, 5, 8]
labels = ['a', 'b', 'c', 'd']
plt.pie(values, labels=labels, autopct=lambda p:f'{p:.2f}%, {p*sum(values)/100 :.0f} items')
plt.show()
and result will be like:
autopct enables you to display the percentage value of each slice using Python string formatting.
For example,
autopct = '%.1f' # display the percentage value to 1 decimal place
autopct = '%.2f' # display the percentage value to 2 decimal places
If you want to show the % symbol on the pie chart, you have to write/add:
autopct = '%.1f%%'
autopct = '%.2f%%'
val=int(pct*total/100.0)
should be
val=int((pct*total/100.0)+0.5)
to prevent rounding errors.
With the help of matplotlib gallary and hints from StackOverflow users, I came up with the following pie chart.
the autopct shows amounts and kinds of ingredients.
import matplotlib.pyplot as plt
%matplotlib inline
reciepe= ["480g Flour", "50g Eggs", "90g Sugar"]
amt=[int(x.split('g ')[0]) for x in reciepe]
ing=[x.split()[-1] for x in reciepe]
fig, ax=plt.subplots(figsize=(5,5), subplot_kw=dict(aspect='equal'))
wadges, text, autotext=ax.pie(amt, labels=ing, startangle=90,
autopct=lambda p:"{:.0f}g\n({:.1f})%".format(p*sum(amt)/100, p),
textprops=dict(color='k', weight='bold', fontsize=8))
ax.legend(wadges, ing,title='Ingredents', loc='best', bbox_to_anchor=(0.35,0.85,0,0))
Piechart showing the amount and of percent of a sample recipe ingredients
Pie chart showing the salary and percent of programming Language users

Categories

Resources