Howto force Pandas and native matplotlib to share axis

Howto force Pandas and native matplotlib to share axis - python

I folks,
Consider the following example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(range(len(x)), np.linspace(-1,1,100), y.T)
plt.show()
At this point, I would like the both axis (ax1,ax2) to share the x-axis, i.e. displaying proper pandas dates on the second axis. sharex=True does not seem to work. How can I achieve that? I tried different possibilities which did not work out.
Edit: Since the pandas date formatting is superior to the native matplotlib formatting, please provide me with a solution where pandas date formatting is used (for instance, zooming with an interactive environment works much better with pandas date formatting). Thanks You!

One way to do it would be to do all the plotting with matplotlib, this way there are no problems with the different time formats being used:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex='col')
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
#x.plot(ax=ax1)
ax1.plot(x.index, x.values)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(x.index, np.linspace(-1,1,100), y.T)
fig.tight_layout()
plt.show()
This gives the following plot:

What seems to work fine is to first plot the same line into the axes that should host the image, then plot the image, then remove the line again. What this does is that it tells pandas to apply its locators and formatters to that axes; they will stay after removing the line.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex=True)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
x.plot(ax=ax2, legend=False)
ax2.pcolormesh(dates, np.linspace(-1,1,100), y.T)
ax2.lines[0].remove()
plt.show()
Note that there may be caveats of this solution when zooming or panning. Consider it more like a hack and use it as long as it works, but don't blame anyone once it doesn't.

Related

Transposing x and y axes with matplotlib and pandas

I'm trying to use a bar chart to visualize my csv data. The data looks like this:
question,count_1,count_2,count_3,count_4,count_5
Q1,0,0,6,0,0
Q2,6,0,0,0,0
Q3,3,2,1,0,0
Q4,0,0,6,0,0
Q5,6,0,0,0,0
Q6,0,6,0,0,0
Q7,6,0,0,0,0
Q8,0,0,0,5,1
Q9,1,4,0,0,1
Q10,0,0,1,5,0
Here is my code
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.bar(stacked=True)
ax.legend(loc='best')
plt.show()
Which gives me:
What I'm trying to do is flip the x and y axes. I want the bars to be horizontal and y axis to be the questions. I tried to transpose my data frame using:
ax = df.set_index(['question']).T.plot.bar(stacked=True)
but that gives me:
which is not what I want. Can anyone help?

to get the bars horizontally (flip the x and y axis), you need to use barh (horizontal bar). More info here. So, the code would be...
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.barh(stacked=True)
ax.legend(loc='best')
plt.show()
Output plot

Add annotation to specific cells in heatmap

I am plotting a seaborn heatmap and would like to annotate only the specific cells with custom text.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
labels = StringIO(u'''7,8,4,,1
5,2,,2,8
1,,6,,7
3,1,,4,7''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
labels = pd.read_csv(labels, header=None)
#labels = np.ma.masked_invalid(labels)
fig, ax = plt.subplots()
sns.heatmap(data, annot=labels, ax=ax, vmin=0, vmax=100)
plt.show()
The above code generates the following heatmap:
and the commented line generates the following heatmap:
I would like to show only the non-nan (or non-zero) text on the cells. How can that be achieved?

Use a string array for annot instead of a masked array:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
labels = StringIO(u'''7,8,4,,1
5,2,,2,8
1,,6,,7
3,1,,4,7''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
labels = pd.read_csv(labels, header=None)
#labels = np.ma.masked_invalid(labels)
# Convert everything to strings:
annotations = labels.astype(str)
annotations[np.isnan(labels)] = ""
fig, ax = plt.subplots()
sns.heatmap(data, annot=annotations, fmt="s", ax=ax, vmin=0, vmax=100)
plt.show()

To complement the answer by #mrzo, you can use na_filter=False in read_csv() to store nans as empty strings and use pandas.DataFrame.astype() to convert to strings in place:
# ...
labels = pd.read_csv(labels, header=None, na_filter=False).astype(str)
sns.heatmap(data, annot=labels, fmt='s', ax=ax, vmin=0, vmax=100)

Just going to add this as it has taken me some time to work out how to do something similar programmatically for a slightly different application: I wanted to suppress 0-values from the annotation, but because the values were arising as the result of a crosstab operation I couldn't use William Miller's nice approach without writing the crosstab out and then reading it back in which seemed... inelegant.
There may be a yet more elegant way to do this, but for me running it through numpy was ridiculously fast and quite easy.
import numpy as np
import pandas as pd
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
# For more complex functions you could write a def instead
# of using this simple lambda function
an = np.vectorize(lambda x: '' if x<50 else str(round(x,-1)))(data.to_numpy())
sns.heatmap(
data=data.to_numpy(), # Note this is now numpy too
cmap='BuPu',
annot=an, # The matching ndarray of annotations
fmt = '', # Formats annotations as strings (i.e. no formatting)
cbar=False, # Seems overkill if you've got annotations
vmin=0,
vmax=data.max().max()
)
This can make life a little more difficult in terms of labelling axes, though it's straightforward enough: ax.set_xticklabels(df.columns.values). And if you had axislabels in, say, the first column then you'd need to use iloc (data.iloc[:,1:]) in your to_numpy call, but combined with a custom colormap (e.g. 0==white) you can create heatmaps that are a lot easier to look at.
Obviously the crude rounding is confusing (why does 80 have different shades?) but you get the point:

Format y axis as percent

I have an existing plot that was created with pandas like this:
df['myvar'].plot(kind='bar')
The y axis is format as float and I want to change the y axis to percentages. All of the solutions I found use ax.xyz syntax and I can only place code below the line above that creates the plot (I cannot add ax=ax to the line above.)
How can I format the y axis as percentages without changing the line above?
Here is the solution I found but requires that I redefine the plot:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as mtick
data = [8,12,15,17,18,18.5]
perc = np.linspace(0,100,len(data))
fig = plt.figure(1, (7,4))
ax = fig.add_subplot(1,1,1)
ax.plot(perc, data)
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
xticks = mtick.FormatStrFormatter(fmt)
ax.xaxis.set_major_formatter(xticks)
plt.show()
Link to the above solution: Pyplot: using percentage on x axis

This is a few months late, but I have created PR#6251 with matplotlib to add a new PercentFormatter class. With this class you just need one line to reformat your axis (two if you count the import of matplotlib.ticker):
import ...
import matplotlib.ticker as mtick
ax = df['myvar'].plot(kind='bar')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
PercentFormatter() accepts three arguments, xmax, decimals, symbol. xmax allows you to set the value that corresponds to 100% on the axis. This is nice if you have data from 0.0 to 1.0 and you want to display it from 0% to 100%. Just do PercentFormatter(1.0).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%', respectively. decimals=None will automatically set the number of decimal points based on how much of the axes you are showing.
Update
PercentFormatter was introduced into Matplotlib proper in version 2.1.0.

pandas dataframe plot will return the ax for you, And then you can start to manipulate the axes whatever you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,5))
# you get ax from here
ax = df.plot()
type(ax) # matplotlib.axes._subplots.AxesSubplot
# manipulate
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])

Jianxun's solution did the job for me but broke the y value indicator at the bottom left of the window.
I ended up using FuncFormatterinstead (and also stripped the uneccessary trailing zeroes as suggested here):
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame(np.random.randn(100,5))
ax = df.plot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
Generally speaking I'd recommend using FuncFormatter for label formatting: it's reliable, and versatile.

For those who are looking for the quick one-liner:
plt.gca().set_yticklabels([f'{x:.0%}' for x in plt.gca().get_yticks()])
this assumes
import: from matplotlib import pyplot as plt
Python >=3.6 for f-String formatting. For older versions, replace f'{x:.0%}' with '{:.0%}'.format(x)

I'm late to the game but I just realize this: ax can be replaced with plt.gca() for those who are not using axes and just subplots.
Echoing #Mad Physicist answer, using the package PercentFormatter it would be:
import matplotlib.ticker as mtick
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
#if you already have ticks in the 0 to 1 range. Otherwise see their answer

I propose an alternative method using seaborn
Working code:
import pandas as pd
import seaborn as sns
data=np.random.rand(10,2)*100
df = pd.DataFrame(data, columns=['A', 'B'])
ax= sns.lineplot(data=df, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='title')
#changing ylables ticks
y_value=['{:,.2f}'.format(x) + '%' for x in ax.get_yticks()]
ax.set_yticklabels(y_value)

You can do this in one line without importing anything:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{}%'.format))
If you want integer percentages, you can do:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))
You can use either ax.yaxis or plt.gca().yaxis. FuncFormatter is still part of matplotlib.ticker, but you can also do plt.FuncFormatter as a shortcut.

Based on the answer of #erwanp, you can use the formatted string literals of Python 3,
x = '2'
percentage = f'{x}%' # 2%
inside the FuncFormatter() and combined with a lambda expression.
All wrapped:
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: f'{y}%'))

Another one line solution if the yticks are between 0 and 1:
plt.yticks(plt.yticks()[0], ['{:,.0%}'.format(x) for x in plt.yticks()[0]])

add a line of code
ax.yaxis.set_major_formatter(ticker.PercentFormatter())

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:

As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:

In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)

A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():

From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)

You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)

Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.

this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.

You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Add date tickers to a matplotlib/python chart

I have a question that sounds simple but it's driving me mad for some days. I have a historical time series closed in two lists: the first list is containing prices, let's say P = [1, 1.5, 1.3 ...] while the second list is containing the related dates, let's say D = [01/01/2010, 02/01/2010...]. What I would like to do is to plot SOME of these dates (when I say "some" is because the "best" result I got so far is to show all of them as tickers, so creating a black cloud of unreadable data in the x-axis) that, when you zoom in, are shown more in details. This picture is now having the progressive automated range made by Matplotlib:
Instead of 0, 200, 400 etc. I would like to have the dates values that are related to the data-point plotted. Moreover, when I zoom-in I get the following:
As well as I get the detail between 0 and 200 (20, 40 etc.) I would like to get the dates attached to the list.
I'm sure this is a simple problem to solve but I'm new to Matplotlib as well as to Python and any hint would be appreciated. Thanks in advance

Matplotlib has sophisticated support for plotting dates. I'd recommend the use of AutoDateFormatter and AutoDateLocator. They are even locale-specific, so they choose month-names according to your locale.
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateFormatter, AutoDateLocator
xtick_locator = AutoDateLocator()
xtick_formatter = AutoDateFormatter(xtick_locator)
ax = plt.axes()
ax.xaxis.set_major_locator(xtick_locator)
ax.xaxis.set_major_formatter(xtick_formatter)
EDIT
For use with multiple subplots, use multiple locator/formatter pairs:
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateFormatter, AutoDateLocator, date2num
x = [datetime.datetime.now() + datetime.timedelta(days=30*i) for i in range(20)]
y = np.random.random((20))
xtick_locator = AutoDateLocator()
xtick_formatter = AutoDateFormatter(xtick_locator)
for i in range(4):
ax = plt.subplot(2,2,i+1)
ax.xaxis.set_major_locator(xtick_locator)
ax.xaxis.set_major_formatter(xtick_formatter)
ax.plot(date2num(x),y)
plt.show()

You can do timeseries plot with pandas
For detail refer this : http://pandas.pydata.org/pandas-docs/dev/timeseries.html and
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.plot.html
import pandas as pd
DateStrList = ['01/01/2010','02/01/2010']
P = [2,3]
D = pd.Series([pd.to_datetime(date) for date in DateStrList])
series =pd.Series(P, index=D)
pd.Series.plot(series)

import matplotlib.pyplot as plt
import pandas
pandas.TimeSeries(P, index=D).plot()
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Howto force Pandas and native matplotlib to share axis - python

Related

Transposing x and y axes with matplotlib and pandas

Add annotation to specific cells in heatmap

Format y axis as percent

Multiple histograms in Pandas

Add date tickers to a matplotlib/python chart

Categories

Resources