I am making a series of bar plots of data with two categorical variables and one numeric. What i have is the below, but what I would love to do is to facet by one of the categorical variables as with facet_wrap in ggplot. I have a somewhat working example, but I get the wrong plot type (lines and not bars) and I do subsetting of the data in a loop--that can't be the best way.
## first try--plain vanilla
import pandas as pd
import numpy as np
N = 100
## generate toy data
ind = np.random.choice(['a','b','c'], N)
cty = np.random.choice(['x','y','z'], N)
jobs = np.random.randint(low=1,high=250,size=N)
## prep data frame
df_city = pd.DataFrame({'industry':ind,'city':cty,'jobs':jobs})
df_city_grouped = df_city.groupby(['city','industry']).jobs.sum().unstack()
df_city_grouped.plot(kind='bar',stacked=True,figsize=(9, 6))
This gives something like this:
city industry jobs
0 z b 180
1 z c 121
2 x a 33
3 z a 121
4 z c 236
However, what i would like to see is something like this:
## R code
library(plyr)
df_city<-read.csv('/home/aksel/Downloads/mockcity.csv',sep='\t')
## summarize
df_city_grouped <- ddply(df_city, .(city,industry), summarise, jobstot = sum(jobs))
## plot
ggplot(df_city_grouped, aes(x=industry, y=jobstot)) +
geom_bar(stat='identity') +
facet_wrap(~city)
The closest I get with matplotlib is something like this:
cols =df_city.city.value_counts().shape[0]
fig, axes = plt.subplots(1, cols, figsize=(8, 8))
for x, city in enumerate(df_city.city.value_counts().index.values):
data = df_city[(df_city['city'] == city)]
data = data.groupby(['industry']).jobs.sum()
axes[x].plot(data)
So two questions:
Can I do bar plots (they plot lines as shown here) using the AxesSubplot object and end up with something along the lines of the facet_wrap example from ggplot example;
In loops generating charts such as this attempt, I subset the data in each. I can't imagine that is the 'proper' way to do this type of faceting?
Second example here: http://pandas-docs.github.io/pandas-docs-travis/visualization.html#bar-plots
Anyway, you can always do that by hand, as you did yourself.
EDIT:
BTW, you can always use rpy2 in python, so you can do all the same things as in R.
Also, have a look at this: https://pandas.pydata.org/pandas-docs/version/0.14.1/rplot.html
I am not sure, but it should be helpful for creating plots over many panels, though might require further reading.
#tcasell suggested the bar call in the loop. Here is a working, if not elegant, example.
## second try--facet by county
N = 100
industry = ['a','b','c']
city = ['x','y','z']
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
jobs = np.random.randint(low=1,high=250,size=N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'jobs':jobs})
## how many panels do we need?
cols =df_city.city.value_counts().shape[0]
fig, axes = plt.subplots(1, cols, figsize=(8, 8))
for x, city in enumerate(df_city.city.value_counts().index.values):
data = df_city[(df_city['city'] == city)]
data = data.groupby(['industry']).jobs.sum()
print (data)
print type(data.index)
left= [k[0] for k in enumerate(data)]
right= [k[1] for k in enumerate(data)]
axes[x].bar(left,right,label="%s" % (city))
axes[x].set_xticks(left, minor=False)
axes[x].set_xticklabels(data.index.values)
axes[x].legend(loc='best')
axes[x].grid(True)
fig.suptitle('Employment By Industry By City', fontsize=20)
The Seaborn library, which is built on Matplotlib and could be considered a superset of it, has flexible and powerful plotting options for facet plots--they even use similar terminology to R. Scroll down on this page for multiple examples.
Related
In the code below I'd like to loop through all categorical variables in "variables", and show separate boxplots of "fare" for all of them in a single plotting window. How do I do that? Thanks.
import seaborn as sns
sns.set(style="ticks")
titanic = sns.load_dataset("titanic")
variables = list(titanic.select_dtypes(include="object").columns) # list of categorical variables
# single boxplot of fare vs passenger sex
g = sns.catplot(x="sex", y="fare", kind="box", data=titanic.query("fare>0"))
g.set(yscale="log")
Update: The following looping code seems to work, but I'd like some help with cleaning up the plot (attached below) if possible, namely removing the empty subplot window and interior axes ticks/labels. Thanks again.
fig, axs = plt.subplots(nrows=2, ncols=3)
i = j = 0
for variable in variables:
g = sns.boxplot(x=variable, y="fare", data=titanic.query("fare>0"), ax=axs[i][j])
g.set(yscale="log")
j += 1
if j>2:
i += 1; j = 0
Update #2: YOLO's code below does the job. Thanks!
Here's a way to do:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15,10))
for i, c in enumerate(variables, 1):
plt.subplot(2,3,i)
g = sns.boxplot(x=c, y="fare",data=titanic.query("fare>0"))
g.set(yscale="log")
My aim is to show a bar chart with 3-dim data, x, categorical and y1, y2 as continuous series; the bars should have heights from y1 and color to indicate y2.
This does not seem to be particularly obscure to me, but I didn't find a simple / built-in way to use a bar chart to visualise three dimensions -- I'm thinking mostly for exploratory purposes, before investigating relationships more formally.
Am I missing a type of plot in the libraries? Is there a good alternative to showing 3d data?
Anyway here are some things that I've tried that aren't particularly satisfying:
Some data for these attempts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Example data with explicit (-ve) correlation in the two series
n = 10; sd = 2.5
fruits = [ 'Lemon', 'Cantaloupe', 'Redcurrant', 'Raspberry', 'Papaya',
'Apricot', 'Cherry', 'Durian', 'Guava', 'Jujube']
np.random.seed(101)
cost = np.random.uniform(3, 15, n)
harvest = 50 - (np.random.randn(n) * sd + cost)
df = pd.DataFrame(data={'fruit':fruits, 'cost':cost, 'harvest':harvest})
df.sort_values(by="cost", inplace=True) # preferrable to sort during plot only
# set up several subplots to show progress.
n_colors = 5; cmap_base = "coolwarm" # a diverging map
fig, axs = plt.subplots(3,2)
ax = axs.flat
Attempt 1 uses hue for the 3rd dim data in barplot. However, this produces a single color for each value in the series, and also seems to do odd things with the bar width & spacing.
import seaborn as sns
sns.barplot(ax=ax[0], x='fruit', y='cost', hue='harvest',
data=df, palette=cmap_base)
# fix the sns barplot label orientation
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90)
Attempt 2 uses the pandas DataFrame.plot.bar, with a continuous color range, then adds a colorbar (need scalar mappable). I borrowed some techniques from medium post among others.
import matplotlib as mpl
norm = mpl.colors.Normalize(vmin=min(df.harvest), vmax=max(df.harvest), clip=True)
mapper1 = mpl.cm.ScalarMappable(norm=norm, cmap=cmap_base)
colors1 = [mapper1.to_rgba(x) for x in df.harvest]
df.plot.bar(ax=ax[1], x='fruit', y='cost', color=colors1, legend=False)
mapper1._A = []
plt.colorbar(mapper1, ax=ax[1], label='havest')
Attempt 3 builds on this, borrowing from https://gist.github.com/jakevdp/91077b0cae40f8f8244a to facilitate a discrete colormap.
def discrete_cmap(N, base_cmap=None):
"""Create an N-bin discrete colormap from the specified input map"""
# from https://gist.github.com/jakevdp/91077b0cae40f8f8244a
base = plt.cm.get_cmap(base_cmap)
color_list = base(np.linspace(0, 1, N))
cmap_name = base.name + str(N)
return base.from_list(cmap_name, color_list, N)
cmap_disc = discrete_cmap(n_colors, cmap_base)
mapper2 = mpl.cm.ScalarMappable(norm=norm, cmap=cmap_disc)
colors2 = [mapper2.to_rgba(x) for x in df.harvest]
df.plot.bar(ax=ax[2], x='fruit', y='cost', color=colors2, legend=False)
mapper2._A = []
cb = plt.colorbar(mapper2, ax=ax[2], label='havest')
cb.set_ticks(np.linspace(*cb.get_clim(), num=n_colors+1)) # indicate color boundaries
cb.set_ticklabels(["{:.0f}".format(t) for t in cb.get_ticks()]) # without too much precision
Finally, attempt 4 gives in to trying 3d in one plot and present in 2 parts.
sns.barplot(ax=ax[4], x='fruit', y='cost', data=df, color='C0')
ax[4].set_xticklabels(ax[4].get_xticklabels(), rotation=90)
sns.regplot(x='harvest', y='cost', data=df, ax=ax[5])
(1) is unusable - I'm clearly not using as intended. (2) is ok with 10 series but with more series is harder to tell whether a given sample is above/below average, for instance. (3) is quite nice and scales to 50 bars ok, but it is far from "out-of-the-box", too involved for a quick analysis. Moreover, the sm._A = [] seems like a hack but the code fails without it. Perhaps the solution in a couple of lines in (4) is a better way to go.
To come back to the question again: Is it possible easily produce a bar chart that displays 3d data? I've focused on using a small number of colors for the 3rd dimension for easier identification of trends, but I'm open to other suggestions.
I've posted a solution as well, which uses a lot of custom code to achieve what I can't really believe is not built in some graphing library of python.
edit:
the following code, using R's ggplot gives a reasonable approximation to (2) with built-in commands.
ggplot(data = df, aes(x =reorder(fruit, +cost), y = cost, fill=harvest)) +
geom_bar(data=df, aes(fill=harvest), stat='identity') +
scale_fill_gradientn(colours=rev(brewer.pal(7,"RdBu")))
The first 2 lines are more or less the minimal code for barplot, and the third changes the color palette.
So if this ease were available in python I'd love to know about it!
I'm posting an answer that does solve my aims of being simple at the point of use, still being useful with ~100 bars, and by leveraging the Fisher-Jenks 1d classifier from PySAL ends up handling outliers quite well (post about d3 coloring)
-- but overall is quite involved (50+ lines in the BinnedColorScaler class, posted at the bottom).
# set up the color binner
quantizer = BinnedColorScaler(df.harvest, k=5, cmap='coolwarm' )
# and plot dataframe with it.
df.plot.bar(ax=ax, x='fruit', y='cost',
color=df.harvest.map(quantizer.map_by_class))
quantizer.add_legend(ax, title='harvest') # show meaning of bins in legend
Using the following class that uses a nice 1d classifier from PySAL and borrows ideas from geoplot/geopandas libraries.
from pysal.esda.mapclassify import Fisher_Jenks
class BinnedColorScaler(object):
'''
give this an array-like data set, a bin count, and a colormap name, and it
- quantizes the data
- provides a bin lookup and a color mapper that can be used by pandas for selecting artist colors
- provides a method for a legend to display the colors and bin ranges
'''
def __init__(self, values, k=5, cmap='coolwarm'):
self.base_cmap = plt.cm.get_cmap(cmap) # can be None, text, or a cmap instane
self.bin_colors = self.base_cmap(np.linspace(0, 1, k)) # evenly-spaced colors
# produce bins - see _discrete_colorize in geoplot.geoplot.py:2372
self.binning = Fisher_Jenks(np.array(values), k)
self.bin_edges = np.array([self.binning.yb.min()] + self.binning.bins.tolist())
# some text for the legend (as per geopandas approx)
self.categories = [
'{0:.2f} - {1:.2f}'.format(self.bin_edges[i], self.bin_edges[i + 1])
for i in xrange(len(self.bin_edges) - 1)]
def map_by_class(self, val):
''' return a color for a given data value '''
#bin_id = self.binning.find_bin(val)
bin_id = self.find_bin(val)
return self.bin_colors[bin_id]
def find_bin(self, x):
''' unfortunately the pysal implementation seems to fail on bin edge
cases :(. So reimplement with the way we expect here.
'''
# wow, subtle. just <= instead of < in the uptos
x = np.asarray(x).flatten()
uptos = [np.where(value <= self.binning.bins)[0] for value in x]
bins = [v.min() if v.size > 0 else len(self.bins)-1 for v in uptos] #bail upwards
bins = np.asarray(bins)
if len(bins) == 1:
return bins[0]
else:
return bins
def add_legend(self, ax, title=None, **kwargs):
''' add legend showing the discrete colors and the corresponding data range '''
# following the geoplot._paint_hue_legend functionality, approx.
# generate a patch for each color in the set
artists, labels = [], []
for i in xrange(len(self.bin_colors)):
labels.append(self.categories[i])
artists.append(mpl.lines.Line2D(
(0,0), (1,0), mfc='none', marker='None', ls='-', lw=10,
color=self.bin_colors[i]))
return ax.legend(artists, labels, fancybox=True, title=title, **kwargs)
I have:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
# Generate random data
set1 = np.random.randint(0, 40, 24)
set2 = np.random.randint(0, 100, 24)
# Put into dataframe and plot
df = pd.DataFrame({'set1': set1, 'set2': set2})
data = pd.melt(df)
sb.swarmplot(data=data, x='variable', y='value')
The two random distributions plotted with seaborn's swarmplot function:
I want the individual plots of both distributions to be connected with a colored line such that the first data point of set 1 in the dataframe is connected with the first data point of set 2.
I realize that this would probably be relatively simple without seaborn but I want to keep the feature that the individual data points do not overlap.
Is there any way to access the individual plot coordinates in the seaborn swarmfunction?
EDIT: Thanks to #Mead, who pointed out a bug in my post prior to 2021-08-23 (I forgot to sort the locations in the prior version).
I gave the nice answer by Paul Brodersen a try, and despite him saying that
Madness lies this way
... I actually think it's pretty straight forward and yields nice results:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Generate random data
rng = np.random.default_rng(42)
set1 = rng.integers(0, 40, 5)
set2 = rng.integers(0, 100, 5)
# Put into dataframe
df = pd.DataFrame({"set1": set1, "set2": set2})
print(df)
data = pd.melt(df)
# Plot
fig, ax = plt.subplots()
sns.swarmplot(data=data, x="variable", y="value", ax=ax)
# Now connect the dots
# Find idx0 and idx1 by inspecting the elements return from ax.get_children()
# ... or find a way to automate it
idx0 = 0
idx1 = 1
locs1 = ax.get_children()[idx0].get_offsets()
locs2 = ax.get_children()[idx1].get_offsets()
# before plotting, we need to sort so that the data points
# correspond to each other as they did in "set1" and "set2"
sort_idxs1 = np.argsort(set1)
sort_idxs2 = np.argsort(set2)
# revert "ascending sort" through sort_idxs2.argsort(),
# and then sort into order corresponding with set1
locs2_sorted = locs2[sort_idxs2.argsort()][sort_idxs1]
for i in range(locs1.shape[0]):
x = [locs1[i, 0], locs2_sorted[i, 0]]
y = [locs1[i, 1], locs2_sorted[i, 1]]
ax.plot(x, y, color="black", alpha=0.1)
It prints:
set1 set2
0 3 85
1 30 8
2 26 69
3 17 20
4 17 9
And you can see that the data is linked correspondingly in the plot.
Sure, it's possible (but you really don't want to).
seaborn.swarmplot returns the axis instance (here: ax). You can grab the children ax.get_children to get all plot elements. You will see that for each set of points there is an element of type PathCollection. You can determine the x, y coordinates by using the PathCollection.get_offsets() method.
I do not suggest you do this! Madness lies this way.
I suggest you have a look at the source code (found here), and derive your own _PairedSwarmPlotter from _SwarmPlotter and change the draw_swarmplot method to your needs.
I am trying to plot several different things in scatter plots by having several subplots and iterating over the remaining categories, but the plots only display the first iteration without throwing any error. To clarify, here is an example of what the data actually look like:
a kind state property T
0 0.905618 I dry prop1 10
1 0.050311 I wet prop1 20
2 0.933696 II dry prop1 30
3 0.114824 III wet prop1 40
4 0.942719 IV dry prop1 50
5 0.276627 II wet prop2 10
6 0.612303 III dry prop2 20
7 0.803451 IV wet prop2 30
8 0.257816 II dry prop2 40
9 0.122468 IV wet prop2 50
And this is how I generated the example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
kinds = ['I','II','III','IV']
states = ['dry','wet']
props = ['prop1','prop2']
T = [10,20,30,40,50]
a = np.random.rand(10)
k = ['I','I','II','III','IV','II','III','IV','II','IV']
s = ['dry','wet','dry','wet','dry','wet','dry','wet','dry','wet']
p = ['prop1','prop1','prop1','prop1','prop1','prop2','prop2','prop2','prop2','prop2']
t = [10,20,30,40,50,10,20,30,40,50]
df = pd.DataFrame(index=range(10),columns=['a','kind','state','property','T'])
df['a']=a
df['kind']=k
df['state']=s
df['property']=p
df['T']=t
print df
Next, I am going to generate 2 rows and 2 columns of subplots, to display variabilities in property1 and property2 in wet and dry states. So I basically slice my dataframe into several smaller ones like this:
first = df[(df['state']=='dry')&(df['property']=='prop1')]
second = df[(df['state']=='wet')&(df['property']=='prop1')]
third = df[(df['state']=='dry')&(df['property']=='prop2')]
fourth = df[(df['state']=='wet')&(df['property']=='prop2')]
dfs = [first,second,third,fourth]
in each of these subplots, which specify certain laboratory conditions, I want to plot the values of a versus T for different kinds of samples. To distinguish between the kinds of samples, I assign different colours and markers to them. So here is my plotting script:
fig = plt.figure(figsize=(8,8.5))
gs = gridspec.GridSpec(2,2, hspace=0.4, wspace=0.3)
colours = ['r','b','g','gold']
symbols = ['v','v','^','^']
titles=['dry 1','wet 1','dry 2','wet 2']
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df['T'],df['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
plt.show()
But the result only plots the first iteration, in this case kind I in red triangles. If I remove this first item from the iterating lists, it only plots the first variable (kind II in blue triangles).
What am I doing wrong?
The figure looks like this, but I would like to have each subplot accordingly populated with red and blue and green and gold markers.
(Please note this happens with my real data as well, so the problem should not be in the way I generate the example.)
Your problem lies within the inner for loop. By writing df = df[df['kind']==r], you replace the original df with the version filtered for I. Then, in the next iteration of the loop, where you would filter for II, no further data is found. Therefore you also get no error message, as the code is otherwise 'correct'. If you rewrite the relevant piece of code like this:
for no, df in enumerate(dfs):
ax = fig.add_subplot(gs[no])
for i, r in enumerate(kinds):
#print i, r
df2 = df[df['kind']==r]
c = colours[i]
m = symbols[i]
plt.scatter(df2['T'],df2['a'],c=c,s=50.0, marker=m, edgecolor='k')
ax = plt.xlabel('T')
ax = plt.xticks(T)
ax = plt.ylabel('A')
ax = plt.title(titles[no],fontsize=12,alpha=0.75)
It should work just fine. Tested on Python 3.5.
I have the foll. dataframe:
Av_Temp Tot_Precip
278.001 0
274 0.0751864
270.294 0.631634
271.526 0.229285
272.246 0.0652201
273 0.0840059
270.463 0.0602944
269.983 0.103563
268.774 0.0694555
269.529 0.010908
270.062 0.043915
271.982 0.0295718
and want to plot a boxplot where the x-axis is 'Av_Temp' divided into equi-sized bins (say 2 in this case), and the Y-axis shows the corresponding range of values for Tot_Precip. I have the foll. code (thanks to Find pandas quartiles based on another column), however, when I plot the boxplots, they are getting plotted one on top of another. Any suggestions?
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
grp_df = df.groupby(expl_var+'_Deciles').apply(lambda x: numpy.array(x[cname]))
fig, ax = plt.subplots()
for i in range(len(grp_df)):
box_arr = grp_df[i]
box_arr = box_arr[~numpy.isnan(box_arr)]
stats = cbook.boxplot_stats(box_arr, labels = str(i))
ax.bxp(stats)
ax.set_yscale('log')
plt.show()
Since you're using pandas already, why not use the boxplot method on dataframes?
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
ax = df.boxplot(by='Av_Temp_Deciles', column='Tot_Precip')
ax.set_yscale('log')
That produces this: http://i.stack.imgur.com/20KPx.png
If you don't like the labels, throw in a
plt.xlabel('');plt.suptitle('');plt.title('')
If you want a standard boxplot, the above should be fine. My understanding of the separation of boxplot into boxplot_stats and bxp is to allow you to modify or replace the stats generated and fed to the plotting routine. See https://github.com/matplotlib/matplotlib/pull/2643 for some details.
If you need to draw a boxplot with non-standard stats, you can use boxplot_stats on 2D numpy arrays, so you only need to call it once. No loops required.
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
# I moved your nan check into the df apply function
grp_df = df.groupby('Av_Temp_Deciles').apply(lambda x: numpy.array(x[cname][~numpy.isnan(x[cname])]))
# boxplot_stats can take a 2D numpy array of data, and a 1D array of labels
# stats is now a list of dictionaries of stats, one dictionary per quantile
stats = cbook.boxplot_stats(grp_df.values, labels=grp_df.index)
# now it's a one-shot plot, no loops
fig, ax = plt.subplots()
ax.bxp(stats)
ax.set_yscale('log')