I have a set of data that comes from two different sources, and I have multiple sets graphed together. So essentially 6 scatterplots with error bars (all different colors), and each scatterplot has two sources.
Basically I want the blue scatterplot to have two different markers, 'o' and's'. I currently have done this by plotting each point individually with a loop and checking to see if the source is 1 or 2. If it is 1 it plots a 's' if the source is 2 then it plots a 'o'.
However this method does not really allow for having a legend. (Data1, Data2,...Data6)
Is there a better way of doing this?
EDIT:
I want a cleaner method for this, something along the lines of
x=[1,2,3]
y=[4,5,6]
m=['o','s','^']
plt.scatter(x,y,marker=m)
But this returns an error Unrecognized marker style
A more pythonic way (but still a loop) might be something like
x=[1,2,3]
y=[4,5,6]
l=['data1','data2','data3']
m=['ob','sb','^b']
f,a = plt.subplots(1,1)
[a.plot(*data, label=lab) for data,lab in zip(zip(x,y,m),l)]
plt.legend(loc='lower right')
plt.xlim(0,4)
plt.ylim(3,7);
But I guess this is not the most efficient way if you have lots of datapoints.
If you want to use scatter try something like
m=['o','s','^']
f,a = plt.subplots(1,1)
[a.scatter(*data, marker=m1, label=l1) for data,m1,l1 in zip(zip(x,y),m,l)]
I'm pretty sure, there is also a possibility to apply ** and dicts here.
UPDATE:
Instead of looping over the plot command the ability of matplotlib's plot function to read an arbitrary number of x,y,fmt groups, see docs.
x=np.random.random((3,6))
y=np.random.random((3,6))
l=['data1','data2','data3']
m=['ob','sb','^b']
plt.plot(*[i[j] for i in zip(x,y,m) for j in range(3)])
plt.legend(l,loc='lower right')
Calling plot in a loop is fine. You just need to keep the list of lines returned by plot and use fig.legend to create a legend for the whole figure. See http://matplotlib.org/examples/pylab_examples/figlegend_demo.html
Seconded to #tcaswell 's comments, .scatter() returns collections.PathCollection, which provides a fast way of plotting a large number of identical shaped objects. You can use a loop to plot the data as many scatter plots (and many different datasets) but in my opinion it looses all the speed benefit provided by .scatter().
With these being said, it is however not true that the dots have to be identical in a scatter plot. You can have different linewidth, edgecolor and many other things. But the dots have to be the same shape. See this example, assigning different colors (and only plot one dataset):
>>> sc=plt.scatter(x, y, label='test')
>>> sc.set_color(['r','g','b'])
>>> plt.legend()
See details in http://matplotlib.org/api/collections_api.html.
These were all alright, but not really what I was looking for. The problem was how I parsed through my data and how I could add a legend in the wouldn't mess that up. Since I did a for-loop and plotted each point individually based on if it was measured at Observation location 1 or 2 whenever I made a legend it would plot over 50 legend entries. So I plotted my data as full sets (Invisibly and with no change in symbols) then again in color with the varying symbols. This worked better. Thanks though
Related
I have two sets of x-y data, that I'd like to plot as a scatterplot, using sns.scatterplot. I want to highlight two different things:
the difference between different types of data
the difference between the first and the second set of x-y data
For the first, I'm using the inbuilt hue and style, for the second, I'd like to have filled vs. unfilled markers, but I'm wondering how to do so, without doing it all by hand with plt.scatter, where I would have to implement all the magic of sns.scatterplot by hand.
long version, with MWE:
I have X and Y data, and also have some type info for each point of data. I.e. I have a sample 1 which is of type A and yields X=11, Y=21 at the first sampling and X=10, Y=21 at the second sampling. And the same deal for sample 2 of type A, sample 3 of type B and so on (see example file at the end).
So i want to visualize the differences between two samplings, like so:
data = pd.read_csv('testdata.csv', sep=';', index_col=0, header=0)
# data for the csv at the end of the question
sns.scatterplot(x=data['x1'], y=data['y1'])
sns.scatterplot(x=data['x2'], y=data['y2'])
Nice, I can easily see that the first sampling seems to show a linear relationship between X and Y, whereas the second one shows some differences. Now what interests me, is which type of data is affected the most by these differences and that's why I'm using seaborn, instead of pure matplotlib: sns.scatterplot has a lot of nice stuff built in, e.g. hue (and style, to get symbols for printing in b&w):
sizes = (200, 200) # to make stuff more visible
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
OK, so I can easily distinguish my data types, but I lost all information about which sample is what. The obvious solution to me seem to use filled markers for one, and unfilled ones for the other.
However, I can't seem to do that.
I'm aware of this question/answer, using fc='none' which is not documented in the sns.scatterplot documentation but this fails, when also using hue:
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes, fc='none')
As you can see, the second set of markers simply vanishes (there's some artifacts in the B data, where hints of a white cross are visible).
I can kinda fix that by setting ec=...:
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes, fc='none',
ec=('b','b','y','y','y', 'y', 'g', 'g', 'g','r'))
# I would have to define the proper colors, but for this example, they're close enough
but that obviously has a few issues:
the markers in the legend aren't fitting anymore, neither color nor fill
and I'm already halfway in doing-it-all-by-hand territory anyways, e.g. my ec= would fail when I want to plot a new dataset with sample_no 11.
How can I do that with seaborn? Filled vs. unfilled seems quite an obvious flag for scatterplots, but I can't seem to find it.
data for testdata.csv:
sample_no;type;x1;y1;x2;y2
1;A;11;21;10;21
2;A;12;22;12;21
3;B;13;23;13.2;22.8
4;B;14;24;13.8;24
5;B;15;25;14.8;25.2
6;B;16;26;16.3;25.9
7;C;17;27;18;28
8;C;18;28;20;26
9;C;19;29;20;30
10;D;20;30;19;28
I would like to create a Seaborn scatter-plot, using the following dataframe:
df = pd.DataFrame({'A':[1,2,3,4],'B':[2,4,6,8],'C':['y','y','n','n'],'D':[1,1,2,2]})
In my graph A should be the x-variable and B the y-variable. Furthermore I would like to color based on column D. Finally, when C='y' the marker should be open-faced (no facecolor) and when C='n' the marker should have a closed. My original idea was to use the hue and style parameter:
sns.scatterplot(x='A', y='B',
data=df, hue='D',style ='C')
However, I did not manage to obtain the graph I am looking for. Could somebody help me with this? Thank you in advance.
One cannot specify entire marker styles (so 'marker' and 'fillstyle' keys in your case) for matplotlib yet. Have a look on the answer to this post.
So the only thing left for you is to use different markers right away and specify them (as list or dictionary)
sns.scatterplot(data=df, x='A', y='B', hue='D', style='C', markers=['o', 's'])
plt.show()
Apparently, it is very hard to even create non-filled markers in seaborn, as this post explains. The only option is to do some matplotlib-seaborn-hybrid thing... So if you accept to plot things twice onto the same axis (one for a filled marker and one for the unfilled markers), you still have to dig yourself into the quirks of seaborn...
I need to do what has been explained for MATLAB here:
How to show legend for only a specific subset of curves in the plotting?
But using Python instead of MATLAB.
Brief summary of my goal: when plotting for example three curves in the following way
from matplotlib import pyplot as plt
a=[1,2,3]
b=[4,5,6]
c=[7,8,9]
# these are the curves
plt.plot(a)
plt.plot(b)
plt.plot(c)
plt.legend(['a','nothing','c'])
plt.show()
Instead of the word "nothing", I would like not to have anything there.
Using '_' will suppress the legend for a particular entry as following (continue reading for handling underscore _ as a legend). This solution is motivated by the recent post of #ImportanceOfBeingEarnest here.
plt.legend(['a','_','c'])
I would also avoid the way you are putting legends right now because in this way, you have to make sure that the plot commands are in the same order as legend. Rather, put the label in the respective plot commands to avoid errors.
That being said, the straightforward and easiest solution (in my opinion) is to do the following
plt.plot(a, label='a')
plt.plot(b)
plt.plot(c, label='c')
plt.legend()
As #Lucas pointed out in comment, if you want to show an underscore _ as the label for plot b, how would you do it. You can do it using
plt.legend(['a','$\_$','c'])
I used a code like:
g = sns.pairplot(df.loc[:,['column1','column2','column3','column4','column5']])
g.map_offdiag(plt.hexbin, gridsize=(20,20))
and have a pairplot and I expect that upper- and lower- triangle plots to be mirrored. The plots look like this:
I thought maybe the problems are the histograms so I tried to tighten the axes using plt.axis('tight') and plt.autoscale(enable=True, axis='y', tight=True) but nothing changed. I also got rid of the diagonal plots (made them invisible), but still the triangle plots are not mirrored. Why? and how to fix it?
Although still I do not understand why pairplot has this behavior here, I found a workaround. I access each plot within pairplot individually and set the limit manually.
g.axes[I,J].set_ylim(df.column3.min(),df.column3.max())
In this case, I had to repeat this piece of code 5 times, where I = 2 and J = 0,1,2,3,4.
In the graphic below, I want to put in a legend for the calendar plot. The calendar plot was made using ax.plot(...,label='a') and drawing rectangles in a 52x7 grid (52 weeks, 7 days per week).
The legend is currently made using:
plt.gca().legend(loc="upper right")
How do I correct this legend to something more like a colorbar? Also, the colorbar should be placed at the bottom of the plot.
EDIT:
Uploaded code and data for reproducing this here:
https://www.dropbox.com/sh/8xgyxybev3441go/AACKDiNFBqpsP1ZttsZLqIC4a?dl=0
Aside - existing bugs
The code you put on the dropbox doesn't work "out of the box". In particular - you're trying to divide a datetime.timedelta by a numpy.timedelta64 in two places and that fails.
You do your own normalisation and colour mapping (calling into color_list based on an int() conversion of your normalised value). You subtract 1 from this and you don't need to - you already floor the value by using int(). The result of doing this is that you can get an index of -1 which means your very smallest values are incorrectly mapped to the colour for the maximum value. This is most obvious if you plot column 'BIOM'.
I've hacked this by adding a tiny value (0.00001) to the total range of the values that you divide by. It's a hack - I'm not sure that this method of mapping is at all the best use of matplotlib, but that's a different question entirely.
Solution adapting your code
With those bugs fixed, and adding a last suplot below all the existing ones (i.e. replacing 3 with 4 on all your calls to subplot2grid(), you can do the following:
Replace your
plt.gca().legend(loc="upper right")
with
# plot an overall colorbar type legend
# Grab the new axes object to plot the colorbar on
ax_colorbar = plt.subplot2grid((4,num_yrs), (3,0),rowspan=1,colspan=num_yrs)
mappableObject = matplotlib.cm.ScalarMappable(cmap = palettable.colorbrewer.sequential.BuPu_9.mpl_colormap)
mappableObject.set_array(numpy.array(df[col_name]))
col_bar = fig.colorbar(mappableObject, cax = ax_colorbar, orientation = 'horizontal', boundaries = numpy.arange(min_val,max_val,(max_val-min_val)/10))
# You can change the boundaries kwarg to either make the scale look less boxy (increase 10)
# or to get different values on the tick marks, or even omit it altogether to let
col_bar.set_label(col_name)
ax_colorbar.set_title(col_name + ' color mapping')
I tested this with two of your columns ('NMN' and 'BIOM') and on Python 2.7 (I assume you're using Python 2.x given the print statement syntax)
The finalised code that works directly with your data file is in a gist here
You get
How does it work?
It creates a ScalarMappable object that matplotlib can use to map values to colors. It set the array to base this map on to all the values in the column you are dealing with. It then used Figure.colorbar() to add the colorbar - passing in the mappable object so that the labels are correct. I've added boundaries so that the minimum value is shown explicitly - you can omit that if you want matplotlib to sort that out for itself.
P.S. I've set the colormap to palettable.colorbrewer.sequential.BuPu_9.mpl_colormap, matching your get_colors() function which gets these colours as a 9 member list. I strongly recommend importing the colormap you want to use as a nice name to make the use of mpl_colors and mpl_colormap more easy to understand e.g.
import palettable.colorbrewer.sequential.BuPu_9 as color_scale
Then access it as
color_scale.mpl_colormap
That way, you can keep your code DRY and change the colors with only one change.
Layout (in response to comments)
The colorbar may be a little big (certainly tall) for aesthetic ideal. There are a few possible options to do that. I'll point you to two:
The "right" way to do it is probably to use a Gridspec
You could use your existing approach, but increase the number of rows and have the colorbar still in one row, while the other elements span more rows than they do currently.
I've implemented that with 9 rows, an extra column (so that the month labels don't get lost) and the colorbar on the bottom row, spanning 2 less columns than the main figure. I've also used tight_layout with w_pad=0.0 to avoid label clashes. You can play with this to get your exact preferred size. New code here.
This gives:
:
There are functions to do this in matplotlib.colorbar. With some specific code from your example, I could give you a better answer, but you'll use something like:
myColorbar = matplotlib.colorbar.ColorbarBase(myAxes, cmap=myColorMap,
norm=myNorm,
orientation='vertical')