I have a DataFrame (data) with a simple integer index and 5 columns. The columns are Date, Country, AgeGroup, Gender, Stat. (Names changed to protect the innocent.) I would like to produce a FacetGrid where the Country defines the row, AgeGroup defines the column, and Gender defines the hue. For each of those particulars, I would like to produce a time series graph. I.e. I should get an array of graphs each of which has 2 time series on it (1 male, 1 female). I can get very close with:
g = sns.FacetGrid(data, row='Country', col='AgeGroup', hue='Gender')
g.map(plt.plot, 'Stat')
However this just gives me the sample number on the x-axis rather than the dates. Is there a quick fix in this context.
More generally, I understand that the approach with FacetGrid is to make the grid and then map a plotting function to it. If I wanted to roll my own plotting function, what are the conventions it needs to follow? In particular, how can I write my own plotting function (to pass to map for FacetGrid) that accepts multiple columns worth of data from my dataset?
I'll answer your more general question first. The rules for functions that you can pass to FacetGrid.map are:
They must take array-like inputs as positional arguments, with the first argument corresponding to the x axis and the second argument corresponding to the y axis (though, more on the second condition shortly
They must also accept two keyword arguments: color, and label. If you want to use a hue variable than these should get passed to the underlying plotting function, though you can just catch **kwargs and not do anything with them if it's not relevant to the specific plot you're making.
When called, they must draw a plot on the "currently active" matplotlib Axes.
There may be cases where your function draws a plot that looks correct without taking x, y, positional inputs. I think that's basically what's going on here with the way you're using plt.plot. It can be easier then to just call, e.g., g.set_axis_labels("Date", "Stat") after you use map, which will rename your axes properly. You may also want to do g.set(xticklabels=dates) to get more meaningful ticks.
There is also a more general function, FacetGrid.map_dataframe. The rules here are similar, but the function you pass must accept a dataframe input in a parameter called data, and instead of taking array-like positional inputs it takes strings that correspond to variables in that dataframe. On each iteration through the facets, the function will be called with the input dataframe masked to just the values for that combination of row, col, and hue levels.
So in your specific case, you'll need to write a function that we can call plot_by_date that should look something like this:
def plot_by_date(x, y, color=None, label=None):
...
(I'd be more helpful on the body, but I don't actually know how to do much with dates and matplotlib). The end result is that when you call this function it should plot on the currently-active Axes. Then do
g.map(plot_by_date, "Date", "Stat")
And it should work, I think.
Related
I'm new to Python so I hope you'll forgive my silly questions. I have read a dataset from excel with pandas. The dataset is composed by 3 functions (U22, U35, U55) and related same index (called y/75). enter image description here
now I would like to "turn" the graph so that the index "y/75" goes on the y-axis instead of the x-axis, keeping all the functions in the same graph. The results I want to obtain is like in the following picture enter image description here
the code I've used is
var = pd.read_excel('path.xlsx','SummarySheet', index_col=0)
norm_vel=var[['U22',"U35","U55"]]
norm_vel.plot(figsize=(10,10), grid='true')
But with this code I couldn't find a way to change the axes. Then I tried a different approach, so I turned the graph but couldn't add all the functions in the same graph but just one by one
var = pd.read_excel('path.xlsx','SummarySheet', index_col=False)
norm_vel2=var[['y/75','U22',"U35","U55"]]
norm_vel2.plot( x='U22', y='y/75', figsize=(10,10), grid='true' )
plt.title("Velocity profiles")
plt.xlabel("Normalized velocity")
plt.ylabel("y/75")
obtaining this enter image description here
I am not very familiar with dataframes plot. And to be honest, I've been stalking this question expecting that someone would give an obvious answer. But since no one has one (1 hour old questions, is already late for obvious answers), I can at least tell you how I would do it, without the plot method of the dataframe
plt.figure(figsize=(10,10))
plt.grid(True)
plt.plot(var[['U22',"U35","U55"]], var['y/75'])
plt.title("Velocity profiles")
plt.xlabel("Normalized velocity")
plt.ylabel("y/75")
When used to matplotlib, in which, you can have multiple series in both x and y, the instinct says that pandas connections (which are just useful functions to call matplotlib with the correct parameters), should make it possible to just call
var.plot(x=['U22', 'U35', 'U55'], y='y/75')
Since after all,
var.plot(x='y/75', y=['U22', 'U35', 'U55'])
works as expected (3 lines: U22 vs y/75, U35 vs y/75, U55 vs y/75). So the first one should have also worked (3 lines, y/75 vs U22, y/75 vs U35, y/75 vs U55). But it doesn't. Probably the reason why pandas documentation itself says that these matplotlib connections are still a work in progress.
So, all you've to do is call matplotlib function yourself. After all, it is not like pandas is doing much more when calling those .plot method anyway.
I am studying an example of calling function patsy.dmatrices().
The input argument formula_like contains items C(sales) and C(salary), and the function translates these items to discrete values into dummy variables depending on the specific input data. For example, C(salary) gets indicator columns of C(salary)[T.low], C(salary)[T.medium], etc.
So, I wonder:
What is the terminology of C()? Should we call it a function or something? I didn't find a clear description on the official document webpage, but I could have missed something.
What is the purpose of wrapping the column name with C()? I tried to remove it, e.g. changing the item from C(salary) to salary plainly, and the function still translates the column into dummy variables.
I am new to this area, and I highly appreciate any hints or suggestions.
y, X = dmatrices(
formula_like=
'left~satisfaction_level+last_evaluation+number_project+average_montly_hours'
'+time_spend_company+Work_accident+promotion_last_5years+C(sales)+C(salary)',
data=data,
return_type='dataframe')
X.head()
I have two sets of x-y data, that I'd like to plot as a scatterplot, using sns.scatterplot. I want to highlight two different things:
the difference between different types of data
the difference between the first and the second set of x-y data
For the first, I'm using the inbuilt hue and style, for the second, I'd like to have filled vs. unfilled markers, but I'm wondering how to do so, without doing it all by hand with plt.scatter, where I would have to implement all the magic of sns.scatterplot by hand.
long version, with MWE:
I have X and Y data, and also have some type info for each point of data. I.e. I have a sample 1 which is of type A and yields X=11, Y=21 at the first sampling and X=10, Y=21 at the second sampling. And the same deal for sample 2 of type A, sample 3 of type B and so on (see example file at the end).
So i want to visualize the differences between two samplings, like so:
data = pd.read_csv('testdata.csv', sep=';', index_col=0, header=0)
# data for the csv at the end of the question
sns.scatterplot(x=data['x1'], y=data['y1'])
sns.scatterplot(x=data['x2'], y=data['y2'])
Nice, I can easily see that the first sampling seems to show a linear relationship between X and Y, whereas the second one shows some differences. Now what interests me, is which type of data is affected the most by these differences and that's why I'm using seaborn, instead of pure matplotlib: sns.scatterplot has a lot of nice stuff built in, e.g. hue (and style, to get symbols for printing in b&w):
sizes = (200, 200) # to make stuff more visible
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
OK, so I can easily distinguish my data types, but I lost all information about which sample is what. The obvious solution to me seem to use filled markers for one, and unfilled ones for the other.
However, I can't seem to do that.
I'm aware of this question/answer, using fc='none' which is not documented in the sns.scatterplot documentation but this fails, when also using hue:
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes, fc='none')
As you can see, the second set of markers simply vanishes (there's some artifacts in the B data, where hints of a white cross are visible).
I can kinda fix that by setting ec=...:
sns.scatterplot(x=data['x1'], y=data['y1'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes)
sns.scatterplot(x=data['x2'], y=data['y2'], hue=data['type'], style=data['type'],
size=data['type'], sizes=sizes, fc='none',
ec=('b','b','y','y','y', 'y', 'g', 'g', 'g','r'))
# I would have to define the proper colors, but for this example, they're close enough
but that obviously has a few issues:
the markers in the legend aren't fitting anymore, neither color nor fill
and I'm already halfway in doing-it-all-by-hand territory anyways, e.g. my ec= would fail when I want to plot a new dataset with sample_no 11.
How can I do that with seaborn? Filled vs. unfilled seems quite an obvious flag for scatterplots, but I can't seem to find it.
data for testdata.csv:
sample_no;type;x1;y1;x2;y2
1;A;11;21;10;21
2;A;12;22;12;21
3;B;13;23;13.2;22.8
4;B;14;24;13.8;24
5;B;15;25;14.8;25.2
6;B;16;26;16.3;25.9
7;C;17;27;18;28
8;C;18;28;20;26
9;C;19;29;20;30
10;D;20;30;19;28
I need to specify the color and marker for a series of plots on the same axis. In Python, I would simply create an iterator for each and use next() to get them out in order one at a time. I cannot find an equivalent in MATLAB; all the examples I have found involve explicitly calling the list holding the colors and markers by index, but this precludes using them in loops that don't use a matching iterator. Is there a more appropriate substitution for the iterator concept?
Alternately, is there a more appropriate way to accomplish this in MATLAB?
You can use the ColorOrder and LineStyleOrder properties of the axis: you can find here the complete documentation.
The ColorOrder property is a three-column matrix of RGB triplets and the LineStyleOrder is a cell array of line specifiers or, alternatively, a string of specifiers separated by |.
This figure has been created using the code below. Of course, you can also generate the ColorOrder matrix using one of the built-in colormaps or even a custom one.
figure;
set(gca, 'ColorOrder', hsv(5));
set(gca, 'LineStyleOrder', '-|--|:');
hold on;
t = 0:pi/20:2*pi;
for i = 1:15
plot(t, sin(t-i/5));
end
Anyway, as far as I know in MATLAB there isn't the concept of iterator, especially in the Python sense, but at least this solution should address your problem without explicitly calling the list of colors and/or marker by index.
You can define the look (such as color and marker) for the plots in the plot command. E.g. plot(1:5,'-go') will produce a green plot with o-makers.(Info)
Alternatively, you can indeed iterate over the plots in an axis. If you do all the plots in one command, like
h = plot(1:5,[1:5;2:2:10]);
then h will be a vector of chart line objects, and you can then iterate over these objects using
for i=1:length(h)
h(i).<some_modifications>
end
and set properties like this:
h(i).LineWidth = 2;
h(i).Marker = '*';
or in MATLAB versions before 2014:
set(h(i),'LineWidth',2)
set(h(i),'Marker','*')
If you do the plots in separate commands, you can manually collect the returned chart line objects in a vector and do the same thing (or of course modify them directly). You can find some properties you can use here.
Is this what you were looking for?
In the graphic below, I want to put in a legend for the calendar plot. The calendar plot was made using ax.plot(...,label='a') and drawing rectangles in a 52x7 grid (52 weeks, 7 days per week).
The legend is currently made using:
plt.gca().legend(loc="upper right")
How do I correct this legend to something more like a colorbar? Also, the colorbar should be placed at the bottom of the plot.
EDIT:
Uploaded code and data for reproducing this here:
https://www.dropbox.com/sh/8xgyxybev3441go/AACKDiNFBqpsP1ZttsZLqIC4a?dl=0
Aside - existing bugs
The code you put on the dropbox doesn't work "out of the box". In particular - you're trying to divide a datetime.timedelta by a numpy.timedelta64 in two places and that fails.
You do your own normalisation and colour mapping (calling into color_list based on an int() conversion of your normalised value). You subtract 1 from this and you don't need to - you already floor the value by using int(). The result of doing this is that you can get an index of -1 which means your very smallest values are incorrectly mapped to the colour for the maximum value. This is most obvious if you plot column 'BIOM'.
I've hacked this by adding a tiny value (0.00001) to the total range of the values that you divide by. It's a hack - I'm not sure that this method of mapping is at all the best use of matplotlib, but that's a different question entirely.
Solution adapting your code
With those bugs fixed, and adding a last suplot below all the existing ones (i.e. replacing 3 with 4 on all your calls to subplot2grid(), you can do the following:
Replace your
plt.gca().legend(loc="upper right")
with
# plot an overall colorbar type legend
# Grab the new axes object to plot the colorbar on
ax_colorbar = plt.subplot2grid((4,num_yrs), (3,0),rowspan=1,colspan=num_yrs)
mappableObject = matplotlib.cm.ScalarMappable(cmap = palettable.colorbrewer.sequential.BuPu_9.mpl_colormap)
mappableObject.set_array(numpy.array(df[col_name]))
col_bar = fig.colorbar(mappableObject, cax = ax_colorbar, orientation = 'horizontal', boundaries = numpy.arange(min_val,max_val,(max_val-min_val)/10))
# You can change the boundaries kwarg to either make the scale look less boxy (increase 10)
# or to get different values on the tick marks, or even omit it altogether to let
col_bar.set_label(col_name)
ax_colorbar.set_title(col_name + ' color mapping')
I tested this with two of your columns ('NMN' and 'BIOM') and on Python 2.7 (I assume you're using Python 2.x given the print statement syntax)
The finalised code that works directly with your data file is in a gist here
You get
How does it work?
It creates a ScalarMappable object that matplotlib can use to map values to colors. It set the array to base this map on to all the values in the column you are dealing with. It then used Figure.colorbar() to add the colorbar - passing in the mappable object so that the labels are correct. I've added boundaries so that the minimum value is shown explicitly - you can omit that if you want matplotlib to sort that out for itself.
P.S. I've set the colormap to palettable.colorbrewer.sequential.BuPu_9.mpl_colormap, matching your get_colors() function which gets these colours as a 9 member list. I strongly recommend importing the colormap you want to use as a nice name to make the use of mpl_colors and mpl_colormap more easy to understand e.g.
import palettable.colorbrewer.sequential.BuPu_9 as color_scale
Then access it as
color_scale.mpl_colormap
That way, you can keep your code DRY and change the colors with only one change.
Layout (in response to comments)
The colorbar may be a little big (certainly tall) for aesthetic ideal. There are a few possible options to do that. I'll point you to two:
The "right" way to do it is probably to use a Gridspec
You could use your existing approach, but increase the number of rows and have the colorbar still in one row, while the other elements span more rows than they do currently.
I've implemented that with 9 rows, an extra column (so that the month labels don't get lost) and the colorbar on the bottom row, spanning 2 less columns than the main figure. I've also used tight_layout with w_pad=0.0 to avoid label clashes. You can play with this to get your exact preferred size. New code here.
This gives:
:
There are functions to do this in matplotlib.colorbar. With some specific code from your example, I could give you a better answer, but you'll use something like:
myColorbar = matplotlib.colorbar.ColorbarBase(myAxes, cmap=myColorMap,
norm=myNorm,
orientation='vertical')