The following is my code.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')
titanic = sns.load_dataset('titanic')
g = sns.FacetGrid(titanic, col="sex")
g = g.map(plt.hist, "age")
The Histogram looks like as shown.
Now I have a question about the parameter col. I see two histograms arranged in a row. But I have mentioned that col=sex. So what is the purpose of col parameter and why histograms are arranged in a rowwise fashion?
Specifying the col parameter subsets the data frame into grouped by the variable that matches the indicated argument name. Each group will be plotted in a separate column in the resulting plot. In your case, the data frame variable sex has two groups: males and females. In the resulting plot, each of these groups have been plotted plotted in a separate column, that's why there are two columns and one row in your plot.
From the FaceGrid docstring:
row, col, hue : strings
Variables that define subsets of the data, which will be drawn on separate facets in the grid. See the *_order parameters to control the order of levels of this variable.
Related
I have a dataframe where one of the columns is a 16 element vector (stored as a list).
In the past, I have found seaborn's lineplot highly useful for regression analysis on a scalar column. The vector column has me in a bind.
Consider a seaborn sample program:
import seaborn as sns
sns.set_theme(style="darkgrid")
# Load an example dataset with long-form data
fmri = sns.load_dataset("fmri")
# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri)
it yields a figure, such as this
If I add another column signal2 to fmri:
fmri['signal2'] = '[1,2,3,4,5,6]'
(this is for representational purposes only)
In the dataset I have, there is a list of 16 floats in a column of the dataset.
What I want to do is look at lineplot for:
sns.lineplot(x="<length of vector>", y="signal2",
hue="region", style="event",
data=fmri)
Basically, look at variations in the vector for different subsections of the dataset.
Assuming that you want one line for each index in the list, e.g., the value at the 0th index for all rows will create a single line. To do this, we need to first explode the lists while keeping track of the index of each list value.
First, I create an example column signal2 with a list of length 6 for each row:
fmri['signal2'] = list(np.random.random((len(fmri), 6)))
Note that if the list is actually a string (as in the question), we need to convert it to an actual list first. If the column already contains lists, this is not needed.
fmri['signal2'] = fmri['signal2'].str[1:-1].str.split(',')
Then, we explode the list and add the list indices using cumcount:
fmri = fmri.explode('signal2')
fmri['signal2'] = fmri['signal2'].astype(float) # Needed if the elements are strings.
fmri['x'] = fmri.groupby(fmri.index).cumcount()
To plot the data as separate lines, set the hue parameter to be the list index column while keeping the x-axis as timepoint:
sns.lineplot(x="timepoint", y="signal2", hue="x", data=fmri)
Resulting plot:
I have two pandas dataframes named train_df and test_df. They both have columns with same names and test_df doesn't have only one column that train_df does. I now want to plot violin plots showing distribution(like box plot) of each column of my dataframe(s) in each row of the plot; and each side of a violin representing the same column but from different dataframe, in order to compare distributions of each column from both the dataframes. How do I do this (preferably in matplotlib or seaborn)?
EDIT 1:
something like this plot below, but I want each violin to show the distribution of each column and each side of the violin to show the distribution of column with same column from each dataframe. Except that this image shows only two columns and used third column for colour.
You will have to combine your two dataframe in one, with a column setting the origin of each line:
# create fake data
tips = sns.load_dataset('tips')
train_df = tips.loc[tips['smoker']=='Yes']
test_df = tips.loc[tips['smoker']=='No']
# concatenate both dataframe
df = pd.concat([train_df.assign(orig='train'), test_df.assign(orig='test')], axis=0)
# plot
ax = sns.violinplot(x="day", y="total_bill", hue="orig",
data=df, split=True)
I have a pandas DataFrame with multiple columns filled with numbers and rows, and the 1st column has the categorical data. Obviously, I have NaN values and zeroes in multiple rows (but not the entire blank row, of course) and in different columns.
The rows have valuable data in other columns which are not NaN. And the columns have valuable data in other rows, which are also not NaN.
The problem is that sns.pairplot does not ignore NaN values for correlation and returns errors (such as division by zero, string to float conversion, etc.).
I have seen some people saying to use fillna() method, but I am hoping if anyone knows a more elegant way to do this, without having to go through that solution and spend numerous hours to fix the plot, axis, filters, etc. afterwards. I didn't like that work around.
It is similar to what this person has reported:
https://github.com/mwaskom/seaborn/issues/1699
ZeroDivisionError: 0.0 cannot be raised to a negative power
Here is the sample dataset:
Seaborn's PairGrid function will allow you to create your desired plot. PairGrid is much more flexible than sns.pairplot. Any PairGrid created has three sections: the upper triangle, the lower triangle and the diagonal.
For each part, you can define a customized plotting function. The upper and lower triangle sections can take any plotting function that accepts two arrays of features (such as plt.scatter) as well as any associated keywords (e.g. marker). The diagonal section accepts a plotting function that has a single feature array as input (such as plt.hist) in addition to the relevant keywords.
For your purpose, you can filter out the NaNs in your customized function(s):
from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns
data = datasets.load_iris()
iris = pd.DataFrame(data.data, columns=data.feature_names)
# break iris dataset to create NaNs
iris.iat[1, 0] = np.nan
iris.iat[4, 0] = np.nan
iris.iat[4, 2] = np.nan
iris.iat[5, 2] = np.nan
# create customized scatterplot that first filters out NaNs in feature pair
def scatterFilter(x, y, **kwargs):
interimDf = pd.concat([x, y], axis=1)
interimDf.columns = ['x', 'y']
interimDf = interimDf[(~ pd.isnull(interimDf.x)) & (~ pd.isnull(interimDf.y))]
ax = plt.gca()
ax = plt.plot(interimDf.x.values, interimDf.y.values, 'o', **kwargs)
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=iris, vars=list(iris.columns), size = 4)
# Map a scatter plot to the upper triangle
grid = grid.map_upper(scatterFilter, color='darkred')
# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins=10, edgecolor='k', color='darkred')
# Map a density plot to the lower triangle
grid = grid.map_lower(scatterFilter, color='darkred')
This will yield the following plot:
PairPlot allows you to plot contour plots, annotate the panels with descriptive statistics, etc. For more details, see here.
I'm looking to plot two columns of a time series based on a groupby of a third column. It works as intended more or less, but I can't tell which subgroup is being plotted in the output as it is not included in the legend or anywhere else in the graphs outputted.
Is there a way to include the subgroup name in the graphs outputted?
This is what I've attempted on the dataframe as follows:
dataframe
awareness.groupby('campaign_name')['sum_purchases_value','sum_ad_spend'].plot(figsize=(20,8), legend=True);
Try this:
grouped = awareness.groupby('campaign_name')
titles = [name for name,data in grouped]
plots = grouped['sum_purchases_value',
'sum_ad_spend'].plot(figsize=(20,8), legend=True)
for plot, label in zip(plots, titles):
plot.set(title = label)
The pandas plot function returns a Series of matplotlib subplot objects, so using the for loop you can customize whatever you like (x labels, y labels, font size, etc.)
Given some data:
pt = pd.DataFrame({'alrmV':[000,000,000,101,101,111,111],
'he':[e,e,e,e,h,e,e],
'inc':[0,0,0,0,0,1,1]})
I would like to create a bar plot separated on row and col.
g = sns.FacetGrid(pt, row='inc', col='he', margin_titles=True)
g.map( sns.barplot(pt['alrmV']), color='steelblue')
This, works, but how do I also add:
an ordered x-axis
only display the top-two-by-count alrmV types
To get an ordered x-axis, that displays the top 2 count types, I played around with this grouping, but unable to get it into a Facet grid:
grouped = pt.groupby( ['he','inc'] )
grw= grouped['alrmV'].value_counts().fillna(0.) #.unstack().fillna(0.)
grw[:2].plot(kind='bar')
Using FacetGrid, slicing limits the total count displayed
g.map(sns.barplot(pt['alrmV'][:10]), color='steelblue')
So how can I get a bar graph, that is separated on row and col, and is ordered and displays only top 2 counts?
I couldn't get the example to work with the data you provided, so I'll use one of the example datasets to demonstrate:
import seaborn as sns
tips = sns.load_dataset("tips")
We'll make a plot with sex in the columns, smoker in the rows, using day as the x variable for the barplot. To get the top two days in order, we could do
top_two_ordered = tips.day.value_counts().order().index[-2:]
Then you can pass this list to the x_order argument of barplot.
Although you can use FacetGrid directly here, it's probably easier to use the factorplot function:
g = sns.factorplot("day", col="sex", row="smoker",
data=tips, margin_titles=True, size=3,
x_order=top_two_ordered)
Which draws:
While I wouldn't recommend doing exactly what you proposed (plotting bars for different x values in each facet), it could be accomplished by doing something like
g = sns.FacetGrid(tips, col="sex", row="smoker", sharex=False)
def ordered_barplot(data, **kws):
x_order = data.day.value_counts().order().index[-2:]
sns.barplot(data.day, x_order=x_order)
g.map_dataframe(ordered_barplot)
to make