I'm making a simple pairplot with Seaborn in Python that shows different levels of a categorical variable by the color of plot elements across variables in a Pandas DataFrame. Although the plot comes out exactly as I want it, the categorical variable is binary, which makes the legend quite meaningless to an audience not familiar with the data (categories are naturally labeled as 0 & 1).
An example of my code:
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
Is there a way to change legend label text with pairplot? Or should I use PairGrid, and if so how would I approach this?
Found it! It was answered here: Edit seaborn legend
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
g._legend.set_title(new_title)
Since you don't provide a full example of code, nor mock data, I will use my own codes to answer.
First solution
The easiest must be to keep your binary labels for analysis and to create a column with proper names for plotting. Here is a sample code of mine, you should grab the idea:
def transconum(morph):
if (morph == 'S'):
return 1.0
else:
return 0.0
CompactGroups['MorphNum'] = CompactGroups['MorphGal'].apply(transconum)
Second solution
Another way would be to overwrite labels on the flight. Here is a sample code of mine which works perfectly:
grid = sns.jointplot(x="MorphNum", y="PropS", data=CompactGroups, kind="reg")
grid.set_axis_labels("Central type", "Spiral proportion among satellites")
grid.ax_joint.set_xticks([0, 1, 1])
plt.xticks(range(2), ('$Red$', '$S$'))
Related
I'm trying to generate a histogram in Altair, but I'm having trouble controlling the tick count for the axis corresponding to the binned variable (x-axis). I'm new to Altair so apologies I'm missing something obvious here. I tried to look for whether others had faced this kind of issue but didn't find an exact match.
The code to generate the histogram is
alt.Chart(df_test).mark_bar().encode(
x=alt.X('x:Q', bin=alt.Bin(step=0.1), scale=alt.Scale(domain=[8.9, 11.6])),
y=alt.Y('count(y):Q', title='Count(Y)')
).configure_axis(labelLimit=0, tickCount=3)
df_test is a Pandas dataframe - the data for which is available here.
The above code generates the following histogram. Changing tickCount changes the y-axis tick counts, but not the x-axis.
Any guidance is appreciated.
There might be a more convenient way to do this using bin=, but one approach is to use transform_bin with mark_rect, since this does not change the axis into a binned axis (which are more difficult to customize):
import altair as alt
from vega_datasets import data
source = data.movies.url
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(tickCount=3)),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
You might notice that you don't get the exact number of ticks, this is because there is rounding to "nice" values, such as multiple of 5 etc. I couldn't turn this off even when setting nice=False on the scale, so another approach in those cases is to pass the exact tick values values=.
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(values=[0, 3, 6, 9])),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
Be careful with decimal values, these are automatically displayed as integers (even with tickRound=False), but in the wrong position (this seems like a bug to me so if you investigate it more you might want to report on the Vega Lite issue tracker.
I would like to create an Altair map having n distinct categories (each with a specific color) while
having a second variable that controls the alpha/shading/color of these categories?
Now, I am able to produce a map colored by category and using whatever custom color of my choice and I am able to produce a map with a continuous variable and using whatever colormap of my choice.
I am not sure, however, how to proceed to obtain what I am looking for.
I thought I could possibly add some extra color with something like this:
.encode(alt.Color('properties.Cat2:O', scale=alt.Scale(domain=domain, range=range_),alt.Color('properties.colmap2:Q', scale=alt.Scale(domain=domain, range=cm_range_))
but I feel like I am trying random stuff and not getting any closer.
EDIT
Following #jakevdp's comments I am trying to include an opacity argument. However, I am unsure about the proper syntax.
chart_json = json.loads(gdf.to_json())
chart_data= alt.Data(values=chart_json ['features'])
data_1km_geojson = alt.InlineData(values=val_1km, format=alt.DataFormat(property='features',type='json'))
domain=['Label1','Label2']
range_=['#b0d247','#007bd1']
chart_layer1 = alt.Chart(chart_data).mark_geoshape().encode(
alt.Color('properties.Cat2:O', scale=alt.Scale(domain=domain, range=range_),title = "sometitle"),
opacity=alt.Opacity('properties:OpacityVar:Q', bin=True),
).properties(
width=1100,
height=800
)
#Visualize the result
(background+chart_layer1).configure_view(stroke='white')
Additionally, the variable I am trying to use for the opacity argument has actually a very broad support (from 10.000 to 100Billions). Should I do a minmax normalization first?
Solved in the comments by splitting up the variables on two different encodings (opacity and color) instead of having them both on color.
Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots
I'm trying to plot a Pandas Series with lots of samples:
In [1]: vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
In [2]: len(vp_series)
Out[2]: 17499650
In [3]: vp_series.index[-1]
Out[3]: 559888625359
When I try to plot this series, the produced plot looks like this:
In [4]: vp_series.plot()
Clearly not all data points are plotted, and max value on the x axis is only about 1.75e7 instead of 5.59e11.
However, when I try to plot the same data in Julia (using Plots and the PyPlot backend) it produces the correct figure:
What should I do here to make the plot contain all the data points? I tried to search in the doc of matplotlib and Pandas.Series but found nothing.
I found the reason is that the way I used to create the pandas.Series is wrong. Instead of
vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
I should be using
vp_series = pd.Series(data=raw_df.Count.values, index=raw_df.Timestamp)
The first way is causing my series to contain a lot of missing values (NaN) which are not plotted. The reason is well explained in here.
I know I didn't ask my question properly and I appreciate all the comments.
I would like to create a figure that shows how much money people earned in a game (continuous variable) as a function of the categorical values of three other variables. The first variable is whether people were included or excluded prior to the Money game, the second variable is whether people knew their decision-making partner and the last is the round of the game (participants played 5 rounds with a known co-player and 5 rounds with an unknown co-player). I know how to do draw violin plots as a function of the values of two categorical variables using FacetGrid (see below) but I did not manage to add another layer to it.
g= sns.FacetGrid(df_long, col = 'XP_Social_Condition', size=5, aspect=1)
g.map(sns.boxplot, 'DM partner', 'Money', palette = col_talk)
I have created two dataframe versions: my initial one and a melted one (see image below). I have also tried to create two plots together using f, (ax_l, ax_r) = but this does not seem to take FacetGrid plots as plots within the plot... You can see below links to see the data and the kind of plot I would like to use as a subplot - one showing a known player and one showing the unknown player. I am happy to share the data if it would help.
I have now tried the solution proposed
grid = sns.FacetGrid(melted_df, hue='DM partner', col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money')
But it still does not work. This shows the plot shown below, with the third hue variable not showing well the different conditions.
here is the new figure I get - almost there
data - original and melted
Thank you very much for your help.
OK, so you want to create one plot of continuous data depending on three different categorical variables?
I think what you're looking for is:
grid = sns.FacetGrid(melted_df, col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money', 'DM partner').add_legend()
The col results in two plots, one for each value of XP_Social_Condition. The three values passed to grid.map split the data so 'Round' becomes the x-axis, 'money' the y-axis and 'DM partner' the color. You can play around and swap the values 'DM_partner', 'XP_Social_Condition' and 'Round'.
The result should now look something like this or this ('Round' and 'DM Partner' swapped).