Seaborn showing values in legend not present in Pandas column - python

I'm generating a scatterplot for a Pandas DataFrame data, containing amongst others the numeric column 'year' with the unique values
array([2010., 2011., 2012., 2013., 2014., 2015., 2016., 2017., 2018.])
as shown with data.year.unique().
Displaying the plot like this:
ax = sns.scatterplot(x='x', y='y', hue='name', size='year', data=data, palette=sns.color_palette('deep', 7))
generates a legend with the groupings for year listed as
This is misleading, as the plot only contains data from 2008 to 2020.
I've tried passing a tuple (min, max) to the sns.scatterplot function as described in the documentation, to no avail.
Changing the data type of the column 'year' to categoric does print the range of the years correctly in the legend, but yields a legend entry for every single year, which is unnecessary and takes up a lot of space.
I've also tried the solution from this related thread, but it doesn't change the range of the legend entries.
How can I force seaborn to show the actual range of values in the legend? Alternatively, if it only works by using a categorical column, how can I only show every second entry in the legend?

Related

Python line plotting 2 columns in same DataFrame using Index and count [duplicate]

This question already has answers here:
How to plot multiple pandas columns
(3 answers)
Plot multiple columns of pandas DataFrame using Seaborn
(2 answers)
How do I create a multiline plot using seaborn?
(3 answers)
Closed 26 days ago.
Newbie to Python so am unsure whether this can be done in one graph or not. I have one DataFrame containing Year, Number of Accidents and Number of Fatalities:
I am trying to generate a line plot that shows x axis = Year, y axis = number of instances per year, and 2 lines showing number of each individual column. Using Seaborn, I can only see a way to map 2 columns and hue. Can anyone please provide any advice on whether this is achievable in either Matplotlib or Seaborn.
Tried using Seaborn but cannot work out how to set up x and y axis as required and show 2 individual columns within that:
sns.lineplot(x=f1_safety['NumberOfFatalities'],y=f1_safety['NumberOfAccidents'].count(), hue = f1_safety['year'].count())
plt.show()
There are at least two ways to accomplish what you want to do here.
The simpler one uses pandas built-in plotting API. You can plot dataframes directly when they are already in the correct form. In your case, you need to set the year as the index, and then can plot right away:f1_safety.set_index("year").plot()
If you want to use seaborn, you first need to transform the data into the correct format. seaborn takes x and y, and you can not specify different y columns directly (like y1, y2 and so on). Instead, you need to transform the data into "long format". In such a table, you get one index or id column, one value column and a "description" kind of columns. This works like this:
f1_safety = pd.melt(df, id_vars="year", value_vars=["NumberOfAccidents", "NumberOfFatalities"])
sns.lineplot(data=f1_safety, x="year", y="value", hue="variable")
The plot in both cases looks quite the same:
There are other ways. In particular, in Jupyter you can execute two plot statements in the same cell, and matplotlib will put the plots into the same figure, even cycling through the colors as necessary.

Creating a single tidy seaborn plot in a 'for' loop

I'm trying to generate a plot in seaborn using a for loop to plot the contents of each dataframe column on its own row.
The number of columns that need plotting can vary between 1 and 30. However, the loop creates multiple individual plots, each with their own x-axis, which are not aligned and with a lot of wasted space between the plots. I'd like to have all the plots together with a shared x-axis without any vertical spacing between each plot that I can then save as a single image.
The code I have been using so far is below.
comp_relflux = measurements.filter(like='rel_flux_C', axis=1) *# Extracts relevant columns from larger dataframe
comp_relflux=comp_relflux.reindex(comp_relflux.mean().sort_values().index, axis=1) # Sorts into order based on column mean.
plt.rcParams["figure.figsize"] = [12.00, 1.00]
for column in comp_relflux.columns:
plt.figure()
sns.scatterplot((bjd)%1, comp_relflux[column], color='b', marker='.')
This is a screenshot of the resultant plots.
I have also tried using FacetGrid, but this just seems to plot the last column's data.
p = sns.FacetGrid(comp_relflux, height=2, aspect=6, despine=False)
p.map(sns.scatterplot, x=(bjd)%1, y=comp_relflux[column])
To combine the x-axis labels and have just one instead of having it for each row, you can use sharex. Also, using plt.subplot() to the number of columns you have, you would also be able to have just one figure with all the subplots within it. As there is no data available, I used random numbers below to demonstrate the same. There are 4 columns of data in my df, but have kept as much of your code and naming convention as is. Hope this is what you are looking for...
comp_relflux = pd.DataFrame(np.random.rand(100, 4)) #Random data - 4 columns
bjd=np.linspace(0,1,100) # Series of 100 points - 0 to 1
rows=len(comp_relflux.columns) # Use this to get column length = subplot length
fig, ax = plt.subplots(rows, 1, sharex=True, figsize=(12,6)) # The subplots... sharex is assigned here and I move the size in here from your rcParam as well
for i, column in enumerate(comp_relflux.columns):
sns.scatterplot((bjd)%1, comp_relflux[column], color='b',marker='.', ax=ax[i])
1 output plot with 4 subplots

How to use column values for x axis labels in matplotlib

I have a basic DataFrame in pandas and using matplotlib to create a chart
I have followed advice found on SO and also on the docs for labelling the values on the x axis but they won't change from the indices.
I have this,
Presc_df_asc = Presc_df.sort_values('Total Items',ascending=True)
Presc_df_asc['Total Items'].plot.bar(x="Practice", ylim=[Presc_df_asc['Total Items'].min(), Presc_df_asc['Total Items'].max()])
plt.xlabel('Practice')
plt.ylabel('Total Items')
plt.title('practice total items')
plt.legend(('Items',),loc='upper center')
From what I have found plot.bar(x="Practice" should set the x-axis to show the values int he practice column under each bar.
But no matter what I try I get the x-axis labelled as indices with just the main label saying Practices.
In order for the plotting command to be able to access the "Practice" column, you need to apply the plot function to the entire dataframe (or a sub_dataframe that contains at least these two columns). The code below uses the corresponding labels below each bar. The rot=0 argument prevents the labels from being rotated by 90°.
Presc_df_asc.plot.bar(x="Practice", y ="Total Items",
ylim=[Presc_df_asc['Total Items'].min(),
Presc_df_asc['Total Items'].max()], rot=0)

Groupby Plot - Include Subgroup Name

I'm looking to plot two columns of a time series based on a groupby of a third column. It works as intended more or less, but I can't tell which subgroup is being plotted in the output as it is not included in the legend or anywhere else in the graphs outputted.
Is there a way to include the subgroup name in the graphs outputted?
This is what I've attempted on the dataframe as follows:
dataframe
awareness.groupby('campaign_name')['sum_purchases_value','sum_ad_spend'].plot(figsize=(20,8), legend=True);
Try this:
grouped = awareness.groupby('campaign_name')
titles = [name for name,data in grouped]
plots = grouped['sum_purchases_value',
'sum_ad_spend'].plot(figsize=(20,8), legend=True)
for plot, label in zip(plots, titles):
plot.set(title = label)
The pandas plot function returns a Series of matplotlib subplot objects, so using the for loop you can customize whatever you like (x labels, y labels, font size, etc.)

pandas color scheme not working properly with my data (python) [duplicate]

This question already has answers here:
Pandas DataFrame Bar Plot - Plot Bars Different Colors From Specific Colormap
(3 answers)
Closed 4 years ago.
I would like to change the default color scheme of my pandas plot. I tried with different color schemes through cmap pandas parameter, but when I change it, all bars of my barplot get the same color.
The code I tried is the following one:
yearlySalesGenre = df1.groupby('Genre').Global_Sales.sum().sort_values()
fig = plt.figure()
ax2 = plt.subplot()
yearlySalesGenre.plot(kind='bar',ax=ax2, sort_columns=True, cmap='tab20')
plt.show(fig)
And the data that I plot (yearlySalesGenre) is a pandas Series type:
Genre
Strategy 174.50
Adventure 237.69
Puzzle 243.02
Simulation 390.42
Fighting 447.48
Racing 728.90
Misc 803.18
Platform 828.08
Role-Playing 934.40
Shooter 1052.94
Sports 1249.47
Action 1745.27
Using tab20 cmap I get the following plot:
I get all bars with the first color of all the tab20 scheme. What I am doing wrong?
Note that if I use the default color scheme of pandas plot, it properly displays all bars with different colors, but the thing is that I want to use a particular color scheme.
As posted, it's a duplicated answer. Just in case, the answer is that pandas makes color schemes based on different columns, not in rows. So to use different colors you can transpose the data + some other stuff (duplicated link), or directly use the matplotlib.pyplot plotting that allows more flexibility (in my case):
plt.bar(range(len(df)), df, color=plt.cm.tab20(np.arange(len(df))))
Maybe this is what you want:
df2.T.plot( kind='bar', sort_columns=True, cmap='tab20')
I think the problem you have is that you only have one series. Pandas plot bar will plot separate series (columns) each with its own color, and separate each each bar based on the index.
By using .T, the series in your data become multiple columns but within only one index. I am sure you can play with the legend to get a better display.

Categories

Resources