I have an issue with axis labels when using groupby and trying to plot with seaborn. Here is my problem:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'user': ['Bob', 'Jane','Alice','Bob','Jane','Alice'],
'income': [40000, 50000, 42000,47000,53000,46000]})
groupedProduct = df.groupby(['Product']).sum().reset_index()
I then plot a horizontal bar plot using seaborn:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
#Prettify the plot
bar.set_yticklabels( bar.get_yticks(), size = 10)
bar.set_xticklabels( bar.get_xticks(), size = 10)
bar.set_ylabel("User", fontsize = 20)
bar.set_xlabel("Income ($)", fontsize = 20)
bar.set_title("Total income per user", fontsize = 20)
sns.set_theme(style="whitegrid")
sns.set_color_codes("muted")
Unfortunately, when I run the code in such a manner, the y-axis ticks are labelled as 0,1,2 instead of Bob, Jane, Alice as I'd like it to.
I can get around the issue if I use matplotlib in the following manner:
df_group_user = df.groupby(['user']).sum()
df_group_user['income'].plot(kind="barh")
plt.title("Total income per user")
plt.ylabel("User")
plt.xlabel("Income ($)")
Ideally, I'd like to use seaborn for plotting, but if I don't use reset_index() like above, when calling sns.barplot:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
ValueError: Could not interpret input 'user'
just try re-writing the positions of x and y axis.
I'm using a diff dataframe to exhibit similar situation.
gp = df.groupby("Gender")['Salary'].sum().reset_index()
gp
Output:
Gender Salary
0 Female 8870
1 Male 23667
Now while plotting a bar chart, mention x axis first and then supply y axis and check,
bar = sns.barplot(x = 'Salary', y = "Gender", data = gp);
Related
I have plotted a heatmap which is displayed below. on the xaxis it shows time of the day and y axis shows date. I want to show xaxis at every hour instead of the random xlabels it displays here.
I tried following code but the resulting heatmap overrites all xlabels together:
t = pd.date_range(start='00:00:00', end='23:59:59', freq='60T').time
df = pd.DataFrame(index=t)
df.reset_index(inplace=True)
df['index'] = df['index'].astype('str')
sns_hm = sns.heatmap(data=mat, cbar=True, lw=0,cmap=colormap,xticklabels=df['index'])
The following code supposes mat is a dataframe with columns for some timestamps for each of a number of days. Each of the days, the same timestamps need to appear again.
After drawing the heatmap, the left and right limits of the x-axis are retrieved. Supposing these go from 0 to 24 hour, the range can be subdivided into 25 positions, one for each of the hours.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
from matplotlib.colors import ListedColormap, to_hex
# first, create some test data
df = pd.DataFrame()
df["date"] = pd.date_range('20220304', periods=19000, freq=DateOffset(seconds=54))
df["val"] = (((np.random.rand(len(df)) ** 100).cumsum() / 2).astype(int) % 2) * 100
df['day'] = df['date'].dt.strftime('%d-%m-%Y')
df['time'] = df['date'].dt.strftime('%H:%M:%S')
mat = df.pivot(index='day', columns='time', values='val')
colors = list(plt.cm.Greens(np.linspace(0.2, 0.9, 10)))
ax = sns.heatmap(mat, cmap=colors, cbar_kws={'ticks': range(0, 101, 10)})
xmin, xmax = ax.get_xlim()
tick_pos = np.linspace(xmin, xmax, 25)
tick_labels = [f'{h:02d}:00:00' for h in range(len(tick_pos))]
ax.set_xticks(tick_pos)
ax.set_xticklabels(tick_labels, rotation=90)
ax.set(xlabel='', ylabel='')
plt.tight_layout()
plt.show()
The left plot shows the default tick labels, the right plot the customized labels.
i would like to:
Store in a director series all the directors present in the director column of df.
Display in a horizontal bar graph the 10 most present directors in the catalogue.
Do I need to make a value.count first ? To set the top 10 before creating the plt.bar ?
# divided the director name
df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
You can create a countplot and use the order= parameter to select the 10 highest counts:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# directors = df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
np.random.seed(123456)
directors = pd.Series(np.random.choice(
['Allen', 'Almodóvar', 'Bergman', 'Buñuel', 'Chaplin', 'Eastwood', 'Fassbinder', 'Fellini', 'Hitchcock', 'Keaton',
'Kubrick', 'Polanski', 'Renoir', 'Scorsese', 'Spielberg', 'Welles', 'Wenders', 'Wilder'], 200), name='Director')
ax = sns.countplot(y=directors, order=directors.value_counts().iloc[:10].index, palette='rocket')
ax.tick_params(axis='y', length=0)
plt.tight_layout()
plt.show()
# c. Top 10 recovered countries (Bar plot)
top10_recovered = pd.DataFrame(data.groupby('Country')['Recovered'].sum().nlargest(10).sort_values(ascending = False))
fig3 = px.bar(top10_recovered, x = top10_recovered.index, y = 'Recovered', height = 600, color = 'Recovered',
title = 'Top 10 Recovered Cases Countries', color_continuous_scale = px.colors.sequential.Viridis)
fig3.show()
I want to plot two bar graphs side by side using matplotlib/seaborn for two countries Covid-19 confirmed cases: Italy and India for comparison. However after trying many methods I couldn't achieve the problem. Confirmed cases of both countries are coming from two different data frames.
Data source
I want to plot 'Dates' column on x-axis and 'Confirmed cases count' on y-axis.
Attaching images of my code for reference.
P.S: I am new to data visualization and pandas too.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-
19/master/data/countries-aggregated.csv', parse_dates = ['Date'])
df.head(5)
ind_cnfd = df[['Date', 'Country', 'Confirmed']]
ind_cnfd = ind_cnfd[ind_cnfd['Country']=='India']
italy_cnfd = df[['Date', 'Country', 'Confirmed']]
italy_cnfd = italy_cnfd[italy_cnfd['Country'] == 'Italy']
Expected output kind of this:
With dates on x-axis and confirmed cases on y-axis
Here's an example of what you can put together using matplotlib with seaborn. Feel free to play around with the axes settings, spacing, and so on by looking through matplotlib/seaborn documentation. Take note that I only did import matplotlib.pyplot as plt if you want to run any of this code from your notebook. I didn't use seaborn by the way.
You can optionally display the confirmed cases on a log-based y scale with the line: plt.yscale('log')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv',
parse_dates = ['Date'])
# select the Date, Country, Confirmed features from country, with reset of index
ind_cnfd = df[df.Country == 'India']
ind_cnfd = ind_cnfd[['Date', 'Confirmed']].reset_index(drop = True)
ind_cnfd = ind_cnfd.rename(columns = {'Confirmed': 'Confirmed Cases in India'})
italy_cnfd = df[df.Country == 'Italy']
italy_cnfd = italy_cnfd[['Date', 'Confirmed']].reset_index(drop = True)
italy_cnfd = italy_cnfd.rename(columns = {'Confirmed': 'Confirmed Cases in Italy'})
# combine dataframes together, turn the date column into the index
df_cnfd = pd.concat([ind_cnfd.drop(columns = 'Date'), italy_cnfd], axis = 1)
df_cnfd['Date'] = df_cnfd['Date'].dt.date
df_cnfd.set_index('Date', inplace=True)
# make a grouped bar plot time series
ax = df_cnfd.plot.bar()
# show every other tick label
for label in ax.xaxis.get_ticklabels()[::2]:
label.set_visible(False)
# add titles, axis labels
plt.suptitle("Confirmed COVID-19 Cases over Time", fontsize = 15)
plt.xlabel("Dates")
plt.ylabel("Number of Confirmed Cases")
plt.tight_layout()
# plt.yscale('log')
plt.show()
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()
I'm working on a school project and I'm stuck in making a grouped bar chart. I found this article online with an explanation: https://www.pythoncharts.com/2019/03/26/grouped-bar-charts-matplotlib/
Now I have a dataset with an Age column and a Sex column in the Age column there stand how many years the client is and in the sex is a 0 for female and 1 for male. I want to plot the age difference between male and female. Now I have tried the following code like in the example:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, data.loc[data['Sex'] == '0', 'count'], width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, data.loc[data['Sex'] == '1', 'count'], width=bar_width)
This raised the following error: KeyError: 'count'
Then I tried to change the code a bit and got another error:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, (data.loc[data['Sex'] == '0'].count()), width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, (data.loc[data['Sex'] == '1'].count()), width=bar_width)
This raised the error: ValueError: shape mismatch: objects cannot be broadcast to a single shape
Now how do I count the results that I do can make this grouped bar chart?
It seems like the article goes through too much trouble just to plot grouped chart bar:
np.random.seed(1)
data = pd.DataFrame({'Sex':np.random.randint(0,2,1000),
'Age':np.random.randint(20,50,1000)})
(data.groupby('Age')['Sex'].value_counts() # count the Sex values for each Age
.unstack('Sex') # turn Sex into columns
.plot.bar(figsize=(12,6)) # plot grouped bar
)
Or even simpler with seaborn:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=data, x='Age', hue='Sex', ax=ax)
Output: