In a seaborn horizontal barplot, which has two set of barplots where one set is placed on top of another, how the axes of each can be controlled independanlty? Currently, I want to adjust the thickness of the bars based on the frequency of the occurrence of some entity in it.
Currently both the barplots are plotted with axes stored in ax1 and ax2. But I am able to adjust the thickness of the bar only for ax1 (lightblue in colour), but not for ax2 (dark blue. All bars have uniform thickness). I am not able to figure out how the assignment of ax2 needs to be done so as to adjust the bar thickness for the second set of bars as well.
How can varying length bars for both the barplots be obtained?
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(15, 45))
crashes = plotie.groupby('target_wcount').mean()
sns.set_color_codes("pastel")
ax1 = sns.barplot(x="uno", y="indie", orient='h', data=crashes,
label="uno", color="b")
sns.set_color_codes("muted")
ax2 = sns.barplot(x="miss", y="indie", orient='h', data=crashes,
label="miss", color="b")
for bar, newwidth in zip(ax1.patches, summa):
bar.set_height(3*newwidth)
for bar, newwidth in zip(ax2.patches, summa):
bar.set_height(3*newwidth)
sns.despine(left=True, bottom=True)
f.savefig('filea')
Sample For data
output_wcount missing_count match_count uni indie uno miss
target_wcount
49 49.0 39.440000 9.560000 1.0 49 1.0 0.804898
48 48.0 36.730000 11.270000 1.0 48 1.0 0.765208
46 46.0 34.400000 11.600000 1.0 46 1.0 0.747826
45 45.0 33.940000 11.060000 1.0 45 1.0 0.754222
44 44.0 34.630000 9.370000 1.0 44 1.0 0.787045
43 43.0 31.420000 11.580000 1.0 43 1.0 0.730698
42 42.0 31.455000 10.545000 1.0 42 1.0 0.748929
41 41.0 29.630000 11.370000 1.0 41 1.0 0.722683
40 40.0 28.430000 11.570000 1.0 40 1.0 0.710750
39 39.0 27.935556 11.064444 1.0 39 1.0 0.716296
By using the twinx function this could be easily solved
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(15, 45))
crashes = plotie.groupby('target_wcount').mean()
sns.set_color_codes("pastel")
ax1 = sns.barplot(x="uno", y="indie", orient='h', data=crashes,
label="uno", color="b")
sns.set_color_codes("muted")
ax2 = ax.twinx()
sns.barplot(x="miss", y="indie", orient='h', data=crashes,
label="miss", color="b",ax=ax2)
for bar, newwidth in zip(ax1.patches, summa):
bar.set_height(3*newwidth)
for bar, newwidth in zip(ax2.patches, summa):
bar.set_height(3*newwidth)
sns.despine(left=True, bottom=True)
f.savefig('filea')
Related
I am trying to plot a scatter plot on top of a bar plot using sns.scatterplot() and df.plot(kind='bar'); The figure turns out to be fine, but it would be even nicer if I can align each of the scatter points to its corresponding bar with an identical label.
I have read the document on Rectangle of matplotlib.pyplot that it has a get_x() method that can "Return the left coordinate of the rectangle";
I wonder if there is a way for me to assign these coordinates to the scatter points that'd be plotted by seaborn?
Code
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(9, 9)
fig.set_dpi(300)
bar_df.plot(kind='bar', ax=ax)
ax2 = ax.twinx()
sns.scatterplot(data=line_df, ax=ax2)
Dataframes
bar_df
year
apple
banana
citrus
...
2020
12
34
56
78
2025
12
34
56
78
2030
12
34
56
78
2035
12
34
56
78
line_df
year
apple
banana
citrus
...
2020
23
45
67
89
2025
23
45
67
89
2030
23
45
67
89
2035
23
45
67
89
It'd be really nice if I could make the points in the same vertical line as the bar with the same header;
sns.scatterplot interprets the x-axis as numeric. As such, it doesn't align well with a bar plot, nor does it have a dodge= parameter.
You can use sns.stripplot instead.
Seaborn works easiest with its data in "long form", which can be achieved via pandas pd.melt.
Here is some example code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
bar_df, line_df = pd.read_html('https://stackoverflow.com/questions/73191315')
bar_df_long = bar_df.melt(id_vars='year', var_name='fruit', value_name='bar_value')
line_df_long = line_df.melt(id_vars='year', var_name='fruit', value_name='line_value')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6,6), dpi=300)
sns.barplot(data=bar_df_long, x='year', y='bar_value', hue='fruit', dodge=True, ax=ax)
ax2 = ax.twinx()
sns.stripplot(data=line_df_long, x='year', y='line_value', hue='fruit', dodge=True, jitter=False,
edgecolor='black', linewidth=1, ax=ax2)
ax2.legend_.remove() # remove the second legend
plt.tight_layout()
plt.show()
#dataframe
a=
timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54
I applied this code
a.plot(kind="density")
this is not what i want.
I want to put Count on Y axis and timestamp in X axis with Density plotting.
just like i can do it with plt.bar(a['timestamp'],a['count'])
OR this is not possible with Density plotting?
The following code creates a density histogram. The total area sums to 1, supposing each of the timestamps counts as 1 unit. To get the timestamps as x-axis, they are set as the index. To get the total area to sum to 1, all count values are divided by their total sum.
A kde a calculated from the same data.
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
from io import StringIO
a_str = '''timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54'''
a = pd.read_csv(StringIO(a_str), delim_whitespace=True)
ax = (a.set_index('timestamp') / a['count'].sum()).plot.bar(width=0.9, rot=0, figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kde')
ax.set_xlim(xs[0], xs[-1])
ax.legend(labels=['kde', 'density histogram'])
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
If you just want to plot the kde curve, you can leave out the histogram. Optionally you can fill the area under the curve.
fig, ax = plt.subplots(figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
# plot the kde curve
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kernel density estimation')
# optionally fill the area below the curve
ax.fill_between(xs, kde(xs), color='crimson', alpha=0.2)
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
To plot multiple similar curves, for example using more count columns, you can use a loop. A list of colors that go well together could be obtained from the Set2 colormap:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
a = pd.DataFrame({'timestamp': ['2021-08-16', '2021-08-17', '2021-08-18', '2021-08-19', '2021-08-20', '2021-08-21',
'2021-08-22', '2021-08-23', '2021-08-24', '2021-08-25', '2021-08-26']})
for i in range(1, 5):
a[f'count{i}'] = (np.random.uniform(0, 12, len(a)) ** 2).astype(int)
xs = np.linspace(-1, len(a), 200)
fig, ax = plt.subplots(figsize=(12, 4))
for column, color in zip(a.columns[1:], plt.cm.Set2.colors):
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a[column])
ax.plot(xs, kde(xs), lw=2, color=color, label=f"kde of '{column}'")
ax.fill_between(xs, kde(xs), color=color, alpha=0.2)
ax.set_xlim(xs[0], xs[-1])
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('Date')
ax.set_ylabel('Density of Counts')
plt.tight_layout()
plt.show()
I am trying to plot two displots side by side with this code
fig,(ax1,ax2) = plt.subplots(1,2)
sns.displot(x =X_train['Age'], hue=y_train, ax=ax1)
sns.displot(x =X_train['Fare'], hue=y_train, ax=ax2)
It returns the following result (two empty subplots followed by one displot each on two lines)-
If I try the same code with violinplot, it returns result as expected
fig,(ax1,ax2) = plt.subplots(1,2)
sns.violinplot(y_train, X_train['Age'], ax=ax1)
sns.violinplot(y_train, X_train['Fare'], ax=ax2)
Why is displot returning a different kind of output and what can I do to output two plots on the same line?
seaborn.distplot has been DEPRECATED in seaborn 0.11 and is replaced with the following:
displot(), a figure-level function with a similar flexibility over the kind of plot to draw. This is a FacetGrid, and does not have the ax parameter, so it will not work with matplotlib.pyplot.subplots.
histplot(), an axes-level function for plotting histograms, including with kernel density smoothing. This does have the ax parameter, so it will work with matplotlib.pyplot.subplots.
It is applicable to any of the seaborn FacetGrid plots that there is no ax parameter. Use the equivalent axes-level plot.
Look at the documentation for the figure-level plot to find the appropriate axes-level plot function for your needs.
See Figure-level vs. axes-level functions
Because the histogram of two different columns is desired, it's easier to use histplot.
See How to plot in multiple subplots for a number of different ways to plot into maplotlib.pyplot.subplots
Also review seaborn histplot and displot output doesn't match
Tested in seaborn 0.11.1 & matplotlib 3.4.2
fig, (ax1, ax2) = plt.subplots(1, 2)
sns.histplot(x=X_train['Age'], hue=y_train, ax=ax1)
sns.histplot(x=X_train['Fare'], hue=y_train, ax=ax2)
Imports and DataFrame Sample
import seaborn as sns
import matplotlib.pyplot as plt
# load data
penguins = sns.load_dataset("penguins", cache=False)
# display(penguins.head())
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
Axes Level Plot
With the data in a wide format, use sns.histplot
# select the columns to be plotted
cols = ['bill_length_mm', 'bill_depth_mm']
# create the figure and axes
fig, axes = plt.subplots(1, 2)
axes = axes.ravel() # flattening the array makes indexing easier
for col, ax in zip(cols, axes):
sns.histplot(data=penguins[col], kde=True, stat='density', ax=ax)
fig.tight_layout()
plt.show()
Figure Level Plot
With the dataframe in a long format, use displot
# create a long dataframe
dfl = penguins.melt(id_vars='species', value_vars=['bill_length_mm', 'bill_depth_mm'], var_name='bill_size', value_name='vals')
# display(dfl.head())
species bill_size vals
0 Adelie bill_length_mm 39.1
1 Adelie bill_depth_mm 18.7
2 Adelie bill_length_mm 39.5
3 Adelie bill_depth_mm 17.4
4 Adelie bill_length_mm 40.3
# plot
sns.displot(data=dfl, x='vals', col='bill_size', kde=True, stat='density', common_bins=False, common_norm=False, height=4, facet_kws={'sharey': False, 'sharex': False})
Multiple DataFrames
If there are multiple dataframes, they can be combined with pd.concat, and use .assign to create an identifying 'source' column, which can be used for row=, col=, or hue=
# list of dataframe
lod = [df1, df2, df3]
# create one dataframe with a new 'source' column to use for row, col, or hue
df = pd.concat((d.assign(source=f'df{i}') for i, d in enumerate(lod, 1)), ignore_index=True)
See Import multiple csv files into pandas and concatenate into one DataFrame to read multiple files into a single dataframe with an identifying column.
Here is the sample data:
Datetime Price Data1 Data2 ShiftedPrice
0 2017-11-05 09:20:01.134 2123.0 12.23 34.12 300.0
1 2017-11-05 09:20:01.789 2133.0 32.43 45.62 330.0
2 2017-11-05 09:20:02.238 2423.0 35.43 55.62 NaN
3 2017-11-05 09:20:02.567 3423.0 65.43 56.62 NaN
4 2017-11-05 09:20:02.948 2463.0 45.43 58.62 NaN
I am trying to draw a plot between Datetime and Shiftedprice columns and horizontal lines for mean, confidence intervals of the ShiftedPrice column.
Have a look at the code below:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
df1 = df.dropna(subset=['ShiftedPrice'])
df1
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(121)
ax = df1.plot(x='Datetime',y='ShiftedPrice')
# Plotting the mean
ax.axhline(y=df1['ShiftedPrice'].mean(), color='r', linestyle='--', lw=2)
plt.show()
# Plotting Confidence Intervals
ax.axhline(y=df1['ShiftedPrice'].mean() + 1.96*np.std(df1['ShiftedPrice'],ddof=1), color='g', linestyle=':', lw=2)
ax.axhline(y=df1['ShiftedPrice'].mean() - 1.96*np.std(df1['ShiftedPrice'],ddof=1), color='g', linestyle=':', lw=2)
plt.show()
My problem is that horizontal lines are not appearing. Instead, I get the following message
ax.axhline(y=df1['ShiftedPrice'].mean(), color='r', linestyle='--', lw=2)
Out[22]: <matplotlib.lines.Line2D at 0xccc5c18>
I have a plot where the 5th bar is incorrectly placed right next to the 4th bar. What should I change?
My small_ax_0 pandas dataframe looks like this:
INDEX 0
0 1 5.0
1 10001 4.0
2 20001 5.0
3 30001 5.0
4 40001 5.0
5 50001 4.0
6 60001 1.0
7 70001 4.0
8 80001 0.0
9 90001 4.0
Here is my code:
plt.hist(small_ax_0[0])
plt.tick_params(axis='both', which='major', labelsize=100)
plt.tick_params(axis='both', which='minor', labelsize=100)
plt.xlabel('Position', fontsize=100)
plt.ylabel('Frequency', fontsize=100)
plt.title('My Plot', fontsize = 150) ##
plt.grid(b=True, which='major', color='grey', linestyle='dotted')
plt.xticks( rotation = 45)
plt.show()
pandas visualization
df['0'].value_counts().sort_index().plot(kind='bar')
By default, hist returns 10 bins, equally spaced along the range of your data. So in this case, the data ranges from 0 to 5, and spacing between bins is 0.5. If you just want to plot the number of occurrences of each number, I suggest using np.unique() and use a bar plot:
import numpy as np
nums, freq = np.unique(small_ax_0[0], return_counts=True)
plt.bar(nums, freq)
and you get a figure, where the bars are centered around each number.