I have a boxplot:
fig, ax = plt.subplots(1,1)
bp = df.boxplot(column='transaction_value',
by='store_type', grid=True,
ax=ax, showfliers=True)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
ax.set_ylim([0, 800])
ax.set_ylabel('transaction_value')
plt.show()
I have a seaborn stripplot:
bplot=sns.stripplot(y='transaction_value', x='store_type',
data=df,
jitter=True,
marker='o',
alpha=0.1,
color='black')
When I try to overlay the stripplot on the boxplot, it deletes the first boxplot (on the very far left).
fig, ax = plt.subplots(1,1)
bp = df.boxplot(column='transaction_value',
by='store_type', grid=True,
ax=ax, showfliers=True)
bplot=sns.stripplot(y='transaction_value', x='store_type',
data=df,
jitter=True,
marker='o',
alpha=0.1,
color='black')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
ax.set_ylim([0, 500])
ax.set_ylabel('transaction_value')
plt.show()
How can I stop this from happening?
Added data example:
a
transaction_value store_type
0 30.927648 express
1 20.356693 extra
2 48.201950 metro
3 77.213957 metro
4 15.482211 superstore
5 85.794876 superstore
6 16.199844 extra
7 0.007816 superstore
8 50.925737 metro
9 81.393811 metro
10 7.616312 superstore
11 82.172441 metro
12 49.608503 extra
13 71.907878 metro
14 85.833738 superstore
15 88.131029 express
16 11.541427 extra
17 89.759724 metro
18 96.435902 superstore
19 91.984656 superstore
20 67.795293 metro
21 39.806654 superstore
22 39.565823 metro
23 37.507718 superstore
24 37.918300 metro
25 18.599158 metro
26 3.815219 extra
27 83.210068 express
28 3.988503 extra
29 94.298953 superstore
a = pd.read_clipboard()
fig, ax = plt.subplots(1,1)
bp = a.boxplot(column='transaction_value',
by='store_type', grid=True,
ax=ax, showfliers=True)
bplot=sns.stripplot(y='transaction_value', x='store_type',
data=a,
jitter=True,
marker='o',
alpha=0.1,
color='black')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
ax.set_ylim([0, 500])
ax.set_ylabel('transaction_value')
plt.show()
#ImportanceOfBeingErnest provided a solution in comments while I was typing, but I was going to suggest something else:
For better consistency, I would recommend to use seaborn to do the boxplots as well, this should ensure that both plots are laid out the same way,
fig, ax = plt.subplots(1,1)
sns.boxplot(y='transaction_value', x='store_type', data=df, ax=ax,
color='w')
sns.stripplot(y='transaction_value', x='store_type', data=df, ax=ax,
jitter=True,
marker='o',
alpha=0.1,
color='black')
ax.set_ylabel('transaction_value')
plt.show()
Related
I am trying to plot a scatter plot on top of a bar plot using sns.scatterplot() and df.plot(kind='bar'); The figure turns out to be fine, but it would be even nicer if I can align each of the scatter points to its corresponding bar with an identical label.
I have read the document on Rectangle of matplotlib.pyplot that it has a get_x() method that can "Return the left coordinate of the rectangle";
I wonder if there is a way for me to assign these coordinates to the scatter points that'd be plotted by seaborn?
Code
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(9, 9)
fig.set_dpi(300)
bar_df.plot(kind='bar', ax=ax)
ax2 = ax.twinx()
sns.scatterplot(data=line_df, ax=ax2)
Dataframes
bar_df
year
apple
banana
citrus
...
2020
12
34
56
78
2025
12
34
56
78
2030
12
34
56
78
2035
12
34
56
78
line_df
year
apple
banana
citrus
...
2020
23
45
67
89
2025
23
45
67
89
2030
23
45
67
89
2035
23
45
67
89
It'd be really nice if I could make the points in the same vertical line as the bar with the same header;
sns.scatterplot interprets the x-axis as numeric. As such, it doesn't align well with a bar plot, nor does it have a dodge= parameter.
You can use sns.stripplot instead.
Seaborn works easiest with its data in "long form", which can be achieved via pandas pd.melt.
Here is some example code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
bar_df, line_df = pd.read_html('https://stackoverflow.com/questions/73191315')
bar_df_long = bar_df.melt(id_vars='year', var_name='fruit', value_name='bar_value')
line_df_long = line_df.melt(id_vars='year', var_name='fruit', value_name='line_value')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6,6), dpi=300)
sns.barplot(data=bar_df_long, x='year', y='bar_value', hue='fruit', dodge=True, ax=ax)
ax2 = ax.twinx()
sns.stripplot(data=line_df_long, x='year', y='line_value', hue='fruit', dodge=True, jitter=False,
edgecolor='black', linewidth=1, ax=ax2)
ax2.legend_.remove() # remove the second legend
plt.tight_layout()
plt.show()
#dataframe
a=
timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54
I applied this code
a.plot(kind="density")
this is not what i want.
I want to put Count on Y axis and timestamp in X axis with Density plotting.
just like i can do it with plt.bar(a['timestamp'],a['count'])
OR this is not possible with Density plotting?
The following code creates a density histogram. The total area sums to 1, supposing each of the timestamps counts as 1 unit. To get the timestamps as x-axis, they are set as the index. To get the total area to sum to 1, all count values are divided by their total sum.
A kde a calculated from the same data.
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
from io import StringIO
a_str = '''timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54'''
a = pd.read_csv(StringIO(a_str), delim_whitespace=True)
ax = (a.set_index('timestamp') / a['count'].sum()).plot.bar(width=0.9, rot=0, figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kde')
ax.set_xlim(xs[0], xs[-1])
ax.legend(labels=['kde', 'density histogram'])
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
If you just want to plot the kde curve, you can leave out the histogram. Optionally you can fill the area under the curve.
fig, ax = plt.subplots(figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
# plot the kde curve
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kernel density estimation')
# optionally fill the area below the curve
ax.fill_between(xs, kde(xs), color='crimson', alpha=0.2)
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
To plot multiple similar curves, for example using more count columns, you can use a loop. A list of colors that go well together could be obtained from the Set2 colormap:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
a = pd.DataFrame({'timestamp': ['2021-08-16', '2021-08-17', '2021-08-18', '2021-08-19', '2021-08-20', '2021-08-21',
'2021-08-22', '2021-08-23', '2021-08-24', '2021-08-25', '2021-08-26']})
for i in range(1, 5):
a[f'count{i}'] = (np.random.uniform(0, 12, len(a)) ** 2).astype(int)
xs = np.linspace(-1, len(a), 200)
fig, ax = plt.subplots(figsize=(12, 4))
for column, color in zip(a.columns[1:], plt.cm.Set2.colors):
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a[column])
ax.plot(xs, kde(xs), lw=2, color=color, label=f"kde of '{column}'")
ax.fill_between(xs, kde(xs), color=color, alpha=0.2)
ax.set_xlim(xs[0], xs[-1])
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('Date')
ax.set_ylabel('Density of Counts')
plt.tight_layout()
plt.show()
So I've been trying to plot some data. The x-axis is limited to two years. My question is pretty simple can someones explain why X-axis is limited to date range from 2015Q1 - 2017Q1, when the available data is between 2015Q1 - 2020Q1. Is there something missing or incorrect with my code?
dd2
qtr median count
0 2015Q1 1290000.0 27
1 2015Q2 1330000.0 43
2 2015Q3 1570000.0 21
3 2015Q4 1371000.0 20
4 2016Q1 1386500.0 20
5 2016Q2 1767500.0 22
6 2016Q3 1427500.0 32
7 2016Q4 1501000.0 31
8 2017Q1 1700000.0 29
9 2017Q2 1630000.0 15
10 2017Q3 1687500.0 24
11 2017Q4 1450000.0 15
12 2018Q1 1505000.0 13
13 2018Q2 1494000.0 14
14 2018Q3 1415000.0 21
15 2018Q4 1150000.0 15
16 2019Q1 1228000.0 15
17 2019Q2 1352500.0 12
18 2019Q3 1237500.0 12
19 2019Q4 1455000.0 26
20 2020Q1 1468000.0 9
code
x = dd2['qtr']
y1 = dd2['count']
y2 = dd2['median']
fig, ax = plt.subplots(figsize=(40,10))
ax = plt.subplot(111)
ax2 = ax.twinx()
y1_plot = y1.plot(ax=ax2, color='green', legend=True, marker='*', label="median")
y2_plot = y2.plot(ax=ax, color='red', legend=True, linestyle='--', marker='x', label="count")
plt.title('Price trend analysis')
ax.set_xticklabels(x, rotation='vertical',color='k', size=20)
ax.set_xlabel('year')
ax.set_ylabel('sold price')
ax2.set_ylabel('number of sales')
y1_patch = mpatches.Patch(color='red', label='median sold price')
y2_patch = mpatches.Patch(color='green', label='count')
plt.legend(handles=[y2_patch,y1_patch],loc='upper right')
plt.savefig('chart.png', dpi=300,bbox_inches ='tight')
plt.show()
using mtick to plot all x-axis data.
import matplotlib.ticker as mtick
ax.xaxis.set_major_locator(mtick.IndexLocator(base=1, offset=0))
Instead of going through Pandas' Series plotting methods, I'd use pyplot to plot your x and y data together, like this:
# everything is the same up to 'ax2 = ax.twinx()'
# plot on your axes, save a reference to the line
line1 = ax.plot(x, y1, color="green", label="median sold price", marker='*')
line2 = ax2.plot(x, y2, color="red", label="count", marker='x')
# no need for messing with patches
lines = line1 + line2
labels = [l.get_label() for l in lines]
ax.legend(lines, labels, loc='upper right')
# this is the same as before again
plt.title('Price trend analysis')
ax.xaxis.set_tick_params(rotation=90, color='k', size
ax.set_xlabel('year')
ax.set_ylabel('sold price')
ax2.set_ylabel('number of sales')
plt.savefig('chart.png', dpi=300,bbox_inches ='tight')
plt.show()
When comparing two different Y variables, there is no real way of knowing which chart type belongs to which Y-Axis. I need a legend that says which chart type belongs to which data set.
With help from this site itself I've been able to plot different categorized factors using different chart types, but as you can see there is no way to tell which chart type belongs to which factor/variable
This is the data table(tm_daily_df), and the current code
report_date shift UTL_R Head_Count
0 2019-03-17 A 0.669107 39
1 2019-03-18 A 0.602197 69
2 2019-03-19 A 0.568741 72
3 2019-03-20 A 0.552013 78
4 2019-03-21 A 0.585469 57
5 2019-03-22 A 0.635652 61
6 2019-03-23 A 0.602197 51
7 2019-03-17 1 0.828020 16
8 2019-03-17 2 0.585469 8
9 2019-03-17 3 0.526922 15
10 2019-03-18 1 0.618924 30
11 2019-03-18 2 0.610560 20
12 2019-03-18 3 0.577105 19
13 2019-03-19 1 0.610560 28
14 2019-03-19 2 0.602197 26
15 2019-03-19 3 0.468375 18
16 2019-03-20 1 0.543650 33
17 2019-03-20 2 0.552013 26
18 2019-03-20 3 0.552013 19
19 2019-03-21 1 0.577105 22
20 2019-03-21 2 0.585469 19
21 2019-03-21 3 0.602197 16
22 2019-03-22 1 0.593833 26
23 2019-03-22 2 0.685835 20
24 2019-03-22 3 0.635652 15
25 2019-03-23 1 0.577105 23
26 2019-03-23 2 0.627288 16
27 2019-03-23 3 0.602197 12
fig, ax = plt.subplots(figsize=(15,6))
g = sns.lineplot(x='report_date', y='UTL_R', data=tm_daily_df, ax=ax, hue = 'shift', legend = None,
marker='o', markersize=10)
ax2 = ax.twinx()
g = sns.barplot(x='report_date', y='Head_Count', data=tm_daily_df, ax=ax2, hue='shift',alpha=.5)
ax.set_title('Utilization Ratio vs HeadCount')
plt.show()
I want to have a legend that says which chart type belongs to which data set. In this case, there would be a secondary legend that shows a line and the word "UTL_R" and a square(or whatever would represent a bar graph) next to the word "Head_Count" .
I'm also open to any other ideas that can define the applied chart types. Keep in mind this graph is one of many from a huge set of variables, it's not a single instance.
Is there maybe a way i can just put an image/small table into the figure if this is not possible?
tl;dr at the bottom
I recently needed to implement two legends on a project as well. My code is something like:
def plot_my_data(ax, local_zerog, local_oneg, local_maxg):
# local_zerog list looks like: [local_zerog_dcmdcl_names, local_zerog_dcmdcl_values, local_zerog_time2double_names, local_zerog_time2double_values]
# the others are structured the same way as well
mpl.rcParams["lines.markersize"] = 7
dcmdcl = ax.scatter(local_zerog[0], local_zerog[1], label='Zero G', facecolors='none', edgecolors='b') #dcmdcl
ax.scatter(local_oneg[0], local_oneg[1], label="One G", facecolors='none', edgecolors='g')
ax.scatter(local_maxg[0], local_maxg[1], label="Max G", facecolors='none', edgecolors='r')
ax.tick_params(axis="x", direction="in", top=False, labeltop=False, labelbottom=True)
ax.tick_params(axis="y", direction="in", right=True)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, horizontalalignment='center')
legend1 = ax.legend(loc=1)
time2double = ax.scatter(local_zerog[2], local_zerog[3], label='Zero G', marker='s', color='b') #time2double
ax.scatter(local_oneg[2], local_oneg[3], label="One G", marker='s', color='g')
ax.scatter(local_maxg[2], local_maxg[3], label="Max G", marker='s', color='r')
ax.plot(local_oneg[0], [0 for _ in local_oneg[0]], color='k') # line at 0
ax.plot(local_oneg[2], [0 for _ in local_oneg[2]], color='k')
ax.legend([dcmdcl, time2double], ["dcmdcl [%]", "time2double [s]"], loc=2)
plt.gca().add_artist(legend1)
Where I had basically 6 sets of data: 3 for dcmdcl and 3 for time2double. Each has a different color/shape so basically I plotted all of one shape in the lines
dcmdcl = ax.scatter(local_zerog[0], local_zerog[1], label='Zero G', facecolors='none', edgecolors='b') #dcmdcl
ax.scatter(local_oneg[0], local_oneg[1], label="One G", facecolors='none', edgecolors='g')
ax.scatter(local_maxg[0], local_maxg[1], label="Max G", facecolors='none', edgecolors='r')
ax.tick_params(axis="x", direction="in", top=False, labeltop=False, labelbottom=True)
ax.tick_params(axis="y", direction="in", right=True)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=90, horizontalalignment='center')
legend1 = ax.legend(loc=1)
where the last line generates a legend based off the various labels I've assigned. Now to differentiate between the shapes I took one dcmdcl and one time2double and made another legend. The relevant code is:
dcmdcl = ax.scatter(local_zerog[0], local_zerog[1], label='Zero G', facecolors='none', edgecolors='b') #dcmdcl
time2double = ax.scatter(local_zerog[2], local_zerog[3], label='Zero G', marker='s', color='b') #time2double
ax.legend([dcmdcl, time2double], ["dcmdcl [%]", "time2double [s]"], loc=2)
where I basically feed it two specific instances and tell it to create another legend from this information and place it at another location.
tl;dr
It looks like you already have the legend you want for one of the data sets, so now you basically need to run:
legend1 = ax.legend(['put a series of items you want to describe here'], ['put how you would like to title them (needs to be in same order as previous list)'], loc=2)
plt.gca().add_artist(legend1)
I think the order might be important here (I don't remember from when I made it), but if you'll notice my order is:
plot some stuff
legend1 = ax.legend(loc=1) to make a legend (not plotted yet, just a variable)
plot more stuff
ax.legend([dcmdcl, time2double], ["dcmdcl [%]", "time2double [s]"], loc=2) (note this is not assigned to a variable this time)
plt.gca().add_artist(legend1) now I use the variable made earlier to plot it via add_artist()
My code to generate each ax that is passed into my function above:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 3, 1)
zerog, oneg, maxg = build_plot_data(lower_mach)
plot_my_data(ax, zerog, oneg, maxg)
ax.set_title("Mach < .7")
Here is the sample data:
Datetime Price Data1 Data2 ShiftedPrice
0 2017-11-05 09:20:01.134 2123.0 12.23 34.12 300.0
1 2017-11-05 09:20:01.789 2133.0 32.43 45.62 330.0
2 2017-11-05 09:20:02.238 2423.0 35.43 55.62 NaN
3 2017-11-05 09:20:02.567 3423.0 65.43 56.62 NaN
4 2017-11-05 09:20:02.948 2463.0 45.43 58.62 NaN
I am trying to draw a plot between Datetime and Shiftedprice columns and horizontal lines for mean, confidence intervals of the ShiftedPrice column.
Have a look at the code below:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
df1 = df.dropna(subset=['ShiftedPrice'])
df1
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(121)
ax = df1.plot(x='Datetime',y='ShiftedPrice')
# Plotting the mean
ax.axhline(y=df1['ShiftedPrice'].mean(), color='r', linestyle='--', lw=2)
plt.show()
# Plotting Confidence Intervals
ax.axhline(y=df1['ShiftedPrice'].mean() + 1.96*np.std(df1['ShiftedPrice'],ddof=1), color='g', linestyle=':', lw=2)
ax.axhline(y=df1['ShiftedPrice'].mean() - 1.96*np.std(df1['ShiftedPrice'],ddof=1), color='g', linestyle=':', lw=2)
plt.show()
My problem is that horizontal lines are not appearing. Instead, I get the following message
ax.axhline(y=df1['ShiftedPrice'].mean(), color='r', linestyle='--', lw=2)
Out[22]: <matplotlib.lines.Line2D at 0xccc5c18>