Plotting: qcut then groupby two variables - python

I have the following dataset:
df = pd.DataFrame({'cls': [1,2,2,1,2,1,2,1,2,1,2],
'x': [10,11,21,21,8,1,4,3,5,6,2],
'y': [10,1,2,2,5,2,4,3,8,6,5]})
df['bin'] = pd.qcut(np.array(df['x']), 4)
a = df.groupby(['bin', 'cls'])['y'].mean()
a
This gives me
bin cls
(0.999, 3.5] 1 2.5
2 5.0
(3.5, 6.0] 1 6.0
2 6.0
(6.0, 10.5] 1 10.0
2 5.0
(10.5, 21.0] 1 2.0
2 1.5
Name: y, dtype: float64
I want to plot the right-most column (that is, the average of y per cls per bin) per bin per class. That is, for each bin we have two values of y that I would like to plot as points/scatters. Is that possible using matplotlib or seaborn?

You can indeed use seaborn for what you're asking. Does this work?
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# set up some plotting options
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1,1,1)
# we reset index to avoid having to do multi-indexing
a = a.reset_index()
# use seaborn with argument 'hue' to do the grouping
sns.barplot(x="bin", y="y", hue="cls", data=a, ax=ax)
plt.show()
EDIT: I've just noticed that you wanted to plot "points". I wouldn't advise it for this dataset but you can do that if you replace barplot with catplot.

Related

Annotate a normalized barchart with original data

I have a dataframe consisting of;
home away type
0 0.0 0.0 reds
1 5.0 1.0 yellows
2 7.0 5.0 corners
3 4.0 10.0 PPDA
4 5.0 1.0 shots off
5 7.0 5.0 shots on
6 1.0 1.0 goals
7 66.0 34.0 possession
to get the stacked bar chart I wanted, I normalized the data using
stackeddf1 = df1.iloc[:,0:2].apply(lambda x: x*100/sum(x),axis=1)
and then I create my barchart using
ax = stackeddf1.iloc[1:, 0:2].plot.barh(align='center', stacked=True, figsize=(20, 20),legend=None)
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2,
y+height/2,
'{:.0f}'.format(width),
horizontalalignment='center',
verticalalignment='center')
This though, annotates the barchart with the new normalized data. If possible I'd like to find a way to use my original to annotate.
You can use matplotlib's new bar_label function together with the values of the original dataframe:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd
import numpy as np
df = pd.DataFrame({'home': np.random.randint(1, 10, 10),
'away': np.random.randint(1, 10, 10),
'type': [*'abcdefghij']})
df_normed = df.set_index('type')
df_normed = df_normed.div(df_normed.sum(axis=1), axis=0).multiply(100)
ax = df_normed.plot.barh(stacked=True, width=0.9, cmap='turbo')
for bars, col in zip(ax.containers, df.columns):
ax.bar_label(bars, labels=df[col], label_type='center', fontsize=15, color='yellow')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1))
for sp in ['top', 'right']:
ax.spines[sp].set_visible(False)
ax.xaxis.set_major_formatter(PercentFormatter())
ax.margins(x=0)
plt.tight_layout()
plt.show()

Set the y-axis to scale in a Seaborn heat map

I currently have a dataframe, df:
In [1]: df
Out [1]:
one two
1.5 11.22
2 15.36
2.5 11
3.3 12.5
3.5 14.78
5 9
6.2 26.14
I used this code to get a heat map:
In [2]:
plt.figure(figsize=(30, 7))
plt.title('Test')
ax = sns.heatmap(data=df, annot=True,)
plt.xlabel('Test')
ax.invert_yaxis()
value = 6
index = np.abs(df.index - value).argmin()
ax.axhline(index + .5, ls='--')
print(index)
Out [2]:
I am looking for the y-axis, instead, to automatically scale and plot the df[2] values in their respective positions on the full axis. For example, there should be a clear empty space between 3.5 and 5.0 as there aren’t any values - I want the values in between on the y-axis with 0 value against them.
This can be easily achieved with a bar plot instead:
plt.bar(df['one'], df['two'], color=list('rgb'), width=0.2, alpha=0.4)

Pandas: Finding maxima of 2d data (integers) in Dataframe

I have a 2d data set of (x,y). x and y are integer values.
How can I use only Pandas code to find all x values where y reaches its maximum values (there are multiple and same absolute maxima)?
I also want to plot (with pandas.DataFrame.plot) x vs. y and mark the maxima positions.
Example code:
import numpy as np
import pandas as pd
np.random.seed(10)
x = np.arange(100)*0.2
y = np.random.randint(0, 20, size=100)
data = np.vstack((x, y)).T
df = pd.DataFrame(data, columns=['x', 'y'])
ymax = df['y'].max()
df_ymax = df[df['y'] == ymax]
print(df_ymax)
# x y
# 13 2.6 19.0
# 24 4.8 19.0
# 28 5.6 19.0
# 86 17.2 19.0
# 88 17.6 19.0
df.plot(x='x', y='y', figsize=(8, 4),
ylabel='y', legend=False, style=['b-'])
I have no idea how to mark the maxima values (df_ymax) in the same plot, e.g. using circles. How can that be solved?
The final plot should look like this (here I programmed everything with numpy and matplotlib):
Get the Axes returned by df.plot and reuse it to plot the maxima values:
ax = df.plot(x='x', y='y', figsize=(8, 4), ylabel='y', legend=False, style=['b-'])
df_ymax.plot.scatter(x='x', y='y', color='r', ax=ax)

Matplotlib DataFrame boxplot with given max,min and quaritles

I want to plot a box plot with my DataFrame:
A B C
max 10 11 14
min 3 4 10
q1 5 6 12
q3 9 7 13
how can I plot a box plot with these fixed values?
You can use the Axes.bxp method in matplotlib, based on this helpful answer. The input is a list of dictionaries containing the relevant values, but the median is a required key in these dictionaries. Since the data you provided does not include medians, I have made up medians in the code below (but you will need to calculate them from your actual data).
import matplotlib.pyplot as plt
import pandas as pd
# reproducing your data
df = pd.DataFrame({'A':[10,3,5,9],'B':[11,4,6,7],'C':[14,10,12,13]})
# add a row for median, you need median values!
sample_medians = {'A':7, 'B':6.5, 'C':12.5}
df = df.append(sample_medians, ignore_index=True)
df.index = ['max','min','q1','q3','med']
Here is the modified df with medians included:
>>> df
A B C
max 10.0 11.0 14.0
min 3.0 4.0 10.0
q1 5.0 6.0 12.0
q3 9.0 7.0 13.0
med 7.0 6.5 12.5
Now we transform the df into a list of dictionaries:
labels = list(df.columns)
# create dictionaries for each column as items of a list
bxp_stats = df.apply(lambda x: {'med':x.med, 'q1':x.q1, 'q3':x.q3, 'whislo':x['min'], 'whishi':x['max']}, axis=0).tolist()
# add the column names as labels to each dictionary entry
for index, item in enumerate(bxp_stats):
item.update({'label':labels[index]})
_, ax = plt.subplots()
ax.bxp(bxp_stats, showfliers=False);
plt.show()
Unfortunately the median line is a required parameter so it must be specified for every box. Therefore we just make it as thin as possible to be virtually unseeable.
If you want each box to be drawn with different specifications, they will have to be in different subplots. I understand if this looks kind of ugly, so you can play around with the spacing between subplots or consider removing some of the y-axes.
fig, axes = plt.subplots(nrows=1, ncols=3, sharey=True)
# specify list of background colors, median line colors same as background with as thin of a width as possible
colors = ['LightCoral', '#FEF1B5', '#EEAEEE']
medianprops = [dict(linewidth = 0.1, color='LightCoral'), dict(linewidth = 0.1, color='#FEF1B5'), dict(linewidth = 0.1, color='#EEAEEE')]
# create a list of boxplots of length 3
bplots = [axes[i].bxp([bxp_stats[i]], medianprops=medianprops[i], patch_artist=True, showfliers=False) for i in range(len(df.columns))]
# set each boxplot a different color
for i, bplot in enumerate(bplots):
for patch in bplot['boxes']:
patch.set_facecolor(colors[i])
plt.show()

How can I create a boxplot on only positive values in Seaborn?

I want to create a boxplot on about 10 variables where only positive values are considered within each variable. This changes from variable to variable, So something that is 0 in one category might be positive in another.
To do it for one variable looks like this so far;
ax=sns.boxplot(data=[df['Category_1_value'][df['Category_1_value'] > 0]])
I could do the above 10 times but hoped there was an easier way.
Is there a simple option to just ignore the 0 values within each category?
Consider replacing all negative values with np.nan before plotting:
df[df < 0] = np.nan
fig, ax = plt.subplots(figsize=(10,4))
sns.boxplot(data=df, ax=ax)
plt.show()
plt.clf()
plt.close()
To demonstrate with random, seeded data.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(102918)
df = pd.DataFrame(np.random.randn(100, 5))
df.columns = ['Category_'+ str(i) +'_value' for i in range(1, 6)]
print(df.head(5)
# Category_1_value Category_2_value Category_3_value Category_4_value Category_5_value
# 0 -0.911648 -0.453908 -0.495518 0.733304 0.569576
# 1 0.780117 -0.079954 0.134944 -1.764539 -0.267812
# 2 -0.256881 0.470838 0.437137 1.295758 0.385070
# 3 -1.665858 -1.001672 -0.444930 0.758346 0.132343
# 4 -0.167982 1.033756 1.636315 0.458918 0.022343
df[df < 0] = np.nan
print(df.head(5))
# Category_1_value Category_2_value Category_3_value Category_4_value Category_5_value
# 0 NaN NaN NaN 0.733304 0.569576
# 1 0.780117 NaN 0.134944 NaN NaN
# 2 NaN 0.470838 0.437137 1.295758 0.385070
# 3 NaN NaN NaN 0.758346 0.132343
# 4 NaN 1.033756 1.636315 0.458918 0.022343
Plot
fig, ax = plt.subplots(figsize=(10,4))
sns.boxplot(data=df, ax=ax)
plt.show()
plt.clf()
plt.close()

Categories

Resources