Related
I have a dataset that looks like this:
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
Vintage Model Count Case
0 2016Q1 A 1 0
1 2016Q1 A 1 1
2 2016Q2 A 1 1
3 2016Q3 A 1 0
4 2016Q4 A 1 1
5 2016Q1 B 1 1
6 2016Q2 B 1 0
7 2016Q2 B 1 0
8 2016Q2 B 1 1
9 2016Q3 B 1 1
10 2016Q4 B 1 0
What I need to do is:
Plot grouped bar chart, where vintage is the groups and model is the hue/color
Two line plots in the same chart that show the percentage of case over count, aka plot the division of case over count for each model and vintage.
I figured out how to do the first task with a pivot table but haven't been able to add the percentage from the same pivot.
This is the solution for point 1:
dfp = df.pivot_table(index='vintage', columns='model', values='count', aggfunc='sum')
dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
I tried dividing between columns in the pivot table but it's not the right format to plot.
How can I do the percentage calculation and line plots so without creating a different table?
Could the whole task be done with groupby instead? (as I find it easier to use in general)
Here's a solution using the seaborn plotting library, not sure if it's ok for you to use it for your problem
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
agg_df = df.groupby(['Vintage','Model']).sum().reset_index()
agg_df['Fraction'] = agg_df['Case']/agg_df['Count']
sns.barplot(
x = 'Vintage',
y = 'Count',
hue = 'Model',
alpha = 0.5,
data = agg_df,
)
sns.lineplot(
x = 'Vintage',
y = 'Fraction',
hue = 'Model',
marker = 'o',
legend = False,
data = agg_df,
)
plt.show()
plt.close()
IIUC you want the lines to be drawn on the same plot. I'd recommend creating a new y-axis after computing the division from the original df. Then you can plot the lines with seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
dfp = df.pivot_table(index='Vintage', columns='Model', values='Count', aggfunc='sum')
ax1 = dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
dfd = df.groupby(["Vintage", "Model"]).sum() \
.assign(div_pct=lambda x:100*x["Case"]/x["Count"]) \
.reset_index()
ax2 = ax1.twinx() # creating a second y axis
sns.lineplot(data=dfd, x="Vintage", y="div_pct", hue="Model", style="Model", ax=ax2, markers=True, dashes=False)
plt.show()
Output:
I didn't find a complete answer to what i want to do:
I have a dataframe. I want to group by user and their answers to a survey, sum all of their good answers/total of their answers, display it in % and plot the result.
I have an answer column which contains : 1,0 or -1. I want to filter it in order to exclude -1.
Here is what i did so far :
df_sample.groupby('user').filter(lambda x : x['answer'].mean() >-1)
or :
a = df_sample.loc[df_sample['answer']!=-1,['user','answer']]
b = a.groupby(['user','answer']).agg({'answer' : 'sum'})
See it's uncomplete. Thank you for any suggestion that you may have.
Let's try with some sample data:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
df.head():
user answer
0 D 1
1 C 0
2 D -1
3 B 1
4 C 1
Option 1
Filter out the -1 values and use named aggregation to get the "good answers" and "total answers":
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df:
good_answer total_answer
user
A 9 15
B 11 20
C 15 19
D 7 14
Use division and multiplication to get percentage:
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
plot_df:
good_answer total_answer pct
user
A 9 15 60.000000
B 11 20 55.000000
C 15 19 78.947368
D 7 14 50.000000
Then this can be plotted with DataFrame.plot:
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Option 2
If just the percentage is needed groupby mean can be used to get to the resulting plot directly after filtering out the -1s:
plot_df = df[df['answer'].ne(-1)].groupby('user')['answer'].mean().mul(100)
ax = plot_df.plot(
kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
plot_df:
answer
user
A 60.000000
B 55.000000
C 78.947368
D 50.000000
Both options Produce:
All together:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Here is a sample solution assuming you want to calculate percentage based on filtered dataframe.
import pandas as pd
import numpy as np
df_sample = pd.DataFrame(np.random.randint(-1,2,size=(10, 1)), columns=['answer'])
df_sample['user'] = [i for i in 'a b c d e f a b c d'.split(' ')]
df_filtered = df_sample[df_sample.answer>-1]
print(df_filtered.groupby('user').agg({'answer' : lambda x: x.sum()/len(df_filtered)*100}))
I have the following dataset:
df = pd.DataFrame({'cls': [1,2,2,1,2,1,2,1,2,1,2],
'x': [10,11,21,21,8,1,4,3,5,6,2],
'y': [10,1,2,2,5,2,4,3,8,6,5]})
df['bin'] = pd.qcut(np.array(df['x']), 4)
a = df.groupby(['bin', 'cls'])['y'].mean()
a
This gives me
bin cls
(0.999, 3.5] 1 2.5
2 5.0
(3.5, 6.0] 1 6.0
2 6.0
(6.0, 10.5] 1 10.0
2 5.0
(10.5, 21.0] 1 2.0
2 1.5
Name: y, dtype: float64
I want to plot the right-most column (that is, the average of y per cls per bin) per bin per class. That is, for each bin we have two values of y that I would like to plot as points/scatters. Is that possible using matplotlib or seaborn?
You can indeed use seaborn for what you're asking. Does this work?
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# set up some plotting options
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1,1,1)
# we reset index to avoid having to do multi-indexing
a = a.reset_index()
# use seaborn with argument 'hue' to do the grouping
sns.barplot(x="bin", y="y", hue="cls", data=a, ax=ax)
plt.show()
EDIT: I've just noticed that you wanted to plot "points". I wouldn't advise it for this dataset but you can do that if you replace barplot with catplot.
I want to create a boxplot on about 10 variables where only positive values are considered within each variable. This changes from variable to variable, So something that is 0 in one category might be positive in another.
To do it for one variable looks like this so far;
ax=sns.boxplot(data=[df['Category_1_value'][df['Category_1_value'] > 0]])
I could do the above 10 times but hoped there was an easier way.
Is there a simple option to just ignore the 0 values within each category?
Consider replacing all negative values with np.nan before plotting:
df[df < 0] = np.nan
fig, ax = plt.subplots(figsize=(10,4))
sns.boxplot(data=df, ax=ax)
plt.show()
plt.clf()
plt.close()
To demonstrate with random, seeded data.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(102918)
df = pd.DataFrame(np.random.randn(100, 5))
df.columns = ['Category_'+ str(i) +'_value' for i in range(1, 6)]
print(df.head(5)
# Category_1_value Category_2_value Category_3_value Category_4_value Category_5_value
# 0 -0.911648 -0.453908 -0.495518 0.733304 0.569576
# 1 0.780117 -0.079954 0.134944 -1.764539 -0.267812
# 2 -0.256881 0.470838 0.437137 1.295758 0.385070
# 3 -1.665858 -1.001672 -0.444930 0.758346 0.132343
# 4 -0.167982 1.033756 1.636315 0.458918 0.022343
df[df < 0] = np.nan
print(df.head(5))
# Category_1_value Category_2_value Category_3_value Category_4_value Category_5_value
# 0 NaN NaN NaN 0.733304 0.569576
# 1 0.780117 NaN 0.134944 NaN NaN
# 2 NaN 0.470838 0.437137 1.295758 0.385070
# 3 NaN NaN NaN 0.758346 0.132343
# 4 NaN 1.033756 1.636315 0.458918 0.022343
Plot
fig, ax = plt.subplots(figsize=(10,4))
sns.boxplot(data=df, ax=ax)
plt.show()
plt.clf()
plt.close()
This is my pandas DataFrame.
value action
0 1 0
1 2 1
2 3 1
3 4 1
4 3 0
5 2 1
6 5 0
What I want to do is mark value as o if action=0, x if action=1.
So, the plot marker should be like this:
o x x x o x o
But have no idea how to do this...
Need your helps. Thanks.
Consider the following approach:
plt.plot(df.index, df.value, '-X', markevery=df.index[df.action==1].tolist())
plt.plot(df.index, df.value, '-o', markevery=df.index[df.action==0].tolist())
Result:
alternative solution:
plt.plot(df.index, df.value, '-')
plt.scatter(df.index[df.action==0], df.loc[df.action==0, 'value'],
marker='o', s=100, c='green')
plt.scatter(df.index[df.action==1], df.loc[df.action==1, 'value'],
marker='x', s=100, c='red')
Result:
You can plot the filtered dataframe. I.e., you can create two dataframes, one for action 0 and one for action 1. Then plot each individually.
import pandas as pd
df = pd.DataFrame({"value":[1,2,3,4,3,2,5], "action":[0,1,1,1,0,1,0]})
df0 = df[df["action"] == 0]
df1 = df[df["action"] == 1]
ax = df.reset_index().plot(x = "index", y="value", legend=False, color="gray", zorder=0)
df0.reset_index().plot(x = "index", y="value",kind="scatter", marker="o", ax=ax)
df1.reset_index().plot(x = "index", y="value",kind="scatter", marker="x", ax=ax)