create a heatmap of two categorical variables - python

I have the following datasets of three variables:
df['Score'] Float dummy (1 or 0)
df['Province'] an object column where each row is a region
df['Product type'] an object indicating the industry.
I would like to create a jointplot where on the x axis I have the different industries, on the y axis the different provinces and as colours of my jointplot I have the relative frequency of the score.
Something like this.
https://seaborn.pydata.org/examples/hexbin_marginals.html
For the time being, I could only do the following
mean = df.groupby(['Province', 'Product type'])['score'].mean()
But i am not sure how to plot it.
Thanks!

If you are looking for a heatmap, you could use seaborn heatmap function. However you need to pivot your table first.
Just creating a small example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
score = [1, 1, 1, 0, 1, 0, 0, 0]
provinces = ['Place1' ,'Place2' ,'Place2', 'Place3','Place1', 'Place2','Place3','Place1']
products = ['Product1' ,'Product3' ,'Product2', 'Product2','Product1', 'Product2','Product1','Product1']
df = pd.DataFrame({'Province': provinces,
'Product type': products,
'score': score
})
My df looks like:
'Province''Product type''score'
0 Place1 Product1 1
1 Place2 Product3 1
2 Place2 Product2 1
3 Place3 Product2 0
4 Place1 Product1 1
5 Place2 Product2 0
6 Place3 Product1 0
7 Place1 Product1 0
Then:
df_heatmap = df.pivot_table(values='score',index='Province',columns='Product type',aggfunc=np.mean)
sns.heatmap(df_heatmap,annot=True)
plt.show()
The result is:

Related

Divide two columns in pivot table and plot grouped bar chart with pandas

I have a dataset that looks like this:
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
Vintage Model Count Case
0 2016Q1 A 1 0
1 2016Q1 A 1 1
2 2016Q2 A 1 1
3 2016Q3 A 1 0
4 2016Q4 A 1 1
5 2016Q1 B 1 1
6 2016Q2 B 1 0
7 2016Q2 B 1 0
8 2016Q2 B 1 1
9 2016Q3 B 1 1
10 2016Q4 B 1 0
What I need to do is:
Plot grouped bar chart, where vintage is the groups and model is the hue/color
Two line plots in the same chart that show the percentage of case over count, aka plot the division of case over count for each model and vintage.
I figured out how to do the first task with a pivot table but haven't been able to add the percentage from the same pivot.
This is the solution for point 1:
dfp = df.pivot_table(index='vintage', columns='model', values='count', aggfunc='sum')
dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
I tried dividing between columns in the pivot table but it's not the right format to plot.
How can I do the percentage calculation and line plots so without creating a different table?
Could the whole task be done with groupby instead? (as I find it easier to use in general)
Here's a solution using the seaborn plotting library, not sure if it's ok for you to use it for your problem
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
agg_df = df.groupby(['Vintage','Model']).sum().reset_index()
agg_df['Fraction'] = agg_df['Case']/agg_df['Count']
sns.barplot(
x = 'Vintage',
y = 'Count',
hue = 'Model',
alpha = 0.5,
data = agg_df,
)
sns.lineplot(
x = 'Vintage',
y = 'Fraction',
hue = 'Model',
marker = 'o',
legend = False,
data = agg_df,
)
plt.show()
plt.close()
IIUC you want the lines to be drawn on the same plot. I'd recommend creating a new y-axis after computing the division from the original df. Then you can plot the lines with seaborn:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({
'Vintage': ['2016Q1','2016Q1', '2016Q2','2016Q3','2016Q4','2016Q1', '2016Q2','2016Q2','2016Q2','2016Q3','2016Q4'],
'Model': ['A','A','A','A','A','B','B','B','B','B','B',],
'Count': [1,1,1,1,1,1,1,1,1,1,1],
'Case':[0,1,1,0,1,1,0,0,1,1,0],
})
dfp = df.pivot_table(index='Vintage', columns='Model', values='Count', aggfunc='sum')
ax1 = dfp.plot(kind='bar', figsize=(8, 4), rot=45, ylabel='Frequency', title="Vintages")
dfd = df.groupby(["Vintage", "Model"]).sum() \
.assign(div_pct=lambda x:100*x["Case"]/x["Count"]) \
.reset_index()
ax2 = ax1.twinx() # creating a second y axis
sns.lineplot(data=dfd, x="Vintage", y="div_pct", hue="Model", style="Model", ax=ax2, markers=True, dashes=False)
plt.show()
Output:

pie chart drawing for a specific column in pandas python

I have a dataframe df, which has many columns. In df["house_electricity"], there are values like 1,0 or blank/NA. I want to plot the column in terms of a pie chart, where percentage of only 1 and 0 will be shown. Similarly I want to plot another pie chart where percentage of 1,0 and blank/N.A all will be there.
customer_id
house_electricity
house_refrigerator
cid01
0
0
cid02
1
na
cid03
1
cid04
1
cid05
na
0
#I wrote the following but it didnt give my my expected result
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("my_file.csv")
df_col=df.columns
df["house_electricity"].plot(kind="pie")
#I wrote the following but it didnt give my my expected result
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("my_file.csv")
df_col=df.columns
df["house_electricity"].plot(kind="pie")
For a dataframe
df = pd.DataFrame({'a':[1,0,np.nan,1,1,1,'',0,0,np.nan]})
df
a
0 1
1 0
2 NaN
3 1
4 1
5 1
6
7 0
8 0
9 NaN
The code below will give
df["a"].value_counts(dropna=False).plot(kind="pie")
If you want combine na and empty value, try replacing empty values with np.nan, then try to plot
df["a"].replace("", np.nan).value_counts(dropna=False).plot(kind="pie")
For solution you need to try with this code to generate 3 blocks.
import pandas as pd
import matplotlib.pyplot as plt
data = {'customer_id': ['cid01', 'cid02', 'cid03', 'cid04', 'cid05'],
'house_electricity': [0, 1, None, 1, None],
'house_refrigerator': [0, None, 1, None, 0]}
df = pd.DataFrame(data)
counts = df['house_electricity'].value_counts(dropna=False)
counts.plot.pie(autopct='%1.1f%%', labels=['0', '1', 'NaN'], shadow=True)
plt.title('Percentage distribution of house_electricity column')
plt.axis('equal')
plt.show()
Result:

Plot groupby percentage dataframe

I didn't find a complete answer to what i want to do:
I have a dataframe. I want to group by user and their answers to a survey, sum all of their good answers/total of their answers, display it in % and plot the result.
I have an answer column which contains : 1,0 or -1. I want to filter it in order to exclude -1.
Here is what i did so far :
df_sample.groupby('user').filter(lambda x : x['answer'].mean() >-1)
or :
a = df_sample.loc[df_sample['answer']!=-1,['user','answer']]
b = a.groupby(['user','answer']).agg({'answer' : 'sum'})
See it's uncomplete. Thank you for any suggestion that you may have.
Let's try with some sample data:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
df.head():
user answer
0 D 1
1 C 0
2 D -1
3 B 1
4 C 1
Option 1
Filter out the -1 values and use named aggregation to get the "good answers" and "total answers":
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df:
good_answer total_answer
user
A 9 15
B 11 20
C 15 19
D 7 14
Use division and multiplication to get percentage:
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
plot_df:
good_answer total_answer pct
user
A 9 15 60.000000
B 11 20 55.000000
C 15 19 78.947368
D 7 14 50.000000
Then this can be plotted with DataFrame.plot:
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Option 2
If just the percentage is needed groupby mean can be used to get to the resulting plot directly after filtering out the -1s:
plot_df = df[df['answer'].ne(-1)].groupby('user')['answer'].mean().mul(100)
ax = plot_df.plot(
kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
plot_df:
answer
user
A 60.000000
B 55.000000
C 78.947368
D 50.000000
Both options Produce:
All together:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Here is a sample solution assuming you want to calculate percentage based on filtered dataframe.
import pandas as pd
import numpy as np
df_sample = pd.DataFrame(np.random.randint(-1,2,size=(10, 1)), columns=['answer'])
df_sample['user'] = [i for i in 'a b c d e f a b c d'.split(' ')]
df_filtered = df_sample[df_sample.answer>-1]
print(df_filtered.groupby('user').agg({'answer' : lambda x: x.sum()/len(df_filtered)*100}))

Plot made of array from a pandas dataset in Python

The problem is: I have a SQLAlchemy database called NumFav with arrays of favourite numbers of some people, which uses such a structure:
id name numbers
0 Vladislav [2, 3, 5]
1 Michael [4, 6, 7, 9]
numbers is postgresql.ARRAY(Integer)
I want to make a plot which demonstrates id of people on X and numbers dots on Y in order to show which numbers have been chosen like this:
I extract data using
df = pd.read_sql(Session.query(NumFav).statement, engine)
How can I create a plot with such data?
You can explode the number lists into "long form":
df = df.explode('numbers')
df['color'] = df.id.map({0: 'red', 1: 'blue'})
# id name numbers color
# 0 Vladislav 2 red
# 0 Vladislav 3 red
# 0 Vladislav 5 red
# 1 Michael 4 blue
# 1 Michael 6 blue
# 1 Michael 7 blue
# 1 Michael 9 blue
Then you can directly plot.scatter:
df.plot.scatter(x='name', y='numbers', c='color')
Like this:
import matplotlib.pyplot as plt
for idx, row in df.iterrows():
plt.plot(row['numbers'])
plt.legend(df['name'])
plt.show()

Filtering outliers within each category of categorical data in pandas

I'm new to pandas/seaborn/etc and attempting to graph a subset of my data in a different style (using seaborn), using something like the example here https://seaborn.pydata.org/generated/seaborn.stripplot.html :
>>> ax = sns.stripplot(x="day", y="total_bill", hue="smoker",
... data=tips, jitter=True,
... palette="Set2", dodge=True)
My goal is to plot only the outliers within each x/hue dimension, i.e. for the example shown I'd be using 8 different percentile cutoffs for the 8 different columns of points displayed.
I have a dataframe like:
Cat RPS latency_ns
0 X 100 909423.0
1 X 100 14747385.0
2 X 1000 14425058.0
3 Y 100 7107907.0
4 Y 1000 21466101.0
... ... ... ...
And I want to filter this data, leaving only the upper 99.9th percentile outliers.
I've found I can do:
df.groupby([dim1_label, dim2_label]).quantile(0.999)
To get something like:
latency_ns
Cat RPS
X 10 RPS 6.463337e+07
100 RPS 4.400980e+07
1000 RPS 6.075070e+07
Y 100 RPS 3.958944e+07
Z 10 RPS 5.621427e+07
100 RPS 4.436208e+07
1000 RPS 6.658783e+07
But I'm not sure where to go from here with a merge/filter operation.
Here is a small example I created to guide you. I hope it is helpful.
Code
import numpy as np
import pandas as pd
import seaborn as sns
#create a sample data frame
n = 1000
prng = np.random.RandomState(123)
x = prng.uniform(low=1, high=5, size=(n,)).astype('int')
#print(x[:10])
#[3 2 1 3 3 2 4 3 2 2]
y = prng.normal(size=(n,))
#print(y[:10])
#[ 1.32327371 -0.00315484 -0.43065984 -0.14641577 1.16017595 -0.64151234
#-0.3002324 -0.63226078 -0.20431653 0.2136956 ]
z = prng.binomial(n=1,p=2/3,size=(n,))
#print(z[:10])
#[1 0 1 1 1 1 0 1 1 1]
#analagously to the smoking example, my df x maps day,
#y maps to total bill, and z maps to is smoker (or not)
df = pd.DataFrame(data={'x':x,'y':y,'z':z})
#df.head()
df_filtered = pd.DataFrame()
#df.groupby.quantile([0.9]) returns a scalar, unless you want to plot only a single point, use this
#if you want to plot values that are within the lower and upper bounds, then some
#conditional filtering is required, see the conditional filtering I wrote below
for i,j in df.groupby([x, z]):
b = j.quantile([0,0.9]) #use [0.999,1] in your case
lb = b['y'].iloc[0]
ub = b['y'].iloc[1]
df_temp = j[(j['y']>=lb)&(j['y']<=ub)]
df_filtered = pd.concat([df_filtered,df_temp])
#print(df_filtered.count())
#x 897
#y 897
#z 897
#dtype: int64
Output
import matplotlib.pyplot as plt
ax = sns.stripplot(x='x', y='y', hue='z',
data=df_filtered, jitter=True,
palette="Set2", dodge=True)
plt.show()

Categories

Resources