Add density curve on the histogram - python

I am able to make histogram in python but I am unable to add density curve , I see many code which are using different ways to add density curve on histogram but I am not sure how to get on my code
I have added density = true but not able to get density curve on histogram
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X=df['A']
hist, bins = np.histogram(X, bins=10,density=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width)
plt.show()

Here is an approach using distplot method of seaborn. Also, mentioned in the comments:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.distplot(X, kde=True, bins=20, hist=True)
plt.show()
However, distplot will be removed in a future version of seaborn. Therefore, alternatives are to use histplot and displot.
sns.histplot
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.histplot(X, kde=True, bins=20)
plt.show()
sns.displot
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.displot(X, kde=True, bins=20)
plt.show()

Pandas also has kde plot:
hist, bins = np.histogram(X, bins=10,density=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width, zorder=1)
# density plot
df['A'].plot.kde(zorder=2, color='C1')
plt.show()
Output:

Related

Lower triangle mask with seaborn clustermap

How can I mask the lower triangle while hierarchical clustering with seaborn's clustermap?
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#pearson coefficients
corr = np.corrcoef(np.random.randn(10, 200))
#lower triangle
mask = np.tril(np.ones_like(corr))
fig, ax = plt.subplots(figsize=(6,6))
#heatmap works as expected
sns.heatmap(corr, cmap="Blues", mask=mask, cbar=False)
#clustermap not so much
sns.clustermap(corr, cmap="Blues", mask=mask, figsize=(6,6))
plt.show()
Well, the clustermap clusters the values according to similarity. This changes the order of the rows and the columns.
You could create a regular clustermap, and in a second step apply the mask:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
corr = np.corrcoef(np.random.randn(10, 200))
g = sns.clustermap(corr, cmap="Blues", figsize=(6, 6))
mask = np.tril(np.ones_like(corr))
values = g.ax_heatmap.collections[0].get_array().reshape(corr.shape)
new_values = np.ma.array(values, mask=mask)
g.ax_heatmap.collections[0].set_array(new_values)
plt.show()

How to plot histogram and distribution from frequency table?

I have a frequency table
frequency table
and I have been trying to plot this data into something like this,
histogram with distribution curve
so tried this,
to_plot = compare_df[['counts', 'theoritical counts']]
bins=[0,2500,5000,7500,10000,12500,15000,17500,20000]
sns.displot(to_plot,bins=bins)
but, it turned out to be like this,
plot
Any idea what I did wrong? Please help.
First off, note that you lose important information when creating a kde plot only from frequencies.
sns.histplot() has a parameter weights= which can handle the frequencies. I didn't see a way to get this to work using a long dataframe and hue, but you can call histplot separately for each column. Here is an example starting from generated data:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set()
bins = np.array([0, 2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000])
df = pd.DataFrame({'counts': np.random.randint(2, 30, 8),
'theoretical counts': np.random.randint(2, 30, 8)},
index=pd.interval_range(0, 20000, freq=2500))
df['theoretical counts'] = (3 * df['counts'] + df['theoretical counts']) // 4
fig, ax = plt.subplots()
for column, color in zip(['counts', 'theoretical counts'], ['cornflowerblue', 'crimson']):
sns.histplot(x=(bins[:-1] + bins[1:]) / 2, weights=df[column], bins=8, binrange=(0, 20000),
kde=True, kde_kws={'cut': .3},
color=color, alpha=0.5, label=column, ax=ax)
ax.legend()
ax.set_xticks(range(0, 20001, 2500))
plt.show()
With very varying bin width, there isn't enough information for a suitable kde curve. Also, a bar plot seems more appropriate then a histogram. Here is an exmple:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set()
bins = [0, 250, 500, 1000, 1500, 2500, 5000, 10000, 50000, np.inf]
bin_labels = [f'{b0}-{b1}' for b0, b1, in zip(bins[:-1], bins[1:])]
df = pd.DataFrame({'counts': np.random.randint(2, 30, 9),
'theoretical counts': np.random.randint(2, 30, 9)})
df['theoretical counts'] = (3 * df['counts'] + df['theoretical counts']) // 4
fig, ax = plt.subplots(figsize=(10, 4))
sns.barplot(data=df.melt(), x=np.tile(bin_labels, 2), y='value',
hue='variable', palette=['cornflowerblue', 'crimson'], ax=ax)
plt.tight_layout()
plt.show()
sns.barplot() has some options, for example dodge=False, alpha=0.5 to draw the bars at the same spot.
Couple of things:
when you provide a DataFrame to sns.displot you would need also to specify which column to use for the distribution as the x kwarg.
this leads into the 2nd issue: I don't know of a way to get multiple distributions using sns.displot, but you can use sns.histplot in approximately this way:
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset('titanic')
ax = sns.histplot(data=titanic,x='age',bins=30,color='r',alpha=.25,
label='age')
sns.histplot(data=titanic,x='fare',ax=ax,bins=30,color='b',alpha=.25,
label='fare')
ax.legend()
plt.show()
Result below, and please note that I just used an example dataset to get you a rough image as quickly as possible:

How to draw multiple lines with Seaborn?

I am trying to draw a plot with two lines. Both with different colors. And different labels. This is what I have come up with.
This is code that I have written.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data1 = pd.read_csv("/content/drive/MyDrive/Summer-2020/URMC/training_x_total_data_ones.csv", header=None)
data2 = pd.read_csv("/content/drive/MyDrive/Summer-2020/URMC/training_x_total_data_zeroes.csv", header=None)
sns.lineplot(data=data1, color="red")
sns.lineplot(data=data2)
What am I doing wrong?
Edit
This is how my dataset looks like
So, I just added another color in the second line and that seemed to work.
import random
import numpy as np
import seaborn as sns
mu, sigma = 0, 0.1
s = np.random.normal(mu, sigma, 100)
mu1, sigma1 = 0.5, 1
t = np.random.normal(mu1, sigma1, 100)
sns.lineplot(data= s, color = "red")
sns.lineplot(data= t, color ="blue")
Try specifying the x and y of the call to sns.lineplot?
import pandas as pd
import numpy as np
import seaborn as sns
x = np.arange(10)
df1 = pd.DataFrame({'x':x,
'y':np.sin(x)})
df2 = pd.DataFrame({'x':x,
'y':x**2})
sns.lineplot(data=df1, x='x', y='y', color="red")
sns.lineplot(data=df2, x='x', y='y')
Without doing so, I get a similar plot as yours.

How to color swarmplot dots depending on quartile?

I would like to create a plot where dots are overlaid depending on whether or not they are within the 1st-3rd quartiles in seaborn. What function to use?
Something similar to the figure:
The following code creates a Seaborn swarmplot and then recolors the dots depending on their quartile. Looping through the collections created by the swarmplot, the y-data are retrieved. np.percentile calculates the borders of the quartiles and np.digitize calculates the corresponding quartiles. These quartiles can be used to define the color.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
# cmap = plt.get_cmap('tab10')
cmap = ListedColormap(['gold', 'crimson', 'teal', 'orange'])
ax = sns.swarmplot(x="day", y="total_bill", data=tips)
for col in ax.collections:
y = col.get_offsets()[:,1]
perc = np.percentile(y, [25, 50, 75])
col.set_cmap(cmap)
col.set_array(np.digitize(y, perc))
plt.show()
The same approach can be used for a stripplot (optionally without jitter) to create a plot similar to the one in the question.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
N = 200
x = np.repeat(list('abcdefg'), N)
y = np.random.normal(np.repeat(np.random.uniform(11, 15, 7), N), 1)
cmap = ListedColormap(['grey', 'turquoise', 'grey'])
ax = sns.stripplot(x=x, y=y, jitter=False, alpha=0.2)
for col in ax.collections:
y = col.get_offsets()[:, 1]
perc = np.percentile(y, [25, 75])
col.set_cmap(cmap)
col.set_array(np.digitize(y, perc))
plt.show()

How do I format my y-axis to show a comma separator in my Seaborn graph? [duplicate]

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 6)
g = sns.scatterplot(ax=ax, x="Area", y="Rent/Sqft", hue="Region", marker='o', data=df, s=100, palette= palette)
g.legend(bbox_to_anchor=(1, 1), ncol=1)
g.set(xlim = (50000,250000))
How can I can change the axis format from a number to custom format? For example, 125000 to 125.00K
IIUC you can format the xticks and set these:
In[60]:
#generate some psuedo data
df = pd.DataFrame({'num':[50000, 75000, 100000, 125000], 'Rent/Sqft':np.random.randn(4), 'Region':list('abcd')})
df
Out[60]:
num Rent/Sqft Region
0 50000 0.109196 a
1 75000 0.566553 b
2 100000 -0.274064 c
3 125000 -0.636492 d
In[61]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 4)
g = sns.scatterplot(ax=ax, x="num", y="Rent/Sqft", hue="Region", marker='o', data=df, s=100, palette= palette)
g.legend(bbox_to_anchor=(1, 1), ncol=1)
g.set(xlim = (50000,250000))
xlabels = ['{:,.2f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
Out[61]:
The key bit here is this line:
xlabels = ['{:,.2f}'.format(x) + 'K' for x in g.get_xticks()/1000]
g.set_xticklabels(xlabels)
So this divides all the ticks by 1000 and then formats them and sets the xtick labels
UPDATE
Thanks to #ScottBoston who has suggested a better method:
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.2f}'.format(x/1000) + 'K'))
see the docs
The canonical way of formatting the tick labels in the standard units is to use an EngFormatter. There is also an example in the matplotlib docs.
Also see Tick locating and formatting
Here it might look as follows.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import pandas as pd
df = pd.DataFrame({"xaxs" : np.random.randint(50000,250000, size=20),
"yaxs" : np.random.randint(7,15, size=20),
"col" : np.random.choice(list("ABC"), size=20)})
fig, ax = plt.subplots(figsize=(8, 5))
palette = sns.color_palette("bright", 6)
sns.scatterplot(ax=ax, x="xaxs", y="yaxs", hue="col", data=df,
marker='o', s=100, palette="magma")
ax.legend(bbox_to_anchor=(1, 1), ncol=1)
ax.set(xlim = (50000,250000))
ax.xaxis.set_major_formatter(ticker.EngFormatter())
plt.show()
Using Seaborn without importing matplotlib:
import seaborn as sns
sns.set()
chart = sns.relplot(x="x_val", y="y_val", kind="line", data=my_data)
ticks = chart.axes[0][0].get_xticks()
xlabels = ['$' + '{:,.0f}'.format(x) for x in ticks]
chart.set_xticklabels(xlabels)
chart.fig
Thank you to EdChum's answer above for getting me 90% there.
Here's how I'm solving this: (similar to ScottBoston)
from matplotlib.ticker import FuncFormatter
f = lambda x, pos: f'{x/10**3:,.0f}K'
ax.xaxis.set_major_formatter(FuncFormatter(f))
We could used the APIs: ax.get_xticklabels() , get_text() and ax.set_xticklabels do it.
e.g,
xlabels = ['{:.2f}k'.format(float(x.get_text().replace('−', '-')))/1000 for x in g.get_xticklabels()]
g.set_xticklabels(xlabels)

Categories

Resources