How to plot histogram and distribution from frequency table?

How to plot histogram and distribution from frequency table? - python

I have a frequency table
frequency table
and I have been trying to plot this data into something like this,
histogram with distribution curve
so tried this,
to_plot = compare_df[['counts', 'theoritical counts']]
bins=[0,2500,5000,7500,10000,12500,15000,17500,20000]
sns.displot(to_plot,bins=bins)
but, it turned out to be like this,
plot
Any idea what I did wrong? Please help.

First off, note that you lose important information when creating a kde plot only from frequencies.
sns.histplot() has a parameter weights= which can handle the frequencies. I didn't see a way to get this to work using a long dataframe and hue, but you can call histplot separately for each column. Here is an example starting from generated data:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set()
bins = np.array([0, 2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000])
df = pd.DataFrame({'counts': np.random.randint(2, 30, 8),
'theoretical counts': np.random.randint(2, 30, 8)},
index=pd.interval_range(0, 20000, freq=2500))
df['theoretical counts'] = (3 * df['counts'] + df['theoretical counts']) // 4
fig, ax = plt.subplots()
for column, color in zip(['counts', 'theoretical counts'], ['cornflowerblue', 'crimson']):
sns.histplot(x=(bins[:-1] + bins[1:]) / 2, weights=df[column], bins=8, binrange=(0, 20000),
kde=True, kde_kws={'cut': .3},
color=color, alpha=0.5, label=column, ax=ax)
ax.legend()
ax.set_xticks(range(0, 20001, 2500))
plt.show()
With very varying bin width, there isn't enough information for a suitable kde curve. Also, a bar plot seems more appropriate then a histogram. Here is an exmple:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set()
bins = [0, 250, 500, 1000, 1500, 2500, 5000, 10000, 50000, np.inf]
bin_labels = [f'{b0}-{b1}' for b0, b1, in zip(bins[:-1], bins[1:])]
df = pd.DataFrame({'counts': np.random.randint(2, 30, 9),
'theoretical counts': np.random.randint(2, 30, 9)})
df['theoretical counts'] = (3 * df['counts'] + df['theoretical counts']) // 4
fig, ax = plt.subplots(figsize=(10, 4))
sns.barplot(data=df.melt(), x=np.tile(bin_labels, 2), y='value',
hue='variable', palette=['cornflowerblue', 'crimson'], ax=ax)
plt.tight_layout()
plt.show()
sns.barplot() has some options, for example dodge=False, alpha=0.5 to draw the bars at the same spot.

Couple of things:
when you provide a DataFrame to sns.displot you would need also to specify which column to use for the distribution as the x kwarg.
this leads into the 2nd issue: I don't know of a way to get multiple distributions using sns.displot, but you can use sns.histplot in approximately this way:
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset('titanic')
ax = sns.histplot(data=titanic,x='age',bins=30,color='r',alpha=.25,
label='age')
sns.histplot(data=titanic,x='fare',ax=ax,bins=30,color='b',alpha=.25,
label='fare')
ax.legend()
plt.show()
Result below, and please note that I just used an example dataset to get you a rough image as quickly as possible:

Related

Add density curve on the histogram

I am able to make histogram in python but I am unable to add density curve , I see many code which are using different ways to add density curve on histogram but I am not sure how to get on my code
I have added density = true but not able to get density curve on histogram
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X=df['A']
hist, bins = np.histogram(X, bins=10,density=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width)
plt.show()

Here is an approach using distplot method of seaborn. Also, mentioned in the comments:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.distplot(X, kde=True, bins=20, hist=True)
plt.show()
However, distplot will be removed in a future version of seaborn. Therefore, alternatives are to use histplot and displot.
sns.histplot
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.histplot(X, kde=True, bins=20)
plt.show()
sns.displot
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X = df['A']
sns.displot(X, kde=True, bins=20)
plt.show()

Pandas also has kde plot:
hist, bins = np.histogram(X, bins=10,density=True)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.bar(center, hist, align='center', width=width, zorder=1)
# density plot
df['A'].plot.kde(zorder=2, color='C1')
plt.show()
Output:

How to color swarmplot dots depending on quartile?

I would like to create a plot where dots are overlaid depending on whether or not they are within the 1st-3rd quartiles in seaborn. What function to use?
Something similar to the figure:

The following code creates a Seaborn swarmplot and then recolors the dots depending on their quartile. Looping through the collections created by the swarmplot, the y-data are retrieved. np.percentile calculates the borders of the quartiles and np.digitize calculates the corresponding quartiles. These quartiles can be used to define the color.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
# cmap = plt.get_cmap('tab10')
cmap = ListedColormap(['gold', 'crimson', 'teal', 'orange'])
ax = sns.swarmplot(x="day", y="total_bill", data=tips)
for col in ax.collections:
y = col.get_offsets()[:,1]
perc = np.percentile(y, [25, 50, 75])
col.set_cmap(cmap)
col.set_array(np.digitize(y, perc))
plt.show()
The same approach can be used for a stripplot (optionally without jitter) to create a plot similar to the one in the question.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
N = 200
x = np.repeat(list('abcdefg'), N)
y = np.random.normal(np.repeat(np.random.uniform(11, 15, 7), N), 1)
cmap = ListedColormap(['grey', 'turquoise', 'grey'])
ax = sns.stripplot(x=x, y=y, jitter=False, alpha=0.2)
for col in ax.collections:
y = col.get_offsets()[:, 1]
perc = np.percentile(y, [25, 75])
col.set_cmap(cmap)
col.set_array(np.digitize(y, perc))
plt.show()

Use Seaborn to plot 1D time series as a line with marginal histogram along y-axis

I'm trying to recreate the broad features of the following figure:
(from E.M. Ozbudak, M. Thattai, I. Kurtser, A.D. Grossman, and A. van Oudenaarden, Nat Genet 31, 69 (2002))
seaborn.jointplot does most of what I need, but it seemingly can't use a line plot, and there's no obvious way to hide the histogram along the x-axis. Is there a way to get jointplot to do what I need? Barring that, is there some other reasonably simple way to create this kind of plot using Seaborn?

Here is a way to create roughly the same plot as shown in the question. You can share the axes between the two subplots and make the width-ratio asymmetric.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
x = np.linspace(0,8, 300)
y = np.tanh(x)+np.random.randn(len(x))*0.08
fig, (ax, axhist) = plt.subplots(ncols=2, sharey=True,
gridspec_kw={"width_ratios" : [3,1], "wspace" : 0})
ax.plot(x,y, color="k")
ax.plot(x,np.tanh(x), color="k")
axhist.hist(y, bins=32, ec="k", fc="none", orientation="horizontal")
axhist.tick_params(axis="y", left=False)
plt.show()

It turns out that you can produce a modified jointplot with the needed characteristics by working directly with the underlying JointGrid object:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
x = np.linspace(0,8, 300)
y = (1 - np.exp(-x*5))*.5
ynoise= y + np.random.randn(len(x))*0.08
grid = sns.JointGrid(x, ynoise, ratio=3)
grid.plot_joint(plt.plot)
grid.ax_joint.plot(x, y, c='C0')
plt.sca(grid.ax_marg_y)
sns.distplot(grid.y, kde=False, vertical=True)
# override a bunch of the default JointGrid style options
grid.fig.set_size_inches(10,6)
grid.ax_marg_x.remove()
grid.ax_joint.spines['top'].set_visible(True)
Output:

You can use ax_marg_x.patches to affect the outcome.
Here, I use it to turn the x-axis plot white so that it cannot be seen (although the margin for it remains):
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white", color_codes=True)
x, y = np.random.multivariate_normal([2, 3], [[0.3, 0], [0, 0.5]], 1000).T
g = sns.jointplot(x=x, y=y, kind="hex", stat_func=None, marginal_kws={'color': 'green'})
plt.setp(g.ax_marg_x.patches, color="w", )
plt.show()
Output:

Convert a Histogram which has two variables plotted on it into a smooth Curve

Here is the code for generating the histogram. For the full code you can refer to this iPython Notebook
# Splitting the dataset into malignant and benign.
dataMalignant=datas[datas['diagnosis'] ==1]
dataBenign=datas[datas['diagnosis'] ==0]
#Plotting these features as a histogram
fig, axes = plt.subplots(nrows=10, ncols=1, figsize=(15,60))
for idx,ax in enumerate(axes):
ax.figure
binwidth= (max(datas[features_mean[idx]]) - min(datas[features_mean[idx]]))/250
ax.hist([dataMalignant[features_mean[idx]],dataBenign[features_mean[idx]]], bins=np.arange(min(datas[features_mean[idx]]), max(datas[features_mean[idx]]) + binwidth, binwidth) , alpha=0.5,stacked=True, normed = True, label=['M','B'],color=['r','g'])
ax.legend(loc='upper right')
ax.set_title(features_mean[idx])
plt.show()
How do I convert this Histogram into a smooth curve with the area under the curve shaded/highlighted.

here is a simple example that might help you
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(123)
datas = pd.DataFrame(np.random.randint(0, 2, size=(100, 1)), columns=['diagnosis'])
datas['data'] = np.random.randint(0, 100,size=(100, 1))
I used numpy's histogram function,but you could also use ax.hist with same arguments instead.
benign_hist=np.histogram(datas[datas['diagnosis']==0]['data'],bins=np.arange(0, 100, 10))
malignant_hist=np.histogram(datas[datas['diagnosis']==1]['data'],bins=np.arange(0, 100, 10))
fig,ax=plt.subplots(1,1)
ax.fill_between(malignant_hist[1][1:], malignant_hist[0], color='r', alpha=0.5)
ax.fill_between(benign_hist[1][1:], benign_hist[0], color='b', alpha=0.5)
in the above example for plotting convenience instead of bin midpoints I just used 9 bin edges for demonstration.
in OP's code one could assign hist_data = ax.hist(...)
hist_data[0] contains histogram values and hist_data1 contains bins to fill in areas use something like
fig, ax=plt.subplots(1,1)
ax.fill_between(hist_data[1][1:],hist_data[0][0],color='g',alpha=0.5)
ax.fill_between(hist_data[1][1:],hist_data[0][1],color='r',alpha=0.5)

pandas histogram: plot histogram for each column as subplot of a big figure

I am using the following code, trying to plot the histogram of every column of a my pandas data frame df_in as subplot of a big figure.
%matplotlib notebook
from itertools import combinations
import matplotlib.pyplot as plt
fig, axes = plt.subplots(len(df_in.columns) // 3, 3, figsize=(12, 48))
for x in df_in.columns:
df_in.hist(column = x, bins = 100)
fig.tight_layout()
However, the histogram didn't show in the subplot. Any one knows what I missed? Thanks!

I can't comment burhan's answer because I don't have enough reputation points. The problem with his answer is that axes isn't one-dimensional, it contains axes triads, so it needs to be unrolled:
%matplotlib notebook
from itertools import combinations
import matplotlib.pyplot as plt
fig, axes = plt.subplots(len(df_in.columns)//3, 3, figsize=(12, 48))
i = 0
for triaxis in axes:
for axis in triaxis:
df_in.hist(column = df_in.columns[i], bins = 100, ax=axis)
i = i+1

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fig, axis = plt.subplots(2,3,figsize=(8, 8))
df_in.hist(ax=axis)
The above will plot a 2*3 (total 6 histogram for your dataframe). Adjust the rows and columns as per your arrangement requirements(# of columns)
My TA #Benjamin once told me , dataframe means do not have to use for loop.

You need to specify which axis you are plotting to. This should work:
fig, axes = plt.subplots(len(df_in.columns)//3, 3, figsize=(12, 48))
for col, axis in zip(df_in.columns, axes):
df_in.hist(column = col, bins = 100, ax=axis)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot histogram and distribution from frequency table? - python

Related

Add density curve on the histogram

How to color swarmplot dots depending on quartile?

Use Seaborn to plot 1D time series as a line with marginal histogram along y-axis

Convert a Histogram which has two variables plotted on it into a smooth Curve

pandas histogram: plot histogram for each column as subplot of a big figure

Categories

Resources