Histogram color by class - python

I have a pandas dataframe with 2 columns "height" and "class, class is a column with 3 values 1,2 and 5.
Now i want to make a histogram of the height data and color by class. plot19_s["vegetation height"].plot.hist(bins = 10)
this is my histogram
but now I want to see the different classes by a change in color in the histogram.

Since I'm not sure if the potential duplicate actually answers the question here, this is a way to produce a stacked histogram using numpy.histogram and matplotlib bar plot.
import pandas as pd
import numpy as np;np.random.seed(1)
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.exponential(size=100),
"class" : np.random.choice([1,2,5],100)})
_, edges = np.histogram(df["x"], bins=10)
histdata = []; labels=[]
for n, group in df.groupby("class"):
histdata.append(np.histogram(group["x"], bins=edges)[0])
labels.append(n)
hist = np.array(histdata)
histcum = np.cumsum(hist,axis=0)
plt.bar(edges[:-1],hist[0,:], width=np.diff(edges)[0],
label=labels[0], align="edge")
for i in range(1,len(hist)):
plt.bar(edges[:-1],hist[i,:], width=np.diff(edges)[0],
bottom=histcum[i-1,:],label=labels[i], align="edge")
plt.legend(title="class")
plt.show()

Related

How to draw histogram + QQ plots together for each column?

I have a dataset with lots of numerical columns. I want to draw histogram for each column but also add extra QQ plot just to check more thoroughly if data follow normal distribution. So I would like to have histogram and QQ plot under histogram for each column. Something like that:
I tried to do this using following code but both plots overlap each other:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
num_cols = df.select_dtypes(include=np.number)
cols = num_cols.columns.tolist()
df_sample = df.sample(n=5000)
fig, axes = plt.subplots(4, 5, figsize=(15,12), layout = 'constrained')
for col, axs in zip(cols, axes.flat):
sns.histplot(data = df_sample[col], kde = True, stat = 'density', ax = axs, alpha = .4)
sm.qqplot(df_sample[col], line='45', ax = axs)
plt.show()
How can I generate hist and QQ plots one under another for each column?
Another issue is that my QQ plots look strange, I'm wondering if I need to standarize all my columns before making QQ plot.

Convert a histogram plot from a Pandas dataframe to a scatter plot

I have a histogram:
# Lets load a dataset of house prices in Boston.
from sklearn.datasets import load_diabetes
#sklearn gives you the data as a dictionary, so
diabetes = load_diabetes(as_frame=True)
data = diabetes['frame']
import matplotlib.pyplot as plt
%matplotlib inline
bmi_hist = plt.hist(data['bmi'], density=False)
bmi_hist = plt.ylabel("Frequency")
bmi_hist = plt.xlabel("Normalized BMI")
bp_hist = plt.hist(data['bp'], density=False)
bp_hist = plt.ylabel("Frequency")
bp_hist = plt.xlabel("Normalized BP")
This is a histogram for two of the columns in the frame above.
I want to compare these two in a scatter graph. My attempts haven't been quite successful as I know I need an X and a Y to plot.
I thought I would use the same axis as the histogram:
y_bmi = data['bmi'].value_counts() # frequency
x_bmi = data['bmi'] # normalized value
ax1 = df.plot.scatter(x = x_bmi, y= y_bmi, c='DarkBlue')
But this can only be used on the 'dataframe' so do I have to repeat the values of bmi column into a new dataframe? or is there a simpler method?
Any help would be greatly appreciated.
Many Thanks.

can you highlight specific observations in categorical scatter plot in seaborn?

I have 8 categories and I already plotted categorical scatter plot with sns.catplot. Is there a way to highlight a specific observation(s) in each category to compare the positions with respect to other observations?
You could use text annotations, using the annotate method on the ax (matplotlib.axes.Axes) attribute of the FaceGrid object returned by seaborn.catplot. For example, the code below annotates the observations that are greater than .5 on a normal sample:
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.DataFrame(data={'x': range(10), 'y':np.random.normal(0,1,size=10)})
df['odd'] = df.x.apply(lambda x: x % 2)
g = sns.catplot(data=df, x='x', y='y', hue='odd')
df[df.y > .5].apply(lambda p: g.ax.annotate(f'({p.x}, {round(p.y, 2)})', (p.x, p.y)), axis=1)
You can see more details on the annotate method here.

Plotting data with categorical x and y axes in python

I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)

Plot a pandas dataframe with vertical lines

I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.
The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.

Categories

Resources