Here is a table:
dict1 = {'left':[7,3,5,10,9],
'right':[2,17,0,8,1]}
table = pd.DataFrame(dict1)
I've created a regression scatter plot (scatterplot with best fit line):
sns.regplot(x=table['right'], y=table['left'], data=table)
I would like to add labels to datapoints in the plot where values are => 10 in either columns. Not sure how to do this.
You can iterate over the x,y pairs and if any are >=10 add that text to the chart at those coordinates, with an offset of +/- .5 so it doesn't land on the dot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dict1 = {'left':[7,3,5,10,9],
'right':[2,17,0,8,1]}
table = pd.DataFrame(dict1)
ax = sns.regplot(x=table['right'], y=table['left'], data=table)
for x in table.values:
if any([n>=10 for n in x]):
ax.text(x=x[1]+.5, y=x[0]-.5, s=','.join(map(str,reversed(x))))
Related
I have a dataset with lots of numerical columns. I want to draw histogram for each column but also add extra QQ plot just to check more thoroughly if data follow normal distribution. So I would like to have histogram and QQ plot under histogram for each column. Something like that:
I tried to do this using following code but both plots overlap each other:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
num_cols = df.select_dtypes(include=np.number)
cols = num_cols.columns.tolist()
df_sample = df.sample(n=5000)
fig, axes = plt.subplots(4, 5, figsize=(15,12), layout = 'constrained')
for col, axs in zip(cols, axes.flat):
sns.histplot(data = df_sample[col], kde = True, stat = 'density', ax = axs, alpha = .4)
sm.qqplot(df_sample[col], line='45', ax = axs)
plt.show()
How can I generate hist and QQ plots one under another for each column?
Another issue is that my QQ plots look strange, I'm wondering if I need to standarize all my columns before making QQ plot.
I have a histogram:
# Lets load a dataset of house prices in Boston.
from sklearn.datasets import load_diabetes
#sklearn gives you the data as a dictionary, so
diabetes = load_diabetes(as_frame=True)
data = diabetes['frame']
import matplotlib.pyplot as plt
%matplotlib inline
bmi_hist = plt.hist(data['bmi'], density=False)
bmi_hist = plt.ylabel("Frequency")
bmi_hist = plt.xlabel("Normalized BMI")
bp_hist = plt.hist(data['bp'], density=False)
bp_hist = plt.ylabel("Frequency")
bp_hist = plt.xlabel("Normalized BP")
This is a histogram for two of the columns in the frame above.
I want to compare these two in a scatter graph. My attempts haven't been quite successful as I know I need an X and a Y to plot.
I thought I would use the same axis as the histogram:
y_bmi = data['bmi'].value_counts() # frequency
x_bmi = data['bmi'] # normalized value
ax1 = df.plot.scatter(x = x_bmi, y= y_bmi, c='DarkBlue')
But this can only be used on the 'dataframe' so do I have to repeat the values of bmi column into a new dataframe? or is there a simpler method?
Any help would be greatly appreciated.
Many Thanks.
I am a beginner in python. I am trying to plot a CSV file in the form of a facet grid using the seaborne library.
import matplotlib.pyplot as plt
import seaborn as sns
g = sns.FacetGrid(df, col="Gamma1",col_wrap=6,sharex=False)
g = (g.map(plt.scatter, "ARMSE", "Frobenius_norm_correlation").add_legend())
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Friedman_chain')
For each of the scatterplots in the facet grid, I want to state the co-ordinates of the data point with the minimum value of ARMSE and mark this point with a different color from the other data points in the given scatter plot.can you suggest to me how to do it?
The dataframe df contains the columns ARMSE,Gamma1,Frobenius_norm_correlation.
I am attaching the image of the current plot below :
You can create a column identifying the minimum data point as part of pre-processing and pass this column's name to seaborn.
For example, taking a sample dataset:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"group": list("ABCDEFGHIJ") * 10,
"y": np.random.normal(loc=1, scale=1, size=100),
"x": np.array([[x] * 10 for x in range(10)]).flatten()
})
# new column identifying the minimum value
df["min"] = df["y"] == df.groupby("group")["y"].transform(min)
g = sns.FacetGrid(df, col="group", hue="min", col_wrap=5, sharex=True)
g = (g.map(plt.scatter, "x", "y").add_legend())
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Min value detection')
I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)
I have a pandas dataframe with 2 columns "height" and "class, class is a column with 3 values 1,2 and 5.
Now i want to make a histogram of the height data and color by class. plot19_s["vegetation height"].plot.hist(bins = 10)
this is my histogram
but now I want to see the different classes by a change in color in the histogram.
Since I'm not sure if the potential duplicate actually answers the question here, this is a way to produce a stacked histogram using numpy.histogram and matplotlib bar plot.
import pandas as pd
import numpy as np;np.random.seed(1)
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.exponential(size=100),
"class" : np.random.choice([1,2,5],100)})
_, edges = np.histogram(df["x"], bins=10)
histdata = []; labels=[]
for n, group in df.groupby("class"):
histdata.append(np.histogram(group["x"], bins=edges)[0])
labels.append(n)
hist = np.array(histdata)
histcum = np.cumsum(hist,axis=0)
plt.bar(edges[:-1],hist[0,:], width=np.diff(edges)[0],
label=labels[0], align="edge")
for i in range(1,len(hist)):
plt.bar(edges[:-1],hist[i,:], width=np.diff(edges)[0],
bottom=histcum[i-1,:],label=labels[i], align="edge")
plt.legend(title="class")
plt.show()