How to visualize variations between columns through plot pandas

How to visualize variations between columns through plot pandas - python

I have a data frame shown in the figure; some mismatches exist in the rows. I want to plot the first column versus all the other columns to depict the variation. Can anyone tell me how I can do that.?

You could use a 2D heatmap to show the differences:
For example, if you had a dataframe like this:
Dataframe:
U V W X Y Z
0 M M M M M M
1 K K R K K K
2 A A A A B A
3 I I I I I I
4 L L L L L L
You could use the following code to identify the differences to that in column U:
Code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({"U":['M','K','A','I','L'], "V":['M','K','A','I','L'], "W":['M','R','A','I','L'], "X":['M','K','A','I','L'], "Y":['M','K','B','I','L'], "Z":['M','K','A','I','L']})
# Create a new dataframe with the differences between the values in each column and the values in the first column
diff_df = df.apply(lambda x: x != df['U'])
# Convert the dataframe to a numpy array
diff_arr = diff_df.values
cmap = plt.cm.RdYlGn_r
fig, ax = plt.subplots()
im = ax.imshow(diff_arr, cmap=cmap)
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel('Differences', rotation=-90, va="bottom")
ax.set_xticks(np.arange(len(df.columns)))
ax.set_yticks(np.arange(len(df)))
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.index)
for i in range(len(df)):
for j in range(len(df.columns)):
text = ax.text(j, i, df.iloc[i, j], ha="center", va="center", color="w")
ax.set_title("Differences")
fig.tight_layout()
plt.show()
Output:

Related

How to remove NaN values from matshow 2D plot and format labels as percentages

I am plotting some 2D array data as a map with a color scale and I have some NaN values that I don't want to show up. I need these values to show up as plain white squares and for the formatting to be in percentage style. Here is what I have so far:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
filename = "data.csv"
df = pd.read_csv(filename, header=None)
fig, ax = plt.subplots()
ax.matshow(df, cmap='bwr')
for (i, j), z in np.ndenumerate(df):
ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center',
bbox=dict(boxstyle='round', facecolor='white', edgecolor='0.3'))
plt.show()
This is the data I have:
...and here is what I get out right now. This is almost exactly what I want, except for those NaN values and the percentage formatting.
Any help would be very much appreciated.

Format the text string with an if-else block.
nan can only be checked with np.isnan
See Fixed digits after decimal with f-strings for additional information about f-string formatting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# sample data
np.random.seed(365)
df = pd.DataFrame(np.random.random((4, 4)))
df.iloc[2, -1] = np.nan
df.iloc[3, 2:] = np.nan
# plot
fig, ax = plt.subplots()
ax.matshow(df, cmap='bwr')
for (i, j), z in np.ndenumerate(df):
if np.isnan(z):
z = '' # if z is nan make it an empty string
else:
z = f'{z*100:0.1f}%' # otherwise multiply by 100 and string format
ax.text(j, i, z, ha='center', va='center',
bbox=dict(boxstyle='round', facecolor='white', edgecolor='0.3'))
plt.show()

two DataFrame plots

I have a similar plot to the one answered in the link below:
two DataFrame plot in a single plot matplotlip
I made some modification to plots for df2 columns code block because i think that is where i have to modify but i could not yield the output.
a sample of the plot i want is this
this was how i modified it:
f, axes = plt.subplots(nrows=len(signals.columns)+1, sharex=True, )
i = 0
for col in df2.columns:
fig, axs = plt.subplots()
sns.regplot(x='', y='', data=df2, ax=axs[0])
df2[col].plot(ax=axes[i], color='grey')
axes[i].set_ylabel(col)
i+=1
I have seen that its wrong.
I tried this out, it seems like a head way :)
How do I make modification on this to get what i want:
f, axes = plt.subplots(nrows=len(signals.columns)+1, sharex=True, )
# plots for df2 columns
i = 0
for col in df2.columns:
lw=1
df2[col].plot(ax=axes[i], color='grey')
axes[i].set_ylim(0, 1)
axes[i].set_ylabel(col)
sns.rugplot(df2["P1"])

You have several options to make this graph. df1 and df2 are as defined in your previous question
The version with matplotlib.pyplot.scatter is faster to draw, but less faithful to the example. The version with seaborn.rugplot looks identical to the example, but takes longer to draw. I highlighted the important part of the code between comment lines ########
using matplotlib.pyplot.scatter
import seaborn as sns
import numpy as np
f, axes = plt.subplots(nrows=len(df2.columns)+1, sharex=True,
gridspec_kw={'height_ratios':np.append(np.repeat(1, len(df2.columns)), 3)})
####### variable part below #######
# plots for df2 columns
i = 0
for col in df2.columns:
axes[i].scatter(x=df2.index, y=np.repeat(0, len(df2)), c=df2[col], marker='|', cmap='Greys')
axes[i].set_ylim(-0.5, 0.5)
axes[i].set_yticks([0])
axes[i].set_yticklabels([col])
i+=1
###################################
## code to plot annotations
axes[-1].set_xlabel('Genomic position')
axes[-1].set_ylabel('annotations')
axes[-1].set_ylim(-0.5, 1.5)
axes[-1].set_yticks([0, 1])
axes[-1].set_yticklabels(['−', '+'])
for _, r in df1.iterrows():
marker = '|'
lw=1
if r['type'] == 'exon':
marker=None
lw=8
y = 1 if r['strand'] == '+' else 0
axes[-1].plot((r['start'], r['stop']), (y, y),
marker=marker, lw=lw,
solid_capstyle='butt',
color='#505050')
# remove space between plots
plt.subplots_adjust(hspace=0)
axes[-1].set_xlim(0, len(df2))
f.set_size_inches(6, 2)
using seaborn.rugplot
import seaborn as sns
import numpy as np
f, axes = plt.subplots(nrows=len(df2.columns)+1, sharex=True,
gridspec_kw={'height_ratios':np.append(np.repeat(1, len(df2.columns)), 3)})
####### variable part below #######
import matplotlib
import matplotlib.cm as cm
norm = matplotlib.colors.Normalize(vmin=0, vmax=1, clip=True)
mapper = cm.ScalarMappable(norm=norm, cmap=cm.Greys)
# plots for df2 columns
i = 0
for col in df2.columns:
sns.rugplot(x=df2.index, color=list(map(mapper.to_rgba, df2[col])), height=1, ax=axes[i])
axes[i].set_yticks([0])
axes[i].set_yticklabels([col])
i+=1
###################################
## code to plot annotations
axes[-1].set_xlabel('Genomic position')
axes[-1].set_ylabel('annotations')
axes[-1].set_ylim(-0.5, 1.5)
axes[-1].set_yticks([0, 1])
axes[-1].set_yticklabels(['−', '+'])
for _, r in df1.iterrows():
marker = '|'
lw=1
if r['type'] == 'exon':
marker=None
lw=8
y = 1 if r['strand'] == '+' else 0
axes[-1].plot((r['start'], r['stop']), (y, y),
marker=marker, lw=lw,
solid_capstyle='butt',
color='#505050')
# remove space between plots
plt.subplots_adjust(hspace=0)
axes[-1].set_xlim(0, len(df2))
f.set_size_inches(6, 2)

How should a nested loop statement to create vertical lines in a for loop statement that creates histograms work?

I am trying to use a for loop to create histograms for each fields in a dataframe. The dataframe here is labeled as 'df4'.
There are 3 fields/columns.
Then I want to create vertical lines using quantiles for each of the columns as defined in the following series: p, exp, eng.
My code below only successfully creates the vertical lines on the last field/column or histogram.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df4 = pd.read_csv("xyz.csv", index_col = "abc_id" )
# dataframe
# x coordinates for the lines
p = df4['abc'].quantile([0.25,0.5,0.75,0.9,0.95])
exp = df4['efg'].quantile([0.25,0.5,0.75,0.9,0.95])
eng = df4['xyz'].quantile([0.25,0.5,0.75,0.9,0.95])
# colors for the lines
colors = ['r','k','b','g','y']
bins = [0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000]
fig, axs = plt.subplots(len(df4.columns), figsize=(10, 25))
for n, col in enumerate(df4.columns):
if (n==0):
for xc,c in zip(exp,colors):
plt.axvline(x=xc, label='line at x = {}'.format(xc), c=c)
if (n==1):
for xc,c in zip(eng,colors):
plt.axvline(x=xc, label='line at x = {}'.format(xc), c=c)
if (n==2):
for xc,c in zip(p,colors):
plt.axvline(x=xc, label='line at x = {}'.format(xc), c=c)
df[col].hist(ax=axs[n],bins=50)
plt.legend()
plt.show()

Visualize 3 columns as a heatmap in seaborn / pandas [duplicate]

I need to create MatplotLib heatmap (pcolormesh) using Pandas DataFrame TimeSeries column (df_all.ts) as my X-axis.
How to convert Pandas TimeSeries column to something which can be used as X-axis in np.meshgrid(x, y) function to create heatmap? The workaround is to create Matplotlib drange using same parameters as in pandas column, but is there a simple way?
x = pd.date_range(df_all.ts.min(),df_all.ts.max(),freq='H')
xt = mdates.drange(df_all.ts.min(), df_all.ts.max(), dt.timedelta(hours=1))
y = arange(ylen)
X,Y = np.meshgrid(xt, y)

I do not know what you mean by heat map for a time series, but for a dataframe you may do as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
from string import ascii_uppercase
from matplotlib import patheffects
m, n = 4, 7 # 4 rows, 7 columns
df = pd.DataFrame(np.random.randn(m, n),
columns=list(ascii_uppercase[:n]),
index=list(ascii_uppercase[-m:]))
ax = plt.imshow(df, interpolation='nearest', cmap='Oranges').axes
_ = ax.set_xticks(np.linspace(0, n-1, n))
_ = ax.set_xticklabels(df.columns)
_ = ax.set_yticks(np.linspace(0, m-1, m))
_ = ax.set_yticklabels(df.index)
ax.grid('off')
ax.xaxis.tick_top()
optionally, to print actual values in the middle of each square, with some shadows for readability, you may do:
path_effects = [patheffects.withSimplePatchShadow(shadow_rgbFace=(1,1,1))]
for i, j in product(range(m), range(n)):
_ = ax.text(j, i, '{0:.2f}'.format(df.iloc[i, j]),
size='medium', ha='center', va='center',
path_effects=path_effects)

Show values over matplotlib imshow plot [duplicate]

I need to create MatplotLib heatmap (pcolormesh) using Pandas DataFrame TimeSeries column (df_all.ts) as my X-axis.
How to convert Pandas TimeSeries column to something which can be used as X-axis in np.meshgrid(x, y) function to create heatmap? The workaround is to create Matplotlib drange using same parameters as in pandas column, but is there a simple way?
x = pd.date_range(df_all.ts.min(),df_all.ts.max(),freq='H')
xt = mdates.drange(df_all.ts.min(), df_all.ts.max(), dt.timedelta(hours=1))
y = arange(ylen)
X,Y = np.meshgrid(xt, y)

I do not know what you mean by heat map for a time series, but for a dataframe you may do as below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
from string import ascii_uppercase
from matplotlib import patheffects
m, n = 4, 7 # 4 rows, 7 columns
df = pd.DataFrame(np.random.randn(m, n),
columns=list(ascii_uppercase[:n]),
index=list(ascii_uppercase[-m:]))
ax = plt.imshow(df, interpolation='nearest', cmap='Oranges').axes
_ = ax.set_xticks(np.linspace(0, n-1, n))
_ = ax.set_xticklabels(df.columns)
_ = ax.set_yticks(np.linspace(0, m-1, m))
_ = ax.set_yticklabels(df.index)
ax.grid('off')
ax.xaxis.tick_top()
optionally, to print actual values in the middle of each square, with some shadows for readability, you may do:
path_effects = [patheffects.withSimplePatchShadow(shadow_rgbFace=(1,1,1))]
for i, j in product(range(m), range(n)):
_ = ax.text(j, i, '{0:.2f}'.format(df.iloc[i, j]),
size='medium', ha='center', va='center',
path_effects=path_effects)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to visualize variations between columns through plot pandas - python

I have a data frame shown in the figure; some mismatches exist in the rows. I want to plot the first column versus all the other columns to depict the variation. Can anyone tell me how I can do that.?

Related

How to remove NaN values from matshow 2D plot and format labels as percentages

two DataFrame plots

How should a nested loop statement to create vertical lines in a for loop statement that creates histograms work?

Visualize 3 columns as a heatmap in seaborn / pandas [duplicate]

Show values over matplotlib imshow plot [duplicate]

Categories

Resources