I tried to follow the Altair example of "Text over a Heatmap" but came across some problems.
My dataset consists of two indexes (N, Z) and a value column color.
I would like to set the origin at the bottom-left side.
How to display the heatmap with text in a script?
Is it possible to put labels (N, Z) in each pixel?
Below attached the part of the code.
def chart_altair(self):
import altair as alt
data = self.df.dropna().reset_index(name='color')
# Configure common options
base = alt.Chart(data).encode(
alt.X('N:O'),
alt.Y('Z:O'),
)
# Configure heatmap
heatmap = base.mark_rect().encode(
color=alt.Color(
'color:Q',
scale=alt.Scale(scheme='viridis'),
legend=alt.Legend()
)
)
text = base.mark_text(baseline='middle').encode(
text='color:Q'
)
I pasted below a few rows of my dataset which consists of two indexses Z, N, and a value column color (represents actually a atomic mass table). The "heatmap" should be similar to the chart of nuclei, with the neutron number increase to the right x-axis, and the proton number increases upwards in the y-axis. However, the proton number increases in the opposite direction (in the negative y-axis).
Z N color
0 1 8.07
1 0 7.29
1 1 13.14
1 2 14.95
2 1 14.93
1 3 24.62
2 2 2.42
3 3 14.09
4 2 18.38
2 5 26.07
3 4 14.91
4 3 15.77
5 2 27.68
Here is an example where I concatenate the two labels using transform_calculate. You could also do this by creating the label column in pandas instead.
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
heatmap = alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
)
heatmap + heatmap.mark_text().transform_calculate(label = '"" + datum.x + datum.y').encode(
text='label:N',
color=alt.value('black'))
Related
I have a data set with 8 columns and several rows. The columns contain measurements for different variable (6 in total) under 2 different conditions, each consisting of 4 columns that contain repeated measurements for a particular condition.
Using Searborn, I would like to generate a bar chart displaying the mean and standard deviation of every 4 columns, grouped by index key (i.e. measured variable). The dataframe structure is as follows:
np.random.seed(10)
df = pd.DataFrame({
'S1_1':np.random.randn(6),
'S1_2':np.random.randn(6),
'S1_3':np.random.randn(6),
'S1_4':np.random.randn(6),
'S2_1':np.random.randn(6),
'S2_2':np.random.randn(6),
'S2_3':np.random.randn(6),
'S2_4':np.random.randn(6),
},index= ['var1','var2','var3','var4','var5','var6'])
How do I pass to seaborn that I would like only 2 bars, 1 for the first 4 columns and 1 for the second. With each bar displaying the mean (and standard deviation or some other measure of dispersion) across 4 columns.
I was thinking of using multi-indexing, adding a second column level to group the columns into 2 condition,
df.columns = pd.MultiIndex.from_arrays([['Condition 1'] * 4 + ['Condition 2'] * 4,df.columns])
but I can't figure out what I should pass to Seaborn to generate the plot I want.
If anyone could point me in the right direction, that would be a great help!
Update Based on Comment
Plotting is all about reshaping the dataframe for the plot API
# still create the groups
l = df.columns
n = 4
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# stack each group and add an id column
data_list = list()
for group in groups:
id_ = group[0][1]
data = df[group].copy().T
data['id_'] = id_
data_list.append(data)
df2 = pd.concat(data_list, axis=0).reset_index()
df2.rename({'index': 'sample'}, axis=1, inplace=True)
# melt df2 into a long form
dfm = df2.melt(id_vars=['sample', 'id_'])
# plot
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', hue='id_', ci='sd', aspect=3)
df2.head()
sample YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C id_
0 S2_1 -13.062716 -8.084685 2.360795 -0.740357 3.086768 -0.117259 -5.678183 2.527573 -17.326287 -1.319402 2
1 S2_2 -5.431474 -12.676807 0.070569 -4.214761 -4.318011 -4.489010 -10.268632 0.691448 -24.189106 -2.343884 2
2 S2_3 -9.365509 -12.281169 0.497772 -3.228236 0.212941 -2.287206 -10.250004 1.111842 -27.811564 -4.329987 2
3 S2_4 -7.582111 -15.587219 -1.286167 -4.531494 -3.090265 -4.718281 -8.933496 2.079757 -21.580854 -2.834441 2
4 S3_1 -12.618254 -20.010779 -2.530541 -3.203072 -2.436503 -2.922565 -15.972632 3.551605 -35.618485 -4.925495 3
dfm.head()
sample id_ variable value
0 S2_1 2 YAL001C -13.062716
1 S2_2 2 YAL001C -5.431474
2 S2_3 2 YAL001C -9.365509
3 S2_4 2 YAL001C -7.582111
4 S3_1 3 YAL001C -12.618254
Plot Result
kind='box'
A box plot might be a better to convey the distribution
p = sns.catplot(kind='box', data=dfm, y='variable', x='value', hue='id_', height=12)
Original Answer
Use a list comprehension to chunk the columns into groups of 4
This uses the original, more comprehensive data that was posted. It can be found in revision 4
Create a figure with subplots and zip each group to an ax from axes
Use each group to select data from df and transpose the data with .T.
Using sns.barplot the default estimator is mean, so the length of the bar is the mean, and set ci='sd' so the confidence interval is the standard deviation.
sns.barplot(data=data, ci='sd', ax=ax) can easily be replaced with sns.boxplot(data=data, ax=ax)
import seaborn as sns
# using the first comma separated data that was posted, create groups of 4
l = df.columns
n = 4 # chunk size for groups
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# plot
fig, axes = plt.subplots(num_gps, 1, figsize=(12, 6*num_gps))
for ax, group in zip(axes, groups):
data = df[group].T
sns.barplot(data=data, ci='sd', ax=ax)
ax.set_title(f'{group.to_list()}')
fig.tight_layout()
fig.savefig('test.png')
Example of data
The bar is the mean of each column, and the line is the standard deviation
YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C
S8_1 -1.731388 -17.215712 -3.518643 -2.358103 0.418170 -1.529747 -12.630343 2.435674 -27.471971 -4.021264
S8_2 -1.325524 -24.056632 -0.984390 -2.119338 -1.770665 -1.447103 -10.618954 2.156420 -30.362998 -4.735058
S8_3 -2.024020 -29.094027 -6.146880 -2.101090 -0.732322 -2.773949 -12.642857 -0.009749 -28.486835 -4.783863
S8_4 2.541671 -13.599049 -2.688125 -2.329332 -0.694555 -2.820627 -8.498677 3.321018 -31.741916 -2.104281
Plot Result
My experience with Python is pretty basic. I have written Python code to import data from an external file and perform a calculation. My result looks something like this (except much larger in reality).
1 1
1 1957
1 0.15
2 346
2 0.90
2 100
3 1920
3 100
3 40
What I want to do is plot these two columns as a single series, but then distinguish each data point according to a certain pattern. I know this sounds unnecessarily complicated, but it's something I need to do to help out the people who will use my code. Unfortunately, my Python skills fail me here. More specifically:
1. The first column has "1," "2," or "3." So first I want to make all the "1" data points circles (for example), all the "2" data points some other symbol, and likewise for the "3" data points.
2. Next. There are three rows for each distinct number. So for "1," the "0.15" in the second column is the average value, the "1957" is the maximum value, the "1" is the minimum value. I want to make the data point associated with each number's average value (the top row for each number) green (for example). I want the maximum and minimum values to have their own colors too.
So I will end up with a plot that shows one series only, but where each data point looks distinct. If anyone could please point me in the right direction, I would be very grateful. If I have not said this clearly, please let me know and I'll try again!
For different marker styles you currently need to create different plot instances (see this github issue). Using different colors can be done by passing an array as the color argument. So for example:
import matplotlib.pyplot as plt
import numpy as np
data = np.array([
[1, 0.15],
[1, 1957],
[1, 1],
[2, 346],
[2, 0.90],
[2, 100],
[3, 1920],
[3, 100],
[3, 40],
])
x, y = np.transpose(data)
symbols = ['o', 's', 'D']
colors = ['blue', 'orange', 'green']
for value, marker in zip(np.unique(x), symbols):
mask = (x == value)
plt.scatter(x[mask], y[mask], marker=marker, color=colors)
plt.show()
What I would do is to separate the data into three different columns so you have a few series. Then I'd use the plt.scatter with different markers to get the desired effect.
code
import matplotlib.pyplot as plt
import numpy as np
# Fixing random state for reproducibility
np.random.seed(19680801)
N = 100
r0 = 0.6
x = 0.9 * np.random.rand(N)
y = 0.9 * np.random.rand(N)
area = (20 * np.random.rand(N))**2 # 0 to 10 point radii
c = np.sqrt(area)
r = np.sqrt(x ** 2 + y ** 2)
area1 = np.ma.masked_where(r < r0, area)
area2 = np.ma.masked_where(r >= r0, area)
plt.scatter(x, y, s=area1, marker='^', c=c)
plt.scatter(x, y, s=area2, marker='o', c=c)
# Show the boundary between the regions:
theta = np.arange(0, np.pi / 2, 0.01)
plt.plot(r0 * np.cos(theta), r0 * np.sin(theta))
plt.show()
source: https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/scatter_masked.html#sphx-glr-gallery-lines-bars-and-markers-scatter-masked-py
I am trying to plot column data vs the row label of a data frame. When I do so, the plot looks good but the the Y axis starts to look illegible as the number of rows is increased. What I don't get it why does the automatic spacing for the X axis work fine but not the same for the Y axis.
x1 = M.iloc[:,1]
plt.plot(x1,x)
Where the variable "x" represents Column 0 values of dataframe "M" below
The "M" dataframe:
0.0 0.5 1.0
0 300 300.000000 1550
1.00e-01 s 300 300.769527 1550
2.00e-01 s 300 301.538106 1550
3.00e-01 s 300 302.305739 1550
.
.
.
2.80e+00 s 300 321.192396 1550
2.90e+00 s 300 321.935830 1550
Edit
So it seems it's the formatting of the first column being in scientific notation that is messing things up, still not sure why however
x = [0]
i=1
while i < 30:
q = i*0.1
xx = str('{:.2e}'.format(q)) + ' s'
x.append(xx)
i = i + 1
M = pd.DataFrame(index=x, columns=3)
So in the code above, it is the line xx = str('{:.2e}'.format(q)) + ' s' that is making the Y-labels go crazy. I unfortunately can't take it out as I need them to be in scientific notation.
You can try tick-spacing if okay to eliminate few tick labels. Other options are to increase you plot size or decrase font size for y labels.
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
x1 = M.iloc[:,1]
tick_spacing = 2 # or whatever label gap you want to use.
fig, ax = plt.subplots(1,1)
apx.plot(x1,x)
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.show()
I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py
I am iteratively plotting the np.exp results of 12 rows of data from a 2D array (12,5000), out_array. All data share the same x values, (x_d). I want the first 4 iterations to all plot as the same color, the next 4 to be a different color, and next 4 a different color...such that I have 3 different colors each corresponding to the 1st-4th, 5th-8th, and 9th-12th iterations respectively. In the end, it would also be nice to define these sets with their corresponding colors in a legend.
I have researched cycler (https://matplotlib.org/examples/color/color_cycle_demo.html), but I can't figure out how to assign colors into sets of iterations > 1. (i.e. 4 in my case). As you can see in my code example, I can have all 12 lines plotted with different (default) colors -or- I know how to make them all the same color (i.e. ...,color = 'r',...)
plt.figure()
for i in range(out_array.shape[0]):
plt.plot(x_d, np.exp(out_array[i]),linewidth = 1, alpha = 0.6)
plt.xlim(-2,3)
I expect a plot like this, only with a total of 3 different colors, each corresponding to the chunks of iterations described above.
An other solution
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
color = ['r', 'g', 'b', 'p']
for i in range(12):
plt.plot(x, i*x, color[i//4])
plt.show()
plt.figure()
n = 0
color = ['r','g','b']
for i in range(out_array.shape[0]):
n = n+1
if n/4 <= 1:
c = 1
elif n/4 >1 and n/4 <= 2:
c = 2
elif n/4 >2:
c = 3
else:
print(n)
plt.plot(x_d, np.exp(out_array[i]),color = color[c-1])
plt.show()