Plotting the mean of multiple columns including standard deviation - python

I have a data set with 8 columns and several rows. The columns contain measurements for different variable (6 in total) under 2 different conditions, each consisting of 4 columns that contain repeated measurements for a particular condition.
Using Searborn, I would like to generate a bar chart displaying the mean and standard deviation of every 4 columns, grouped by index key (i.e. measured variable). The dataframe structure is as follows:
np.random.seed(10)
df = pd.DataFrame({
'S1_1':np.random.randn(6),
'S1_2':np.random.randn(6),
'S1_3':np.random.randn(6),
'S1_4':np.random.randn(6),
'S2_1':np.random.randn(6),
'S2_2':np.random.randn(6),
'S2_3':np.random.randn(6),
'S2_4':np.random.randn(6),
},index= ['var1','var2','var3','var4','var5','var6'])
How do I pass to seaborn that I would like only 2 bars, 1 for the first 4 columns and 1 for the second. With each bar displaying the mean (and standard deviation or some other measure of dispersion) across 4 columns.
I was thinking of using multi-indexing, adding a second column level to group the columns into 2 condition,
df.columns = pd.MultiIndex.from_arrays([['Condition 1'] * 4 + ['Condition 2'] * 4,df.columns])
but I can't figure out what I should pass to Seaborn to generate the plot I want.
If anyone could point me in the right direction, that would be a great help!

Update Based on Comment
Plotting is all about reshaping the dataframe for the plot API
# still create the groups
l = df.columns
n = 4
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# stack each group and add an id column
data_list = list()
for group in groups:
id_ = group[0][1]
data = df[group].copy().T
data['id_'] = id_
data_list.append(data)
df2 = pd.concat(data_list, axis=0).reset_index()
df2.rename({'index': 'sample'}, axis=1, inplace=True)
# melt df2 into a long form
dfm = df2.melt(id_vars=['sample', 'id_'])
# plot
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', hue='id_', ci='sd', aspect=3)
df2.head()
sample YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C id_
0 S2_1 -13.062716 -8.084685 2.360795 -0.740357 3.086768 -0.117259 -5.678183 2.527573 -17.326287 -1.319402 2
1 S2_2 -5.431474 -12.676807 0.070569 -4.214761 -4.318011 -4.489010 -10.268632 0.691448 -24.189106 -2.343884 2
2 S2_3 -9.365509 -12.281169 0.497772 -3.228236 0.212941 -2.287206 -10.250004 1.111842 -27.811564 -4.329987 2
3 S2_4 -7.582111 -15.587219 -1.286167 -4.531494 -3.090265 -4.718281 -8.933496 2.079757 -21.580854 -2.834441 2
4 S3_1 -12.618254 -20.010779 -2.530541 -3.203072 -2.436503 -2.922565 -15.972632 3.551605 -35.618485 -4.925495 3
dfm.head()
sample id_ variable value
0 S2_1 2 YAL001C -13.062716
1 S2_2 2 YAL001C -5.431474
2 S2_3 2 YAL001C -9.365509
3 S2_4 2 YAL001C -7.582111
4 S3_1 3 YAL001C -12.618254
Plot Result
kind='box'
A box plot might be a better to convey the distribution
p = sns.catplot(kind='box', data=dfm, y='variable', x='value', hue='id_', height=12)
Original Answer
Use a list comprehension to chunk the columns into groups of 4
This uses the original, more comprehensive data that was posted. It can be found in revision 4
Create a figure with subplots and zip each group to an ax from axes
Use each group to select data from df and transpose the data with .T.
Using sns.barplot the default estimator is mean, so the length of the bar is the mean, and set ci='sd' so the confidence interval is the standard deviation.
sns.barplot(data=data, ci='sd', ax=ax) can easily be replaced with sns.boxplot(data=data, ax=ax)
import seaborn as sns
# using the first comma separated data that was posted, create groups of 4
l = df.columns
n = 4 # chunk size for groups
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)
# plot
fig, axes = plt.subplots(num_gps, 1, figsize=(12, 6*num_gps))
for ax, group in zip(axes, groups):
data = df[group].T
sns.barplot(data=data, ci='sd', ax=ax)
ax.set_title(f'{group.to_list()}')
fig.tight_layout()
fig.savefig('test.png')
Example of data
The bar is the mean of each column, and the line is the standard deviation
YAL001C YAL002W YAL004W YAL005C YAL007C YAL008W YAL011W YAL012W YAL013W YAL014C
S8_1 -1.731388 -17.215712 -3.518643 -2.358103 0.418170 -1.529747 -12.630343 2.435674 -27.471971 -4.021264
S8_2 -1.325524 -24.056632 -0.984390 -2.119338 -1.770665 -1.447103 -10.618954 2.156420 -30.362998 -4.735058
S8_3 -2.024020 -29.094027 -6.146880 -2.101090 -0.732322 -2.773949 -12.642857 -0.009749 -28.486835 -4.783863
S8_4 2.541671 -13.599049 -2.688125 -2.329332 -0.694555 -2.820627 -8.498677 3.321018 -31.741916 -2.104281
Plot Result

Related

SeaBorn - choosing hue based on column

Having a DF of predicted (column) an actual (column) values. I would like to plot an overlaid histogram that have the same look as when using the hue property. I couldn't find a way of doing so without reconstructing my original data.
Here is an example of what I'm trying to do:
df = pd.DataFrame({'A':np.random.uniform(low=0.0, high=9.0, size=(150,)),'P':np.random.uniform(low=0.0, high=9.0, size=(150,))})
actual = df.A.to_frame()
predicted = df.P.to_frame()
print(df.head())
actual.columns = ['value']
actual['t'] = 'A'
predicted.columns = ['value']
predicted['t'] = 'P'
tmp = pd.concat([actual,predicted])
print(tmp.head())
sns.histplot(data=tmp,x='value' ,hue="t")
Output:
Original DF
A P
0 2.546046 2.503833
1 4.797077 2.306720
2 1.358222 4.839675
3 7.063206 8.828486
4 3.010978 7.406337
Manipulated DF
value t
0 2.546046 A
1 4.797077 A
2 1.358222 A
3 7.063206 A
4 3.010978 A
<matplotlib.axes._subplots.AxesSubplot at 0x7fd657112fd0>
Q: How can I get similar results without manipulating my original DF?
Unless I am misunderstanding what you're asking.
From the docs:
If neither x nor y is assigned, the dataset is treated as wide-form, and a histogram is drawn for each numeric column
df = pd.DataFrame(
{
"A": np.random.uniform(low=0.0, high=9.0, size=(150,)),
"P": np.random.uniform(low=0.0, high=9.0, size=(150,)),
}
)
ax = sns.histplot(df)

altair: Text over a Heatmap in a script

I tried to follow the Altair example of "Text over a Heatmap" but came across some problems.
My dataset consists of two indexes (N, Z) and a value column color.
I would like to set the origin at the bottom-left side.
How to display the heatmap with text in a script?
Is it possible to put labels (N, Z) in each pixel?
Below attached the part of the code.
def chart_altair(self):
import altair as alt
data = self.df.dropna().reset_index(name='color')
# Configure common options
base = alt.Chart(data).encode(
alt.X('N:O'),
alt.Y('Z:O'),
)
# Configure heatmap
heatmap = base.mark_rect().encode(
color=alt.Color(
'color:Q',
scale=alt.Scale(scheme='viridis'),
legend=alt.Legend()
)
)
text = base.mark_text(baseline='middle').encode(
text='color:Q'
)
I pasted below a few rows of my dataset which consists of two indexses Z, N, and a value column color (represents actually a atomic mass table). The "heatmap" should be similar to the chart of nuclei, with the neutron number increase to the right x-axis, and the proton number increases upwards in the y-axis. However, the proton number increases in the opposite direction (in the negative y-axis).
Z N color
0 1 8.07
1 0 7.29
1 1 13.14
1 2 14.95
2 1 14.93
1 3 24.62
2 2 2.42
3 3 14.09
4 2 18.38
2 5 26.07
3 4 14.91
4 3 15.77
5 2 27.68
Here is an example where I concatenate the two labels using transform_calculate. You could also do this by creating the label column in pandas instead.
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
heatmap = alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
)
heatmap + heatmap.mark_text().transform_calculate(label = '"" + datum.x + datum.y').encode(
text='label:N',
color=alt.value('black'))

Convert /reshape a dataset from 'wide to long' format and convert the time column into time format for time-series analysis

I have a dataset with 7 columns - level,Time_30,Time_60,Time_90,Time_120,Time_150 and Time_180
My main goal is to do a time-series anomaly detection using cell count in a 30-minute interval.
I want to do the following data preparation steps:
(I) melt/reshape the df into the appropriate time-series format (from wide to long)- consolidate the columns time_30, time_60 ,....., time_180 into one column time with 6 levels (30,60,.....,180)
(II) since the result from (I) comes out as 30,60,.....180, I want to set the time column as the appropriate time or date format for time-series (something like this '%H:%M:%S')
(III) use a for-loop to plot the time-series plot for each level - A, B,...., F) for comparison purposes.
(IV) Anomaly detection
# generate/import dataset
import pandas as pd
df = pd.DataFrame({'level':[A,B,C,D,E,F],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
Desired outcome
# first series
level, time, count
A, 30, 1993.05
B, 60, 2123.15
C, 90, 2323.56
D, 120, 2355.52
E, 150, 2425.31
F, 180, 2443.35
# 2nd series
level,time,count
A,30,1999.45
B,60,2299.59
C,90,2495.99
D,120,2491.19
E,150,2599.51
F,180,2609.92
.
.
.
.
# up until the last series
See below for my attempt
# (I)
df1 = pd.melt(df,id_vars = ['level'],var_name = 'time',value_name = 'count') #
# (II)
df1['time'] = pd.to_datetime(df1['time'],format= '%H:%M:%S' ).dt.time
OR
df1['time'] = pd.to_timedelta(df1['time'], unit='m')
# (III)
plt.figure(figsize=(10,5))
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.3)
# Perform STL Decomp
stl = STL(df1)
result = stl.fit()
seasonal, trend, resid = result.seasonal, result.trend, result.resid
plt.figure(figsize=(8,6))
plt.subplot(4,1,1)
plt.plot(df1)
plt.title('Original Series', fontsize=16)
plt.subplot(4,1,2)
plt.plot(trend)
plt.title('Trend', fontsize=16)
plt.subplot(4,1,3)
plt.plot(seasonal)
plt.title('Seasonal', fontsize=16)
plt.subplot(4,1,4)
plt.plot(resid)
plt.title('Residual', fontsize=16)
plt.tight_layout()
estimated = trend + seasonal
plt.figure(figsize=(12,4))
plt.plot(df1)
plt.plot(estimated)
plt.figure(figsize=(10,4))
plt.plot(resid)
# Anomaly detection
resid_mu = resid.mean()
resid_dev = resid.std()
lower = resid_mu - 3*resid_dev
upper = resid_mu + 3*resid_dev
anomalies = df1[(resid < lower) | (resid > upper)] # returns the datapoints with the anomalies
anomalies
plt.plot(df1)
for timex in range(30,180):
plt.axvline(datetime(timex,1,1), color='k', linestyle='--', alpha=0.6)
plt.scatter(anomalies.index, anomalies.count, color='r', marker='D')
Please note: if you can only attempt I and/or II that would be much appreciated.
I made a few small edits to your sample dataframe based on my comment above:
import pandas as pd
df = pd.DataFrame({'level':['A','B','C','D','E'],
'Time_30':[1993.05,1999.45, 2001.11, 2007.39, 2219.77],
'Time_60':[2123.15,2299.59, 2339.19, 2443.37, 2553.15],
'Time_90':[2323.56,2495.99,2499.13, 2548.71, 2656.0],
'Time_120':[2355.52,2491.19,2519.92,2611.81, 2753.11],
'Time_150':[2425.31,2599.51, 2539.9, 2713.77, 2893.58],
'Time_180':[2443.35,2609.92, 2632.49, 2774.03, 2901.25]} )
First, manipulate the Time_* column names to be integer values:
timecols = [int(c.replace("Time_","")) for c in df.columns if c != 'level']
df.columns = ['level'] + timecols
After that you can pd.melt() like you were thinking, yielding a datarame with all those "series" you mentioned above concatenated together:
df1 = df.melt(id_vars=['level'], value_vars=timecols, var_name='time', value_name='count').sort_values(['level','time']).reset_index(drop=True)
print(df1.head(10))
level time count
0 A 30 1993.05
1 A 60 2123.15
2 A 90 2323.56
3 A 120 2355.52
4 A 150 2425.31
5 A 180 2443.35
6 B 30 1999.45
7 B 60 2299.59
8 B 90 2495.99
9 B 120 2491.19
If you want to loop over the levels, select them with:
for level in df1['level'].unique():
tmp = df1[df1['level']==level]
or
for level in df1['level'].unique():
tmp = df1[df1['level']==level].copy()
...if you intend to modify/add data to the tmp dataframe.
As for making timestamps, you could do:
df1['time'] = pd.to_timedelta(df1['time'], unit='min')
...like you were attempting, but it depends on how you're using it. If you just want strings that look like "00:30:00", etc, you can try something like:
df1['time'] = pd.to_timedelta(df1['time'], unit='min').apply(lambda x:str(x)[-8:])
Anyway, hope that gets you on track for what you need.

How to draw proper chart of distributional tree?

I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Have you tried stacked bar graph?
https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py

Python Pandas plot multiindex specify x and y

Below is an example DataFrame.
joaquin manolo
xx 0 0.000000e+00 44.000000
1 1.570796e+00 52.250000
2 3.141593e+00 60.500000
3 4.712389e+00 68.750000
4 6.283185e+00 77.000000
yy 0 0.000000e+00 37.841896
1 2.078796e+00 39.560399
2 5.292179e-17 41.026434
3 -8.983291e-02 42.304767
4 -4.573916e-18 43.438054
As you can see, the row index has two levels, ['xx', 'yy'] and [0, 1, 2, 3, 4]. I want to call DataFrame.plot() in such a way that it will produce two subplots, one for joaquin and one for manolo, and where I can specify to use data.loc["xx", :] for the domain data and to use data.loc["yy", :] for the ordinate data. In addition, I want the option to supply the subplots on which the plots should be drawn, in a list (or array) of matplotlib.axes._subplots.AxesSubplot instances, such as those that can be returned by the DataFrame.hist() method. How can this be done?
Generating the data above
Just in case you're wondering, below is the code I used to generate the data. If there is an easier way to generate this data, I'd be very interested to know as a side-note.
joaquin_dict = {}
xx_joaquin = numpy.linspace(0, 2*numpy.pi, 5)
yy_joaquin = 10 * numpy.sin(xx_joaquin) * numpy.exp(-xx_joaquin)
for i in range(len(xx_joaquin)):
joaquin_dict[("xx", i)] = xx_joaquin[i]
joaquin_dict[("yy", i)] = yy_joaquin[i]
manolo_dict = {}
xx_manolo = numpy.linspace(44, 77, 5)
yy_manolo = 10 * numpy.log(xx_manolo)
for i in range(len(xx_manolo)):
manolo_dict[("xx", i)] = xx_manolo[i]
manolo_dict[("yy", i)] = yy_manolo[i]
data_dict = {"joaquin": joaquin_dict, "manolo": manolo_dict}
data = pandas.DataFrame.from_dict(data_dict)
Just use a for loop:
fig, axes = pl.subplots(1, 2)
for ax, col in zip(axes, data.columns):
data[col].unstack(0).plot(x="xx", y="yy", ax=ax, title=col)
output:

Categories

Resources