How to make a horizontal stacked histplot based on counts? - python

I have a df which represents three states (S1, S2, S3) at 3 timepoints (1hr, 2hr and 3hr). I would like to show a stacked bar plot of the states but the stacks are discontinous or at least not cumulative. How can I fix this in Seaborn? It is important that time is on the y-axis and the state counts on the x-axis.
Below is some code.
data = [[3, 2, 18],[4, 13, 6], [1, 2, 20]]
df = pd.DataFrame(data, columns = ['S1', 'S2', 'S3'])
df = df.reset_index().rename(columns = {'index':'Time'})
melt = pd.melt(df, id_vars = 'Time')
plt.figure()
sns.histplot(data = melt,x = 'value', y = 'Time', bins = 3, hue = 'variable', multiple="stack")
EDIT:
This is somewhat what I am looking for, I hope this gives you an idea. Please ignore the difference in the scales between boxes...

If I understand correctly, I think you want to use value as a weight:
sns.histplot(
data=melt, y='Time', hue='variable', weights='value',
multiple='stack', shrink=0.8, discrete=True,
)

This is pretty tough in seaborn as it doesn't natively support stacked bars. You can use either the builtin plot from pandas, or try plotly express.
data = [[3, 2, 18],[4, 13, 6], [1, 2, 20]]
df = pd.DataFrame(data, columns = ['S1', 'S2', 'S3'])
df = df.reset_index().rename(columns = {'index':'Time'})
# so your y starts at 1
df.Time+=1
melt = pd.melt(df, id_vars = 'Time')
# so y isn't treated as continuous
melt.Time = melt.Time.astype('str')
Pandas can do it, but getting the labels in there is a bit of pain. Check around to figure out how to do it.
df.set_index('Time').plot(kind='barh', stacked=True)
Plotly makes it easier:
import plotly.express as px
px.bar(melt, x='value', y='Time', color='variable', orientation='h', text='value')

Related

Connecting pairs of dots on the scatterplot with jitter with lines - Python

I have two groups of points, but they also overlap, so I need to add jitter if I plot them with a scatterplot. I also want to connect matching points from each group (they all have a pair).
There are many questions that suggest:
data = [['abc', 'pre', 10], ['abc', 'post', 5], ['bce', 'pre', 10], ['bce', 'post', 5], ['cef', 'pre', 8], ['cef', 'post', 5]]
df = pd.DataFrame(data, columns=['ID', 'time', 'value'])
grouped = df.groupby('ID')
for name, group in grouped:
sns.scatterplot(x='time', y='value', data=group, color='#3C74BC')
sns.lineplot(x='time', y='value', data=group, color='#3C74BC')
plt.show()
It works ok, but it doesn't have jitter. If I add jitter via sns. stripplot(), the lines do not connect dots anymore and they are coming out of arbitrary places.
The approach below makes following changes:
Convert the time to numeric (0 for 'pre' and 1 for 'post') via (df['time'] != 'pre').astype(float)
Add a random jitter to these values: + np.random.uniform(-0.1, 0.1, len(df)). Depending on how many values you have, you might change 0.1 to a larger value.
Use sns.lineplot with a marker to avoid the need of scatterplot.
Use hue='ID' to draw everything in one go.
As hue doesn't look to color=, use palette= with the same number of colors as there are different hue values.
Suppress the legend, as all hue values have the same color.
Assign tick labels to 0 and 1.
Set xlim to so the tick labels are at equal distances to the respective border.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
data = [['abc', 'pre', 10], ['abc', 'post', 5], ['bce', 'pre', 10], ['bce', 'post', 5], ['cef', 'pre', 8], ['cef', 'post', 5]]
df = pd.DataFrame(data, columns=['ID', 'time', 'value'])
df['time'] = (df['time'] != 'pre').astype(float) + np.random.uniform(-0.1, 0.1, len(df))
ax = sns.lineplot(x='time', y='value', data=df, hue='ID', marker='o',
palette=['#3C74BC'] * len(df['ID'].unique()), legend=False)
ax.set_xticks([0, 1], ['pre', 'post'])
ax.set_xlim(-0.2, 1.2)
plt.show()

Pandas Plot floating bar chart

I am trying to create a bar chart where the upper and lower bound of each bar could be above or below zero. Hence the boxes should "float" depending on the data. I'm also trying to use pandas.plot function as it makes my life way easier in the real application.
The solution I've devised is a horrible kludge and only partially works. Basically I'm running two different bar charts that overlap, with one of the bars being white to "hide" the main bar if necessary. I'm using a mask to mark which bars should be which color. As you can see, this works OK in the "London" and "Paris" example below, but in the "Tokyo" it isn't working because the green bar is "in front" of the white bar.
I could manually fix this a few ways that I can think of, but it would make an already kludgy solution even worse. I'm sure there's a better way that I'm just not smart enough to think of!
Here's the plot, and full code below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_data = {'Category':['London', 'Paris', 'New York', 'Tokyo'],
'Upper':[10, 5, 0, -5],
'Lower':[5, -5, -10, -10]}
df = pd.DataFrame(data = df_data)
#Color corrector
u_mask = df['Upper'] < 0
d_mask = df['Lower'] < 0
n = len(df)
uca = ['darkgreen' for i in range(n)]
uca = np.array(uca)
uc = uca.copy()
uc[u_mask] = 'white'
dca = ['white' for i in range(n)]
dca = np.array(dca, dtype=uca.dtype)
dc = dca.copy()
dc[d_mask] = 'darkgreen'
(df.plot(kind='bar', y='Upper', x='Category',
color=uc, legend=False))
ax = plt.gca()
(df.plot(kind='bar', y='Lower', x='Category',
color=dc, legend=False, ax=ax))
plt.axhline(0, color='black')
x_axis = ax.xaxis
x_axis.label.set_visible(False)
plt.subplots_adjust(left=0.1,right=0.90,bottom=0.2,top=0.90)
plt.show()
To create the plot via pandas, you could create an extra column with the height. And use df.plot(..., y=df['Height'], bottom=df['Lower']):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_data = {'Category': ['London', 'Paris', 'New York', 'Tokyo'],
'Upper': [10, 5, 0, -5],
'Lower': [5, -5, -10, -10]}
df = pd.DataFrame(data=df_data)
df['Height'] = df['Upper'] - df['Lower']
ax = df.plot(kind='bar', y='Height', x='Category', bottom=df['Lower'],
color='darkgreen', legend=False)
ax.axhline(0, color='black')
plt.tight_layout()
plt.show()
PS: Note that pandas barplot forces the lower ylim to be "sticky". This is a desired behavior when all values are positive and the bars stand firmly on y=0. However, this behavior is distracting when both positive and negative values are involved.
To remove the stickyness:
ax.use_sticky_edges = False # df.plot() makes the lower ylim sticky
ax.autoscale(enable=True, axis='y')
plt.bar has a bottom paramter. You just need to calculate the heights. Here is a very easy exampel:
upper = [10, 5, 0, -5]
lower = [5, -5, -10, -10]
height = [upper[i] - lower[i] for i in range(len(upper))]
data = [1,2,3]
plt.bar(range(len(lower)),height, bottom=lower)
plt.show()

Why my Seaborn line plot x-axis shifts one unit?

I am trying to compare two simple and summarized pandas dataframe with line plot from Seaborn library but one of the lines shifts one unit in X axis. What's wrong with it?
The dataframes are:
Here is my code:
df = pd.read_csv('/home/gazelle/Documents/m3inference/m3_result.csv',index_col='id')
df = df.drop("Unnamed: 0",axis=1)
for i, v in df.iterrows():
if str(i) not in result:
df.drop(i, inplace=True)
else:
df.loc[i, 'estimated'] = result[str(i)]
m3 = pd.read_csv('plot_result.csv').set_index('id')
ids = list(m3.index.values)
m3 = m3['age'].value_counts().to_frame().reset_index().sort_values('index')
m3 = m3.rename(columns={m3.columns[0]:'bucket', m3.columns[1]:'age'})
df_estimated = df[df.index.isin(ids)]['estimated'].value_counts().to_frame().reset_index().sort_values('index')
df_estimated = df_estimated.rename(columns={df_estimated.columns[0]:'bucket', df_estimated.columns[1]:'age'})
sns.lineplot(x='bucket', y='age', data=m3)
sns.lineplot(x='bucket', y='age', data=df_estimated)
And the result is:
As has been pointed out in the comments, the data and code you provide appear to produce the correct result:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set()
m3 = pd.DataFrame({"index": [2, 3, 4, 1], "age": [123, 116, 66, 33]})
df_estimated = pd.DataFrame({"index": [3, 2, 4, 1], "estimated": [200, 100, 37, 1]})
sns.lineplot(x="index", y="age", data=m3)
sns.lineplot(x="index", y="estimated", data=df_estimated)
plt.show()
This gives a plot which is different from the one you posted above:
From your screenshots it looks like you are working in a Jupyter notebook. You are probably suffering from the issue that at the time you plot, the dataframe m3 no longer has the values you printed above, but has been modified.

How to iteratively plot different data as boxplots in seaborn (without them overlapping)?

Is there a way to iteratively plot data using seaborn's sns.boxplot() without having the boxplots overlap? (without combining datasets into a single pd.DataFrame())
Background
Sometimes when comparing different (e.g. size/shape) datasets, a mutual comparison is often useful and can be made by binning the datasets by a different shared variable (via pd.cut() and df.groupby(), as shown below).
Previously, I have iteratively plotted these "binned" data as boxplots on the same axis by looping separate DataFrames using matplotlib's ax.boxplot() (by providing y axis location values as a position argument to to ensure boxplots don't overlap).
Example
Below is an simplified example that shows the overlapping plots in when using sns.boxplot():
import seaborn as sns
import random
import pandas as pd
import matplotlib.pyplot as plt
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
variable_to_bin_by = 'tip'
binned_variable = 'total_bill'
df = tips[[binned_variable, variable_to_bin_by] ]
# Create a second dataframe with different values and shape
df2 = pd.concat( [ df.copy() ] *5 )
# Use psuedo random numbers to convey that df2 is different to df
scale = [ random.uniform(0,2) for i in range(len(df2[binned_variable])) ]
df2[ binned_variable ] = df2[binned_variable].values * scale * 5
dfs = [ df, df2 ]
# Group the data by a list of bins
bins = [0, 1, 2, 3, 4]
for n, df in enumerate( dfs ):
gdf = df.groupby( pd.cut(df[variable_to_bin_by].values, bins ) )
data = [ i[1][binned_variable].values for i in gdf]
dfs[n] = pd.DataFrame( data, index = bins[:-1])
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
colors = ['red', 'black']
for n in range(2):
ax = sns.boxplot( data=dfs[n].T, ax=ax, width=0.2, orient='h',
color=colors[n] )
plt.ylabel( variable_to_bin_by )
plt.xlabel( binned_variable )
plt.show()
More detail
I realise the simplified example above could resolved by combining the DataFrames and providing the hue argument to sns.boxplot().
Updating the index of the DataFrames provide also doesn't help, as y values from the last DataFrame provided is then used.
Providing the kwargs argument (e.g. kwargs={'positions': dfs[n].T.index}) won't work as this raises a TypeError.
TypeError: boxplot() got multiple values for keyword argument
'positions'
The setting sns.boxplot()'s dodge argument to True doesn't solve this.
Funnily enough, the "hack" that I proposed earlier today in this answer could be applied here.
It complicates the code a bit because seaborn expects a long-form dataframe instead of a wide-form to use hue-nesting.
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
df = tips[['total_bill', 'tip'] ]
# Group the data by
bins = [0, 1, 2, 3, 4]
gdf = df.groupby( pd.cut(df['tip'].values, bins ) )
data = [ i[1]['total_bill'].values for i in gdf]
df = pd.DataFrame( data , index = bins[:-1]).T
dfm = df.melt() # create a long-form database
dfm.loc[:,'dummy'] = 'dummy'
# Create a second, slightly different, DataFrame
dfm2 = dfm.copy()
dfm2.value = dfm.value*2
dfs = [ dfm, dfm2 ]
colors = ['red', 'black']
hue_orders = [['dummy','other'], ['other','dummy']]
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
for n in range(2):
ax = sns.boxplot( data=dfs[n], x='value', y='variable', hue='dummy', hue_order=hue_orders[n], ax=ax, width=0.2, orient='h',
color=colors[n] )
ax.legend_.remove()
plt.show()

Python pandas summary table plot

Really can't get to grips with how to plot a summary table of a pandas df. I'm sure this is not a case for a pivot table, or maybe a transposed method of displaying the data. Best I could find was : Plot table and display Pandas Dataframe
My code attempts are just not getting there:
dc = pd.DataFrame({'A' : [1, 2, 3, 4],'B' : [4, 3, 2, 1],'C' : [4, 3, 2, 1]})
data = dc['A'],dc['B'],dc['C']
ax = plt.subplot(111, frame_on=False)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
cols=["A", "B", "C"]
row_labels=[0]
the_table = plt.table(cellText=data,colWidths = [0.5]*len(cols),rowLabels=row_labels, colLabels=cols,cellLoc = 'center', rowLoc = 'center')
plt.show()
All I would like to do, is produce a table plot, with A B C in the first column, and the total and mean in the rows next to them (see below). Any help or guidance would be great...feeling really stupid... (excuse the code example, it doesn't yet have the total and mean yet included...)
Total Mean
A x x
B x x
C x x
import pandas as pd
import matplotlib.pyplot as plt
dc = pd.DataFrame({'A' : [1, 2, 3, 4],'B' : [4, 3, 2, 1],'C' : [3, 4, 2, 2]})
plt.plot(dc)
plt.legend(dc.columns)
dcsummary = pd.DataFrame([dc.mean(), dc.sum()],index=['Mean','Total'])
plt.table(cellText=dcsummary.values,colWidths = [0.25]*len(dc.columns),
rowLabels=dcsummary.index,
colLabels=dcsummary.columns,
cellLoc = 'center', rowLoc = 'center',
loc='top')
fig = plt.gcf()
plt.show()
Does the dataFrame.describe() function helps you?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html
Sorry, not enough points for comments.

Categories

Resources