Plot grid of histograms based on group variable using plotly - python

I have a data frame that contains multiple variables where each variable is logically connected to a factor level of an additional group variable. I would like to plot a histogram of each variable in such a way that it is possible to show a grid of multiple histograms 'group-wise'.
Here's an example data frame df_melt (the variables var_1,var_2,var_3,var_4 are logically connected to the factor level 'foo', the variables var_5,var_6,var_7 belong to factor level 'bar'):
import numpy as np
import pandas as pd
# simulate data and create plot-ready dataframe
np.random.seed(42)
var_values = np.random.randint(low=1,high=100,size=(100,7))
var_names = ['var_1','var_2','var_3','var_4','var_5','var_6','var_7']
group_names = ['foo','foo','foo','foo','bar','bar','bar']
df = pd.DataFrame(var_values,columns=var_names)
multi_index = pd.MultiIndex.from_arrays([df.columns,group_names],names=['variable','group'])
df.columns = multi_index
df_melt = pd.melt(df)
The output should look like this:
These stackoverflow posts might help to provide an answer, but I was not able to come up with a solution on my own:
Plotting a grouped pandas data in plotly
Plotly equivalent for pd.DataFrame.hist

Best I came up with is the following. Sadly, this is not in the nicely plotted format that you wanted, but I think/hope you can start with this.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# simulate data and create plot-ready dataframe
np.random.seed(42)
var_values = np.random.randint(low=1,high=100,size=(100,7))
var_names = ['var_1','var_2','var_3','var_4','var_5','var_6','var_7']
group_names = ['foo','foo','foo','foo','bar','bar','bar']
df = pd.DataFrame(var_values,columns=var_names)
multi_index = pd.MultiIndex.from_arrays([df.columns,group_names],names=['variable','group'])
df.columns = multi_index
df_melt = pd.melt(df)
uniq_cols = set(group_names)
for col in uniq_cols:
rows = df_melt[df_melt['group']==col]['variable'].unique()
# print(list(rows))
num_vars = len(rows)
fig = make_subplots(rows=1, cols=len(rows), column_titles=list(rows))
for i, row in enumerate(rows):
fig.add_trace(go.Histogram(x=df_melt[(df_melt['group']==col) & (df_melt['variable']==row)]['value']),
row=1, col=i+1)
fig.show()

Related

Plotly Distplot subplots

I am trying to write a for loop that for distplot subplots.
I have a dataframe with many columns of different lengths. (not including the NaN values)
fig = make_subplots(
rows=len(assets), cols=1,
y_title = 'Hourly Price Distribution')
i=1
for col in df_all.columns:
fig = ff.create_distplot([[df_all[[col]].dropna()]], col)
fig.append()
i+=1
fig.show()
I am trying to run a for loop for subplots for distplots and get the following error:
PlotlyError: Oops! Your data lists or ndarrays should be the same length.
UPDATE:
This is an example below:
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
fig = ff.create_distplot([df[c].dropna() for c in df.columns],
df.columns,show_hist=False,show_rug=False)
fig.show()
I would like to plot each distribution in a different subplot.
Thank you.
Update: Distribution plots
Calculating the correct values is probably both quicker and more elegant using numpy. But I often build parts of my graphs using one plotly approach(figure factory, plotly express) and then use them with other elements of the plotly library (plotly.graph_objects) to get what I want. The complete snippet below shows you how to do just that in order to build a go based subplot with elements from ff.create_distplot. I'd be happy to give further explanations if the following suggestion suits your needs.
Plot
Complete code
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
cols = dfm.year.unique()
nrows = len(cols)
fig = make_subplots(rows=nrows, cols=1)
for r, col in enumerate(cols, 1):
dfs = dfm[dfm['year']==col]
fx1 = ff.create_distplot([dfs['value'].values], ['distplot'],curve_type='kde')
fig.add_trace(go.Scatter(
x= fx1.data[1]['x'],
y =fx1.data[1]['y'],
), row = r, col = 1)
fig.show()
First suggestion
You should:
1. Restructure your data with pd.melt(df, id_vars=['index'], value_vars=df.columns[1:]),
2. and the use the occuring column 'variable' to build subplots for each year through the facet_row argument to get this:
In the complete snippet below you'll see that I've changed 'variable' to 'year' in order to make the plot more intuitive. There's one particularly convenient side-effect with this approach, namely that running dfm.dropna() will remove the na value for 2012 only. If you were to do the same thing on your original dataframe, the corresponding value in the same row for 2013 would also be removed.
import numpy as np
import pandas as pd
import plotly.express as px
df = pd.DataFrame({'2012': np.random.randn(20),
'2013': np.random.randn(20)+1})
df['2012'].iloc[0] = np.nan
df = df.reset_index()
dfm = pd.melt(df, id_vars=['index'], value_vars=df.columns[1:])
dfm = dfm.dropna()
dfm.rename(columns={'variable':'year'}, inplace = True)
fig = px.histogram(dfm, x="value",
facet_row = 'year')
fig.show()

Create a stacked graph or bar graph using plotly in python

I have data like this :
[ ('2018-04-09', '10:18:11',['s1',10],['s2',15],['s3',5])
('2018-04-09', '10:20:11',['s4',8],['s2',20],['s1',10])
('2018-04-10', '10:30:11',['s4',10],['s5',6],['s6',3]) ]
I want to plot a stacked graph preferably of this data.
X-axis will be time,
it should be like this
I created this image in paint just to show.
X axis will show time like normal graph does( 10:00 ,April 3,2018).
I am stuck because the string value (like 's1',or 's2' ) will change in differnt bar graph.
Just to hard code and verify,I try this:
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import matplotlib
plotly.offline.init_notebook_mode()
def createPage():
graph_data = []
l1=[('com.p1',1),('com.p2',2)('com.p3',3)]
l2=[('com.p1',1),('com.p4',2)('com.p5',3)]
l3=[('com.p2',8),('com.p3',2)('com.p6',30)]
trace_temp = go.Bar(
x='2018-04-09 10:18:11',
y=l1[0],
name = 'top',
)
graph_data.append(trace_temp)
plotly.offline.plot(graph_data, filename='basic-scatter3.html')
createPage()
Error I am getting is Tuple Object is not callable.
So can someone please suggest some code for how I can plot such data.
If needed,I may store data in some other form which may be helpful in plotting.
Edit :
I used the approach suggested in accepted answer and succeed in plotting using plotly like this
fig=df.iplot(kin='bar',barmode='stack',asFigure=True)
plotly.offline.plt(fig,filename="stack1.html)
However I faced one error:
1.When Time intervals are very close,Data overlaps on graph.
Is there a way to overcome it.
You could use pandas stacked bar plot. The advantage is that you can create with pandas easily the table of column/value pairs you have to generate anyhow.
from matplotlib import pyplot as plt
import pandas as pd
all_data = [('2018-04-09', '10:18:11', ['s1',10],['s2',15],['s3',5]),
('2018-04-09', '10:20:11', ['s4',8], ['s2',20],['s1',10]),
('2018-04-10', '10:30:11', ['s4',10],['s5',6], ['s6',3]) ]
#load data into dataframe
df = pd.DataFrame(all_data, columns = list("ABCDE"))
#combine the two descriptors
df["day/time"] = df["A"] + "\n" + df["B"]
#assign each list to a new row with the appropriate day/time label
df = df.melt(id_vars = ["day/time"], value_vars = ["C", "D", "E"])
#split each list into category and value
df[["category", "val"]] = pd.DataFrame(df.value.values.tolist(), index = df.index)
#create a table with category-value pairs from all lists, missing values are set to NaN
df = df.pivot(index = "day/time", columns = "category", values = "val")
#plot a stacked bar chart
df.plot(kind = "bar", stacked = True)
#give tick labels the right orientation
plt.xticks(rotation = 0)
plt.show()
Output:

output table to left of horizontal bar chart with pandas

I am trying to get an output from a dataframe that shows a stacked horizontal bar chart with a table to the left of it. The relevant data is as follows:
import pandas as pd
import matplotlib.pyplot as plt
cols = ['metric','target','daily_avg','days_green','days_yellow','days_red']
vals = ['Volume',338.65,106.81,63,2,1]
OutDict = dict(zip(cols,vals))
df = pd.DataFrame(columns = cols)
df = df.append(OutDict, ignore_index = True)
I'd like to get something similar to what's in the following: Python Matplotlib how to get table only. I can get the stacked bar chart:
df[['days_green','days_yellow','days_red']].plot.barh(stacked=True)
Adding in the keyword argument table=True puts a table below the chart. How do I get the axis to either display the df as a table or add one in next to the chart. Also, the DataFrame will eventually have more than one row, but if I can get it work for one then I should be able to get it to work for n rows.
Thanks in advance.
Unfortunately using the pandas.plot method you won't be able to do this. The docs for the table parameter state:
If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.
So you will have to use matplotlib directly to get this done. One option is to create 2 subplots; one for your table and one for your chart. Then you can add the table and modify it as you see fit.
import matplotlib.pyplot as plt
import pandas as pd
cols = ['metric','target','daily_avg','days_green','days_yellow','days_red']
vals = ['Volume',338.65,106.81,63,2,1]
OutDict = dict(zip(cols,vals))
df = pd.DataFrame(columns = cols)
df = df.append(OutDict, ignore_index = True)
fig, (ax1, ax2) = plt.subplots(1, 2)
df[['days_green','days_yellow','days_red']].plot.barh(stacked=True, ax=ax2)
ax1.table(cellText=df[['days_green','days_yellow','days_red']].values, colLabels=['days_green', 'days_yellow', 'days_red'], loc='center')
ax1.axis('off')
fig.show()

Boxplots by group for multivariate two-factorial designs using matplotlib + pandas

I am analysing a two-factorial (M)ANOVA; the sampling design consists of two categorical variables with two and three levels respectively and a response of dimension 4. Having done all the data parsing in python, I would like to continue plotting the data within python, too. (Rather than switch to R for the plotting.) My code, though, is not only very verbose, but the whole thing looks and feels like a really bad hack, too. My question: What is the pandas-matplotlib-way of producing the following plot? Out of interest: I would also be happy to see a solution that is not using seaborn.
The solution in R (the plotting is 2 lines of code):
# Data managment
library(reshape2)
# Plotting
library(ggplot2)
# Creating sample data
set.seed(12345)
dat = data.frame(matrix(rnorm(42*4, mean=c(10,3,5,1)), ncol=4, byrow=T))
names(dat) = c('Base', 'State23', 'State42', 'End')
gen = factor(sample(2, size=42, replace=T), labels=c('WT', 'HET'))
env = factor(sample(3, size=42, replace=T), labels=c('heavySmoker', 'casualSmoker', 'nonSmoker'))
dat$genotype = gen
dat$environment = env
# Plotting the data
dam = melt(dat, measure.vars=c('Base', 'State23', 'State42', 'End'))
p = ggplot(dam, aes(genotype, value, fill=environment)) + geom_boxplot() + facet_wrap(~variable, nrow=1)
ggsave('boxplot-r.png', plot=p)
This will produce the following plot:
My current solution in python:
# Numerics
import numpy as np
from numpy.random import randint
# Data managment
import pandas as pd
from pandas import DataFrame
from pandas import Series
# Plotting
import matplotlib
matplotlib.use('Qt4Agg')
import matplotlib.pyplot as pt
import seaborn as sns
# Creating sample data
np.random.seed(12345)
index = pd.Index(np.arange(42))
frame = DataFrame(np.random.randn(42,4) + np.array([10,3,5,1]), columns=['Base', 'State23', 'State42', 'End'], index=index)
genotype = Series(['WT', 'HET'], name='genotype', dtype='category')
environment = Series(['heavySmoker', 'casualSmoker', 'nonSmoker'], name='environment', dtype='category')
gen = genotype[np.random.randint(2, size=42)]
env = environment[np.random.randint(3, size=42)]
gen.index = frame.index
env.index = frame.index
frame['genotype'] = gen
frame['environment'] = env
# Plotting the data
response = ['Base', 'State23', 'State42', 'End']
fig, ax = pt.subplots(1, len(response), sharex=True, sharey=True)
for i, r in enumerate(response):
sns.boxplot(data=frame, x='genotype', y=r, hue='environment', ax=ax[i])
ax[i].set_ylabel('')
ax[i].set_title(r)
fig.subplots_adjust(wspace=0)
fig.savefig('boxplot-python.png')
This will produce the following plot:
As you probably agree, the code is not only verbose, but it also does not really do what I want. For example, I have no idea how to remove the multiple appearance of the legend, and the labelling on the x-axis is odd.
Edited to use factorplot instead of Facetgrid as suggested by mwaskom in the comments.
If you melt the dataframe, then you can take advantage of Seaborn's factorplot:
df = pd.melt(frame, id_vars=['genotype', 'environment'])
sns.factorplot(data=df, x='genotype', y='value',
hue='environment', col='variable',
kind='box', legend=True)
You can rename "value" and "variable" as you wish in the melt function.
Here is the resulting chart:
Previous answer with FacetGrid:
g = sns.FacetGrid(df, col="variable", size=4, aspect=.7)
g.map(sns.boxplot, "genotype", "value", "environment").add_legend(title="environment")

pandas boxplot: swap box placement for comparison

tmpdf.boxplot(['original','new'], by = 'by column', ax = ax, sym = '')
gets me a plot like this
I want to compare "original" with "new", how can I arrange to put the two "0" boxes in one panel and the two "1" boxes in another panel? And of course swap the labelling with that.
Thanks
Here is a sample dataset to demonstrate.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# simulate some artificial data
# ==========================================
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10,2), columns=['original', 'new'] )
df['by column'] = pd.Series([0,0,0,0,1,1,1,1,1,1])
# your original plot
ax = df.boxplot(['original', 'new'], by='by column', figsize=(12,6))
To get desired output, use groupby explicitly out of boxplot, so that we iterate over all subgroups, and plot a boxplot for each.
ax = df[['original', 'new']].groupby(df['by column']).boxplot(figsize=(12,6))

Categories

Resources