Grouped Bar graph Pandas - python

I have a table in a pandas DataFrame named df:
+--- -----+------------+-------------+----------+------------+-----------+
|avg_views| avg_orders | max_views |max_orders| min_views |min_orders |
+---------+------------+-------------+----------+------------+-----------+
| 23 | 123 | 135 | 500 | 3 | 1 |
+---------+------------+-------------+----------+------------+-----------+
What I am looking for now is to plot a grouped bar graph which shows me
(avg, max, min) of views and orders in one single bar chart.
i.e on x axis there would be Views and orders separated by a distance
and 3 bars of (avg, max, min) for views and similarly for orders.
I have attached a sample bar graph image, just to know how the bar graph should look.
Green color should be for avg, yellow for max and pink for avg.
I took the following code from setting spacing between grouped bar plots in matplotlib but it is not working for me:
plt.figure(figsize=(13, 7), dpi=300)
groups = [[23, 135, 3], [123, 500, 1]]
group_labels = ['views', 'orders']
num_items = len(group_labels)
ind = np.arange(num_items)
margin = 0.05
width = (1. - 2. * margin) / num_items
s = plt.subplot(1, 1, 1)
for num, vals in enumerate(groups):
print 'plotting: ', vals
# The position of the xdata must be calculated for each of the two data
# series.
xdata = ind + margin + (num * width)
# Removing the "align=center" feature will left align graphs, which is
# what this method of calculating positions assumes.
gene_rects = plt.bar(xdata, vals, width)
s.set_xticks(ind + 0.5)
s.set_xticklabels(group_labels)
plotting: [23, 135, 3]
...
ValueError: shape mismatch: objects cannot be broadcast to a single shape

Using pandas:
import pandas as pd
groups = [[23,135,3], [123,500,1]]
group_labels = ['views', 'orders']
# Convert data to pandas DataFrame.
df = pd.DataFrame(groups, index=group_labels).T
# Plot.
pd.concat(
[df.mean().rename('average'), df.min().rename('min'),
df.max().rename('max')],
axis=1).plot.bar()

You should not have to modify your dataframe just to plot it in a certain way right ?
Use seaborn !
import seaborn as sns
sns.catplot(x = "x", # x variable name
y = "y", # y variable name
hue = "type", # group variable name
data = df, # dataframe to plot
kind = "bar")
source

Related

seaborn jointplot prints partial legend

I'm getting something weird with the legend in a seaborn jointplot. I want to plot some quantity y as function of a quantity x for 8 different datasets. These datasets have only two columns for x and y and a different number of rows. First of all I concatenate all rows of all datasets using numpy
y = np.concatenate(((data1[:,1]), (data2[:,1]), (data3[:,1]), (data4[:,1]),(data5[:,1]), (data6[:,1]), (data7[:,1]), (data8[:,1])), axis=0)
x = np.concatenate(((data1[:,0]), (data2[:,0]), (data3[:,0]), (data4[:,0]), (data5[:,0]), (data6[:,0]), (data7[:,0]), (data8[:,0])), axis=0)
Then I create the array of values which I will use for the parameter "hue" in the jointplot, which will distinguish the several datasets in the legend/colors. I do this by assigning at every dataset one number from 1 to 8,which is repeated for every row of the cumulative dataset:
indexes = np.concatenate((np.ones(len(data1[:,0])), 2*np.ones(len(data2[:,0])), 3*np.ones(len(data3[:,0])), 4*np.ones(len(data4[:,0])), 5*np.ones(len(data5[:,0])), 6*np.ones(len(data6[:,0])), 7*np.ones(len(data7[:,0])), 8*np.ones(len(data8[:,0]))), axis=0)
Then I create the dataset:
all_together = np.column_stack((x, y, indexes))
df = pd.DataFrame(all_together, columns = ['x','y','Dataset'])
So now I can create the jointplot. This is simply done by:
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo')
handles, labels = g.ax_joint.get_legend_handles_labels()
g.ax_joint.legend(handles=handles, labels=['data1', 'data2', 'data3', 'data4', 'data5', 'data6', 'data7', 'data8'], fontsize=10)
At this point, the problem is: all points are getting plotted (at least I think), but the legend only shows: data1, data2, data3, data4 and data5. I don't understand why it is not showing also the other three labels, and in this way the plot is difficult to read. I have checked and the cumulative dataset df has the correct shape. Any ideas?
You can add legend='full' to obtain a full legend. By default, sns.jointplot uses sns.scatterplot for the central plot. The keyword parameters which aren't used by jointplot are sent to scatterplot. The legend parameter can be "auto", "brief", "full", or False.
From the docs:
If “brief”, numeric hue and size variables will be represented with a sample of evenly spaced values. If “full”, every group will get an entry in the legend. If “auto”, choose between brief or full representation based on number of levels. If False, no legend data is added and no legend is drawn.
The following code is tested with seaborn 0.11.2:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
N = 200
k = np.repeat(np.arange(1, 9), N // 8)
df = pd.DataFrame({'x': 5 * np.cos(2 * k * np.pi / 8) + np.random.randn(N),
'y': 5 * np.sin(2 * k * np.pi / 8) + np.random.randn(N),
'Dataset': k})
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo', legend='full')
plt.show()

How to plot a histogram for all unique combinations of data?

Is there a way I can get a size frequency histogram for a population under different scenarios for specific days in python
means with error bars
My data are in a format similar to this table:
SCENARIO RUN MEAN DAY
A 1 25 10
A 1 15 30
A 2 20 10
A 2 27 30
B 1 45 10
B 1 50 30
B 2 43 10
B 2 35 30
results_data.groupby(['Scenario', 'Run']).mean() does not give me the days I want to visualize the data by
it returns the mean on the days in each run.
Use seaborn.FacetGrid
FactGrid is a Multi-plot grid for plotting conditional relationships
Map seaborn.distplot onto the FacetGrid and use hue=DAY.
Setup Data and DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random # just for test data
import numpy as np # just for test data
# data
random.seed(365)
np.random.seed(365)
data = {'MEAN': [np.random.randint(20, 51) for _ in range(500)],
'SCENARIO': [random.choice(['A', 'B']) for _ in range(500)],
'DAY': [random.choice([10, 30]) for _ in range(500)],
'RUN': [random.choice([1, 2]) for _ in range(500)]}
# create dataframe
df = pd.DataFrame(data)
Plot with kde=False
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5)
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=False, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plot with kde=True
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5, palette='GnBu')
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=True, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plots with error bars
Using how to add error bars to histogram diagram in python
Using df from above
Use matplotlib.pyplot.errorbar to plot the error bars on the histogram.
from itertools import product
# create unique combinations for filtering df
scenarios = df.SCENARIO.unique()
runs = df.RUN.unique()
days = df.DAY.unique()
combo_list = [scenarios, runs, days]
results = list(product(*combo_list))
# plot
for i, result in enumerate(results, 1): # iterate through each set of combinations
s, r, d = result
data = df[(df.SCENARIO == s) & (df.RUN == r) & (df.DAY == d)] # filter dataframe
# add subplot rows, columns; needs to equal the number of combinations in results
plt.subplot(2, 4, i)
# plot hist and unpack values
n, bins, _ = plt.hist(x='MEAN', bins=range(20, 51, 5), data=data, color='g')
# calculate bin centers
bin_centers = 0.5 * (bins[:-1] + bins[1:])
# draw errobars, use the sqrt error. You can use what you want there
# poissonian 1 sigma intervals would make more sense
plt.errorbar(bin_centers, n, yerr=np.sqrt(n), fmt='k.')
plt.title(f'Scenario: {s} | Run: {r} | Day: {d}')
plt.tight_layout()
plt.show()

How to set size of AxesSubplot in relativey simple Python program?

Python 3.7 environent
I want to create a stacked bar plot with some labels on top of each subcategory displyed as the bar. The data comes from a CSV file, and some of the labels are rather long, so they are larger than the bar width. The problem could be easily solved by scaling the whole graphic such that the bars become large enough for the labels, but I fail to re-size the plot as a whole. here the code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
dataset = 'Number'
dataFrame: pd.DataFrame = pd.read_csv('my_csv_file_with_data.csv', sep=',', header=2)
dataFrame['FaultDuration [h]'] = dataFrame['DurationH']
# ***********************************************************
# Data gymnastics to transform data in desired format
# determine the main categories
mainCategories: pd.Series = dataFrame['MainCategory']
mainCategories = mainCategories.drop_duplicates()
mainCategories = mainCategories.sort_values()
print('Main Categories: '+ mainCategories)
# subcategories
subCategories: pd.Series = pd.Series(data=dataFrame['SubCategorie'].drop_duplicates().sort_values().values)
subCategories = subCategories.sort_values()
print('Sub Categories: '+ subCategories)
# Build new frame with subcategories as headers
columnNames = pd.Series(data=['SubCategory2'])
columnNames = columnNames.append(mainCategories)
rearrangedData: pd.DataFrame = pd.DataFrame(columns=columnNames.values)
for subCategory in subCategories:
subset: pd.DataFrame = dataFrame.loc[dataFrame['SubCategorie'] == subCategory]
rearrangedRow = pd.DataFrame(columns=mainCategories.values)
rearrangedRow = rearrangedRow.append(pd.Series(), ignore_index=True)
rearrangedRow['SubCategory2'] = subCategory
for mainCategory in mainCategories:
rowData: pd.DataFrame = subset.loc[subset['MainCategorie'] == mainCategory]
if (rowData is not None and rowData.size > 0):
rearrangedRow[mainCategory] = float(rowData[dataset].values)
else:
rearrangedRow[mainCategory] = 0.0
rearrangedData = rearrangedData.append(rearrangedRow, ignore_index=True)
# *********************************************************************
# here the plot is created:
thePlot = rearrangedData.set_index('SubCategory2').T.plot.bar(stacked=True, width=1, cmap='rainbow')
thePlot.get_legend().remove()
labels = []
# *************************************************************
# creation of bar patches and labels in bar chart
rowIndex = 0
for item in rearrangedData['SubCategory2']:
colIndex = 0
for colHead in rearrangedData.columns:
if colHead != 'SubCategory2':
if rearrangedData.iloc[rowIndex, colIndex] > 0.0:
label = item + '\n' + str(rearrangedData.iloc[rowIndex, colIndex])
labels.append(item)
else:
labels.append('')
colIndex = colIndex + 1
rowIndex = rowIndex + 1
patches = thePlot.patches
for label, rect in zip(labels, patches):
width = rect.get_width()
if width > 0:
x = rect.get_x()
y = rect.get_y()
height = rect.get_height()
thePlot.text(x + width/2., y + height/2., label, ha='center', va='center', size = 7 )
# Up to here things work like expected...
# *******************************************************
# now I want to produce output in the desired format/size
# things I tried:
1) thePlot.figure(figsize=(40,10)) <---- Fails with error 'Figure' object is not callable
2) plt.figure(figsize=(40,10)) <---- Creates a second, empty plot of the right size, but bar chart remains unchanged
3) plt.figure(num=1, figsize=(40,10)) <---- leaves chart plot unchanged
plt.tight_layout()
plt.show()
The object "thePlot" is an AxesSubplot. How do I get to a properly scaled chart?
You can use the set sizes in inches:
theplot.set_size_inches(18.5, 10.5, forward=True)
For example see:
How do you change the size of figures drawn with matplotlib?

Plotting a histogram gives height error msg

I'm trying to plot a histogram based on percentages, I keep getting the below error:
ValueError: incompatible sizes: argument 'height' must be length 6 or scalar
It's something to do with this line but I'm not sure what's wrong with the height argument.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
xaxis=['epic1', 'epic2', 'epic3', 'epic4', 'epic5', 'epic6']
n=len(xaxis)
names = ('epic1', 'epic2', 'epic3', 'epic4', 'epic5', 'epic6')
data = {'done': [57,53,49,65,78,56,89],
'progress': [23,12,34,11,34,12,12],
'todo' :[11,5,6,7,8,4,6]}
df = pd.DataFrame(data)
df['total'] = df['done'] + df['progress'] + df['todo']
df['done_per'] = df['done'] / df['total'] * 100
df['progress_per'] = df['progress'] / df['total'] * 100
df['todo_per'] = df['todo'] / df['total'] * 100
barWidth = 0.25
# Create green Bars
plt.bar(xaxis, done_per, color='#b5ffb9', edgecolor='green', width=barWidth)
# Create orange Bars
plt.bar(xaxis, progress_per, bottom=done_per, color='#f9bc86',
edgecolor='orange', width=barWidth)
# Create blue Bars
plt.bar(xaxis, todo_per, bottom=[i+j for i,j in zip(done_per, progress_per)],
color='blue', edgecolor='blue', width=barWidth)
plt.xticks(xaxis, names)
plt.xlabel("epics")
plt.show()
There is 7 items in X_per(yaxis) and 6 items in xaxis.
If you want 7 items instead.
Adding 'epic7' into xaxis should do the job. xaxis.append('epic7')
I think you had missed few lines in your code:
done_per = df['done_per']
progress_per = df['progress_per']
todo_per = df['todo_per']

How to make multiline graph with matplotlib subplots and pandas?

I'm fairly new at coding (completely self taught), and have started using it at at my job as a research assistant in a cancer lab. I need some help setting up a few line graphs in matplot lab.
I have a dataset that includes nextgen sequencing data for about 80 patients. on each patient, we have different timepoints of analysis, different genes detected (out of 40), and the associated %mutation for the gene.
My goal is to write two scripts, one that will generate a "by patient" plot, that will be a linegraph with y-%mutation, x-time of measurement, and will have a different color line for all lines made by each of the patient's associated genes. The second plot will be a "by gene", where I will have one plot contain different color lines that represent each of the different patient's x/y values for that specific gene.
Here is an example dataframe for 1 genenumber for the above script:
gene yaxis xaxis pt# gene#
ASXL1-3 34 1 3 1
ASXL1-3 0 98 3 1
IDH1-3 24 1 3 11
IDH1-3 0 98 3 11
RUNX1-3 38 1 3 21
RUNX1-3 0 98 3 21
U2AF1-3 33 1 3 26
U2AF1-3 0 98 3 26
I have setup a groupby script that when I iterate over it, gives me a dataframe for every gene-timepoint for each patient.
grouped = df.groupby('pt #')
for groupObject in grouped:
group = groupObject[1]
For patient 1, this gives the following output:
y x gene patientnumber patientgene genenumber dxtotransplant \
0 40.0 1712 ASXL1 1 ASXL1-1 1 1857
1 26.0 1835 ASXL1 1 ASXL1-1 1 1857
302 7.0 1835 RUNX1 1 RUNX1-1 21 1857
I need help writing a script that will create either of the plots described above. using the bypatient example, my general idea is that I need to create a different subplot for every gene a patient has, where each subplot is the line graph represented by that one gene.
Using matplotlib this is about as far as I have gotten:
plt.figure()
grouped = df.groupby('patient number')
for groupObject in grouped:
group = groupObject[1]
df = group #may need to remove this
for element in range(len(group)):
xs = np.array(df[df.columns[1]]) #"x" column
ys= np.array(df[df.columns[0]]) #"y" column
gene = np.array(df[df.columns[2]])[element] #"gene" column
plt.subplot(1,1,1)
plt.scatter(xs,ys, label=gene)
plt.plot(xs,ys, label=gene)
plt.legend()
plt.show()
This produces the following output:
In this output, the circled line is not supposed to be connected to the other 2 points. In this case, this is patient 1, who has the following datapoint:
x y gene
1712 40 ASXL1
1835 26 ASXL1
1835 7 RUNX1
Using seaborn I have gotten close to my desired graph using this code:
grouped = df.groupby(['patientnumber'])
for groupObject in grouped:
group = groupObject[1]
g = sns.FacetGrid(group, col="patientgene", col_wrap=4, size=4, ylim=(0,100))
g = g.map(plt.scatter, "x", "y", alpha=0.5)
g = g.map(plt.plot, "x", "y", alpha=0.5)
plt.title= "gene:%s"%element
Using this code I get the following:
If I adjust the line:
g = sns.FacetGrid(group, col="patientnumber", col_wrap=4, size=4, ylim=(0,100))
I get the following result:
As you can see in the 2d example, the plot is treating every point on my plot as if they are from the same line (but they are actually 4 separate lines).
How I can tweak my iterations so that each patient-gene is treated as a separate line on the same graph?
I wrote a subplot function that may give you a hand. I modified the data a tad to help illustrate the plotting functionality.
gene,yaxis,xaxis,pt #,gene #
ASXL1-3,34,1,3,1
ASXL1-3,3,98,3,1
IDH1-3,24,1,3,11
IDH1-3,7,98,3,11
RUNX1-3,38,1,3,21
RUNX1-3,2,98,3,21
U2AF1-3,33,1,3,26
U2AF1-3,0,98,3,26
ASXL1-3,39,1,4,1
ASXL1-3,8,62,4,1
ASXL1-3,0,119,4,1
IDH1-3,27,1,4,11
IDH1-3,12,62,4,11
IDH1-3,1,119,4,11
RUNX1-3,42,1,4,21
RUNX1-3,3,62,4,21
RUNX1-3,1,119,4,21
U2AF1-3,16,1,4,26
U2AF1-3,1,62,4,26
U2AF1-3,0,119,4,26
This is the subplotting function...with some extra bells and whistles :)
def plotByGroup(df, group, xCol, yCol, title = "", xLabel = "", yLabel = "", lineColors = ["red", "orange", "yellow", "green", "blue", "purple"], lineWidth = 2, lineOpacity = 0.7, plotStyle = 'ggplot', showLegend = False):
"""
Plot multiple lines from a Pandas Data Frame for each group using DataFrame.groupby() and MatPlotLib PyPlot.
#params
df - Required - Data Frame - Pandas Data Frame
group - Required - String - Column name to group on
xCol - Required - String - Column name for X axis data
yCol - Required - String - Column name for y axis data
title - Optional - String - Plot Title
xLabel - Optional - String - X axis label
yLabel - Optional - String - Y axis label
lineColors - Optional - List - Colors to plot multiple lines
lineWidth - Optional - Integer - Width of lines to plot
lineOpacity - Optional - Float - Alpha of lines to plot
plotStyle - Optional - String - MatPlotLib plot style
showLegend - Optional - Boolean - Show legend
#return
MatPlotLib Plot Object
"""
# Import MatPlotLib Plotting Function & Set Style
from matplotlib import pyplot as plt
matplotlib.style.use(plotStyle)
figure = plt.figure() # Initialize Figure
grouped = df.groupby(group) # Set Group
i = 0 # Set iteration to determine line color indexing
for idx, grp in grouped:
colorIndex = i % len(lineColors) # Define line color index
lineLabel = grp[group].values[0] # Get a group label from first position
xValues = grp[xCol] # Get x vector
yValues = grp[yCol] # Get y vector
plt.subplot(1,1,1) # Initialize subplot and plot (on next line)
plt.plot(xValues, yValues, label = lineLabel, color = lineColors[colorIndex], lw = lineWidth, alpha = lineOpacity)
# Plot legend
if showLegend:
plt.legend()
i += 1
# Set title & Labels
axis = figure.add_subplot(1,1,1)
axis.set_title(title)
axis.set_xlabel(xLabel)
axis.set_ylabel(yLabel)
# Return plot for saving, showing, etc.
return plt
And to use it...
import pandas
# Load the Data into Pandas
df = pandas.read_csv('data.csv')
#
# Plotting - by Patient
#
# Create Patient Grouping
patientGroup = df.groupby('pt #')
# Iterate Over Groups
for idx, patientDF in patientGroup:
# Let's give them specific titles
plotTitle = "Gene Frequency over Time by Gene (Patient %s)" % str(patientDf['pt #'].values[0])
# Call the subplot function
plot = plotByGroup(patientDf, 'gene', 'xaxis', 'yaxis', title = plotTitle, xLabel = "Days", yLabel = "Gene Frequency")
# Add Vertical Lines at Assay Timepoints
timepoints = set(patientDf.xaxis.values)
[plot.axvline(x = timepoint, linewidth = 1, linestyle = "dashed", color='gray', alpha = 0.4) for timepoint in timepoints]
# Let's see it
plot.show()
And of course, we can do the same by gene.
#
# Plotting - by Gene
#
# Create Gene Grouping
geneGroup = df.groupby('gene')
# Generate Plots for Groups
for idx, geneDF in geneGroup:
plotTitle = "%s Gene Frequency over Time by Patient" % str(geneDf['gene'].values[0])
plot = plotByGroup(geneDf, 'pt #', 'xaxis', 'yaxis', title = plotTitle, xLab = "Days", yLab = "Frequency")
plot.show()
If this isn't what you're looking for, provide a clarification and I'll take another crack at it.

Categories

Resources