Passing column data from pandas df to bokeh plotting function - python

I'm working on automating plotting functions for metabolomics data with bokeh. Currently, I'm trying to read in my dataframe from CSV and iterate through the columns generating box plots for each metabolite (column).
I have an example df that looks like this:
Sample Group AMP ADP ATP
1A A 239847 239084 987374
1B A 245098 241210 988950
2A B 238759 200554 921032
2B B 230029 215408 89980
Here is what my code looks like:
import pandas
from bokeh.plotting import figure, output_file, show, save
from bokeh.charts import BoxPlot
df = pandas.read_csv("testdata_2.csv")
for colname, col in df.iteritems():
p = BoxPlot(df, values=df[colname], label='Group', xlabel='Group', ylabel='Peak Area',
title=colname)
output_file("boxplot.html")
show(p)
This generates an error:
raise ValueError("expected an element of either %s, got %r" % (nice_join(self.type_params), value))
ValueError: expected an element of either Column Name or Column String or List(Column Name or Column String
It seems that setting values=df[colname] is the issue. If I replace it with values=df['colname'] it gives me a key error for colname. I can plot just fine if I specify a given column such as values='ATP' but I need to be able loop through all columns.
Any guidance? Is this even the best approach?
Thanks in advance.

If you want to organize them horizontally, you can create different graphs, and then you could use for instance hplot from bokeh.io as follows:
import pandas
from bokeh.plotting import figure, output_file, show, save
from bokeh.charts import BoxPlot
from bokeh.io import hplot
df = pandas.read_csv("testdata_2.csv")
p = []
for colname in ['AMP','ADP','ATP']:
p += [BoxPlot(df, values=colname, label='Group', xlabel='Group',
ylabel='Peak Area',title=colname, width=250,height=250)]
output_file("boxplot.html")
show(hplot(*p))
For your particular example I get:

Related

Unable to use Bokeh and Panda to read a csv and plot it

I'm trying to plot a line graph from a simple CSV file with two columns using Bokeh for data visualisation and Panda to read the CSV and handle the data. However, i can't seem to pass the data I've imported using pandas to Bokeh to plot my line graph.
This is running locally on my computer. I've tried and debugged each section of the code and the sole problem seems to occur when I pass the data from pandas to bokeh.
I've tried printing the columns I've selected from my csv to check that the entire column has been selected too.
#Requirements for App
from bokeh.plotting import figure, output_file, show
import pandas as pd
from bokeh.models import ColumnDataSource
#Import data-->Weight measurements over a period of time [ STUB ]
weight = pd.read_csv("weight.csv")
#Define parameters
x=weight["Date"]
y=weight["Weight"]
#Take data and present in a graph
output_file("test.html")
p = figure(plot_width=400, plot_height=400)
p.line(x,y,line_width=2)
show(p)
I expect to get a line graph that plots each weight entry each day but I get a blank plot.
This should work. Pandas doesn't know that it is working with dates so you have to specify this with pd.to_datetime().
#!/usr/bin/python3
from bokeh.plotting import figure, output_file, show
import pandas as pd
from bokeh.models import DatetimeTickFormatter, ColumnDataSource
#Import data-->Weight measurements over a period of time [ STUB ]
weight = pd.read_csv("weight.csv")
#Define parameters
weight["Date"] = pd.to_datetime(weight['Date'])
weight["Weight"] = pd.to_numeric(weight['Weight'])
source = ColumnDataSource(weight)
#Take data and present in a graph
output_file("test.html")
p = figure(plot_width=400, plot_height=400, x_axis_type='datetime')
p.line(x='Date',y='Weight',line_width=2, source=source)
p.xaxis.formatter=DatetimeTickFormatter(
minutes=["%M"],
hours=["%H:%M"],
days=["%d/%m/%Y"],
months=["%m/%Y"],
years=["%Y"]
)
show(p)

Python plotting dictionary

I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!
I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).

Verbatim labels in legend in bokeh plots

I'm trying to use bokeh in python for interactive analysis of my plots.
My data are stored in pandas.Dataframe. I'd like to have a legend with column names as labels. However, bokeh extracts values from respective column instead.
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource
output_notebook()
BokehJS 0.12.13 successfully loaded.
df = pd.DataFrame({'accuracy': np.random.random(10)}, index=pd.Index(np.arange(10), name='iteration'))
df
output:
accuracy
iteration
0 0.977427
1 0.057319
2 0.307741
3 0.127390
4 0.662976
5 0.313618
6 0.214040
7 0.214274
8 0.864432
9 0.800101
Now plot:
p = figure(width=900, y_axis_type="log")
source = ColumnDataSource(df)
p.line(x='iteration', y='accuracy', source=source, legend='accuracy')
show(p)
Result:
Desired output, obtained with adding space: legend='accuracy'+' ':
Although I've reached my goal, the method does not satisfy me. I think, there should be more elegant and official way to tell between column name and legend label.
There is. Bokeh tries to "do the right thing" in most situations, but doing that makes for a few corner cases where the behavior is less desirable, and this is one of them. However, specifically in this instance, you can always be explicit about whether the string is to be interpreted as a value or as field:
from bokeh.core.properties import value
p.line(x='iteration', y='accuracy', source=source, legend=value('accuracy'))

Python / Pandas / Bokeh: plotting multiple lines with legends from dataframe

I have data in a Pandas dataframe that I am trying to plot to a time series line graph.
When plotting one single line, I have been able to do this quite successfully using the p.line function, ensuring I make the x_axis_type 'datetime'.
To plot multiple lines, I have tried using p.multi_line, which worked well but I also need a legend and, according to this post, it's not possible to add a legend to a multiline: Bokeh how to add legend to figure created by multi_line method?
Leo's answer to the question in the link above looks promising, but I can't seem to work out how to apply this when the data is sourced from a dataframe.
Does anyone have any tips?
OK, this seems to work:
from bokeh.plotting import figure, output_file, save
from bokeh.models import ColumnDataSource
import pandas as pd
from pandas import HDFStore
from bokeh.palettes import Spectral11
# imports data to dataframe from our storage hdf5 file
# our index column has no name, so this is assigned a name so it can be
# referenced to for plotting
store = pd.HDFStore('<file location>')
df = pd.DataFrame(store['d1'])
df = df.rename_axis('Time')
#the number of columns is the number of lines that we will make
numlines = len(df.columns)
#import color pallet
mypalette = Spectral11[0:numlines]
# remove unwanted columns
col_list = ['Column A', 'Column B']
df = df[col_list]
# make a list of our columns
col = []
[col.append(i) for i in df.columns]
# make the figure,
p = figure(x_axis_type="datetime", title="<title>", width = 800, height = 450)
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = '<units>'
# loop through our columns and colours
for (columnnames, colore) in zip(col, mypalette):
p.line(df.index, df[columnnames], legend = columnnames, color = colore )
# creates an output file
output_file('<output location>')
#save the plot
save(p)

Iterating over columns with for loops in pandas dataframe

I am trying to take a dataframe read in from CSV file, and generate scatter plots for each column within the dataframe. For example, I have read in the following with df=pandas.readcsv()
Sample AMP ADP ATP
1A 239847 239084 987374
1B 245098 241210 988950
2A 238759 200554 921032
2B 230029 215408 899804
I would like to generate a scatter plot using sample as the x values, and the areas for each of the columns.
I am using the following code with bokeh.plotting to plot each column manually
import pandas
from bokeh.plotting import figure, show
df = pandas.read_csv("data.csv")
p = figure(x_axis_label='Sample', y_axis_label='Peak Area', x_range=sorted(set(df['Sample'])))
p.scatter(df['Sample'], df['AMP'])
show(p)
This generates scatter plots successfully, but I would like to create a loop to generate a scatter plot for each column. In my full dataset, I have over 500 columns I would like to plot.
I have followed references for using df.iteritems and df.itertuples for iterating through dataframes, but I'm not sure how to get the output I want.
I have tried the following:
for index, row in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[row])
show(p)
I hit an error right away:
raise KeyError('%s not in index' % objarr[mask] KeyError: "['1A' '1B'
'2A' '2B'] not in index
Any guidance? Thanks in advance.
iteritems iterates over columns, not rows. But your real problem is when you are trying to df[row] instead of df[index]. I'd switch wording to columns and do this:
for colname, col in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[colname])
show(p)

Categories

Resources