I am trying to take a dataframe read in from CSV file, and generate scatter plots for each column within the dataframe. For example, I have read in the following with df=pandas.readcsv()
Sample AMP ADP ATP
1A 239847 239084 987374
1B 245098 241210 988950
2A 238759 200554 921032
2B 230029 215408 899804
I would like to generate a scatter plot using sample as the x values, and the areas for each of the columns.
I am using the following code with bokeh.plotting to plot each column manually
import pandas
from bokeh.plotting import figure, show
df = pandas.read_csv("data.csv")
p = figure(x_axis_label='Sample', y_axis_label='Peak Area', x_range=sorted(set(df['Sample'])))
p.scatter(df['Sample'], df['AMP'])
show(p)
This generates scatter plots successfully, but I would like to create a loop to generate a scatter plot for each column. In my full dataset, I have over 500 columns I would like to plot.
I have followed references for using df.iteritems and df.itertuples for iterating through dataframes, but I'm not sure how to get the output I want.
I have tried the following:
for index, row in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[row])
show(p)
I hit an error right away:
raise KeyError('%s not in index' % objarr[mask] KeyError: "['1A' '1B'
'2A' '2B'] not in index
Any guidance? Thanks in advance.
iteritems iterates over columns, not rows. But your real problem is when you are trying to df[row] instead of df[index]. I'd switch wording to columns and do this:
for colname, col in df.iteritems():
p = figure()
p.scatter(df['Sample'], df[colname])
show(p)
Related
I have a dataframe of different cereals and want to plot their calories as a barplot. Now I also want to plot the mean value of the calorie values as a lineplot into the same figure as the barplot. I had the idea to put the mean value into a 1x1 dataframe by its own but I got the error
"None of [Index(['mean'], dtype='object')] are in the [columns]"
But I'm not determined to that approach.
I was unsuccessful in finding any solution for myself. Is there any?
My code inculding the calculation of the mean value but without showing it in the figure:
import pandas as pd
df = pd.read_csv("cereal.csv")
mn = df["calories"].mean()
df.plot.bar(x="name", y="calories")
If I understand well you have a few bars and you would like a single horizontal line for the mean? You can try:
import pandas as pd
df = pd.read_csv("cereal.csv")
mn = df["calories"].mean()
ax = df.plot.bar(x="name", y="calories")
ax.axhline(mn, ls=':')
I am using Python pandas read_excel to create a histogram or line plot. I would like to read in the entire file. It is a large file and I only want to plot certain values on it. I know how to use skiprows and parse_cols in read_excel, but if I do this, it does not read a part of the file that I need to use for the axis labels. I also do not know how to tell it to plot what I want for x-values and what I want for the y-values. Heres what I have:
df=pd.read_excel('JanRain.xlsx',parse_cols="C:BD")
years=df[0]
precip=df[31:32]
df.plot.bar()
I want the x axis to be row 1 of the excel file(years) and I want each bar in the bar graph to be the values on row 31 of the excel file. Im not sure how to isolate this. Would it be easier to read with pandas then plot with matplotlib?
Here is a sample of the excel file. The first row is years and the second column is days of the month (this file is only for 1 month:
Here's how I would plot the data in row 31 of a large dataframe, setting row 0 as the x-axis. (updated answer)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
create a random array with 32 rows, and 10 columns
df = pd.DataFrame(np.random.rand(320).reshape(32,10), columns=range(64,74), index=range(1,33))
df.to_excel(r"D:\data\data.xlsx")
Read only the columns and rows that you want using "parse_cols" and "skiprows." The first column in this example is the dataframe index.
# load desired columns and rows into a dataframe
# in this method, I firse make a list of all skipped_rows
desired_cols = [0] + list(range(2,9))
skipped_rows = list(range(1,33))
skipped_rows.remove(31)
df = pd.read_excel(r"D:\data\data.xlsx", index_col=0, parse_cols=desired_cols, skiprows=skipped_rows)
Currently this yields a dataframe with only one row.
65 66 67 68 69 70 71
31 0.310933 0.606858 0.12442 0.988441 0.821966 0.213625 0.254897
isolate only the row that you want to plot, giving a pandas.Series with the original column header as the index
ser = df.loc[31, :]
Plot the series.
fig, ax = plt.subplots()
ser.plot(ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")
fig, ax = plt.subplots()
ser.plot(kind="bar", ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")
I have a DataFrame like this:
I tried these two instructions one after another:
sns.boxplot([dataFrame.mean_qscore_template,dataFrame.mean_qscore_complement,dataFrame.mean_qscore_2d])
sns.boxplot(x = "mean_qscore_template", y= "mean_qscore_complement", hue = "mean_qscore_2d" data = tips)
I want to get mean_qscore_template, mean_qscore_complement and mean_qscore_2d on the x-axis with the measure on y-axis but it doesn't work.
In the documentation they give an example with tips but my dataFrame is not organized f the same way.
sns.boxplot(data = dataFrame) will make boxplots for each numeric column of your dataframe.
I'm working on automating plotting functions for metabolomics data with bokeh. Currently, I'm trying to read in my dataframe from CSV and iterate through the columns generating box plots for each metabolite (column).
I have an example df that looks like this:
Sample Group AMP ADP ATP
1A A 239847 239084 987374
1B A 245098 241210 988950
2A B 238759 200554 921032
2B B 230029 215408 89980
Here is what my code looks like:
import pandas
from bokeh.plotting import figure, output_file, show, save
from bokeh.charts import BoxPlot
df = pandas.read_csv("testdata_2.csv")
for colname, col in df.iteritems():
p = BoxPlot(df, values=df[colname], label='Group', xlabel='Group', ylabel='Peak Area',
title=colname)
output_file("boxplot.html")
show(p)
This generates an error:
raise ValueError("expected an element of either %s, got %r" % (nice_join(self.type_params), value))
ValueError: expected an element of either Column Name or Column String or List(Column Name or Column String
It seems that setting values=df[colname] is the issue. If I replace it with values=df['colname'] it gives me a key error for colname. I can plot just fine if I specify a given column such as values='ATP' but I need to be able loop through all columns.
Any guidance? Is this even the best approach?
Thanks in advance.
If you want to organize them horizontally, you can create different graphs, and then you could use for instance hplot from bokeh.io as follows:
import pandas
from bokeh.plotting import figure, output_file, show, save
from bokeh.charts import BoxPlot
from bokeh.io import hplot
df = pandas.read_csv("testdata_2.csv")
p = []
for colname in ['AMP','ADP','ATP']:
p += [BoxPlot(df, values=colname, label='Group', xlabel='Group',
ylabel='Peak Area',title=colname, width=250,height=250)]
output_file("boxplot.html")
show(hplot(*p))
For your particular example I get:
I am new to python and pandas, and have the following DataFrame.
How can I plot the DataFrame where each ModelID is a separate plot, saledate is the x-axis and MeanToDate is the y-axis?
Attempt
data[40:76].groupby('ModelID').plot()
DataFrame
You can make the plots by looping over the groups from groupby:
import matplotlib.pyplot as plt
for title, group in df.groupby('ModelID'):
group.plot(x='saleDate', y='MeanToDate', title=title)
See for more information on plotting with pandas dataframes:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
and for looping over a groupby-object:
http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups
Example with aggregation:
I wanted to do something like the following, if pandas had a colour aesthetic like ggplot:
aggregated = df.groupby(['model', 'training_examples']).aggregate(np.mean)
aggregated.plot(x='training_examples', y='accuracy', label='model')
(columns: model is a string, training_examples is an integer, accuracy is a decimal)
But that just produces a mess.
Thanks to joris's answer, I ended up with:
for index, group in df.groupby(['model']):
group_agg = group.groupby(['training_examples']).aggregate(np.mean)
group_agg.plot(y='accuracy', label=index)
I found that title= was just replacing the single title of the plot on each loop iteration, but label= does what you'd expect -- after running plt.legend(), of course.