Good morning,
I'm trying to boxplot the 'columns' from 1 to 6 vs the 'ElapsedTime(hours)' column with the use of plotly library.
Here is my dataframe :
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| Date | Time | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | ElapsedTime(hours) |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:48:37 | 0,011535 | 8,4021 | 0,00027 | 0,027806 | 8,431 | 0,000362 | 0 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:59:37 | 0,013458 | 8,4421 | 0,000314 | 0,032214 | 8,4738 | 0,000416 | 0,183333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 07:14:37 | 0,017793 | 8,4993 | 0,000384 | 0,038288 | 8,5372 | 0,000486 | 0,433333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:12:50 | 0,018808 | 8,545 | 0,000414 | 0,042341 | 8,5891 | 0,000539 | 24,9702778 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:42:50 | 0,025931 | 8,3627 | 0,000534 | 0,032379 | 8,3556 | 0,000557 | 25,9036111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:57:50 | 0,025164 | 8,5518 | 0,000505 | 0,041134 | 8,6516 | 0,000254 | 26,1536111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:45:28 | 0,026561 | 8,6266 | 0,000533 | 0,050387 | 8,6718 | 0,00065 | 46,9475 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:55:28 | 0,027744 | 8,6455 | 0,000543 | 0,051511 | 8,6916 | 0,000664 | 47,11416667 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 06:05:28 | 0,028854 | 8,485 | 0,000342 | 0,05693 | 8,6934 | 0,000695 | 47,28083333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
for now, i just know how to boxplot each column vs nothing using these lines of code :
import warnings
import pandas as pd
import plotly.graph_objects as go
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
fig = go.Figure()
fig.add_trace(
go.Box(y=da['Column1'], name='Column1'))
fig.add_trace(
go.Box(y=da['Column2'], name='Column2'))
fig.add_trace(
go.Box(y=da['Column3'], name='Column3'))
fig.add_trace(
go.Box(y=da['Column4'], name='Column4'))
fig.add_trace(
go.Box(y=da['Column5'], name='Column5'))
fig.add_trace(
go.Box(y=da['Column6'], name='Column6'))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
from plotly import offline
offline.plot(fig)
Output :
I can choose to show one column :
What i want (if possible) : Plot my columns from 1 to 6 vs a range of ElapsedTime(hours). For exemple i choose to have a range of 10 hours, so the boxplots will be taking in consideration that range and plot all the values of that range into one box.
PS : if i add x=da['ElapsedTime(hours)'] inside the go.Box(), i will be ploting each value of columns 1 to 6 versus one value from the ElapsedTime column and i don't want that, I want a box in a range of an ElapsedTime.
Extra : If possible, i want the columns from 1 to 6 to be in a dropdown button so that i can click and choose which column i wanna see in the range of the ElapsedTime i choosed.
Thank you for your time, and have a great day !
EDIT :#################################################
I tried these lines. The problem is that i have an error saying dataframe doesn't have a name argument (name=data.name) and if i get rid of that, let's say i don't use name=data.name, i will get a plot that is not Box. Do you have any idea on how to overcome this problem ?
da["DateTime"] = pd.to_datetime(da.Date + " " + da.Time)
columns = [c for c in da.columns if c.startswith("Column")]
da.set_index("DateTime")[columns].resample("1D")
fig = go.Figure()
for start_datetime, data in da.set_index("DateTime")[columns].resample("1D"):
fig.add_trace(
go.Box(x=data.index, y=data.values, name=data.name))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
fig.update_layout(boxmode='group')
from plotly import offline
offline.plot(fig)
Here are some suggestions.
Merge the Date and Time columns into a DateTime column:
import pandas as pd
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
da["DateTime"] = pd.to_datetime(df.Date + " " + df.Time)
df = da # df is more natural for me ;)
I use the following to mark the interesting columns:
columns = [c for c in df.columns if c.startswith("Column")]
Use the aggregate method, to aggregate data over some time range.
For example to aggregate over one day, use
df.set_index("DateTime")[columns].resample("1D")
The result is an object, that you can either run some aggregations on, e.g. compute the mean for each such sample:
df.set_index("DateTime")[columns].resample("1D").mean()
If you want to leverage plotly's functionality to create the boxplot, I would use a loop though:
for start_datetime, data in df.set_index("DateTime")[columns].resample("1D"):
print(start_datetime)
print(data)
print()
Instead of the print functions, use the plotly commands to create a box in the boxplot.
I've been racking my brain to try to figure out how to plot a pandas DataFrame the way I want but to no avail.
The DataFrame has a MultiIndex and it looks like this:
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| | | | | | run_001 | run_002 | run_003 | run_004 | run_005 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| file_type | server_count | file_count | thread_count | cacheclear_type | | | | | |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| gor | 01servers | 05files | 20threads | ccALWAYS | 15.918 | 16.275 | 15.807 | 17.781 | 16.233 |
| gor | 01servers | 10files | 20threads | ccALWAYS | 17.322 | 17.636 | 16.096 | 16.484 | 16.715 |
| gor | 01servers | 15files | 20threads | ccALWAYS | 19.265 | 17.128 | 17.630 | 18.739 | 16.833 |
| gor | 01servers | 20files | 20threads | ccALWAYS | 23.744 | 20.539 | 21.416 | 22.921 | 22.794 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
What I want to do is plot a line graph where the x values are the 'file_count' value, and the y value for each is the average of all the run_xxx values for the corresponding line in the DataFrame.
If possible I would like to add error bars and even the data points themselves so that I can see the distribution of the data behind that average.
Here's a (crappy) mockup of roughly what I'm talking about:
I've been able to create a boxplot using the boxplot() function built into pandas' DataFrame by doing:
df.transpose().boxplot()
This looks almost okay but a little bit cluttered and doesn't have the actual data points plotted.
Beeswarm plot will very nice in this situation, especially when you have a lot of dots and what to show the distributions of those dots. You need to, however, supply the position parameter to beeswarm as by default it will started at 0. The the boxplot method of pandas DataFrame, on the other hand, plots boxes at x = 1, 2 ...
It comes down to just these:
from beeswarm import *
D1 = beeswarm(df.values, positions = np.arange(len(df.values))+1)
D2 = df.transpose().boxplot(ax=D1[1])
For completeness I'll include the way I finally managed to do this here:
import numpy as np
import matplotlib.pyplot as plt
import random
dft = df.sortlevel(2).transpose()
fig, ax = plt.subplots()
x = []
y = []
y_err = []
scatterx = []
scattery = []
for n, col in enumerate(dft.columns):
x.append(n)
y.append(np.mean(dft[col]))
y_err.append(np.std(dft[col]))
for v in dft[col]:
scattery.append(v)
scatterx.append(n + ((random.random()-0.5)*0.05))
p = plt.plot(x, y, label=label)
color=p[0].get_color()
plt.errorbar(x, y, yerr=y_err, fmt=color)
plt.scatter(scatterx, scattery, alpha=0.3, color=color)
plt.legend(loc=2)
ax.set_xticks(range(len(dft.columns)))
ax.set_xticklabels([x[2] for x in dft.columns])
plt.show()
This will show a line chart with error bars and data points. There may be some errors in the above code. I copied it and simplified a bit before pasting here.