pandas plot columns from two dataframes in in one figure - python

I have a dataframe consisting of mean and std-dev of distributions
df.head()
+---+---------+----------------+-------------+---------------+------------+
| | user_id | session_id | sample_mean | sample_median | sample_std |
+---+---------+----------------+-------------+---------------+------------+
| 0 | 1 | 20081023025304 | 4.972789 | 5 | 0.308456 |
| 1 | 1 | 20081023025305 | 5.000000 | 5 | 1.468418 |
| 2 | 1 | 20081023025306 | 5.274419 | 5 | 4.518189 |
| 3 | 1 | 20081024020959 | 4.634855 | 5 | 1.387244 |
| 4 | 1 | 20081026134407 | 5.088195 | 5 | 2.452059 |
+---+---------+----------------+-------------+---------------+------------+
From this, I plot a histogram of the distribution
plt.hist(df['sample_mean'],bins=50)
plt.xlabel('sampling rate (sec)')
plt.ylabel('Frequency')
plt.title('Histogram of trips mean sampling rate')
plt.show()
I then write a function to compute pdf and cdf, passing dataframe and column name:
def compute_distrib(df, col):
stats_df = df.groupby(col)[col].agg('count').pipe(pd.DataFrame).rename(columns = {col: 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
return stats_df
So for example:
stats_df = compute_distrib(df, 'sample_mean')
stats_df.head(2)
+---+---------------+-----------+----------+----------+
| | sample_median | frequency | pdf | cdf |
+---+---------------+-----------+----------+----------+
| 0 | 1 | 4317 | 0.143575 | 0.143575 |
| 1 | 2 | 10169 | 0.338200 | 0.481775 |
+---+---------------+-----------+----------+----------+
Then I plot the cdf distribution this way:
ax1 = stats_df.plot(x = 'sample_mean', y = ['cdf'], grid = True)
ax1.legend(loc='best')
Goal:
I would like to plot these figures in one figure side-by-side instead of plotting separately and somehow putting them together in my slides.

You can use matplotlib.pyplot.subplots to draw multiple plots next to each other:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=1, ncols=2)
# Pass the data you wish to plot.
axs[0][0].hist(...)
axs[0][1].plot(...)
plt.show()

Related

How do I cluster two stacked bars using matplotlib/python? [duplicate]

This question already has answers here:
grouped stacked bar plot of different datasets stored as np.arrays
(1 answer)
How can I group a stacked bar chart?
(2 answers)
Closed 9 months ago.
I am trying to create a plot that has two stacked bars, side by side, for each FiscalYear.
using matplotlib / python, and I can't see how to "group" the "stacked bars".
This post How to have clusters of stacked bars with python (Pandas) is very close to what I'm trying to do, but I've not had any success finding the solution.
I can create the stacked bars, but not break them down into clusters or groups.
How do I turn this data
+----+--------------+---------------+-----------+----------+------------+
| | FiscalYear | Unallocated | Planned | Actual | Forecast |
|----+--------------+---------------+-----------+----------+------------|
| 0 | 2022 | 744765 | 685998 | 516718 | 442575 |
| 1 | 2023 | 51459 | 323787 | 372689 | 9759 |
| 2 | 2024 | 976143 | 560108 | 255508 | 36041 |
| 3 | 2025 | 695902 | 471972 | 464622 | 332749 |
| 4 | 2026 | 165179 | 345003 | 416089 | 729036 |
+----+--------------+---------------+-----------+----------+------------+
into this picture?
df=pd.DataFrame(np.random.randint(1000, 1000000, size=(5, 4)),
columns=['Unallocated','Planned','Actual','Forecast'])
df.insert(loc=0,
column='FiscalYear',
value=[2022,2023,2024,2025,2026])
print(tabulate(df, headers='keys', tablefmt='psql'))
labels = df['FiscalYear']
u = df['Unallocated']
p = df['Planned']
a = df['Actual']
f = df['Forecast']
width = 0.35 # the width of the bars: can also be len(x) sequence
fig, ax = plt.subplots()
ax.bar(labels, u, width, label='o')
ax.bar(labels, p, width, bottom=u, label='p')
ax.bar(labels, a, width, label='a')
ax.bar(labels, f, width, bottom=a, label='f')
ax.legend()
plt.show()

How to boxplot different columns from a dataframe (y axis) vs groupby a range of hours (x axis) using plotly

Good morning,
I'm trying to boxplot the 'columns' from 1 to 6 vs the 'ElapsedTime(hours)' column with the use of plotly library.
Here is my dataframe :
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| Date | Time | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | ElapsedTime(hours) |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:48:37 | 0,011535 | 8,4021 | 0,00027 | 0,027806 | 8,431 | 0,000362 | 0 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:59:37 | 0,013458 | 8,4421 | 0,000314 | 0,032214 | 8,4738 | 0,000416 | 0,183333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 07:14:37 | 0,017793 | 8,4993 | 0,000384 | 0,038288 | 8,5372 | 0,000486 | 0,433333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:12:50 | 0,018808 | 8,545 | 0,000414 | 0,042341 | 8,5891 | 0,000539 | 24,9702778 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:42:50 | 0,025931 | 8,3627 | 0,000534 | 0,032379 | 8,3556 | 0,000557 | 25,9036111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:57:50 | 0,025164 | 8,5518 | 0,000505 | 0,041134 | 8,6516 | 0,000254 | 26,1536111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:45:28 | 0,026561 | 8,6266 | 0,000533 | 0,050387 | 8,6718 | 0,00065 | 46,9475 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:55:28 | 0,027744 | 8,6455 | 0,000543 | 0,051511 | 8,6916 | 0,000664 | 47,11416667 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 06:05:28 | 0,028854 | 8,485 | 0,000342 | 0,05693 | 8,6934 | 0,000695 | 47,28083333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
for now, i just know how to boxplot each column vs nothing using these lines of code :
import warnings
import pandas as pd
import plotly.graph_objects as go
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
fig = go.Figure()
fig.add_trace(
go.Box(y=da['Column1'], name='Column1'))
fig.add_trace(
go.Box(y=da['Column2'], name='Column2'))
fig.add_trace(
go.Box(y=da['Column3'], name='Column3'))
fig.add_trace(
go.Box(y=da['Column4'], name='Column4'))
fig.add_trace(
go.Box(y=da['Column5'], name='Column5'))
fig.add_trace(
go.Box(y=da['Column6'], name='Column6'))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
from plotly import offline
offline.plot(fig)
Output :
I can choose to show one column :
What i want (if possible) : Plot my columns from 1 to 6 vs a range of ElapsedTime(hours). For exemple i choose to have a range of 10 hours, so the boxplots will be taking in consideration that range and plot all the values of that range into one box.
PS : if i add x=da['ElapsedTime(hours)'] inside the go.Box(), i will be ploting each value of columns 1 to 6 versus one value from the ElapsedTime column and i don't want that, I want a box in a range of an ElapsedTime.
Extra : If possible, i want the columns from 1 to 6 to be in a dropdown button so that i can click and choose which column i wanna see in the range of the ElapsedTime i choosed.
Thank you for your time, and have a great day !
EDIT :#################################################
I tried these lines. The problem is that i have an error saying dataframe doesn't have a name argument (name=data.name) and if i get rid of that, let's say i don't use name=data.name, i will get a plot that is not Box. Do you have any idea on how to overcome this problem ?
da["DateTime"] = pd.to_datetime(da.Date + " " + da.Time)
columns = [c for c in da.columns if c.startswith("Column")]
da.set_index("DateTime")[columns].resample("1D")
fig = go.Figure()
for start_datetime, data in da.set_index("DateTime")[columns].resample("1D"):
fig.add_trace(
go.Box(x=data.index, y=data.values, name=data.name))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
fig.update_layout(boxmode='group')
from plotly import offline
offline.plot(fig)
Here are some suggestions.
Merge the Date and Time columns into a DateTime column:
import pandas as pd
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
da["DateTime"] = pd.to_datetime(df.Date + " " + df.Time)
df = da # df is more natural for me ;)
I use the following to mark the interesting columns:
columns = [c for c in df.columns if c.startswith("Column")]
Use the aggregate method, to aggregate data over some time range.
For example to aggregate over one day, use
df.set_index("DateTime")[columns].resample("1D")
The result is an object, that you can either run some aggregations on, e.g. compute the mean for each such sample:
df.set_index("DateTime")[columns].resample("1D").mean()
If you want to leverage plotly's functionality to create the boxplot, I would use a loop though:
for start_datetime, data in df.set_index("DateTime")[columns].resample("1D"):
print(start_datetime)
print(data)
print()
Instead of the print functions, use the plotly commands to create a box in the boxplot.

how to differentiate with color values from two columns both on x-axis with matplotlib python?

I am trying to do a plot that has on x axis dates and on y some values. But I have two columns as dates. I would like to highlight the date of the second column with a dot of another color. Is it possible?
|---------------------|------------------|------------------|------------------|
| ID | Date1 | Date2 | value |
|---------------------|------------------|------------------|------------------|
| 1 | 2008-05-14 | 2010-03-28 | 5 |
|---------------------|------------------|------------------|------------------|
| 1 | 2005-12-07 | 2010-03-28 | 3 |
|---------------------|------------------|------------------|------------------|
| 1 | 2008-10-27 | 2010-03-28 | 6 |
df1 = df[df['ID']== 1]
df1= df1.sort_values(by='Date1')
date = df1['Date1']
res = df1['values']
fig, ax = plt.subplots()
ax.plot(date, res, 'o-')

Barplot comparing two columns

I would like to draw a barplot graph that would compare the evolution of 2 variables of revenues on a monthly time-axis (12 months of invoices).
I wanted to use sns.barplot, but can't use "hue" (cause the 2 variables aren't subcategories?). Is there another way, as simple as with hue? Can I "create" a hue?
Here is a small sample of my data:
(I did transform my table into a pivot table)
[In]
data_pivot['Revenue-Small-Seller-in'] = data_pivot["Small-Seller"] + data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot['Revenue-Not-Small-Seller-in'] = data_pivot["Best-Seller"] + data_pivot["Medium-Seller"]
data_pivot
[Out]
InvoiceNo Month Year Revenue-Small-Seller-in Revenue-Not-Small-Seller-in
536365 12 2010 139.12 139.12
536366 12 2010 22.20 11.10
536367 12 2010 278.73 246.93
(sorry for the ugly presentation of my data, see the picture to see the complete table (as there are multiple columns))
You can do:
render_df = data_pivot[data_pivot.columns[-2:]]
fig, ax = plt.subplots(1,1)
render_df.plot(kind='bar', ax=ax)
ax.legend()
plt.show()
Output:
Or sns style like you requested
render_df = data_pivot[data_pivot.columns[-2:]].stack().reset_index()
sns.barplot('level_0', 0, hue='level_1',
render_df)
here render_df after stack() is:
+---+---------+-----------------------------+--------+
| | level_0 | level_1 | 0 |
+---+---------+-----------------------------+--------+
| 0 | 0 | Revenue-Small-Seller-in | 139.12 |
| 1 | 0 | Revenue-Not-Small-Seller-in | 139.12 |
| 2 | 1 | Revenue-Small-Seller-in | 22.20 |
| 3 | 1 | Revenue-Not-Small-Seller-in | 11.10 |
| 4 | 2 | Revenue-Small-Seller-in | 278.73 |
| 5 | 2 | Revenue-Not-Small-Seller-in | 246.93 |
+---+---------+-----------------------------+--------+
and output:

Plotting a line plot with error bars and datapoints from a pandas DataFrame

I've been racking my brain to try to figure out how to plot a pandas DataFrame the way I want but to no avail.
The DataFrame has a MultiIndex and it looks like this:
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| | | | | | run_001 | run_002 | run_003 | run_004 | run_005 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| file_type | server_count | file_count | thread_count | cacheclear_type | | | | | |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| gor | 01servers | 05files | 20threads | ccALWAYS | 15.918 | 16.275 | 15.807 | 17.781 | 16.233 |
| gor | 01servers | 10files | 20threads | ccALWAYS | 17.322 | 17.636 | 16.096 | 16.484 | 16.715 |
| gor | 01servers | 15files | 20threads | ccALWAYS | 19.265 | 17.128 | 17.630 | 18.739 | 16.833 |
| gor | 01servers | 20files | 20threads | ccALWAYS | 23.744 | 20.539 | 21.416 | 22.921 | 22.794 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
What I want to do is plot a line graph where the x values are the 'file_count' value, and the y value for each is the average of all the run_xxx values for the corresponding line in the DataFrame.
If possible I would like to add error bars and even the data points themselves so that I can see the distribution of the data behind that average.
Here's a (crappy) mockup of roughly what I'm talking about:
I've been able to create a boxplot using the boxplot() function built into pandas' DataFrame by doing:
df.transpose().boxplot()
This looks almost okay but a little bit cluttered and doesn't have the actual data points plotted.
Beeswarm plot will very nice in this situation, especially when you have a lot of dots and what to show the distributions of those dots. You need to, however, supply the position parameter to beeswarm as by default it will started at 0. The the boxplot method of pandas DataFrame, on the other hand, plots boxes at x = 1, 2 ...
It comes down to just these:
from beeswarm import *
D1 = beeswarm(df.values, positions = np.arange(len(df.values))+1)
D2 = df.transpose().boxplot(ax=D1[1])
For completeness I'll include the way I finally managed to do this here:
import numpy as np
import matplotlib.pyplot as plt
import random
dft = df.sortlevel(2).transpose()
fig, ax = plt.subplots()
x = []
y = []
y_err = []
scatterx = []
scattery = []
for n, col in enumerate(dft.columns):
x.append(n)
y.append(np.mean(dft[col]))
y_err.append(np.std(dft[col]))
for v in dft[col]:
scattery.append(v)
scatterx.append(n + ((random.random()-0.5)*0.05))
p = plt.plot(x, y, label=label)
color=p[0].get_color()
plt.errorbar(x, y, yerr=y_err, fmt=color)
plt.scatter(scatterx, scattery, alpha=0.3, color=color)
plt.legend(loc=2)
ax.set_xticks(range(len(dft.columns)))
ax.set_xticklabels([x[2] for x in dft.columns])
plt.show()
This will show a line chart with error bars and data points. There may be some errors in the above code. I copied it and simplified a bit before pasting here.

Categories

Resources