Boxplot and Scatterplot python - python

I have a time series data on which I would like to build a overlayed scatterplot and boxplot. The data is as so:
TokenUsed date
0 8 2020-01-05
1 8 2020-01-05
2 8 2020-01-05
3 8 2020-01-05
4 8 2020-01-05
... ... ...
51040 7 2020-02-23
51041 7 2020-02-23
51042 7 2020-02-23
51043 7 2020-02-23
51044 7 2020-02-23
This time series can be neatly shown as a boxplot (I've had trouble with the x-axis being a date, but solved it converting it to string). Now I would like to show only the data on which sum is superior to a threshold (>81) in my case. The code and the resulting image are below:
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()
When I add a scatter plot, I get image (2) and by filtering only those >81 I get image(3). What I don't understand is why it can't seem to match the x-axis between the two graphs!
Code:
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
# Without filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df, ax= ax,color=".25")
# Filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df[df["TokenUsed"]>81], ax= ax,color=".25")
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()

Answer:
Try editing your filtering such that no rows of df are actually removed. That is, apply a mask specifically on the TokenUsed column, such that values are replaced with NaN (rather than the whole row being removed). Here's how I would implement this:
#make a new copy df, use that to plot
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")
Explanation
Caveat: this is really my understanding of what is going on from my own observations; I am not actually aware of the implementation behind the scenes
seaborn is less aware of the dates then you are expecting. When creating the boxplot and using the date column for the x-axis, seaborn groups the data by each unique value in the date column. It orders these strings and then creates an integer position for each of them (starting from 0). The y-data are then plotted against these integer values, and the x-tick-labels are replaced with the corresponding string value. So in your case, there are 8 unique date strings, and they are plotted at x-positions from 0 to 7. Also, it doesn't actually matter that they look like dates. You could add more string values to the date column; their position relative to prior data would depend on their alphabetical order (e.g. I would guess the string '00-00-0000' would appear first and the string '999' would appear last).
The filter df[df["TokenUsed"]>81] removes any rows where the TokenUsed value is below 81. This means that the filtered DataFrame will not have as many string date values as the original data. This then creates the unexpected result when plotting. In your filtered data, the first date with values above 81 is 2020-02-09. So in the scatterplot call, those values get plotted at x=0, which is confusing because the values from 2020-01-05 were plotted at x=0 in the call to boxplot.
The fix is to make sure all the original dates are still present in the filtered data, but to replace the filtered out values with NaN so nothing gets plotted.
Here is the example I used to test this:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# fake data, only one date has values over 80
dr = ['01-05-2020'] * 100 + ['01-12-2020'] * 100 + ['01-19-2020'] * 100
data = list(np.random.randint(0,80,200)) + list(np.random.randint(50,150,100))
df = pd.DataFrame({'date':dr, 'TokenUsed':data})
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax=ax, whis=[0,100])
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
# the fix
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()
If I use the same filter that you applied, I get the same issue.

Related

plotting area plot as a subplot [duplicate]

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I'm unfortunately failing to come up with a solution to how and would highly appreciate some help.
You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2x2):
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2)
df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...
Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.
You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:
df.plot(subplots=True, layout=(1,2))
You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.
You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.
Working code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c])
count+=1
Using this code you can plot subplots in any configuration. You need to define the number of rows nrow and the number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.
You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:
plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())
etc...
You can use this:
fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)
ax = fig.add_subplot(222)
plt.plot(x,z)
...
plt.show()
You may not need to use Pandas at all. Here's a matplotlib plot of cat frequencies:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
axes[c].plot(x, y)
axes[c].set_title('cats')
plt.tight_layout()
Option 1: Create subplots from a dictionary of dataframes with long (tidy) data
Assumptions:
There is a dictionary of multiple dataframes of tidy data that are either:
Created by reading in from files
Created by separating a single dataframe into multiple dataframes
The categories, cat, may be overlapping, but all dataframes don't necessarily contain all values of cat
hue='cat'
This example uses a dict of dataframes, but a list of dataframes would be similar.
If the dataframes are wide, use pandas.DataFrame.melt to convert them to long form.
Because dataframes are being iterated through, there's no guarantee that colors will be mapped the same for each plot
A custom color map needs to be created from the unique 'cat' values for all the dataframes
Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
Imports and Test Data
import pandas as pd
import numpy as np # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch # for custom legend - square patches
from matplotlib.lines import Line2D # for custom legend - round markers
import seaborn as sns
import math import ceil # determine correct number of subplot
# synthetic data
df_dict = dict()
for i in range(1, 7):
np.random.seed(i) # for repeatable sample data
data_length = 100
data = {'cat': np.random.choice(['A', 'B', 'C'], size=data_length),
'x': np.random.rand(data_length), 'y': np.random.rand(data_length)}
df_dict[i] = pd.DataFrame(data)
# display(df_dict[1].head())
cat x y
0 B 0.944595 0.606329
1 A 0.586555 0.568851
2 A 0.903402 0.317362
3 B 0.137475 0.988616
4 B 0.139276 0.579745
# display(df_dict[6].tail())
cat x y
95 B 0.881222 0.263168
96 A 0.193668 0.636758
97 A 0.824001 0.638832
98 C 0.323998 0.505060
99 C 0.693124 0.737582
Create color mappings and plot
# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()} # get unique cats
colors = sns.color_palette('tab10', n_colors=len(unique_cat)) # get a number of colors
cmap = dict(zip(unique_cat, colors)) # zip values to colors
col_nums = 3 # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums) # how many rows of plots
# create the figue and axes
fig, axes = plt.subplots(row_nums, col_nums, figsize=(9, 6), sharex=True, sharey=True)
# convert to 1D array for easy iteration
axes = axes.flat
# iterate through dictionary and plot
for ax, (k, v) in zip(axes, df_dict.items()):
sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap, ax=ax)
sns.despine(top=True, right=True)
ax.legend_.remove() # remove the individual plot legends
ax.set_title(f'dataset = {k}', fontsize=11)
fig.tight_layout()
# create legend from cmap
# patches = [Patch(color=v, label=k) for k, v in cmap.items()] # square patches
patches = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in cmap.items()] # round markers
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(title='cat', handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0, frameon=False)
plt.show()
Option 2: Create subplots from a single dataframe with multiple separate datasets
The dataframes must be in a long form with the same column names.
This option uses pd.concat to combine multiple dataframes into a single dataframe, and .assign to add a new column.
See Import multiple csv files into pandas and concatenate into one DataFrame for creating a single dataframes from a list of files.
This option is easier because it doesn't require manually mapping colors to 'cat'
Combine DataFrames
# using df_dict, with dataframes as values, from the top
# combine all the dataframes in df_dict to a single dataframe with an identifier column
df = pd.concat((v.assign(dataset=k) for k, v in df_dict.items()), ignore_index=True)
# display(df.head())
cat x y dataset
0 B 0.944595 0.606329 1
1 A 0.586555 0.568851 1
2 A 0.903402 0.317362 1
3 B 0.137475 0.988616 1
4 B 0.139276 0.579745 1
# display(df.tail())
cat x y dataset
595 B 0.881222 0.263168 6
596 A 0.193668 0.636758 6
597 A 0.824001 0.638832 6
598 C 0.323998 0.505060 6
599 C 0.693124 0.737582 6
Plot a FacetGrid with seaborn.relplot
sns.relplot(kind='scatter', data=df, x='x', y='y', hue='cat', col='dataset', col_wrap=3, height=3)
Both options create the same result, however, it's less complicated to combine all the dataframes, and plot a figure-level plot with sns.relplot.
Building on #joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,
ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...
df.plot.barh(ax=ax1, stacked=True)
Here is a working pandas subplot example, where modes is the column names of the dataframe.
dpi=200
figure_size=(20, 10)
fig, ax = plt.subplots(len(modes), 1, sharex="all", sharey="all", dpi=dpi)
for i in range(len(modes)):
ax[i] = pivot_df.loc[:, modes[i]].plot.bar(figsize=(figure_size[0], figure_size[1]*len(modes)),
ax=ax[i], title=modes[i], color=my_colors[i])
ax[i].legend()
fig.suptitle(name)
import numpy as np
import pandas as pd
imoprt matplotlib.pyplot as plt
fig, ax = plt.subplots(2,2)
df = pd.DataFrame({'A':np.random.randint(1,100,10),
'B': np.random.randint(100,1000,10),
'C':np.random.randint(100,200,10)})
for ax in ax.flatten():
df.plot(ax =ax)

How to do kde plot in pyplot.subplots context? [duplicate]

I have a few Pandas DataFrames sharing the same value scale, but having different columns and indices. When invoking df.plot(), I get separate plot images. what I really want is to have them all in the same plot as subplots, but I'm unfortunately failing to come up with a solution to how and would highly appreciate some help.
You can manually create the subplots with matplotlib, and then plot the dataframes on a specific subplot using the ax keyword. For example for 4 subplots (2x2):
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=2)
df1.plot(ax=axes[0,0])
df2.plot(ax=axes[0,1])
...
Here axes is an array which holds the different subplot axes, and you can access one just by indexing axes.
If you want a shared x-axis, then you can provide sharex=True to plt.subplots.
You can see e.gs. in the documentation demonstrating joris answer. Also from the documentation, you could also set subplots=True and layout=(,) within the pandas plot function:
df.plot(subplots=True, layout=(1,2))
You could also use fig.add_subplot() which takes subplot grid parameters such as 221, 222, 223, 224, etc. as described in the post here. Nice examples of plot on pandas data frame, including subplots, can be seen in this ipython notebook.
You can plot multiple subplots of multiple pandas data frames using matplotlib with a simple trick of making a list of all data frame. Then using the for loop for plotting subplots.
Working code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# dataframe sample data
df1 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df2 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df3 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df4 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df5 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
df6 = pd.DataFrame(np.random.rand(10,2)*100, columns=['A', 'B'])
#define number of rows and columns for subplots
nrow=3
ncol=2
# make a list of all dataframes
df_list = [df1 ,df2, df3, df4, df5, df6]
fig, axes = plt.subplots(nrow, ncol)
# plot counter
count=0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c])
count+=1
Using this code you can plot subplots in any configuration. You need to define the number of rows nrow and the number of columns ncol. Also, you need to make list of data frames df_list which you wanted to plot.
You can use the familiar Matplotlib style calling a figure and subplot, but you simply need to specify the current axis using plt.gca(). An example:
plt.figure(1)
plt.subplot(2,2,1)
df.A.plot() #no need to specify for first axis
plt.subplot(2,2,2)
df.B.plot(ax=plt.gca())
plt.subplot(2,2,3)
df.C.plot(ax=plt.gca())
etc...
You can use this:
fig = plt.figure()
ax = fig.add_subplot(221)
plt.plot(x,y)
ax = fig.add_subplot(222)
plt.plot(x,z)
...
plt.show()
You may not need to use Pandas at all. Here's a matplotlib plot of cat frequencies:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)
f, axes = plt.subplots(2, 1)
for c, i in enumerate(axes):
axes[c].plot(x, y)
axes[c].set_title('cats')
plt.tight_layout()
Option 1: Create subplots from a dictionary of dataframes with long (tidy) data
Assumptions:
There is a dictionary of multiple dataframes of tidy data that are either:
Created by reading in from files
Created by separating a single dataframe into multiple dataframes
The categories, cat, may be overlapping, but all dataframes don't necessarily contain all values of cat
hue='cat'
This example uses a dict of dataframes, but a list of dataframes would be similar.
If the dataframes are wide, use pandas.DataFrame.melt to convert them to long form.
Because dataframes are being iterated through, there's no guarantee that colors will be mapped the same for each plot
A custom color map needs to be created from the unique 'cat' values for all the dataframes
Since the colors will be the same, place one legend to the side of the plots, instead of a legend in every plot
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
Imports and Test Data
import pandas as pd
import numpy as np # used for random data
import matplotlib.pyplot as plt
from matplotlib.patches import Patch # for custom legend - square patches
from matplotlib.lines import Line2D # for custom legend - round markers
import seaborn as sns
import math import ceil # determine correct number of subplot
# synthetic data
df_dict = dict()
for i in range(1, 7):
np.random.seed(i) # for repeatable sample data
data_length = 100
data = {'cat': np.random.choice(['A', 'B', 'C'], size=data_length),
'x': np.random.rand(data_length), 'y': np.random.rand(data_length)}
df_dict[i] = pd.DataFrame(data)
# display(df_dict[1].head())
cat x y
0 B 0.944595 0.606329
1 A 0.586555 0.568851
2 A 0.903402 0.317362
3 B 0.137475 0.988616
4 B 0.139276 0.579745
# display(df_dict[6].tail())
cat x y
95 B 0.881222 0.263168
96 A 0.193668 0.636758
97 A 0.824001 0.638832
98 C 0.323998 0.505060
99 C 0.693124 0.737582
Create color mappings and plot
# create color mapping based on all unique values of cat
unique_cat = {cat for v in df_dict.values() for cat in v.cat.unique()} # get unique cats
colors = sns.color_palette('tab10', n_colors=len(unique_cat)) # get a number of colors
cmap = dict(zip(unique_cat, colors)) # zip values to colors
col_nums = 3 # how many plots per row
row_nums = math.ceil(len(df_dict) / col_nums) # how many rows of plots
# create the figue and axes
fig, axes = plt.subplots(row_nums, col_nums, figsize=(9, 6), sharex=True, sharey=True)
# convert to 1D array for easy iteration
axes = axes.flat
# iterate through dictionary and plot
for ax, (k, v) in zip(axes, df_dict.items()):
sns.scatterplot(data=v, x='x', y='y', hue='cat', palette=cmap, ax=ax)
sns.despine(top=True, right=True)
ax.legend_.remove() # remove the individual plot legends
ax.set_title(f'dataset = {k}', fontsize=11)
fig.tight_layout()
# create legend from cmap
# patches = [Patch(color=v, label=k) for k, v in cmap.items()] # square patches
patches = [Line2D([0], [0], marker='o', color='w', markerfacecolor=v, label=k, markersize=8) for k, v in cmap.items()] # round markers
# place legend outside of plot; change the right bbox value to move the legend up or down
plt.legend(title='cat', handles=patches, bbox_to_anchor=(1.06, 1.2), loc='center left', borderaxespad=0, frameon=False)
plt.show()
Option 2: Create subplots from a single dataframe with multiple separate datasets
The dataframes must be in a long form with the same column names.
This option uses pd.concat to combine multiple dataframes into a single dataframe, and .assign to add a new column.
See Import multiple csv files into pandas and concatenate into one DataFrame for creating a single dataframes from a list of files.
This option is easier because it doesn't require manually mapping colors to 'cat'
Combine DataFrames
# using df_dict, with dataframes as values, from the top
# combine all the dataframes in df_dict to a single dataframe with an identifier column
df = pd.concat((v.assign(dataset=k) for k, v in df_dict.items()), ignore_index=True)
# display(df.head())
cat x y dataset
0 B 0.944595 0.606329 1
1 A 0.586555 0.568851 1
2 A 0.903402 0.317362 1
3 B 0.137475 0.988616 1
4 B 0.139276 0.579745 1
# display(df.tail())
cat x y dataset
595 B 0.881222 0.263168 6
596 A 0.193668 0.636758 6
597 A 0.824001 0.638832 6
598 C 0.323998 0.505060 6
599 C 0.693124 0.737582 6
Plot a FacetGrid with seaborn.relplot
sns.relplot(kind='scatter', data=df, x='x', y='y', hue='cat', col='dataset', col_wrap=3, height=3)
Both options create the same result, however, it's less complicated to combine all the dataframes, and plot a figure-level plot with sns.relplot.
Building on #joris response above, if you have already established a reference to the subplot, you can use the reference as well. For example,
ax1 = plt.subplot2grid((50,100), (0, 0), colspan=20, rowspan=10)
...
df.plot.barh(ax=ax1, stacked=True)
Here is a working pandas subplot example, where modes is the column names of the dataframe.
dpi=200
figure_size=(20, 10)
fig, ax = plt.subplots(len(modes), 1, sharex="all", sharey="all", dpi=dpi)
for i in range(len(modes)):
ax[i] = pivot_df.loc[:, modes[i]].plot.bar(figsize=(figure_size[0], figure_size[1]*len(modes)),
ax=ax[i], title=modes[i], color=my_colors[i])
ax[i].legend()
fig.suptitle(name)
import numpy as np
import pandas as pd
imoprt matplotlib.pyplot as plt
fig, ax = plt.subplots(2,2)
df = pd.DataFrame({'A':np.random.randint(1,100,10),
'B': np.random.randint(100,1000,10),
'C':np.random.randint(100,200,10)})
for ax in ax.flatten():
df.plot(ax =ax)

Increase the size of plots in pandas python

result
year Month Min_days Avg_days Median_days Count MonthName-Year
2015 1 9 12.56 10 4 2015-Jan
2015 2 10 13.67 9 3 2015-Feb
........................................................
2016 12 12 15.788 19 2 2016-Dec
and so on...
I wish to create a line plot plotting min_days, avg_days, median_days, count according to month and year say. Code used for that(which works perfectly):
import matplotlib.pyplot as plt
result=freq_start_month_year_to_date_1(df,'Jan','2015','Dec','2019')
#Visualisations
fig, ax = plt.subplots()
for col in ["Min_days", "Median_days", "Count",'Target_days_before_customer_dead']:
ax.plot(result["Month Name-Year"], result[col], label=col)
ax.legend(loc="best")
ax.tick_params(axis="x", rotation=30)
I am getting a plot . The only issue is that the x axis is too crowded and all the values 2015-Jan, 2015-Feb etc are overlapping so nothing is readable in the x axis, it looks like black scrabbling...I am unable to increase the size of the plot.
I tried below code but that too did not work
fig, ax = plt.subplots(2,2, figsize=(20,20))
Using the above code I got 4 empty sub plots
The problem is you preformatted your x-axis as string and thus robbed matplotlib of the chance to apply its own formatter. matplotlib tried to cram all the strings into the axis so you can never make it wide enough to hold all the labels.
Create a new date column and use it to form your x axis:
from matplotlib import dates as mdates
# The new column to be used as x axis
result['Date'] = pd.to_datetime(result[['Year', 'Month']].assign(Day=1))
# Plot the data
fig, ax = plt.subplots(figsize=(10, 2))
for col in ['Min_days', 'Median_days', 'Count', 'Target_days_before_customer_dead']:
ax.plot(result['Date'], result[col], label=col)
years = mdates.YearLocator() # only print label for the years
months = mdates.MonthLocator() # mark months as ticks
years_fmt = mdates.DateFormatter('%Y')
ax.xaxis.set_major_locator(years)
ax.xaxis.set_minor_locator(months)
ax.xaxis.set_major_formatter(years_fmt)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Result (with random data):

Plotting multiple lines grouped by one column dataframe, with date time as x axis [duplicate]

In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()

Overlay two Data Frames in one graph [duplicate]

This question already has answers here:
How to plot different groups of data from a dataframe into a single figure
(5 answers)
Closed 4 years ago.
I have two different dataframes both are the same size and have the same values for the Date column.
Date Open High Low Close Volume Market Cap
0 2018-04-16 8337.57 8371.15 7925.73 8058.67 5.631310e+09 1.415710e+11
1 2018-04-15 7999.33 8338.42 7999.33 8329.11 5.244480e+09 1.358120e+11
2 2018-04-14 7874.67 8140.71 7846.00 7986.24 5.191430e+09 1.336820e+11
3 2018-04-13 7901.09 8183.96 7758.93 7895.96 7.764460e+09 1.341140e+11
4 2018-04-12 6955.38 7899.23 6806.51 7889.25 8.906250e+09 1.180480e+11
Both data frames have the same structure. same column indexes and same number of rows.
print(df.dtypes)
print(df2.dtypes)
Both give
Date datetime64[ns]
Open float64
High float64
Low float64
Close float64
Volume float64
Market Cap float64
dtype: object
I would like to plot Date on the x axis and Market Cap on the y axis. From my research I believe it is not possible to merge the plot function of a dataframe ie.
df.plot(x='Date', y = 'Market Cap', kind = 'line')
df2.plot(x='Date', y = 'Market Cap', kind = 'line')
My question is what is a good method to obtain the graph that I am trying to create? Should I use groupby, keys, or pyplot
to be clear i would like an overlay of the separate two graphs:
So far I have tried the code:
import matplotlib.pyplot as plt
# Init subplots
fig, axes = plt.subplots(1,1);
# Init ax with the first plot.
ax = df['Market Cap'].plot()
# Plot second df using ax from the first plot.
_ = df2['Market Cap'].plot(ax = ax)
plt.show()
Which results in an improper overlay:
ax = df.plot(x='Date', y = 'Market Cap', kind = 'line')
df2.plot(x='Date', y = 'Market Cap', kind = 'line', ax=ax)
results Courtesy of #scottBoston
Try using subplots.
import matplotlib.pyplot as plt
# Init subplots
fig, axes = plt.subplots(1,1);
# Init ax with the first plot.
ax = df['Market Cap'].plot()
# Plot second df using ax from the first plot.
_ = df2['Market Cap'].plot(ax = ax)
plt.show()

Categories

Resources