scatter plot by category in pandas - python

This has been troubling me for the past 30 minutes. What I'd like to do is to scatter plot by category. I took a look at the documentation, but I haven't been able to find the answer there. I looked here, but when I ran that in iPython Notebook, I don't get anything.
Here's my data frame:
time cpu wait category
8 1 0.5 a
9 2 0.2 a
2 3 0.1 b
10 4 0.7 c
3 5 0.2 c
5 6 0.8 b
Ideally, I'd like to have a scatter plot that shows CPU on the x axis, wait on the y axis, and each point on the graph is distinguished by category. So for example, if a=red, b=blue, and c=green then point (1, 0.5) and (2, 0.2) should be red, (3, 0.1) and (6, 0.8) should be blue, etc.
How would I do this with pandas? or matplotlib? whichever does the job.

This is essentially the same answer as #JoeCondron, but a two liner:
cmap = {'a': 'red', 'b': 'blue', 'c': 'yellow'}
df.plot(x='cpu', y='wait', kind='scatter',
colors=[cmap.get(c, 'black') for c in df.category])
If no color is mapped for the category, it defaults to black.
EDIT:
The above works for Pandas 0.14.1. For 0.16.2, 'colors' needs to be changed to 'c':
df.plot(x='cpu', y='wait', kind='scatter',
c=[cmap.get(c, 'black') for c in df.category])

I'd create a column with your colors based on category, then do the following, where ax is a matplotlib ax and df is your dataframe:
ax.scatter(df['cpu'], df['wait'], marker = '.', c = df['colors'], s = 100)

You could do
color_map = {'a': 'r', 'b': 'b', 'c': 'y'}
ax = plt.subplot()
x, y = df.cpu, df.wait
colors = df.category.map(color_map)
ax.scatter(x, y, color=colors)
This will give you red for category a, blue for b, yellow for c.
So you can past a list of color aliases of the same length as the arrays.
You can check out the myriad available colours here : http://matplotlib.org/api/colors_api.html.
I don't think the plot method is very useful for scatter plots.

Related

How make stacked bar chart from dataframe in python

I have the following dataframe:
Color Level Proportion
-------------------------------------
0 Blue 1 0.1
1 Blue 2 0.3
2 Blue 3 0.6
3 Red 1 0.2
4 Red 2 0.5
5 Red 3 0.3
Here I have 2 color categories, where each color category has 3 levels, and each entry has a proportion, which sum to 1 for each color category. I want to make a stacked bar chart from this dataframe that has 2 stacked bars, one for each color category. Within each of those stacked bars will be the proportion for each level, all summing to 1. So while the bars will be "stacked" different, the bars as complete bars will be the same length of 1.
I have tried this:
df.plot(kind='bar', stacked=True)
I then get this stacked bar chart, which is not what I want:
I want 2 stacked bars, and so a stacked bar for "Blue" and a stacked bar for "Red", where these bars are "stacked" by the proportions, with the colors of these stacks corresponding to each level. And so both of these bars would be of length 1 along the x-axis, which would be labelled "proportion". How can I fix my code to create this stacked bar chart?
Make a pivot and then plot it:
df.pivot(index = 'Color', columns = 'Level', values = 'Proportion')
df.plot(kind = 'bar', stacked = True)
Edit: Cleaner legend
You could create a Seaborn sns.histplot using the proportion as weights and the level as hue:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'Color': ['Blue'] * 3 + ['Red'] * 3,
'Level': [1, 2, 3] * 2,
'Proportion': [.1, .3, .6, .2, .5, .3]})
sns.set_style('white')
ax = sns.histplot(data=df, x='Color', weights='Proportion', hue='Level', multiple='stack', palette='flare', shrink=0.75)
ax.set_ylabel('Proportion')
for bars in ax.containers:
ax.bar_label(bars, label_type='center', fmt='%.2f')
sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1, 0.97))
sns.despine()
plt.tight_layout()
plt.show()

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

Imagine I have the following dataframes
import pandas as pd
import seaborn as sns
import numpy as np
d = {'val': [1, 2,3,4], 'a': [1, 1, 2, 2]}
d2 = {'val': [1, 2], 'a': [1, 2]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
This will give me two dataframes that look the following:
df =
val a
0 1 1
1 2 1
2 3 2
3 4 2
and
df2 =
val a
0 1 1
1 2 2
Now I want to create a boxplot based on val in df and the values of a, i.e. fix a value a, i.e. 1; Then I have two different values val: 1 and 2; Then create a box at x=1 based on the values {1,2}; Then move on to a=2: Based on a=2 we have two values val={3,4} so create a box at x=2 based on the values {3,4};
Then I want to simply draw a line based on df2, where a is again my x-axis and val my y-axis; The way I did that is the following
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
sns.pointplot(x='a', y='val', data=df2, ax=ax)
The problem is that the box for a=1 is shifted at a=2 and the box for a=2 disappeared; I am confused if I have an error in my code or if it is a bug;
If I just add the boxplot, everything is fine, so if I do:
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
The boxes are at the right position but as soon as I add the pointplot, things don't seem to work anymore;
Anyone an idea what to do?
The problem is that you are plotting categories on the x-axis. Pointplot plots the first item at position 0 while boxplot starts at 1, thus the shift. One possibility is to use an twinned axis:
ax = df.boxplot(column=['val'], by = ['a'])
ax2 = ax.twiny()
sns.pointplot(x='a', y='val', data=df2, ax=ax2)
ax2.xaxis.set_visible(False)

Boxplot by two groups in pandas

I have the following dataset:
df_plots = pd.DataFrame({'Group':['A','A','A','A','A','A','B','B','B','B','B','B'],
'Type':['X','X','X','Y','Y','Y','X','X','X','Y','Y','Y'],
'Value':[1,1.2,1.4,1.3,1.8,1.5,15,19,18,17,12,13]})
df_plots
Group Type Value
0 A X 1.0
1 A X 1.2
2 A X 1.4
3 A Y 1.3
4 A Y 1.8
5 A Y 1.5
6 B X 15.0
7 B X 19.0
8 B X 18.0
9 B Y 17.0
10 B Y 12.0
11 B Y 13.0
And I want to create boxplots per Group (there are two in the example) and in each plot to show by type. I have tried this:
fig, axs = plt.subplots(1,2,figsize=(8,6), sharey=False)
axs = axs.flatten()
for i, g in enumerate(df_plots[['Group','Type','Value']].groupby(['Group','Type'])):
g[1].boxplot(ax=axs[i])
Results in an IndexError, because the loop tries to create 4 plots.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-12-8e1150950024> in <module>
3
4 for i, g in enumerate(df[['Group','Type','Value']].groupby(['Group','Type'])):
----> 5 g[1].boxplot(ax=axs[i])
IndexError: index 2 is out of bounds for axis 0 with size 2
Then I tried this:
fig, axs = plt.subplots(1,2,figsize=(8,6), sharey=False)
axs = axs.flatten()
for i, g in enumerate(df_plots[['Group','Type','Value']].groupby(['Group','Type'])):
g[1].boxplot(ax=axs[i], by=['Group','Type'])
But no, I have the same problem. The expected result should have only two plots, and each plot have a box-and-whisker per Type. This is a sketch of this idea:
Please, any help will be greatly appreciated, with this code I can control some aspects of the data that I can't with seaborn.
Use seaborn.catplot:
import seaborn as sns
sns.catplot(data=df, kind='box', col='Group', x='Type', y='Value', hue='Type', sharey=False, height=4)
We can use groupby boxplot to create subplots per Group and then separate each boxplot by Type:
fig, axes = plt.subplots(1, 2, figsize=(8, 6), sharey=False)
df_plots.groupby('Group').boxplot(by='Type', ax=axes)
plt.show()
Or without subplots by passing parameters directly through the function call:
axes = df_plots.groupby('Group').boxplot(by='Type', figsize=(8, 6),
layout=(1, 2), sharey=False)
plt.show()
Data and imports:
import pandas as pd
from matplotlib import pyplot as plt
df_plots = pd.DataFrame({
'Group': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Type': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'Value': [1, 1.2, 1.4, 1.3, 1.8, 1.5, 15, 19, 18, 17, 12, 13]
})
As #Prune mentioned, the immediate issue is that your groupby() returns four groups (AX, AY, BX, BY), so first fix the indexing and then clean up a couple more issues:
Change axs[i] to axs[i//2] to put groups 0 and 1 on axs[0] and groups 2 and 3 on axs[1].
Add positions=[i] to place the boxplots side by side rather than stacked.
Set the title and xticklabels after plotting (I'm not aware of how to do this in the main loop).
for i, g in enumerate(df_plots.groupby(['Group', 'Type'])):
g[1].boxplot(ax=axs[i//2], positions=[i])
for i, ax in enumerate(axs):
ax.set_title('Group: ' + df_plots['Group'].unique()[i])
ax.set_xticklabels(['Type: X', 'Type: Y'])
Note that mileage may vary depending on version:
matplotlib.__version__
pd.__version__
confirmed working
3.4.2
1.3.1
confirmed not working
3.0.1
1.2.4
The immediate problem is that your groupby operation returns four elements (AX, AY, BX, BY), which you're trying to plot individually. You try to use ax=axs[i] ... but i runs 0-3, while you have only the two elements in your flattened structure. There is no axs[2] or axs[3], which raises the given run-time exception.
You need to resolve your referencing one way or the other.

Seaborn custom range heatmap

I need to build custom seaborn heatmap-like plot according to these requirements:
import pandas as pd
df = pd.DataFrame({"A": [0.3, 0.8, 1.3],
"B": [4, 9, 15],
"C": [650, 780, 900]})
df_info = pd.DataFrame({"id": ["min", "max"],
"A": [0.5, 0.9],
"B": [6, 10],
"C": [850, 880]})
df_info = df_info.set_index('id')
df
A B C
0 0.3 4 650
1 0.8 9 780
2 1.3 15 900
df_info
id A B C
min 0.5 6 850
max 0.9 10 880
Each value within df is supposed to be within a range defined in df_info.
For example the values for the column A are considered normal if they are within 0.5 and 0.9. Values that are outside the range should be colorized using a custom heatmap.
In particular:
Values that fall within the range defined for each column should not be colorized, plain black text on white background cell.
Values lower than min for that column should be colorized, for example in blue. The lower their values from the min the darker the shade of blue.
Values higher than max for that column should be colorized, for example in red. The higher their values from the max the darker the shade of red.
Q: I wouldn't know how to approach this with a standard heatmap, I'm not even sure I can accomplish this with a heatmap plot. Any suggestion?
As far as I know, a heatmap can only have one scale of values. I would suggest normalizing the data you have in the df dataframe so the values in every column follow:
between 0 and 1 if the value is between df_info's min max
below 0 if the value is below df_info's min
above 1 if the value is above df_info's max
To normalize your dataframe use :
for col in df:
df[col] = (df[col] - df_info[col]['min']) / (df_info[col]['max'] - df_info[col]['min'])
Finally, to create the color-coded heatmap use :
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
vmin = df.min().min()
vmax = df.max().max()
colors = [[0, 'darkblue'],
[- vmin / (vmax - vmin), 'white'],
[(1 - vmin)/ (vmax - vmin), 'white'],
[1, 'darkred']]
cmap = LinearSegmentedColormap.from_list('', colors)
sns.heatmap(df, cmap=cmap, vmin=vmin, vmax=vmax)
The additional calculations with vmin and vmax allow a dynamic scaling of the colormap depending on the differences with the minimums and maximums.
Using your input dataframe we have the following heatmap:

Matplotlib - set Alpha and color dynamically

I'm plotting a scatter plot from a Pandas dataframe in Matplotlib. Here is what the dataframe looks like:
X Y R
0 1 945 1236.334519
0 1 950 212.809352
0 1 950 290.663847
0 1 961 158.156856
And here is how i'm plotting the Dataframe:
ax1.scatter(myDF.X, myDF.Y, s=20, c='red', marker='s', alpha=0.5)
My problem is that i want to change how the marker is plotted according to how high or low the value of R is.
Example: if R is higher than 1000 (as it is in the first row of my example), color should be yellow instead of red and alpha should be 0.8 instead of 0.5. If R is lower than 1000, color should be blue and alpha should be 0.4 and so on.
Is there any way to do that or can i only use different dataframe with different data? Thanks in advance!
You can do a custom RGBA color array:
colors = [(1,1,0,0.8) if x>1000 else (1,0,0,0.4) for x in df.R]
plt.scatter(df.X,df.Y, c=colors)
Output:

Categories

Resources