I want a Boxplot with jittered outliers. But only the outliers not the non-outliers.
Searching the web you often find a workaround combining sns.boxplot() and sns.swarmplot().
The problem with that figure is that the outliers are drawn twice. I don't need the red ones I only need the jittered (green) ones.
Also the none-outliers are drawn. I don't need them also.
I also have a feautre request at upstream open about it. But on my current research there is no Seaborn-inbuild solution for that.
This is an MWE reproducing the boxplot shown.
#!/usr/bin/env python3
import random
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
random.seed(0)
df = pandas.DataFrame({
'Vals': random.choices(range(200), k=200)})
df_outliers = pandas.DataFrame({
'Vals': random.choices(range(400, 700), k=20)})
df = pandas.concat([df, df_outliers], axis=0)
flierprops = {
'marker': 'o',
'markeredgecolor': 'red',
'markerfacecolor': 'none'
}
# Usual boxplot
ax = sns.boxplot(y='Vals', data=df, flierprops=flierprops)
# Add jitter with the swarmplot function
ax = sns.swarmplot(y='Vals', data=df, linewidth=.75, color='none', edgecolor='green')
plt.show()
Here is an approach to have jittered outliers. The jitter is similar to sns.stripplot(), not to sns.swarmplot() which uses a rather elaborate spreading algorithm. Basically, all the "line" objects of the subplot are checked whether they have a marker. The x-positions of the "lines" with a marker are moved a bit to create jitter. You might want to vary the amount of jitter, e.g. when you are working with hue.
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
random.seed(0)
df = pd.DataFrame({
'Vals': random.choices(range(200), k=200)})
df_outliers = pd.DataFrame({
'Vals': random.choices(range(400, 700), k=20)})
df = pd.concat([df, df_outliers], axis=0)
flierprops = {
'marker': 'o',
'markeredgecolor': 'red',
'markerfacecolor': 'none'
}
# Usual boxplot
ax = sns.boxplot(y='Vals', data=df, flierprops=flierprops)
for l in ax.lines:
if l.get_marker() != '':
xs = l.get_xdata()
xs += np.random.uniform(-0.2, 0.2, len(xs))
l.set_xdata(xs)
plt.tight_layout()
plt.show()
An alternative approach could be to filter out the outliers, and then call sns.swarmplot() or sns.stripplot() only with those points. As seaborn doesn't return the values calculated to position the whiskers, you might need to calculate those again via scipy, taking into account seaborn's filtering on x and on hue.
I am having trouble plotting a timeseries plot using seaborn. By adding hue to the plot the timeseries plot breaks. I can't figure out why the time series plot break in between and how can I stop dates from overlapping.
The code to replicate the issue is below:
test_data = {'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-06', '2021-01-08', '2021-01-09'],
'Price':[20, 10, 30, 40, 25, 23, 56],
'Kind': ['Pre', 'Pre', 'Pre', 'Pre', 'Current', 'Post', 'Post']}
test_df = pd.DataFrame(test_data)
test_df['Date'] = pd.to_datetime(test_df['Date'])
sns.lineplot(data=test_df, x="Date", y="Price", hue='Kind')
How can I fix the line break and dates overlapping?
Try adding the style and markers arguments to handle the isolated point with "Kind" == "Current":
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.figure()
sns.lineplot(data=test_df, x="Date", y="Price", style='Kind', markers=['o', 'o', 'o'], hue='Kind')
fig.autofmt_xdate()
which displays the plot:
I have made a pie chart using an excel sheet but it is coming out incomplete. I am not sure of the reason. Here is the code:
import matlotplib.pyplot as plt
import pandas as pd
import numpy as np
Employee=pd.read_excel("C:\\Users\\Jon\\Desktop\\data science\\Employee.xlsx")
Employee
colors = ["#1f77b4", "#ff7f0e"]
group_by_departments=Employee.groupby("Department").count().reset_index()
sizes = group_by_departments['Gender']
labels = group_by_departments['Department']
plt.pie(sizes, labels=labels, colors = colors,autopct='%.2f %%')
plt.show()
You can use .size() to get the count for each group. You'll need to group by Department and Gender simultaneously to obtain the individual counts of all the subgroups.
Here is some example code:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
N = 100
Employee = pd.DataFrame({'Gender': np.random.choice(['Male', 'Female'], N),
'Department': np.random.choice(['IT', 'Sales', 'HR', 'Finance'], N),
'Age': np.random.randint(20, 65, N),
'Salary': np.random.randint(20, 100, N) * 1000})
colors = ["turquoise", "tomato"]
group_by_departments_and_gender = Employee.groupby(["Department", "Gender"]).size().reset_index(name='Counts')
sizes = group_by_departments_and_gender['Counts']
labels = [f'{dept}\n {gender}' for dept, gender in group_by_departments_and_gender[['Department', 'Gender']].values]
plt.pie(sizes, labels=labels, colors=colors, autopct='%.2f %%')
plt.tight_layout()
plt.show()
PS: You could assign a color per gender via:
colors = ["magenta" if gender=="Male" else "deepskyblue" for gender in group_by_departments_and_gender["Gender"]]
This especially helps in case one of the genders wouldn't be present in one of the departments.
I have a simple pandas DataFrame as shown below. I want to create a scatter plot of value on the y-axis, date on the x-axis, and color the points by category. However, coloring the points isn't working.
# Create dataframe
df = pd.DataFrame({
'date': ['2016-01-01', '2016-02-01', '2016-03-01', '2016-01-01', '2016-02-01', '2016-03-01'],
'category': ['Wholesale', 'Wholesale', 'Wholesale', 'Retail', 'Retail', 'Retail'],
'value': [50, 60, 65, 55, 62, 70]
})
df['date'] = pd.to_datetime(df['date'])
# Try to plot
df.plot.scatter(x='date', y='value', c='category')
ValueError: 'c' argument must be a mpl color, a sequence of mpl colors or a sequence of numbers, not ['Wholesale' 'Wholesale' 'Wholesale' 'Retail' 'Retail' 'Retail'].
Why am a I getting the error? Pandas scatter plot documentation says the argument c can be "A column name or position whose values will be used to color the marker points according to a colormap."
df.plot.scatter(x='date', y='value', c=df['category'].map({'Wholesale':'red','Retail':'blue'}))
I think you are looking at seaborn:
import seaborn as sns
sns.scatterplot(data=df, x='date', y='value', hue='category')
Output:
Or you can loop through df.groupby:
fig, ax = plt.subplots()
for cat, d in df.groupby('category'):
ax.scatter(x=d['date'],y=d['value'], label=cat)
Output:
One of my favorite aspects of using the ggplot2 library in R is the ability to easily specify aesthetics. I can quickly make a scatterplot and apply color associated with a specific column and I would love to be able to do this with python/pandas/matplotlib. I'm wondering if there are there any convenience functions that people use to map colors to values using pandas dataframes and Matplotlib?
##ggplot scatterplot example with R dataframe, `df`, colored by col3
ggplot(data = df, aes(x=col1, y=col2, color=col3)) + geom_point()
##ideal situation with pandas dataframe, 'df', where colors are chosen by col3
df.plot(x=col1,y=col2,color=col3)
EDIT:
Thank you for your responses but I want to include a sample dataframe to clarify what I am asking. Two columns contain numerical data and the third is a categorical variable. The script I am thinking of will assign colors based on this value.
np.random.seed(250)
df = pd.DataFrame({'Height': np.append(np.random.normal(6, 0.25, size=5), np.random.normal(5.4, 0.25, size=5)),
'Weight': np.append(np.random.normal(180, 20, size=5), np.random.normal(140, 20, size=5)),
'Gender': ["Male","Male","Male","Male","Male",
"Female","Female","Female","Female","Female"]})
Height Weight Gender
0 5.824970 159.210508 Male
1 5.780403 180.294943 Male
2 6.318295 199.142201 Male
3 5.617211 157.813278 Male
4 6.340892 191.849944 Male
5 5.625131 139.588467 Female
6 4.950479 146.711220 Female
7 5.617245 121.571890 Female
8 5.556821 141.536028 Female
9 5.714171 134.396203 Female
Imports and Data
import numpy
import pandas
import matplotlib.pyplot as plt
import seaborn as sns
seaborn.set(style='ticks')
numpy.random.seed(0)
N = 37
_genders= ['Female', 'Male', 'Non-binary', 'No Response']
df = pandas.DataFrame({
'Height (cm)': numpy.random.uniform(low=130, high=200, size=N),
'Weight (kg)': numpy.random.uniform(low=30, high=100, size=N),
'Gender': numpy.random.choice(_genders, size=N)
})
Update August 2021
With seaborn 0.11.0, it's recommended to use new figure level functions like seaborn.relplot than to use FacetGrid directly.
sns.relplot(data=df, x='Weight (kg)', y='Height (cm)', hue='Gender', hue_order=_genders, aspect=1.61)
plt.show()
Update October 2015
Seaborn handles this use-case splendidly:
Map matplotlib.pyplot.scatter onto a seaborn.FacetGrid
fg = sns.FacetGrid(data=df, hue='Gender', hue_order=_genders, aspect=1.61)
fg.map(plt.scatter, 'Weight (kg)', 'Height (cm)').add_legend()
Which immediately outputs:
Old Answer
In this case, I would use matplotlib directly.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def dfScatter(df, xcol='Height', ycol='Weight', catcol='Gender'):
fig, ax = plt.subplots()
categories = np.unique(df[catcol])
colors = np.linspace(0, 1, len(categories))
colordict = dict(zip(categories, colors))
df["Color"] = df[catcol].apply(lambda x: colordict[x])
ax.scatter(df[xcol], df[ycol], c=df.Color)
return fig
if 1:
df = pd.DataFrame({'Height':np.random.normal(size=10),
'Weight':np.random.normal(size=10),
'Gender': ["Male","Male","Unknown","Male","Male",
"Female","Did not respond","Unknown","Female","Female"]})
fig = dfScatter(df)
fig.savefig('fig1.png')
And that gives me:
As far as I know, that color column can be any matplotlib compatible color (RBGA tuples, HTML names, hex values, etc).
I'm having trouble getting anything but numerical values to work with the colormaps.
Actually you could use ggplot for python:
from ggplot import *
import numpy as np
import pandas as pd
df = pd.DataFrame({'Height':np.random.randn(10),
'Weight':np.random.randn(10),
'Gender': ["Male","Male","Male","Male","Male",
"Female","Female","Female","Female","Female"]})
ggplot(aes(x='Height', y='Weight', color='Gender'), data=df) + geom_point()
https://seaborn.pydata.org/generated/seaborn.scatterplot.html
import numpy
import pandas
import seaborn as sns
numpy.random.seed(0)
N = 37
_genders= ['Female', 'Male', 'Non-binary', 'No Response']
df = pandas.DataFrame({
'Height (cm)': numpy.random.uniform(low=130, high=200, size=N),
'Weight (kg)': numpy.random.uniform(low=30, high=100, size=N),
'Gender': numpy.random.choice(_genders, size=N)
})
sns.scatterplot(data=df, x='Height (cm)', y='Weight (kg)', hue='Gender')
You can use the color parameter to the plot method to define the colors you want for each column. For example:
from pandas import DataFrame
data = DataFrame({'a':range(5),'b':range(1,6),'c':range(2,7)})
colors = ['yellowgreen','cyan','magenta']
data.plot(color=colors)
You can use color names or Color hex codes like '#000000' for black say. You can find all the defined color names in matplotlib's color.py file. Below is the link for the color.py file in matplotlib's github repo.
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/colors.py
The OP is coloring by a categorical column, but this answer is for coloring by a column that is numeric, or can be interpreted as numeric, such as a datetime dtype.
pandas.DataFrame.plot and matplotlib.pyplot.scatter can take a c or color parameter, which must be a color, a sequence of colors, or a sequence of numbers.
Tested in python 3.8, pandas 1.3.1, and matplotlib 3.4.2
Choosing Colormaps in Matplotlib for other valid cmap options.
Imports and Test Data
'Date' is already a datetime64[ns] dtype from DataReader
conda install -c anaconda pandas-datareader or pip install pandas-datareader depending on your environment.
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as web # for data; not part of pandas
tickers = 'amzn'
df = web.DataReader(ticker, data_source='yahoo', start='2018-01-01', end='2021-01-01').reset_index()
df['ticker'] = ticker
Date High Low Open Close Volume Adj Close ticker
0 2018-01-02 1190.00000 1170.510010 1172.000000 1189.010010 2694500 1189.010010 amzn
1 2018-01-03 1205.48999 1188.300049 1188.300049 1204.199951 3108800 1204.199951 amzn
c as a number
pandas.DataFrame.plot
df.Date.dt.month creates a pandas.Series of month numbers
ax = df.plot(kind='scatter', x='Date', y='High', c=df.Date.dt.month, cmap='Set3', figsize=(11, 4), title='c parameter as a month number')
plt.show()
matplotlib.pyplot.scatter
fig, ax = plt.subplots(figsize=(11, 4))
ax.scatter(data=df, x='Date', y='High', c=df.Date.dt.month, cmap='Set3')
ax.set(title='c parameter as a month number', xlabel='Date', ylabel='High')
plt.show()
c as a datetime dtype
pandas.DataFrame.plot
ax = df.plot(kind='scatter', x='Date', y='High', c='Date', cmap='winter', figsize=(11, 4), title='c parameter as a datetime dtype')
plt.show()
matplotlib.pyplot.scatter
fig, ax = plt.subplots(figsize=(11, 4))
ax.scatter(data=df, x='Date', y='High', c='Date', cmap='winter')
ax.set(title='c parameter as a datetime dtype', xlabel='Date', ylabel='High')
plt.show()
Though not matplotlib, you can achieve this using plotly express:
import numpy as np
import pandas as pd
import plotly.express as px
df = pd.DataFrame({
'Height':np.random.normal(size=10),
'Weight':np.random.normal(size=10),
'Size': 1, # How large each point should be?
'Gender': ["Male","Male","Male","Male","Male","Female","Female","Female","Female","Female"]})
# Create your plot
px.scatter(df, x='Weight', y='Height', size='Size', color='Gender')
If creating in a notebook, you'll get an interactive output like the following: