I am trying to plot a groupby-pandas-dataframe in which I have a categorical variable by which I would like to order the bars.
A sample code of what I am doing:
import pandas as pd
df = {"month":["Jan", "Jan", "Jan","Feb", "Feb", "Mar"],
"cat":["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(df)
df.groupby("month")["cat"].value_counts().unstack(0).plot.bar()
Which plots:
However, I would like to plot within each category the order to be Jan, Feb, March.
Any help on how to achieve this would be a appreciated.
Kind regards.
You could make the month column categorical to fix an order:
import pandas as pd
df = {"month": ["Jan", "Jan", "Jan", "Feb", "Feb", "Mar"],
"cat": ["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(df)
df["month"] = pd.Categorical(df["month"], ["Jan", "Feb", "Mar"])
df.groupby("month")["cat"].value_counts().unstack(0).plot.bar(rot=0)
An alternative would be to select the column order after the call to unstack(0):
df.groupby("month")["cat"].value_counts().unstack(0)[["Jan", "Feb", "Mar"]].plot.bar(rot=0)
I recommend you to use the seaborn package for plotting data from dataframes. It's very simple to organize and order each element when plotting.
First let's add a column with the counts of each existing month/cat combination:
import pandas as pd
data = {"month":["Jan", "Jan", "Jan","Feb", "Feb", "Mar"],
"cat":["High", "High", "Low", "Medium", "Low", "High"]}
df = pd.DataFrame(data)
df = df.value_counts().reset_index().rename(columns={0: 'count'})
print(df)
# output:
#
# month cat count
# 0 Jan High 2
# 1 Mar High 1
# 2 Jan Low 1
# 3 Feb Medium 1
# 4 Feb Low 1
Plotting with seaborn then becomes as simple as:
import matplotlib.pyplot as plt
import seaborn as sns
sns.barplot(
data=df,
x='cat',
y='count',
hue='month',
order=['Low', 'Medium', 'High'], # Order of elements in the X-axis
hue_order=['Jan', 'Feb', 'Mar'], # Order of colored bars at each X position
)
plt.show()
Output image:
Related
I am trying to visualize different type of "purchases" over a quarterly period for selected customers. To generate this visual, I am using a catplot functionality in seaborn but am unable to add a horizontal line that connects each of the purchased fruits. Each line should start at the first dot for each fruit and end at the last dot for the same fruit. Any ideas on how to do this programmatically?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dta = pd.DataFrame(columns=["Date", "Fruit", "type"], data=[['2017-01-01','Orange',
'FP'], ['2017-04-01','Orange', 'CP'], ['2017-07-01','Orange', 'CP'],
['2017-10-08','Orange', 'CP'],['2017-01-01','Apple', 'NP'], ['2017-04-01','Apple', 'CP'],
['2017-07-01','Banana', 'NP'], ['2017-10-08','Orange', 'CP']
])
dta['quarter'] = pd.PeriodIndex(dta.Date, freq='Q')
sns.catplot(x="quarter", y="Fruit", hue="type", kind="swarm", data=dta)
plt.show()
This is the result:
.
How can I add individual horizontal lines that each connect the dots for purchases of orange and apple?
Each line should start at the first dot for each fruit and end at the last dot for the same fruit.
Use groupby.ngroup to map the quarters to xtick positions
Use groupby.agg to find each fruit's min and max xtick endpoints
Use ax.hlines to plot horizontal lines from each fruit's min to max
df = pd.DataFrame([['2017-01-01', 'Orange', 'FP'], ['2017-04-01', 'Orange', 'CP'], ['2017-07-01', 'Orange', 'CP'], ['2017-10-08', 'Orange', 'CP'], ['2017-01-01', 'Apple', 'NP'], ['2017-04-01', 'Apple', 'CP'], ['2017-07-01', 'Banana', 'NP'], ['2017-10-08', 'Orange', 'CP']], columns=['Date', 'Fruit', 'type'])
df['quarter'] = pd.PeriodIndex(df['Date'], freq='Q')
df = df.sort_values('quarter') # sort dataframe by quarter
df['xticks'] = df.groupby('quarter').ngroup() # map quarter to xtick position
ends = df.groupby('Fruit')['xticks'].agg(['min', 'max']) # find min and max xtick per fruit
g = sns.catplot(x='quarter', y='Fruit', hue='type', kind='swarm', s=8, data=df)
g.axes[0, 0].hlines(ends.index, ends['min'], ends['max']) # plot horizontal lines from each fruit's min to max
Detailed breakdown:
catplot plots the xticks in the order they appear in the dataframe. The sample dataframe is already sorted by quarter, but the real dataframe should be sorted explicitly:
df = df.sort_values('quarter')
Map the quarters to their xtick positions using groupby.ngroup:
df['xticks'] = df.groupby('quarter').ngroup()
# Date Fruit type quarter xticks
# 0 2017-01-01 Orange FP 2017Q1 0
# 1 2017-04-01 Orange CP 2017Q2 1
# 2 2017-07-01 Orange CP 2017Q3 2
# 3 2017-10-08 Orange CP 2017Q4 3
# 4 2017-01-01 Apple NP 2017Q1 0
# 5 2017-04-01 Apple CP 2017Q2 1
# 6 2017-07-01 Banana NP 2017Q3 2
# 7 2017-10-08 Orange CP 2017Q4 3
Find the min and max xticks to get the endpoints per Fruit using groupby.agg:
ends = df.groupby('Fruit')['xticks'].agg(['min', 'max'])
# min max
# Fruit
# Apple 0 1
# Banana 2 2
# Orange 0 3
Use ax.hlines to plot a horizontal line per Fruit from min-endpoint to max-endpoint:
g = sns.catplot(x='quarter', y='Fruit', hue='type', kind='swarm', s=8, data=df)
ax = g.axes[0, 0]
ax.hlines(ends.index, ends['min'], ends['max'])
You just need to enable the horizontal grid for the chart as follows:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dta = pd.DataFrame(
columns=["Date", "Fruit", "type"],
data=[
["2017-01-01", "Orange", "FP"],
["2017-04-01", "Orange", "CP"],
["2017-07-01", "Orange", "CP"],
["2017-10-08", "Orange", "CP"],
["2017-01-01", "Apple", "NP"],
["2017-04-01", "Apple", "CP"],
["2017-07-01", "Banana", "NP"],
["2017-10-08", "Orange", "CP"],
],
)
dta["quarter"] = pd.PeriodIndex(dta.Date, freq="Q")
sns.catplot(x="quarter", y="Fruit", hue="type", kind="swarm", data=dta)
plt.grid(axis='y')
plt.show()
Preview
I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
df['date_m'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
And I want to make a line plot grouped by month and campaign, so I have tried this code:
df['sales'].groupby(df['date_m','campaign']).mean().plot.line()
But I get this error message KeyError: ('date_m', 'campaign'). Please, any help will be greatly appreciated.
Plotting is typically dependant upon the shape of the DataFrame.
.groupby creates a long format DataFrame, which is great for seaborn
.pivot_table creates a wide format DataFrame, which easily works with pandas.DataFrame.plot
.groupby the DataFrame
df['sales'].groupby(...) is incorrect, because df['sales'] selects one column of the dataframe; none of the other columns are available
.groupby converts the DataFrame into a long format, which is great for plotting with seaborn.lineplot.
Specify the hue parameter to separate by 'campaign'.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# perform groupby and reset the index
dfg = df.groupby(['date_m','campaign'])['sales'].mean().reset_index()
# display(dfg.head())
date_m campaign sales
0 2011-01-01 0 10000
1 2011-01-01 1 7000
2 2011-02-01 0 11000
3 2011-02-01 1 8000
4 2011-03-01 0 12000
# plot with seaborn
sns.lineplot(data=dfg, x='date_m', y='sales', hue='campaign')
.pivot_table the DataFrame
.pivot_table shapes the DataFrame correctly for plotting with pandas.DataFrame.plot, and it has an aggregation parameter.
The DataFrame is shaped into a wide format.
# pivot the dataframe into the correct shape for plotting
dfp = df.pivot_table(index='date_m', columns='campaign', values='sales', aggfunc='mean')
# display(dfp.head())
campaign 0 1
date_m
2011-01-01 10000 7000
2011-02-01 11000 8000
2011-03-01 12000 5000
2011-04-01 10500 6000
2011-05-01 10000 6000
# plot the dataframe
dfp.plot()
Plotting with matplotlib directly
fig, ax = plt.subplots(figsize=(8, 6))
for v in df.campaign.unique():
# select the data based on the campaign
data = df[df.campaign.eq(v)]
# this is only necessary if there is more than one value per date
data = data.groupby(['date_m','campaign'])['sales'].mean().reset_index()
ax.plot('date_m', 'sales', data=data, label=f'{v}')
plt.legend(title='campaign')
plt.show()
Notes
Package versions:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4
I am trying to do a bar chart using python and seaborn, but I am getting a error:
ValueError: Could not interpret input 'total'.
This is what I am trying to transform in a bar chart:
level_1 1900 2014 2015 2016 2017 2018
total 0.0 154.4 490.9 628.4 715.2 601.5
This is a image of the same dataframe:
Also I want to delete the column 1990, but when I try to do it by deleting the index, the column 2014 is deleted.
I got this far until now:
valor_ano = sns.barplot(
data= valor_ano,
x= ['2014', '2015', '2016', '2017', '2018'],
y= 'total')
Any suggestions?
Do something like the following:
import seaborn as sns
import pandas as pd
valor_ano = pd.DataFrame({'level_1':[1900, 2014, 2015, 2016, 2017, 2018],
'total':[0.0, 154.4, 490.9, 628.4,715.2,601.5]})
valor_ano.drop(0, axis=0, inplace=True)
valor_plot = sns.barplot(
data= valor_ano,
x= 'level_1',
y= 'total')
This produces the following plot:
EDIT: If you want to do it without the dataframe and just pass in the raw data you can do it with the following code. You can also just use a variable containing a list instead of hard-coding the list:
valor_graph = sns.barplot(
x= [2014, 2015, 2016, 2017, 2018],
y= [154.4, 490.9, 628.4,715.2,601.5])
i have DataFrame with Month,Year and Value and i want to do a TimeSeries Plot.
Sample:
month year Value
12 2016 0.006437804129357764
1 2017 0.013850880792606646
2 2017 0.013330349031207292
3 2017 0.07663058273768052
4 2017 0.7822831457266424
5 2017 0.8089573099244689
6 2017 1.1634845000200715
im trying to plot this Value data with Year and Month, Year and Month in X-Axis and Value in Y-Axis.
One way is this:
import pandas as pd
import matplotlib.pyplot as plt
df['date'] = df['month'].map(str)+ '-' +df['year'].map(str)
df['date'] = pd.to_datetime(df['date'], format='%m-%Y').dt.strftime('%m-%Y')
fig, ax = plt.subplots()
plt.plot_date(df['date'], df['Value'])
plt.show()
You need to set a DateTime index for pandas to properly plot the axis. A one line modification for your dataframe (assuming you don't need year and month anymore as columns and that first day of each month is correct) would do:
df.set_index(pd.to_datetime({
'day': 1,
'month': df.pop('month'),
'year': df.pop('year')
}), inplace=True)
df.Value.plot()
My Goal is just to plot this simple data, as a graph, with x data being dates ( date showing in x-axis) and price as the y-axis. Understanding that the dtype of the NumPy record array for the field date is datetime64[D] which means it is a 64-bit np.datetime64 in 'day' units. While this format is more portable, Matplotlib cannot plot this format natively yet. We can plot this data by changing the dates to DateTime.date instances instead, which can be achieved by converting to an object array: which I did below view the astype('0'). But I am still getting
this error :
view limit minimum -36838.00750000001 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-DateTime value to an axis that has DateTime units
code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(r'avocado.csv')
df2 = df[['Date','AveragePrice','region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['Date'] = df2.Date.astype('O')
plt.style.use('ggplot')
ax = df2[['Date','AveragePrice']].plot(kind='line', title ="Price Change",figsize=(15,10),legend=True, fontsize=12)
ax.set_xlabel("Period",fontsize=12)
ax.set_ylabel("Price",fontsize=12)
plt.show()
df.head(3)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
df2 = df[['Date', 'AveragePrice', 'region']]
df2 = (df2.loc[df2['region'] == 'Albany'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2[['Date', 'AveragePrice']]
df2 = df2.sort_values(['Date'])
df2 = df2.set_index('Date')
print(df2)
ax = df2.plot(kind='line', title="Price Change")
ax.set_xlabel("Period", fontsize=12)
ax.set_ylabel("Price", fontsize=12)
plt.show()
output: