Box plot using pandas - python

Trying to plot a box plot for a pandas dataframe but the x-axis column names don't appear to be clear.
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
fig, ax1 = plt.subplots()
%matplotlib inline
df.boxplot(column = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver','avg_surge','surge_pct','trips_in_first_30_days','weekday_pct'])
Below is the output
How to fix this so that the x-axis columns appear clear

I think you need parameter rot:
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df.boxplot(column=cols, rot=90)
Sample:
np.random.seed(100)
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df = pd.DataFrame(np.random.rand(10, 7), columns=cols)
df.boxplot(column=cols, rot=90)

Another option is to make the orientation of you boxes horizontal.
np.random.seed(100)
cols = ['avg_dist','avg_rating_by_driver','avg_rating_of_driver',
'avg_surge','surge_pct','trips_in_first_30_days','weekday_pct']
df = pd.DataFrame(np.random.rand(10, 7), columns=cols)
df.boxplot(column=cols, vert=False)

Related

How to set ordering of categories in Pandas stacked bar chart

I am trying to make a Pandas bar plot with custom-ordered categories (elements shown with different colors). Here's my code, where I expect the ordering of the categories from bottom to top to follow "catorder":
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df2 = pd.DataFrame({"series":["ser1","ser1","ser1", "ser2", "ser2","ser2"],
"cate":["aatu","boiler","heat pump","aatu","boiler","heat pump"],
"val": [6,15,24,7,15, 21] })
ac2 = pd.pivot_table(df2, values = "val", index = "series", columns = "cate")
catorder= ["heat pump","aatu","boiler"]
ac2.columns = pd.CategoricalIndex(ac2.columns.values,
ordered=True,
categories=catorder)
ac2.sort_index(axis = 1)
fig = plt.figure(figsize=(6,3.5))
ax1 = fig.add_subplot(111)
ac2.plot.bar(stacked=True, ax = ax1)
plt.show()
The problem is that it doesn't work. Categories are still in alphabetical order. Any ideas how to accomplish this common task?
You need to sort the data before plotting:
ac2.sort_index(axis=1).plot.bar(stacked=True, ax = ax1)
Output:

How to plot a bar chart without aggregation Seaborn?

How do you plot a bar chart without aggregation? I have two columns, one contains values and the other is categorical, but I want to plot each row individually, without aggregation.
By default, sns.barplot(x = "col1", y = "col2", data = df) will aggregate by taking the mean of the values for each category in col1.
How do I simply just plot a bar for each row in my dataframe with no aggregation?
In case 'col1' only contains unique labels, you immediately get your result with sns.barplot(x='col1', y='col2', data=df). In case there are repeated labels, you can use the index as x and afterwards change the ticks:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'col1': list('ababab'), 'col2': np.random.randint(10, 20, 6)})
ax = sns.barplot(x=df.index, y='col2', data=df)
ax.set_xticklabels(df['col1'])
ax.set_xlabel('col1')
plt.show()
PS: Similarly, a horizontal bar chart could be created as:
df = pd.DataFrame({'col1': list('ababab'), 'col2': np.random.randint(10, 20, 6)})
ax = sns.barplot(x='col2', y=df.index, data=df, orient='h')
ax.set_yticklabels(df['col1'])
ax.set_ylabel('col1')

Add text annotation to plot from a pandas dataframe

My Code:
import matplotlib.pyplot as plt
import pandas as pd
import os, glob
path = r'C:/Users/New folder'
all_files = glob.glob(os.path.join(path, "*.txt"))
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=',', parse_dates=[0], infer_datetime_format=True,header=None, usecols=[0,1,2,3,4,5,6], names=['Date','Time','open', 'high', 'low', 'close','volume','tradingsymbol'])
df = df[['Date','Time','close','volume','tradingsymbol']]
df["Time"] = pd.to_datetime(df['Time'])
df.set_index('Time', inplace=True)
print(df)
fig, axes = plt.subplots(nrows=2, ncols=1)
################### Volume ###########################
df.groupby('tradingsymbol')['volume'].plot(legend=True, rot=0, grid=True, ax=axes[0])
################### PRICE ###########################
df.groupby('tradingsymbol')['close'].plot(legend=True, rot=0, grid=True, ax=axes[1])
plt.show()
My Current Output is like:
I need add text annotation to matplotlib plot. My desired output similar to below image:
It's hard to answer this question without access to your dataset, or a simpler example. However, I'll try my best.
Let's begin by setting up a dataframe which may or may resemble your data:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 3)),
columns=['a', 'b', 'c'])
With the dataset we'll now proceed to plot it with
fig, ax = plt.subplots(1, 1)
df.plot(legend=True, ax=ax)
Finally, we'll loop over the columns and annotate each datapoint as
for col in df.columns:
for id, val in enumerate(df[col]):
ax.text(id, val, str(val))
This gave me the plot following plot, which resembles your desired figure.

100% area plot of a pandas DataFrame

In pandas' documentation you can find a discussion on area plots, and in particular stacking them. Is there an easy and straightforward way to get a 100% area stack plot like this one
from this post?
The method is basically the same as in the other SO answer; divide each row by the sum of the row:
df = df.divide(df.sum(axis=1), axis=0)
Then you can call df.plot(kind='area', stacked=True, ...) as usual.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(2015)
y = np.random.randint(5, 50, (10,3))
x = np.arange(10)
df = pd.DataFrame(y, index=x)
df = df.divide(df.sum(axis=1), axis=0)
ax = df.plot(kind='area', stacked=True, title='100 % stacked area chart')
ax.set_ylabel('Percent (%)')
ax.margins(0, 0) # Set margins to avoid "whitespace"
plt.show()
yields

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources