Table
Hi, I'm trying to make a histogram with above table, and below is my coding.
def histograms(t):
salaries = t.column('Salary')
salary_bins = np.arange(min(salaries), max(salaries)+1000, 1000)
t.hist('Salary', bins=salary_bins, unit='$')
histograms(full_data)
But it's not showing properly. Can you help me?
Histogram
The bins argument in a histogram specifies the number of bins into which the data will be evenly distributed.
Let's say you have a sample dataframe of salaries like this:
import pandas as pd
sample_dataframe = pd.DataFrame({'name':['joe','jill','martin','emily','frank','john','sue','sally','sam'],
'salary':[105324,65002,98314,24480,55000,62000,75000,79000,32000]})
#output:
name salary
0 joe 105324
1 jill 65002
2 martin 98314
3 emily 24480
4 frank 55000
5 john 62000
6 sue 75000
7 sally 79000
8 sam 32000
If you want to plot a histogram where the salaries will be distributed in 10 bins and you want to stick with your function, you can do:
import matplotlib.pyplot as plt
def histograms(t):
plt.hist(t.salary, bins = 10, color = 'orange', edgecolor = 'black')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.show()
histograms(sample_dataframe)
If you want the x-axis ticks to reflect the boundaries of the 10 bins, you can add this line:
import numpy as np
plt.xticks(np.linspace(min(t.salary), max(t.salary), 11), rotation = 45)
Finally to show the y-ticks as integers, you add these lines:
from matplotlib.ticker import MaxNLocator
plt.gca().yaxis.set_major_locator(MaxNLocator(integer=True))
The final function looks like this:
def histograms(t):
plt.hist(t.salary, bins = 10, color = 'orange', edgecolor = 'black')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.gca().yaxis.set_major_locator(MaxNLocator(integer=True))
plt.xticks(np.linspace(min(t.salary), max(t.salary), 11), rotation = 45)
plt.show()
Is this what you are looking for ?
import matplotlib.pyplot as plt
def histograms(t):
_min = min(t['salary'])
_max = max(t['salary'])
bins = int((_max - _min) / 1000) # dividing the salary range in bins of 1000 each
plt.hist(t['salary'], bins = bins)
histograms(df)
Related
I have an issue with axis labels when using groupby and trying to plot with seaborn. Here is my problem:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'user': ['Bob', 'Jane','Alice','Bob','Jane','Alice'],
'income': [40000, 50000, 42000,47000,53000,46000]})
groupedProduct = df.groupby(['Product']).sum().reset_index()
I then plot a horizontal bar plot using seaborn:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
#Prettify the plot
bar.set_yticklabels( bar.get_yticks(), size = 10)
bar.set_xticklabels( bar.get_xticks(), size = 10)
bar.set_ylabel("User", fontsize = 20)
bar.set_xlabel("Income ($)", fontsize = 20)
bar.set_title("Total income per user", fontsize = 20)
sns.set_theme(style="whitegrid")
sns.set_color_codes("muted")
Unfortunately, when I run the code in such a manner, the y-axis ticks are labelled as 0,1,2 instead of Bob, Jane, Alice as I'd like it to.
I can get around the issue if I use matplotlib in the following manner:
df_group_user = df.groupby(['user']).sum()
df_group_user['income'].plot(kind="barh")
plt.title("Total income per user")
plt.ylabel("User")
plt.xlabel("Income ($)")
Ideally, I'd like to use seaborn for plotting, but if I don't use reset_index() like above, when calling sns.barplot:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
ValueError: Could not interpret input 'user'
just try re-writing the positions of x and y axis.
I'm using a diff dataframe to exhibit similar situation.
gp = df.groupby("Gender")['Salary'].sum().reset_index()
gp
Output:
Gender Salary
0 Female 8870
1 Male 23667
Now while plotting a bar chart, mention x axis first and then supply y axis and check,
bar = sns.barplot(x = 'Salary', y = "Gender", data = gp);
i would like to:
Store in a director series all the directors present in the director column of df.
Display in a horizontal bar graph the 10 most present directors in the catalogue.
Do I need to make a value.count first ? To set the top 10 before creating the plt.bar ?
# divided the director name
df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
You can create a countplot and use the order= parameter to select the 10 highest counts:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# directors = df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
np.random.seed(123456)
directors = pd.Series(np.random.choice(
['Allen', 'Almodóvar', 'Bergman', 'Buñuel', 'Chaplin', 'Eastwood', 'Fassbinder', 'Fellini', 'Hitchcock', 'Keaton',
'Kubrick', 'Polanski', 'Renoir', 'Scorsese', 'Spielberg', 'Welles', 'Wenders', 'Wilder'], 200), name='Director')
ax = sns.countplot(y=directors, order=directors.value_counts().iloc[:10].index, palette='rocket')
ax.tick_params(axis='y', length=0)
plt.tight_layout()
plt.show()
# c. Top 10 recovered countries (Bar plot)
top10_recovered = pd.DataFrame(data.groupby('Country')['Recovered'].sum().nlargest(10).sort_values(ascending = False))
fig3 = px.bar(top10_recovered, x = top10_recovered.index, y = 'Recovered', height = 600, color = 'Recovered',
title = 'Top 10 Recovered Cases Countries', color_continuous_scale = px.colors.sequential.Viridis)
fig3.show()
I have a dataframe:
df.head()[['price', 'volumen']]
price volumen
48 45 3
49 48 1
50 100 2
51 45 1
52 56 1
It represents the number of objects with particular price.
I created a histogram based on the volume column:
I would like to add information about the price distribution of each bin. My idea is to use heatmaps instead of single-color columns. E.g. a color red will show a high price, and yellow a low price.
Here is an example plot to illustrate the general idea:
The following example uses seaborn's tips dataset. A histogram is created by grouping the total_bill into bins. And then the bars are colored depending on the tips in each group.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
sns.set_theme(style='white')
tips = sns.load_dataset('tips')
tips['bin'] = pd.cut(tips['total_bill'], 10) # histogram bin
grouped = tips.groupby('bin')
min_tip = tips['tip'].min()
max_tip = tips['tip'].max()
cmap = 'RdYlGn_r'
fig, ax = plt.subplots(figsize=(12, 4))
for bin, binned_df in grouped:
bin_height = len(binned_df)
binned_tips = np.sort(binned_df['tip']).reshape(-1, 1)
ax.imshow(binned_tips, cmap=cmap, vmin=min_tip, vmax=max_tip, extent=[bin.left, bin.right, 0, bin_height],
origin='lower', aspect='auto')
ax.add_patch(mpatches.Rectangle((bin.left, 0), bin.length, bin_height, fc='none', ec='k', lw=1))
ax.autoscale()
ax.set_ylim(0, 1.05 * ax.get_ylim()[1])
ax.set_xlabel('total bill')
ax.set_ylabel('frequency')
plt.colorbar(ax.images[0], ax=ax, label='tip')
plt.tight_layout()
plt.show()
Here is how it looks with a banded colormap (cmap = plt.get_cmap('Spectral', 9)):
Here is another example using the 'mpg' dataset, with a histogram over car weight and coloring via mile-per-gallon.
You can generate a heat map using Seaborn. bin / shape the dataframe first. This is random data so heat map is not so interesting.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
s = 50
df = pd.DataFrame({"price":np.random.randint(30,120, s),"volume":np.random.randint(1,5, s)})
fig, ax = plt.subplots(2, figsize=[10,6])
df.loc[:,"volume"].plot(ax=ax[0], kind="hist", bins=3)
# reshape for a heatmap... put price into bins and make 2D
dfh = df.assign(pbin=pd.qcut(df.price,5)).groupby(["pbin","volume"]).mean().unstack(1).droplevel(0,axis=1)
axh = sns.heatmap(dfh, ax=ax[1])
Is there a way I can get a size frequency histogram for a population under different scenarios for specific days in python
means with error bars
My data are in a format similar to this table:
SCENARIO RUN MEAN DAY
A 1 25 10
A 1 15 30
A 2 20 10
A 2 27 30
B 1 45 10
B 1 50 30
B 2 43 10
B 2 35 30
results_data.groupby(['Scenario', 'Run']).mean() does not give me the days I want to visualize the data by
it returns the mean on the days in each run.
Use seaborn.FacetGrid
FactGrid is a Multi-plot grid for plotting conditional relationships
Map seaborn.distplot onto the FacetGrid and use hue=DAY.
Setup Data and DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random # just for test data
import numpy as np # just for test data
# data
random.seed(365)
np.random.seed(365)
data = {'MEAN': [np.random.randint(20, 51) for _ in range(500)],
'SCENARIO': [random.choice(['A', 'B']) for _ in range(500)],
'DAY': [random.choice([10, 30]) for _ in range(500)],
'RUN': [random.choice([1, 2]) for _ in range(500)]}
# create dataframe
df = pd.DataFrame(data)
Plot with kde=False
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5)
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=False, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plot with kde=True
g = sns.FacetGrid(df, col='RUN', row='SCENARIO', hue='DAY', height=5, palette='GnBu')
g = g.map(sns.distplot, 'MEAN', bins=range(20, 51, 5), kde=True, hist_kws=dict(edgecolor="k", linewidth=1)).add_legend()
plt.show()
Plots with error bars
Using how to add error bars to histogram diagram in python
Using df from above
Use matplotlib.pyplot.errorbar to plot the error bars on the histogram.
from itertools import product
# create unique combinations for filtering df
scenarios = df.SCENARIO.unique()
runs = df.RUN.unique()
days = df.DAY.unique()
combo_list = [scenarios, runs, days]
results = list(product(*combo_list))
# plot
for i, result in enumerate(results, 1): # iterate through each set of combinations
s, r, d = result
data = df[(df.SCENARIO == s) & (df.RUN == r) & (df.DAY == d)] # filter dataframe
# add subplot rows, columns; needs to equal the number of combinations in results
plt.subplot(2, 4, i)
# plot hist and unpack values
n, bins, _ = plt.hist(x='MEAN', bins=range(20, 51, 5), data=data, color='g')
# calculate bin centers
bin_centers = 0.5 * (bins[:-1] + bins[1:])
# draw errobars, use the sqrt error. You can use what you want there
# poissonian 1 sigma intervals would make more sense
plt.errorbar(bin_centers, n, yerr=np.sqrt(n), fmt='k.')
plt.title(f'Scenario: {s} | Run: {r} | Day: {d}')
plt.tight_layout()
plt.show()
I have a dataframe df containing ages for students and non students, which looks something like this:
Subject Student Age
001 yes 21
002 yes 45
003 no 61
004 no 37
...
I would like to plot the proportions of each group under the age of 40. I can do this in R with plot(factor(age < 40) ~ Student, data = df) which gives me:
Is there a way to replicate this in Python, ideally using either matplotlib or seaborn?
There is no inbuilt option to create such plot. You may create it through matplotlib of course by calculating the respecting numbers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = np.random.poisson(lam=40, size=6000)
b = ((a>50).astype(int)+np.random.rand(6000))>0.9
df = pd.DataFrame({"Subject" : np.arange(6000),
"Age" : a, "Student" : b})
df["Age>40"] = df["Age"] > 40
def propplot(x, y, data):
xdata = data[[x,y]].groupby(x)
xcount = xdata.count()
fig, axes = plt.subplots(ncols=len(xcount),
gridspec_kw={"width_ratios":list(xcount[y].values)})
for ax, (n,grp) in zip(axes, xdata):
ycount = grp.groupby(y).count().T
ycount /= float(ycount.values.sum())
ycount.plot.bar(stacked=True, ax=ax, width=1, legend=False)
ax.set_xlabel(n)
ax.set_xlim(-.5,.5)
ax.set_ylim(0,1)
ax.set_xticks([])
axes[0].set_ylabel(y)
axes[0].legend(ncol=100, title=y, loc=(0,1.02))
fig.text(0.5,0.02, x)
propplot("Student", "Age>40", df)
plt.show()