i would like to:
Store in a director series all the directors present in the director column of df.
Display in a horizontal bar graph the 10 most present directors in the catalogue.
Do I need to make a value.count first ? To set the top 10 before creating the plt.bar ?
# divided the director name
df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
You can create a countplot and use the order= parameter to select the 10 highest counts:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# directors = df['director'].str.split(',', expand=True).stack().reset_index(drop=True)
np.random.seed(123456)
directors = pd.Series(np.random.choice(
['Allen', 'Almodóvar', 'Bergman', 'Buñuel', 'Chaplin', 'Eastwood', 'Fassbinder', 'Fellini', 'Hitchcock', 'Keaton',
'Kubrick', 'Polanski', 'Renoir', 'Scorsese', 'Spielberg', 'Welles', 'Wenders', 'Wilder'], 200), name='Director')
ax = sns.countplot(y=directors, order=directors.value_counts().iloc[:10].index, palette='rocket')
ax.tick_params(axis='y', length=0)
plt.tight_layout()
plt.show()
# c. Top 10 recovered countries (Bar plot)
top10_recovered = pd.DataFrame(data.groupby('Country')['Recovered'].sum().nlargest(10).sort_values(ascending = False))
fig3 = px.bar(top10_recovered, x = top10_recovered.index, y = 'Recovered', height = 600, color = 'Recovered',
title = 'Top 10 Recovered Cases Countries', color_continuous_scale = px.colors.sequential.Viridis)
fig3.show()
Related
I have a dataset containing various fields of users, like dates, like count etc. I am trying to plot a histogram which shows like count with respect to date, how should I do that?
The dataset:
Assuming you want to plot number of public likes by date, you could do something like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('analysis.csv')
# convert text column to date time and keep only the date part
df['created_at'] = pd.to_datetime(df['created_at'])
df['created_at'] = df['created_at'].dt.date
# group by date taking the sum of public_metrics.like_count
df1 = df.groupby(['created_at'])['public_metrics.like_count'].sum().reset_index()
df1 = df1.set_index('created_at')
# plot and show
df1.plot()
plt.show()
And this is the output you will get
Just to add something to the first answer: you could visualize only the likes count of a specific month by making a bar plot. In this way, maybe you have a plot that is "closer" to the idea of histogram that you want. For example, I did it for January month:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.dates as mdates
# Read and clean data
df = pd.read_csv('tweets_data.txt')
df['created_at'] = df['created_at'].str.replace(".000Z", "")
df.created_at
# Create a new dataframe with only two columns: data and number of likes
histogram_data = pd.concat([df[['created_at']],df[['public_metrics.like_count']]],axis=1)
January_values = histogram_data[histogram_data['created_at'].astype(str).str.contains('2018-01')] #histogram_data['created_at'].astype(str)
January_values
January_values.shape
dictionary = {}
for date, n_likes in January_values.itertuples(index=False):
dictionary[date] = n_likes
print(dictionary)
# Create figure and plot space
fig, ax = plt.subplots(figsize=(12, 12))
# Add x-axis and y-axis
ax.bar(dictionary.keys(),
dictionary.values(),
color='purple')
# Set title and labels for axes
ax.set_xlabel('Date', fontsize = 20)
ax.set_ylabel('Counts', fontsize = 20)
ax.set_title('Tweets likes counts in January 2018', fontsize = 15, weight = "bold")
# Ensure a major tick for each week using (interval=1)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
ax.tick_params(axis='x', which='major', labelsize=15, width=2)
plt.setp( ax.xaxis.get_majorticklabels(), rotation=-45, ha="left", weight="bold")
plt.show()
The output is:
Of course, if you use all your data (that are more than 3000 dates), you will obtain a plot with bars really sharp...
I have an issue with axis labels when using groupby and trying to plot with seaborn. Here is my problem:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'user': ['Bob', 'Jane','Alice','Bob','Jane','Alice'],
'income': [40000, 50000, 42000,47000,53000,46000]})
groupedProduct = df.groupby(['Product']).sum().reset_index()
I then plot a horizontal bar plot using seaborn:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
#Prettify the plot
bar.set_yticklabels( bar.get_yticks(), size = 10)
bar.set_xticklabels( bar.get_xticks(), size = 10)
bar.set_ylabel("User", fontsize = 20)
bar.set_xlabel("Income ($)", fontsize = 20)
bar.set_title("Total income per user", fontsize = 20)
sns.set_theme(style="whitegrid")
sns.set_color_codes("muted")
Unfortunately, when I run the code in such a manner, the y-axis ticks are labelled as 0,1,2 instead of Bob, Jane, Alice as I'd like it to.
I can get around the issue if I use matplotlib in the following manner:
df_group_user = df.groupby(['user']).sum()
df_group_user['income'].plot(kind="barh")
plt.title("Total income per user")
plt.ylabel("User")
plt.xlabel("Income ($)")
Ideally, I'd like to use seaborn for plotting, but if I don't use reset_index() like above, when calling sns.barplot:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
ValueError: Could not interpret input 'user'
just try re-writing the positions of x and y axis.
I'm using a diff dataframe to exhibit similar situation.
gp = df.groupby("Gender")['Salary'].sum().reset_index()
gp
Output:
Gender Salary
0 Female 8870
1 Male 23667
Now while plotting a bar chart, mention x axis first and then supply y axis and check,
bar = sns.barplot(x = 'Salary', y = "Gender", data = gp);
I have the following function:
Say hue="animals have three categories dog,bird,horse and we have two dataframes df_m and df_f consisting of data of male animals and women animals only, respectively.
The function plots three distplot of y (e.g y="weight") one for each hue={dog,bird,horse}. In each subplot we plot df_m[y] and df_f[y] such that I can compare the weight of male dogs/female dogs, male birds/female birds, male horses/female horses.
If I set distkwargs={"hist":False} when calling the function the legends ["F","M"] disappears, for some reason. Having distkwargs={"hist":True}` shows the legends
def plot_multi_kde_cat(self,dfs,y,hue,subkwargs={},distkwargs={},legends=[]):
"""
Create a subplot multi_kde with categories in the same plot
dfs: List
- DataFrames for each category e.g one for male and one for females
hue: string
- column for which each category is plotted (in each subplot)
"""
hues = dfs[0][hue].cat.categories
if len(hues)==2: #Only two categories
fig,axes = plt.subplots(1,2,**subkwargs) #Get axes and flatten them
axes=axes.flatten()
for ax,hu in zip(axes,hues):
for df in dfs:
sns.distplot(df.loc[df[hue]==hu,y],ax=ax,**distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend(legends)
else: #More than two categories: create a square grid and remove unsused axes
n_rows = int(np.ceil(np.sqrt(len(hues)))) #number of rows
fig,axes = plt.subplots(n_rows,n_rows,**subkwargs)
axes = axes.flatten()
for ax,hu in zip(axes,hues):
for df in dfs:
sns.distplot(df.loc[df[hue]==hu,y],ax=ax,**distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend(legends)
n_remove = len(axes)-len(hues) #number of axes to remove
if n_remove>0:
for ax in axes[-n_remove:]:
ax.set_visible(False)
fig.tight_layout()
return fig,axes
You can work around the problem by explicitly providing the label to the distplot. This forces a legend entry for each distplot. ax.legend() then already gets the correct labels.
Here is some minimal sample code to illustrate how everything works together:
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
def plot_multi_kde_cat(dfs, y, hue, subkwargs={}, distkwargs={}, legends=[]):
hues = np.unique(dfs[0][hue])
fig, axes = plt.subplots(1, len(hues), **subkwargs)
axes = axes.flatten()
for ax, hu in zip(axes, hues):
for df, legend_label in zip(dfs, legends):
sns.distplot(df.loc[df[hue] == hu, y], ax=ax, label=legend_label, **distkwargs)
ax.set_title(f"Segment: {hu}")
ax.legend()
N = 20
df_m = pd.DataFrame({'animal': np.random.choice(['tiger', 'horse'], N), 'weight': np.random.uniform(100, 200, N)})
df_f = pd.DataFrame({'animal': np.random.choice(['tiger', 'horse'], N), 'weight': np.random.uniform(80, 160, N)})
plot_multi_kde_cat([df_m, df_f], 'weight', 'animal',
subkwargs={}, distkwargs={'hist': False}, legends=['male', 'female'])
plt.show()
I would like to produce a heatmap in Python, similar to the one shown, where the size of the circle indicates the size of the sample in that cell. I looked in seaborn's gallery and couldn't find anything, and I don't think I can do this with matplotlib.
It's the inverse. While matplotlib can do pretty much everything, seaborn only provides a small subset of options.
So using matplotlib, you can plot a PatchCollection of circles as shown below.
Note: You could equally use a scatter plot, but since scatter dot sizes are in absolute units it would be rather hard to scale them into the grid.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
N = 10
M = 11
ylabels = ["".join(np.random.choice(list("PQRSTUVXYZ"), size=7)) for _ in range(N)]
xlabels = ["".join(np.random.choice(list("ABCDE"), size=3)) for _ in range(M)]
x, y = np.meshgrid(np.arange(M), np.arange(N))
s = np.random.randint(0, 180, size=(N,M))
c = np.random.rand(N, M)-0.5
fig, ax = plt.subplots()
R = s/s.max()/2
circles = [plt.Circle((j,i), radius=r) for r, j, i in zip(R.flat, x.flat, y.flat)]
col = PatchCollection(circles, array=c.flatten(), cmap="RdYlGn")
ax.add_collection(col)
ax.set(xticks=np.arange(M), yticks=np.arange(N),
xticklabels=xlabels, yticklabels=ylabels)
ax.set_xticks(np.arange(M+1)-0.5, minor=True)
ax.set_yticks(np.arange(N+1)-0.5, minor=True)
ax.grid(which='minor')
fig.colorbar(col)
plt.show()
Here's a possible solution using Bokeh Plots:
import pandas as pd
from bokeh.palettes import RdBu
from bokeh.models import LinearColorMapper, ColumnDataSource, ColorBar
from bokeh.models.ranges import FactorRange
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import numpy as np
output_notebook()
d = dict(x = ['A','A','A', 'B','B','B','C','C','C','D','D','D'],
y = ['B','C','D', 'A','C','D','B','D','A','A','B','C'],
corr = np.random.uniform(low=-1, high=1, size=(12,)).tolist())
df = pd.DataFrame(d)
df['size'] = np.where(df['corr']<0, np.abs(df['corr']), df['corr'])*50
#added a new column to make the plot size
colors = list(reversed(RdBu[9]))
exp_cmap = LinearColorMapper(palette=colors,
low = -1,
high = 1)
p = figure(x_range = FactorRange(), y_range = FactorRange(), plot_width=700,
plot_height=450, title="Correlation",
toolbar_location=None, tools="hover")
p.scatter("x","y",source=df, fill_alpha=1, line_width=0, size="size",
fill_color={"field":"corr", "transform":exp_cmap})
p.x_range.factors = sorted(df['x'].unique().tolist())
p.y_range.factors = sorted(df['y'].unique().tolist(), reverse = True)
p.xaxis.axis_label = 'Values'
p.yaxis.axis_label = 'Values'
bar = ColorBar(color_mapper=exp_cmap, location=(0,0))
p.add_layout(bar, "right")
show(p)
One option is to use matplotlib's scatter plots with legends and grid. You can specify size of those circles with specifying the scales. You can also change the color of each circle. You should somehow specify X,Y values so that the circles sit straight on lines. This is an example I got from here:
volume = np.random.rayleigh(27, size=40)
amount = np.random.poisson(10, size=40)
ranking = np.random.normal(size=40)
price = np.random.uniform(1, 10, size=40)
fig, ax = plt.subplots()
# Because the price is much too small when being provided as size for ``s``,
# we normalize it to some useful point sizes, s=0.3*(price*3)**2
scatter = ax.scatter(volume, amount, c=ranking, s=0.3*(price*3)**2,
vmin=-3, vmax=3, cmap="Spectral")
# Produce a legend for the ranking (colors). Even though there are 40 different
# rankings, we only want to show 5 of them in the legend.
legend1 = ax.legend(*scatter.legend_elements(num=5),
loc="upper left", title="Ranking")
ax.add_artist(legend1)
# Produce a legend for the price (sizes). Because we want to show the prices
# in dollars, we use the *func* argument to supply the inverse of the function
# used to calculate the sizes from above. The *fmt* ensures to show the price
# in dollars. Note how we target at 5 elements here, but obtain only 4 in the
# created legend due to the automatic round prices that are chosen for us.
kw = dict(prop="sizes", num=5, color=scatter.cmap(0.7), fmt="$ {x:.2f}",
func=lambda s: np.sqrt(s/.3)/3)
legend2 = ax.legend(*scatter.legend_elements(**kw),
loc="lower right", title="Price")
plt.show()
Output:
I don't have enough reputation to comment on Delenges' excellent answer, so I'll leave my comment as an answer instead:
R.flat doesn't order the way we need it to, so the circles assignment should be:
circles = [plt.Circle((j,i), radius=R[j][i]) for j, i in zip(x.flat, y.flat)]
Here is an easy example to plot circle_heatmap.
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.datasets import load_wine as load_data
from psynlig import plot_correlation_heatmap
plt.style.use('seaborn-talk')
data_set = load_data()
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])
#data = df_corr_selected
kwargs = {
'heatmap': {
'vmin': -1,
'vmax': 1,
'cmap': 'viridis',
},
'figure': {
'figsize': (14, 10),
},
}
plot_correlation_heatmap(data, bubble=True, annotate=False, **kwargs)
plt.show()
I'm working on a school project and I'm stuck in making a grouped bar chart. I found this article online with an explanation: https://www.pythoncharts.com/2019/03/26/grouped-bar-charts-matplotlib/
Now I have a dataset with an Age column and a Sex column in the Age column there stand how many years the client is and in the sex is a 0 for female and 1 for male. I want to plot the age difference between male and female. Now I have tried the following code like in the example:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, data.loc[data['Sex'] == '0', 'count'], width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, data.loc[data['Sex'] == '1', 'count'], width=bar_width)
This raised the following error: KeyError: 'count'
Then I tried to change the code a bit and got another error:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, (data.loc[data['Sex'] == '0'].count()), width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, (data.loc[data['Sex'] == '1'].count()), width=bar_width)
This raised the error: ValueError: shape mismatch: objects cannot be broadcast to a single shape
Now how do I count the results that I do can make this grouped bar chart?
It seems like the article goes through too much trouble just to plot grouped chart bar:
np.random.seed(1)
data = pd.DataFrame({'Sex':np.random.randint(0,2,1000),
'Age':np.random.randint(20,50,1000)})
(data.groupby('Age')['Sex'].value_counts() # count the Sex values for each Age
.unstack('Sex') # turn Sex into columns
.plot.bar(figsize=(12,6)) # plot grouped bar
)
Or even simpler with seaborn:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=data, x='Age', hue='Sex', ax=ax)
Output: