I would like to produce a scatter plot of pandas DataFrame with categorical row and column labels using matplotlib. A sample DataFrame looks like this:
import pandas as pd
df = pd.DataFrame({"a": [1,2], "b": [3,4]}, index=["c","d"])
# a b
#c 1 2
#d 3 4
The marker size is the function of the respective DataFrame values. So far, I came up with an awkward solution that essentially enumerates the rows and columns, plots the data, and then reconstructs the labels:
flat = df.reset_index(drop=True).T.reset_index(drop=True).T.stack().reset_index()
# level_0 level_1 0
#0 0 0 1
#1 0 1 2
#2 1 0 3
#3 1 1 4
flat.plot(kind='scatter', x='level_0', y='level_1', s=100*flat[0])
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Which kind of works.
Now, question: Is there a more intuitive, more integrated way to produce this scatter plot, ideally without splitting the data and the metadata?
Maybe not the entire answer you're looking for, but an idea to help save time and readability with the flat= line of code.
Pandas unstack method will produce a Series with a MultiIndex.
dfu = df.unstack()
print(dfu.index)
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
The MultiIndex contains contains the necessary x and y points to construct the plot (in labels). Here, I assign levels and labels to more informative variable names better suited for plotting.
xlabels, ylabels = dfu.index.levels
xs, ys = dfu.index.labels
Plotting is pretty straight-forward from here.
plt.scatter(xs, ys, s=dfu*100)
plt.xticks(range(len(xlabels)), xlabels)
plt.yticks(range(len(ylabels)), ylabels)
plt.show()
I tried this on a few different DataFrame shapes and it seemed to hold up.
It's not exactly what you were asking for, but it helps to visualize values in a similar way:
import seaborn as sns
sns.heatmap(df[::-1], annot=True)
Result:
Maybe you can use numpy array and pd.melt to create the scatter plot as shown below:
arr = np.array([[i,j] for i in range(df.shape[1]) for j in range(df.shape[0])])
plt.scatter(arr[:,0],arr[:,1],s=100*pd.melt(df)['value'],marker='o')
plt.xlabel('level_0')
plt.ylabel('level_1')
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Related
Say I have a dataframe structured like so:
Name x y
Joe 0,1,5 0,3,8
Sue 0,2,8 1,9,5
...
Harold 0,5,6 0,7,2
I'd like to plot the values in the x and y axis on a line plot based on row. In reality, there are many x and y values, but there is always one x value for every y value in these columns. The name of the plot would be the value in "name".
I've tried to do this by first converting x and y to lists in their own separate columns like so:
df['xval'] = df.['x'].str.split(',')
df['yval'] = df.['y'].str.split(',')
And then passing them to seaborn:
ax = sns.lineplot(x=df['xval'], y=df['yval'], data=df)
However, this does not work because 1) I recieve an error, which I presume is due to attempting to pass a list from a dataframe, claiming:
TypeError: unhashable type: 'list'
And 2) I cannot specify the value for df['name'] for the specific line plot. What's the best way to go about solving this problem?
Data and imports:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'name': ['joe', 'sue', 'mike'],
'x': ['0,1,5', '0,2,8', '0,4'],
'y': ['0,3,8', '1,9,5', '1,6']
})
We should convert df into a useable format for plotting. This makes all plotting eaiser. We can take advantage of the fact that x and y have a 1-to-1 relationship. Notice I've added a third name with a 2 xy value as opposed to 3 to show this method will work for varied amounts of x and y per name as long as each row has equal numbers of x and y values.
Creating the plot_df:
# Grab Name Column to Start Plot DF with
plot_df = df.loc[:, ['name']]
# Split X column
plot_df['x'] = df['x'].str.split(',')
# Explode X into Rows
plot_df = plot_df.explode('x').reset_index(drop=True)
# Split and Series Explode y in one step
# This works IF AND ONLY IF a 1-to-1 relationship for x and y
plot_df['y'] = df['y'].str.split(',').explode().reset_index(drop=True)
# These need to be numeric to plot correctly
plot_df.loc[:, ['x', 'y']] = plot_df.loc[:, ['x', 'y']].astype(int)
plot_df:
name x y
0 joe 0 0
1 joe 1 3
2 joe 5 8
3 sue 0 1
4 sue 2 9
5 sue 8 5
6 mike 0 1
7 mike 4 6
References to the methods used in creating plot_df:
DataFrame.loc to subset the dataframe
Series.str.split to split the comma separated values into a list
DataFrame.explode to upscale the DataFrame based on the iterable in x
DataFrame.reset_index to make index unique again after exploding
Series.explode to upscale the lists in the Series y.
Series.reset_index to make index unique again after exploding
DataFrame.astype since the values are initially strings just splitting and exploding is not enough. Will need to convert to a numeric type for them to plot correctly
Plotting (Option 1)
# Plot with hue set to name.
sns.lineplot(data=plot_df, x='x', y='y', hue='name')
plt.show()
References for plotting separate lines:
sns.lineplot to plot. Note the hue argument to create separate lines based on name.
pyplot.show to display.
Plotting (Option 2.a) Subplots:
sns.relplot(data=plot_df, x='x', y='y', col='name', kind='line')
plt.tight_layout()
plt.show()
Plotting (Option 2.b) Subplots:
# Use Grouper From plot_df
grouper = plot_df.groupby('name')
# Create Subplots based on the number of groups (ngroups)
fig, axes = plt.subplots(nrows=grouper.ngroups)
# Iterate over axes and groups
for ax, (grp_name, grp) in zip(axes, grouper):
# Plot from each grp DataFrame on ax from axes
sns.lineplot(data=grp, x='x', y='y', ax=ax, label=grp_name)
plt.show()
References for plotting subplots:
2.a
relplot the row or col parameter can be used to create subplots in a similar way to how hue creates multiple lines. This will return a seaborn.FacetGrid so post processing will be different than lineplot which returns matplotlib.axes.Axes
2.b
groupby to create iterable that can be used to plot subplots.
pyplot.subplots to create subplots to plot on.
groupby.ngroup to count number of groups.
zip to iterate over axes and groups simultaneously.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
Plotting option 3 (separate plots):
# Plot from each grp DataFrame in it's own plot
for grp_name, grp in plot_df.groupby('name'):
fig, ax = plt.subplots()
sns.lineplot(data=grp, x='x', y='y', ax=ax)
ax.set_title(grp_name)
fig.show()
joe plot
mike plot
sue plot
References for plotting separate plots:
groupby to create iterable that can be used to plot each name separately.
pyplot.subplots to create separate plot to plot on.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
From what I understood this is what you want.
df = pd.DataFrame()
df['name'] = ['joe', 'sue']
df['x'] = ['0,1,5', '0,2,8']
df['y'] = ['0,3,8', '1,9,5']
df['newx'] = df['x'].str.split(',')
df['newy'] = df['y'].str.split(',')
for i in range(len(df)):
sns.lineplot(x=df.loc[i, 'newx'], y=df.loc[i, 'newy'])
plt.legend(df['name'])
I have a pandas dataframe that has an int64 column which has values either 0 or 1. And another object column that has different strings.
dataframe screenshot:
I want to plot a graph (preferably bar or pie) that will show how many values are equal to 1, and how many are equal to 0 in that int64 column.
Also I would like to legend them as 1 - Democrats, 0 - Republicans.
To reproduce your df let me do
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
'Party': np.random.choice([0, 1], 1000)
})
now, map the [0,1] to wanted categories
df['Party_Name'] = df.Party.map({0: 'Republicans', 1: 'Democrats'})
count the values and plot the pie
df.Party_Name.value_counts().plot.pie(
y='Party_Name',
autopct='%1.0f%%'
);
I have a dataframe like so:
df = pd.DataFrame({"idx":[1,2,3]*2,"a":[1]*3+[2]*3,'b':[3]*3+[4]*3,'grp':[4]*3+[5]*3})
df = df.set_index("idx")
df
a b grp
idx
1 1 3 4
2 1 3 4
3 1 3 4
1 2 4 5
2 2 4 5
3 2 4 5
and I would like to plot the values of a and b as function of idx. Making one subplot per column and one line per group.
I manage to do this creating axis separately and iterating over groups as proposed here. But I would like to use the subplots parameter of the plot function to avoid looping.
I tried solutions like
df.groupby("grp").plot(subplots=True)
But it plot the groups in different subplots and removing the groupby does not make appear the two separated lines as in the example.
Is it possible? Also is it better to iterate and use matplotlib plot or use pandas plot function?
IIUC, you can do something like this:
axs = df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title').plot()
[v.set_title(f'{i}') for i, v in axs.items()]
Output:
Maybe eaiser to simple loop and plot:
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax = iter(ax)
for n, g in df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title'):
g.plot(ax=next(ax), title=f'{n}')
Output:
If i understod your question correct, you can access columns and rows in a pandas dataframe. An example can be like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.array(df['idx'])
a = np.array(df['a'])
b = np.array(df['b'])
plt.subplot(1,2,1)#(121) will also work
#fill inn title etc for the first plot under here
plt.plot(x,a)
plt.subplot(1,2,2)
#fill inn title etc for the second plot under here
plt.plot(x,b)
plt.show()
edit: Sorry now changed for subplot.
I have two dataframes df1 and df2 in Python, transformed from a numpy array in which df1 has 50 rows and 8 columns and df2 10 rows and 8 columns as well, and I would like to use pairplot to see these values. I have made something like this:
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
sns.pairplot(df1)
sns.pairplot(df2)
plt.show()
But I would like that the points or histograms of the df2 to appear superimpose, for example, in red color to the df1 points which are in blue. How can I do that?
To illustrate problem I use iris dataset.
First produce 2 dataframes:
import seaborn as sns
iris = sns.load_dataset("iris")
df1 = iris[iris.species =='setosa']
df2 = iris[iris.species =='versicolor']
We have now Your starting point. Then concatenate dataframes and plot the result:
df12 = df1.append(df2)
g = sns.pairplot(df12, hue="species")
use hue parameter to separate points by color.
Using hue parameter in seaborn you can choose column that will differ them.
sns.pairplot(joined_df,hue='special_column_to_differ_df')
But you will have to join them first.
something strange is happening in matplotlib.
I have a pandas dataframe and I'm making a stacked histogram using two of its columns. One column is floats that goes into the histogram bins. The other column is only 0's and 1's, which are used to separate the data into two stacks. My actual code is bit more complicated but it goes something like this:
print(df)
df =
col1 col2
1.7 1
2.4 0
3.1 0
4.0 1
etc etc
# First I separate the data by the 0's and 1's in col2
df_1 = df.loc[df['col2']==1]
df_0 = df.loc[df['col2']==0]
fig, axes =
Plotting with matplotlib's histogram function works fine, sort of. If I call this:
fig,axes= plt.subplots(nrows=1, ncols=1)
n,bins,patches= axes.hist( [ df_0['col1'], df_1['col1'] ] , histtype='step', stacked=True, Fill=True)
...I get this very nice plot:
HOWEVER, something very strange happens if I flip the order of df_0 and df_1 when I call hist().
Like if I do this instead:
n,bins,patches= axes[0].hist( [ df_1['col1'], df_0['col1'] ] , histtype='step', stacked=True, Fill=True)
... I get a plot with the stacks flipped (as expected), BUT now the plot has picked up a strange artifact; there's like an invisible line that is cutting off and filling in some places of the graph with color.
What the heck is going on here? My first thought was that maybe column1 or column2 had NaN values or something, but I checked those and the column values are fine. Any ideas on what might be causing this?
fill is not a useful argument to hist. It is a valid argument, because you may fill any patch in matplotlib. However, here you do not have a closed patch to fill.
Instead you may be looking for the different histtype options that are shown in the histogram_histtypes example.
histtype="stepfilled"
histtype='bar'
In this case they both give the same plot,
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np; np.random.seed(42)
a = np.random.rayleigh(size=20)
b = np.random.randn(20)+3
df = pd.DataFrame({"col1" : np.concatenate((a,b)),
"col2" : [0]*20 + [1]*20})
df_1 = df.loc[df['col2']==1]
df_0 = df.loc[df['col2']==0]
fig,axes= plt.subplots(ncols=2)
n,bins,patches= axes[0].hist([df_0['col1'], df_1['col1']], histtype='stepfilled', stacked=True)
n,bins,patches= axes[1].hist([df_0['col1'], df_1['col1']], histtype='bar', stacked=True)
plt.show()