I have a dataframe like so:
df = pd.DataFrame({"idx":[1,2,3]*2,"a":[1]*3+[2]*3,'b':[3]*3+[4]*3,'grp':[4]*3+[5]*3})
df = df.set_index("idx")
df
a b grp
idx
1 1 3 4
2 1 3 4
3 1 3 4
1 2 4 5
2 2 4 5
3 2 4 5
and I would like to plot the values of a and b as function of idx. Making one subplot per column and one line per group.
I manage to do this creating axis separately and iterating over groups as proposed here. But I would like to use the subplots parameter of the plot function to avoid looping.
I tried solutions like
df.groupby("grp").plot(subplots=True)
But it plot the groups in different subplots and removing the groupby does not make appear the two separated lines as in the example.
Is it possible? Also is it better to iterate and use matplotlib plot or use pandas plot function?
IIUC, you can do something like this:
axs = df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title').plot()
[v.set_title(f'{i}') for i, v in axs.items()]
Output:
Maybe eaiser to simple loop and plot:
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax = iter(ax)
for n, g in df.set_index('grp', append=True)\
.stack()\
.unstack('grp')\
.rename_axis(['idx','title'])\
.reset_index('title').groupby('title'):
g.plot(ax=next(ax), title=f'{n}')
Output:
If i understod your question correct, you can access columns and rows in a pandas dataframe. An example can be like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.array(df['idx'])
a = np.array(df['a'])
b = np.array(df['b'])
plt.subplot(1,2,1)#(121) will also work
#fill inn title etc for the first plot under here
plt.plot(x,a)
plt.subplot(1,2,2)
#fill inn title etc for the second plot under here
plt.plot(x,b)
plt.show()
edit: Sorry now changed for subplot.
Related
Say I have a dataframe structured like so:
Name x y
Joe 0,1,5 0,3,8
Sue 0,2,8 1,9,5
...
Harold 0,5,6 0,7,2
I'd like to plot the values in the x and y axis on a line plot based on row. In reality, there are many x and y values, but there is always one x value for every y value in these columns. The name of the plot would be the value in "name".
I've tried to do this by first converting x and y to lists in their own separate columns like so:
df['xval'] = df.['x'].str.split(',')
df['yval'] = df.['y'].str.split(',')
And then passing them to seaborn:
ax = sns.lineplot(x=df['xval'], y=df['yval'], data=df)
However, this does not work because 1) I recieve an error, which I presume is due to attempting to pass a list from a dataframe, claiming:
TypeError: unhashable type: 'list'
And 2) I cannot specify the value for df['name'] for the specific line plot. What's the best way to go about solving this problem?
Data and imports:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'name': ['joe', 'sue', 'mike'],
'x': ['0,1,5', '0,2,8', '0,4'],
'y': ['0,3,8', '1,9,5', '1,6']
})
We should convert df into a useable format for plotting. This makes all plotting eaiser. We can take advantage of the fact that x and y have a 1-to-1 relationship. Notice I've added a third name with a 2 xy value as opposed to 3 to show this method will work for varied amounts of x and y per name as long as each row has equal numbers of x and y values.
Creating the plot_df:
# Grab Name Column to Start Plot DF with
plot_df = df.loc[:, ['name']]
# Split X column
plot_df['x'] = df['x'].str.split(',')
# Explode X into Rows
plot_df = plot_df.explode('x').reset_index(drop=True)
# Split and Series Explode y in one step
# This works IF AND ONLY IF a 1-to-1 relationship for x and y
plot_df['y'] = df['y'].str.split(',').explode().reset_index(drop=True)
# These need to be numeric to plot correctly
plot_df.loc[:, ['x', 'y']] = plot_df.loc[:, ['x', 'y']].astype(int)
plot_df:
name x y
0 joe 0 0
1 joe 1 3
2 joe 5 8
3 sue 0 1
4 sue 2 9
5 sue 8 5
6 mike 0 1
7 mike 4 6
References to the methods used in creating plot_df:
DataFrame.loc to subset the dataframe
Series.str.split to split the comma separated values into a list
DataFrame.explode to upscale the DataFrame based on the iterable in x
DataFrame.reset_index to make index unique again after exploding
Series.explode to upscale the lists in the Series y.
Series.reset_index to make index unique again after exploding
DataFrame.astype since the values are initially strings just splitting and exploding is not enough. Will need to convert to a numeric type for them to plot correctly
Plotting (Option 1)
# Plot with hue set to name.
sns.lineplot(data=plot_df, x='x', y='y', hue='name')
plt.show()
References for plotting separate lines:
sns.lineplot to plot. Note the hue argument to create separate lines based on name.
pyplot.show to display.
Plotting (Option 2.a) Subplots:
sns.relplot(data=plot_df, x='x', y='y', col='name', kind='line')
plt.tight_layout()
plt.show()
Plotting (Option 2.b) Subplots:
# Use Grouper From plot_df
grouper = plot_df.groupby('name')
# Create Subplots based on the number of groups (ngroups)
fig, axes = plt.subplots(nrows=grouper.ngroups)
# Iterate over axes and groups
for ax, (grp_name, grp) in zip(axes, grouper):
# Plot from each grp DataFrame on ax from axes
sns.lineplot(data=grp, x='x', y='y', ax=ax, label=grp_name)
plt.show()
References for plotting subplots:
2.a
relplot the row or col parameter can be used to create subplots in a similar way to how hue creates multiple lines. This will return a seaborn.FacetGrid so post processing will be different than lineplot which returns matplotlib.axes.Axes
2.b
groupby to create iterable that can be used to plot subplots.
pyplot.subplots to create subplots to plot on.
groupby.ngroup to count number of groups.
zip to iterate over axes and groups simultaneously.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
Plotting option 3 (separate plots):
# Plot from each grp DataFrame in it's own plot
for grp_name, grp in plot_df.groupby('name'):
fig, ax = plt.subplots()
sns.lineplot(data=grp, x='x', y='y', ax=ax)
ax.set_title(grp_name)
fig.show()
joe plot
mike plot
sue plot
References for plotting separate plots:
groupby to create iterable that can be used to plot each name separately.
pyplot.subplots to create separate plot to plot on.
sns.lineplot to plot. Note label is needed to have legends. grp_name contains the current key that is common in the current grp DataFrame.
pyplot.show to display.
From what I understood this is what you want.
df = pd.DataFrame()
df['name'] = ['joe', 'sue']
df['x'] = ['0,1,5', '0,2,8']
df['y'] = ['0,3,8', '1,9,5']
df['newx'] = df['x'].str.split(',')
df['newy'] = df['y'].str.split(',')
for i in range(len(df)):
sns.lineplot(x=df.loc[i, 'newx'], y=df.loc[i, 'newy'])
plt.legend(df['name'])
I have dataframe which looks like
df = pd.DataFrame(data={'ID':[1,1,1,2,2,2], 'Value':[13, 12, 15, 4, 2, 3]})
Index ID Value
0 1 13
1 1 12
2 1 15
3 2 4
4 2 2
5 2 3
and I want to plot it by the IDs (categories) so that each category would have different bar plot,
so in this case I would have two figures,
one figure with bar plot of ID=1,
and second separate figure bar plot of ID=2.
Can I do it (preferably without loops) with something like df.plot(y='Value', kind='bar')?
2 options are possible, one using matplotlib and the other seaborn that you should absolutely now as it works well with Pandas.
Pandas with matplotlib
You have to create a subplot with a number of columns and rows you set. It gives an array axes in 1-D if either nrows or ncols is set to 1, or in 2-D otherwise. Then, you give this object to the Pandas plot method.
If the number of categories is not known or high, you need to use a loop.
import pandas as pd
import matplotlib.pyplot as plt
fig, axes = plt.subplots( nrows=1, ncols=2, sharey=True )
df.loc[ df["ID"] == 1, 'Value' ].plot.bar( ax=axes[0] )
df.loc[ df["ID"] == 2, 'Value' ].plot.bar( ax=axes[1] )
plt.show()
Pandas with seaborn
Seaborn is the most amazing graphical tool that I know. The function catplot enables to plot a series of graph according to the values of a column when you set the argument col. You can select the type of plot with kind.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
df['index'] = [1,2,3] * 2
sns.catplot(kind='bar', data=df, x='index', y='Value', col='ID')
plt.show()
I added a column index in order to compare with the df.plot.bar. If you don't want to, remove x='index' and it will display an unique bar with errors.
Objective: To generate 100 barplots using a for loop, and display the output as a subplot image
Data format: Datafile with 101 columns. The last column is the X variable; the remaining 100 columns are the Y variables, against which x is plotted.
Desired output: Barplots in 5 x 20 subplot array, as in this example image:
Current approach: I've been using PairGrid in seaborn, which generates an n x 1 array: .
where input == dataframe; input3 == list from which column headers are called:
for i in input3:
plt.figure(i)
g = sns.PairGrid(input,
x_vars=["key_variable"],
y_vars=i,
aspect=.75, size=3.5)
g.map(sns.barplot, palette="pastel")
Does anyone have any ideas how to solve this?
To give an example of how to plot 100 dataframe columns over a grid of 20 x 5 subplots:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = np.random.rand(3,101)
data[:,0] = np.arange(2,7,2)
df = pd.DataFrame(data)
fig, axes = plt.subplots(nrows=5, ncols=20, figsize=(21,9), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
ax.bar(df.iloc[:,0], df.iloc[:,i+1])
ax.set_xticks(df.iloc[:,0])
plt.show()
You can try to use matplotlob's subplots to create the plot grid and pass the axis to the barplot. The axis indexing you could do using a nested loop...
I would like to produce a scatter plot of pandas DataFrame with categorical row and column labels using matplotlib. A sample DataFrame looks like this:
import pandas as pd
df = pd.DataFrame({"a": [1,2], "b": [3,4]}, index=["c","d"])
# a b
#c 1 2
#d 3 4
The marker size is the function of the respective DataFrame values. So far, I came up with an awkward solution that essentially enumerates the rows and columns, plots the data, and then reconstructs the labels:
flat = df.reset_index(drop=True).T.reset_index(drop=True).T.stack().reset_index()
# level_0 level_1 0
#0 0 0 1
#1 0 1 2
#2 1 0 3
#3 1 1 4
flat.plot(kind='scatter', x='level_0', y='level_1', s=100*flat[0])
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
Which kind of works.
Now, question: Is there a more intuitive, more integrated way to produce this scatter plot, ideally without splitting the data and the metadata?
Maybe not the entire answer you're looking for, but an idea to help save time and readability with the flat= line of code.
Pandas unstack method will produce a Series with a MultiIndex.
dfu = df.unstack()
print(dfu.index)
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
The MultiIndex contains contains the necessary x and y points to construct the plot (in labels). Here, I assign levels and labels to more informative variable names better suited for plotting.
xlabels, ylabels = dfu.index.levels
xs, ys = dfu.index.labels
Plotting is pretty straight-forward from here.
plt.scatter(xs, ys, s=dfu*100)
plt.xticks(range(len(xlabels)), xlabels)
plt.yticks(range(len(ylabels)), ylabels)
plt.show()
I tried this on a few different DataFrame shapes and it seemed to hold up.
It's not exactly what you were asking for, but it helps to visualize values in a similar way:
import seaborn as sns
sns.heatmap(df[::-1], annot=True)
Result:
Maybe you can use numpy array and pd.melt to create the scatter plot as shown below:
arr = np.array([[i,j] for i in range(df.shape[1]) for j in range(df.shape[0])])
plt.scatter(arr[:,0],arr[:,1],s=100*pd.melt(df)['value'],marker='o')
plt.xlabel('level_0')
plt.ylabel('level_1')
plt.xticks(range(df.shape[1]), df.columns)
plt.yticks(range(df.shape[0]), df.index)
plt.show()
I am having issue plotting two dataframs. One has 20711 entries, the other is 20710 entries. I am using plot(x,y) to plot like this:
import pandas as pd
import csv
import matplotlib.pyplot as plt
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(111)
ax.plot(X, Y)
Both are dataframes that were pulled from a csv. that have this structure:
print(X)
0 -2.343060
1 -2.445431
2 -2.335754
3 -2.478535
4 -2.527026
print(Y)
0 0.026940
1 -0.075431
2 0.024246
3 -0.118535
4 -0.167026
5 -0.145475
I keep getting error:
ValueError: x and y must have same first dimension
How do I fix it so that it ignores the last entry?
Well if you can just ditch the last value of Y then the following should work, assuming you have the index in your dataframe too, that is, your csv looks like this:
0,-2.343060
1,-2.445431
2,-2.335754
3,-2.478535
4,-2.527026
and you loaded it like X=pandas.read_csv('x.csv'), then
ax.plot(X.as_matrix().T[1], Y.as_matrix().T[1][:-1])
should work.
As you mentioned in your comment the overlap varies:
ax.plot(X.as_matrix().T[1], Y.as_matrix().T[1][:len(x)])