Trying to make scatter plots in subplots using for-loops - python

I am trying to make subplots using for loop to go through my x variables in the dataframe. All plots would be a scatter plot.
X-variable: 'Protein', 'Fat', 'Sodium', 'Fiber', 'Carbo', 'Sugars'
y-variable: 'Cal'
This is where I am stuck
plt.subplot(2, 3, 2)
for i in range(3):
plt.scatter(i,sub['Cal'])

With this code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data.csv')
columns = list(df.columns)
columns.remove('Cal')
fig, ax = plt.subplots(1, len(columns), figsize = (20, 5))
for idx, col in enumerate(columns, 0):
ax[idx].plot(df['Cal'], df[col], 'o')
ax[idx].set_xlabel('Cal')
ax[idx].set_title(col)
plt.show()
I get this subplot of scatter plots:
However, maybe it is a better choice to use a single scatterplot and use marker color in order to distinguish data type. See this code:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
df = pd.read_csv('data.csv')
# df.drop(columns = ['Sodium'], inplace = True) # <--- removes 'Sodium' column
table = df.melt('Cal', var_name = 'Type')
fig, ax = plt.subplots(1, 1, figsize = (10, 10))
sns.scatterplot(data = table,
x = 'Cal',
y = 'value',
hue = 'Type',
s = 200,
alpha = 0.5)
plt.show()
that give this plot where all data are together:
The 'Sodium' values are different from others by far, so, if you remove this column with this line:
df.drop(columns = ['Sodium'], inplace = True)
you get a more readable plot:

Related

How to use pandas df.plot.scatter to make a figure with subplots

Hello how can i make a figure with scatter subplots using pandas? Its working with plot, but not with scatter.
Here an Example
import numpy as np
import pandas as pd
matrix = np.random.rand(200,5)
df = pd.DataFrame(matrix,columns=['index','A','B','C','D'])
#single plot, working with
df.plot(
kind='scatter',
x='index',
y='A',
s= 0.5
)
# not workig
df.plot(
subplots=True,
kind='scatter',
x='index',
y=['A','B','C'],
s= 0.5
)
Error
raise ValueError(self._kind + " requires an x and y column")
ValueError: scatter requires an x and y column
Edit:
Solution to make a figure with subplots with using df.plot
(Thanks to #Fourier)
import numpy as np
import pandas as pd
matrix = np.random.rand(200,5)#random data
df = pd.DataFrame(matrix,columns=['index','A','B','C','D']) #make df
#get a list for subplots
labels = list(df.columns)
labels.remove('index')
df.plot(
layout=(-1, 5),
kind="line",
x='index',
y=labels,
subplots = True,
sharex = True,
ls="none",
marker="o")
Would this work for you:
import pandas as pd
import numpy as np
df = pd.DataFrame({"index":np.arange(5),"A":np.random.rand(5),"B":np.random.rand(5),"C":np.random.rand(5)})
df.plot(kind="line", x="index", y=["A","B","C"], subplots=True, sharex=True, ls="none", marker="o")
Output
Note: This uses a line plot with invisible lines. For a scatter, I would go and loop over it.
for column in df.columns[:-1]: #[:-1] ignores the index column for my random sample
df.plot(kind="scatter", x="index", y=column)
EDIT
In order to add custom ylabels you can do the following:
axes = df.plot(kind='line', x="index", y=["A","B","C"], subplots=True, sharex=True, ls="none", marker="o", legend=False)
ylabels = ["foo","bar","baz"]
for ax, label in zip(axes, ylabels):
ax.set_ylabel(label)

in pandas , add scatter plot to line plot

I am trying to add a scatter plot to a line plot by using plandas plot function (in jupyter notebook).
I have tried the following code :
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]})
ax = a.plot.line()
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]})
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
I also checked the following various SO answers but couldn't find the solution.
If anytone can help me, that ould be very appreciated.
EDIT:
somehow the accepted answers works, but i realise that in my case the reason it was not working might have to do with the fact i was using datetime.
like in this code, i cant see the red dots...
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
%matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]}, index = pd.date_range(dt(2019,1,1), periods = 4))
plot = a.plot.line(ax = ax)
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]}, index = [x.timestamp() for x in pd.date_range(dt(2019,1,1), periods = 2)])
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
Any idea whats wrong here?
This should do it (just add fig, ax = plt.subplots() in the beginning):
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]})
a.plot.line(ax=ax)
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]})
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
Edit:
This will work for datetimes:
import matplotlib.pyplot as plt
from datetime import datetime as dt
# %matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]}, index = pd.date_range(dt(2019,1,1), periods = 4))
plot = plt.plot_date(x=a.reset_index()['index'], y=a['a'], fmt="-")
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]}, index = pd.date_range(dt(2019,1,1), periods = 2))
plot = plt.scatter(x=b.reset_index()['index'], y=b['b'], c='r')
plt.show()

Pandas groupby scatter plot in a single plot

This is a followup question on this solution. There is automatic assignment of different colors when kind=line but for scatter plot that's not the case.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['label','x','y'])
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
df.groupby('label').plot(kind='scatter', x = "x", y = "y", ax=ax)
There is a connected issue here. Is there any simple workaround for this?
Update:
When I try the solution recommended by #ImportanceOfBeingErnest for a label column with strings, its not working!
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x='x', y='y', c='label', data=df)
It throws following error,
ValueError: Invalid RGBA argument: 'yes'
During handling of the above exception, another exception occurred:
You can use sns:
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['x','y'])
df['label'] = np.random.choice(['yes','no','yes','yes','no'], 100)
fig, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(x='x', y='y', hue='label', data=df)
plt.show()
Output:
Another option is as what suggested in the comment: Map value to number, by categorical type:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(df.x, df.y, c = pd.Categorical(df.label).codes, cmap='tab20b')
plt.show()
Output:
You can loop over groupby and create a scatter per group. That is efficient for less than ~10 categories.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
for n, grp in df.groupby('label'):
ax.scatter(x = "x", y = "y", data=grp, label=n)
ax.legend(title="Label")
plt.show()
Alternatively you can create a single scatter like
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
u, df["label_num"] = np.unique(df["label"], return_inverse=True)
sc = ax.scatter(x = "x", y = "y", c = "label_num", data=df)
ax.legend(sc.legend_elements()[0], u, title="Label")
plt.show()
Incase we have a grouped data already, then I find the following solution could be useful.
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
fig, ax = plt.subplots(figsize=(7,3))
def plot_grouped_df(grouped_df,
ax, x='x', y='y', cmap = plt.cm.autumn_r):
colors = cmap(np.linspace(0.5, 1, len(grouped_df)))
for i, (name,group) in enumerate(grouped_df):
group.plot(ax=ax,
kind='scatter',
x=x, y=y,
color=colors[i],
label = name)
# now we can use this function to plot the groupby data with categorical values
plot_grouped_df(df.groupby('label'),ax)

Creating charts with Pandas

My code is inside a Jupyter Notebook.
I can create a chart using Method 1 below, and have it look exactly as I'd like it to look.
But when I try with Method 2, which uses subplot, I don't know how to make it look the same (setting the figsize, colors, legend off to the right).
How do I use subplot, and have it look the same as Method 1?
Thank you in advance for your help!
# Using Numpy and Pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
df = pd.DataFrame(np.random.randint(0,100,size=(4, 4)), columns=list('ABCD'))
style.use('fivethirtyeight')
# Colorblind-friendly colors
colors = [[0,0,0], [230/255,159/255,0], [86/255,180/255,233/255], [0,158/255,115/255]]
# Method 1
chart = df.plot(figsize = (10,5), color = colors)
chart.yaxis.label.set_visible(True)
chart.set_ylabel("Bitcoin Price")
chart.set_xlabel("Time")
chart.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.show()
# Method 2
fig, ax = plt.subplots()
ax.plot(df)
ax.set_ylabel("Bitcoin Price")
ax.set_xlabel("Time")
plt.show()
You just replace char by ax, like this
ax.yaxis.label.set_visible(True)
ax.set_ylabel("Bitcoin Price") ax.set_xlabel("Time") ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
I'm thinking of two ways to get a result that might be useful for you. pd.DataFrame.plot returns an Axes object you can pass all the methods you want, so both examples just replace chart for ax.
Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
df = pd.DataFrame(np.random.randint(0,100,size=(4, 4)), columns=list('ABCD'))
style.use('fivethirtyeight')
# Colorblind-friendly colors
colors = [[0,0,0], [230/255,159/255,0], [86/255,180/255,233/255], [0,158/255,115/255]]
Iterating over df
colors_gen = (x for x in colors) # we will also be iterating over the colors
fig, ax = plt.subplots(figsize = (10,5))
for i in df: # iterate over columns...
ax.plot(df[i], color=next(colors_gen)) # and plot one at a time
ax.set_ylabel("Bitcoin Price")
ax.set_xlabel("Time")
ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
ax.yaxis.label.set_visible(True)
plt.show()
Use pd.DataFrame.plot but pass ax as an argument
fig, ax = plt.subplots(figsize = (10,5))
df.plot(color=colors, ax=ax)
ax.set_ylabel("Bitcoin Price")
ax.set_xlabel("Time")
ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
ax.yaxis.label.set_visible(True)
plt.show()

Plot boxplot and line from pandas

I am trying to reproduce this graph - a line plot with a boxplot at every point:
Imgur
However, the line plot is always starting at the origin instead of at the first x tick:
Imgur
I have collected my datastructure in a pandas file, with each column header the k_e (of the x axis), with the column being all of the datapoints.
I am plotting the mean of each column and the boxplot like so:
df = df.astype(float)
_, ax = plt.subplots()
df.mean().plot(ax = ax)
df.boxplot(showfliers=False, ax=ax)
plt.xlabel(r'$k_{e}$')
plt.ylabel('Test error rate')
plt.title(r'Accuracies with different $k_{e}$')
plt.show()
I have referred to the link below, and so am passing the 'ax' position but this does not help.
plot line over boxplot using pandas DateFrame
EDIT: Here is a minimal example:
test_errors_dict = dict()
np.random.seed(40)
test_errors_dict[2] = np.random.rand(20)
test_errors_dict[3] = np.random.rand(20)
test_errors_dict[5] = np.random.rand(20)
df = pd.DataFrame(data=test_errors_dict)
df = df.astype(float)
_, ax = plt.subplots()
df.mean().plot(ax=ax)
df.boxplot(showfliers=False, ax=ax)
plt.show()
Result:
Imgur
As shown in the above, the line plots do not align with the boxplot
The boxes are at positions 1,2,3, while the plot is at positions 2,3,5. You may reindex the mean Series to also use the positions 1,2,3.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
test_errors_dict = dict()
np.random.seed(40)
test_errors_dict[2] = np.random.rand(20)
test_errors_dict[3] = np.random.rand(20)
test_errors_dict[5] = np.random.rand(20)
df = pd.DataFrame(data=test_errors_dict)
df = df.astype(float)
mean = df.mean()
mean.index = np.arange(1,len(mean)+1)
_, ax = plt.subplots()
mean.plot(ax=ax)
df.boxplot(showfliers=False, ax=ax)
plt.show()

Categories

Resources