in pandas , add scatter plot to line plot - python

I am trying to add a scatter plot to a line plot by using plandas plot function (in jupyter notebook).
I have tried the following code :
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]})
ax = a.plot.line()
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]})
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
I also checked the following various SO answers but couldn't find the solution.
If anytone can help me, that ould be very appreciated.
EDIT:
somehow the accepted answers works, but i realise that in my case the reason it was not working might have to do with the fact i was using datetime.
like in this code, i cant see the red dots...
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
%matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]}, index = pd.date_range(dt(2019,1,1), periods = 4))
plot = a.plot.line(ax = ax)
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]}, index = [x.timestamp() for x in pd.date_range(dt(2019,1,1), periods = 2)])
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
Any idea whats wrong here?

This should do it (just add fig, ax = plt.subplots() in the beginning):
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]})
a.plot.line(ax=ax)
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]})
plot = b.reset_index().plot.scatter(x = 'index', y = 'b', c ='r', ax = ax)
plt.show()
Edit:
This will work for datetimes:
import matplotlib.pyplot as plt
from datetime import datetime as dt
# %matplotlib inline
fig, ax = plt.subplots()
# plot the line
a = pd.DataFrame({'a': [3,2,6,4]}, index = pd.date_range(dt(2019,1,1), periods = 4))
plot = plt.plot_date(x=a.reset_index()['index'], y=a['a'], fmt="-")
# try to add the scatterplot
b = pd.DataFrame({'b': [5, 2]}, index = pd.date_range(dt(2019,1,1), periods = 2))
plot = plt.scatter(x=b.reset_index()['index'], y=b['b'], c='r')
plt.show()

Related

Matplot legends are printing twice

I am writing a simple code with matplotlib/seaborn to plot the data of a sample csv file. However, when call the sns.histplot() function through a for loop, the legends of each column are displaying twice. Any help would be greatly appreciated:)
Here's the code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import matplotlib
sns.set_style('darkgrid')
df = pd.read_csv('dm_office_sales.csv')
df['salary'] = df['salary'] * 3
df['sample salary'] = df['salary'] * 2
x = df['salary']
y = df['sales']
z = df['sample salary']
fig,ax = plt.subplots()
for i in [x,y,z]:
sns.histplot(data = i, bins=50, ax=ax, palette = 'bright',alpha=0.3, label='{}'.format(i.name))
plt.legend(numpoints=1)
plt.suptitle('Sales/Salary Histogram')
plt.show()
Pass just the columns in question in one step, instead of looping.
sns.histplot(data=df[['salary', 'sales', 'sample salary']], ...)
Here's a demo with the tips dataset:
tips = sns.load_dataset('tips')
fig, ax = plt.subplots()
sns.histplot(tips[['total_bill', 'tip']], bins=50,
ax=ax, alpha=0.3, palette='bright')
plt.show()

Trying to make scatter plots in subplots using for-loops

I am trying to make subplots using for loop to go through my x variables in the dataframe. All plots would be a scatter plot.
X-variable: 'Protein', 'Fat', 'Sodium', 'Fiber', 'Carbo', 'Sugars'
y-variable: 'Cal'
This is where I am stuck
plt.subplot(2, 3, 2)
for i in range(3):
plt.scatter(i,sub['Cal'])
With this code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data.csv')
columns = list(df.columns)
columns.remove('Cal')
fig, ax = plt.subplots(1, len(columns), figsize = (20, 5))
for idx, col in enumerate(columns, 0):
ax[idx].plot(df['Cal'], df[col], 'o')
ax[idx].set_xlabel('Cal')
ax[idx].set_title(col)
plt.show()
I get this subplot of scatter plots:
However, maybe it is a better choice to use a single scatterplot and use marker color in order to distinguish data type. See this code:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
df = pd.read_csv('data.csv')
# df.drop(columns = ['Sodium'], inplace = True) # <--- removes 'Sodium' column
table = df.melt('Cal', var_name = 'Type')
fig, ax = plt.subplots(1, 1, figsize = (10, 10))
sns.scatterplot(data = table,
x = 'Cal',
y = 'value',
hue = 'Type',
s = 200,
alpha = 0.5)
plt.show()
that give this plot where all data are together:
The 'Sodium' values are different from others by far, so, if you remove this column with this line:
df.drop(columns = ['Sodium'], inplace = True)
you get a more readable plot:

Pandas groupby scatter plot in a single plot

This is a followup question on this solution. There is automatic assignment of different colors when kind=line but for scatter plot that's not the case.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['label','x','y'])
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
df.groupby('label').plot(kind='scatter', x = "x", y = "y", ax=ax)
There is a connected issue here. Is there any simple workaround for this?
Update:
When I try the solution recommended by #ImportanceOfBeingErnest for a label column with strings, its not working!
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x='x', y='y', c='label', data=df)
It throws following error,
ValueError: Invalid RGBA argument: 'yes'
During handling of the above exception, another exception occurred:
You can use sns:
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['x','y'])
df['label'] = np.random.choice(['yes','no','yes','yes','no'], 100)
fig, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(x='x', y='y', hue='label', data=df)
plt.show()
Output:
Another option is as what suggested in the comment: Map value to number, by categorical type:
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(df.x, df.y, c = pd.Categorical(df.label).codes, cmap='tab20b')
plt.show()
Output:
You can loop over groupby and create a scatter per group. That is efficient for less than ~10 categories.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
for n, grp in df.groupby('label'):
ax.scatter(x = "x", y = "y", data=grp, label=n)
ax.legend(title="Label")
plt.show()
Alternatively you can create a single scatter like
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
u, df["label_num"] = np.unique(df["label"], return_inverse=True)
sc = ax.scatter(x = "x", y = "y", c = "label_num", data=df)
ax.legend(sc.legend_elements()[0], u, title="Label")
plt.show()
Incase we have a grouped data already, then I find the following solution could be useful.
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['label'] = ['yes','no','yes','yes','no']
fig, ax = plt.subplots(figsize=(7,3))
def plot_grouped_df(grouped_df,
ax, x='x', y='y', cmap = plt.cm.autumn_r):
colors = cmap(np.linspace(0.5, 1, len(grouped_df)))
for i, (name,group) in enumerate(grouped_df):
group.plot(ax=ax,
kind='scatter',
x=x, y=y,
color=colors[i],
label = name)
# now we can use this function to plot the groupby data with categorical values
plot_grouped_df(df.groupby('label'),ax)

How to add multiple trendlines pandas

I have plotted a graph with two y axes and would now like to add two separate trendlines for each of the y plots.
This is my code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
amp_costs=pd.read_csv('/Users/Ampicillin_Costs.csv', index_col=None, usecols=[0,1,2])
amp_costs.columns=['PERIOD', 'ITEMS', 'COST PER ITEM']
ax=amp_costs.plot(x='PERIOD', y='COST PER ITEM', color='Blue', style='.', markersize=10)
amp_costs.plot(x='PERIOD', y='ITEMS', secondary_y=True,
color='Red', style='.', markersize=10, ax=ax)
Any guidance as to how to plot these two trend lines to this graph would be much appreciated!
Here is a quick example of how to use sklearn.linear_model.LinearRegression to make the trend line.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.style.use('ggplot')
%matplotlib inline
period = np.arange(10)
items = -2*period +1 + np.random.randint(-2,2,len(period))
cost = 35000*period +15000 + np.random.randint(-25000,25000,len(period))
data = np.vstack((period,items,cost)).T
df = pd.DataFrame(data, columns=\['P','ITEMS', 'COST'\]).set_index('P')
lmcost = LinearRegression().fit(period.reshape(-1,1), cost.reshape(-1,1))
lmitems = LinearRegression().fit(period.reshape(-1,1), items.reshape(-1,1))
df['ITEMS_LM'] = lmitems.predict(period.reshape(-1,1))
df['COST_LM'] = lmcost.predict(period.reshape(-1,1))
fig,ax = plt.subplots()
df.ITEMS.plot(ax = ax, color = 'b')
df.ITEMS_LM.plot(ax = ax,color= 'b', linestyle= 'dashed')
df.COST.plot(ax = ax, secondary_y=True, color ='g')
df.COST_LM.plot(ax = ax, secondary_y=True, color = 'g', linestyle='dashed')

Plot on top of seaborn clustermap

I generated a clustermap using seaborn.clustermap.
I'd like to draw/plot an horizontal line on top of the heatmap like in this figure
I simply tried to use matplotlib as:
plt.plot([x1, x2], [y1, y2], 'k-', lw = 10)
but the line is not displayed.
The object returned by seaborn.clustermap doesn't have any properties like in this similar question.
How can I plot the line?
Here is the code that generates a "random" clustermap similar to the one I posted:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import random
data = np.random.random((50, 50))
df = pd.DataFrame(data)
row_colors = ["b" if random.random() > 0.2 else "r" for i in range (0,50)]
cmap = sns.diverging_palette(133, 10, n=7, as_cmap=True)
result = sns.clustermap(df, row_colors=row_colors, col_cluster = False, cmap=cmap, linewidths = 0)
plt.plot([5, 30], [5, 5], 'k-', lw = 10)
plt.show()
The axes object that you want is hiding in ClusterGrid.ax_heatmap. This code finds this axis and simply uses ax.plot() to draw the line. You could also use ax.axhline().
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import random
data = np.random.random((50, 50))
df = pd.DataFrame(data)
row_colors = ["b" if random.random() > 0.2 else "r" for i in range (0,50)]
cmap = sns.diverging_palette(133, 10, n=7, as_cmap=True)
result = sns.clustermap(df, row_colors=row_colors, col_cluster = False, cmap=cmap, linewidths = 0)
print dir(result) # here is where you see that the ClusterGrid has several axes objects hiding in it
ax = result.ax_heatmap # this is the important part
ax.plot([5, 30], [5, 5], 'k-', lw = 10)
plt.show()

Categories

Resources