Matplotlib: How to add legend to each scatter? - python

There is my code:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
data = datasets.load_iris(return_X_y=False)
X = data.data
y = data.target
names = data.feature_names
target_names = data.target_names
columns=names+['target']
df = pd.DataFrame(np.hstack([X, y.reshape(-1,1)]), columns=columns)
df.loc[df.target==0, 'target_names'] = 'setosa'
df.loc[df.target==1, 'target_names'] = 'versicolor'
df.loc[df.target==2, 'target_names'] = 'virginica'
indexes = df.index.tolist()
fig,axes = plt.subplots(2,2,figsize=(12,8))
axes[0,0].scatter(indexes,df['sepal length (cm)'],c=y)
axes[0,1].scatter(indexes,df['sepal width (cm)'],c=y)
axes[1,0].scatter(indexes,df['petal length (cm)'],c=y)
axes[1,1].scatter(indexes,df['petal width (cm)'],c=y)
plt.show()
How to add legend to each scatter, where each item is value of y ?

As far as I understand there is no direct way of making the scatter with tags on each data point.
This answer suggests iterating over your data points and labels, once you have created the scatter plots:
for i, txt in enumerate(y):
axes[0,0].annotate(txt, (indexes[i], df['sepal length (cm)'][i]))
...
You can look at formatting options here.

Related

Avoiding overlapping plots in seaborn bar plot

I have the following code where I am trying to plot a bar plot in seaborn. (This is a sample data and both x and y variables are continuous variables).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
xvar = [1,2,2,3,4,5,6,8]
yvar = [3,6,-4,4,2,0.5,-1,0.5]
year = [2010,2011,2012,2010,2011,2012,2010,2011]
df = pd.DataFrame()
df['xvar'] = xvar
df['yvar']=yvar
df['year']=year
df
sns.set_style('whitegrid')
fig,ax=plt.subplots()
fig.set_size_inches(10,5)
sns.barplot(data=df,x='xvar',y='yvar',hue='year',lw=0,dodge=False)
It results in the following plot:
Two questions here:
I want to be able to plot the two bars on 2 side by side and not overlapped the way they are now.
For the x-labels, in the original data, I have alot of them. Is there a way I can set xticks to a specific frequency? for instance, in the chart above only I only want to see 1,3 and 6 for x-labels.
Note: If I set dodge = True then the lines become very thin with the original data.
For the first question, get the patches in the bar chart and modify the width of the target patch. It also shifts the position of the x-axis to represent the alignment.
The second question can be done by using slices to set up a list or a manually created list in a specific order.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
xvar = [1,2,2,3,4,5,6,8]
yvar = [3,6,-4,4,2,0.5,-1,0.5]
year = [2010,2011,2012,2010,2011,2012,2010,2011]
df = pd.DataFrame({'xvar':xvar,'yvar':yvar,'year':year})
fig,ax = plt.subplots(figsize=(10,5))
sns.set_style('whitegrid')
g = sns.barplot(data=df, x='xvar', y='yvar', hue='year', lw=0, dodge=False)
for idx,patch in enumerate(ax.patches):
current_width = patch.get_width()
current_pos = patch.get_x()
if idx == 8 or idx == 15:
patch.set_width(current_width/2)
if idx == 15:
patch.set_x(current_pos+(current_width/2))
ax.set_xticklabels([1,'',3,'','',6,''])
plt.show()

How to draw a figure by seaborn pairplot in several rows?

I have a dataset with 76 features and 1 dependent variable (y). I use seaborn to draw pairplot between features and y in Jupyter notebook. Since the No. of features is high, size of plot for every feature is very small, as can be seen below:
I am looking for a way to draw pairplot in several rows. Also, I don't want to copy and paste pairplot code in several cells in notebook. I am looking for a way to make this figure automatically.
The code I am using (I cannot share dataset, so I use a sample dataset):
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
fig = plt.figure(figsize=(plot_size*num_plots_y, plot_size*num_plots_x), facecolor='white')
axes = [fig.add_subplot(num_plots_y,1,i+1) for i in range(num_plots_y)]
for i, ax in enumerate(axes):
start_index = i * num_plots_x
end_index = (i+1) * num_plots_x
if end_index > len(features_names): end_index = len(features_names)
sns.pairplot(x_vars=features_names[start_index:end_index], y_vars=y_name, data = data)
plt.savefig('figure.png')
The above code has two problems. It shows empty box at the top of the figure and then it shows the pairplots. Following is part of the figure that I get.
Second problem is that it only saves the last row as png file, not the whole figure.
If you have any idea to solve this, please let me know. Thank you.
When I run it directly (python script.py) then it opens every row in separated window - so it treats it as separated objects and it saves in file only last object.
Other problem is that sns doesn't need fig and axes - it can't use subplots to put all on one image - and when I remove fig axes then it stops showing first window with empty box.
I found that FacetGrid has col_wrap to put in many rows. And I found that someone suggested to add this col_wrap in pairplot - Add parameter col_wrap to pairplot #2121 and there is also example how to FacetGrid with scatterplot instead of pairplot and then it can use col_wrap.
Here is code which use FacetGrid with col_wrap
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
'''
for i in range(num_plots_y):
start = i * num_plots_x
end = start + num_plots_x
sns.pairplot(x_vars=features_names[start:end], y_vars=y_name, data=data)
'''
g = sns.FacetGrid(pd.DataFrame(features_names), col=0, col_wrap=4, sharex=False)
for ax, x_var in zip(g.axes, features_names):
sns.scatterplot(data=data, x=x_var, y=y_name, ax=ax)
g.tight_layout()
plt.savefig('figure.png')
plt.show()
Result ('figure.png'):

Trouble creating scatter plot

I'm having trouble using the scatter to create a scatter plot. Can someone help me? I've highlighted the line causing the error:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('vetl8.csv')
df = pd.DataFrame(data=data)
clusterNum = 3
X = df.iloc[:, 1:].values
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
k_means = KMeans(init="k-means++", n_clusters=clusterNum, n_init=12)
k_means.fit(X)
labels = k_means.labels_
df["Labels"] = labels
df.to_csv('dfkmeans.csv')
plt.scatter(df[2], df[1], c=labels) **#Here**
plt.xlabel('K', fontsize=18)
plt.ylabel('g', fontsize=16)
plt.show()
#data set correct
You are close, just a minor adjustment to access the x-y columns by number should fix it:
plt.scatter(df[df.columns[2]], df[df.columns[1]], c=df["Labels"])

Adding specific dots to a series plot in Python

I have a time series plot and I would like to add a red dot at a specific time index. Below is a sample code:
dt_index = pd.to_datetime(['2020-01-01','2020-02-01','2020-03-01','2020-04-01','2020-05-01'])
series = pd.Series([1.1,2.2,3.3,4.5,6.7], index = dt_index)
dots_to_add = pd.to_datetime(['2020-01-01','2020-04-01'])
series.plot()
Using dots_to_add as an index, how would I add a red dot to the line?
A dot in plot is called marker.
import pandas as pd
import matplotlib.pyplot as plt
dt_index = pd.to_datetime(['2020-01-01','2020-02-01','2020-03-01','2020-04-01','2020-05-01'])
series = pd.Series([1.1,2.2,3.3,4.5,6.7], index = dt_index)
dots_to_add = pd.to_datetime(['2020-01-01','2020-04-01'])
series.plot(marker='o')
plt.show()
I don't find an parameter to make marker color and plot color different. I think there is none, because a marker is a part of plot, they should have the same style.
But I think you can draw a scatter plot instead:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dt_index = pd.to_datetime(['2020-01-01','2020-02-01','2020-03-01','2020-04-01','2020-05-01'])
series = pd.Series([1.1,2.2,3.3,4.5,6.7], index = dt_index)
dots_to_add = pd.to_datetime(['2020-01-01','2020-04-01'])
series.plot()
plt.scatter(series.index, series, color='r')
plt.show()
If you just want to add dots with dots_to_add as an index, you could just use a for loop with each loop plt.scatter() a dot.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dt_index = pd.to_datetime(['2020-01-01','2020-02-01','2020-03-01','2020-04-01','2020-05-01'])
series = pd.Series([1.1,2.2,3.3,4.5,6.7], index = dt_index)
dots_to_add = pd.to_datetime(['2020-01-01','2020-04-01'])
series.plot()
for dot in dots_to_add:
plt.scatter(dot, series[dot], color='r')
plt.show()

Python scatter-plot: Conditions for marker styles?

I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...
What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
plt.show()
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following:

Categories

Resources