How to draw a figure by seaborn pairplot in several rows? - python

I have a dataset with 76 features and 1 dependent variable (y). I use seaborn to draw pairplot between features and y in Jupyter notebook. Since the No. of features is high, size of plot for every feature is very small, as can be seen below:
I am looking for a way to draw pairplot in several rows. Also, I don't want to copy and paste pairplot code in several cells in notebook. I am looking for a way to make this figure automatically.
The code I am using (I cannot share dataset, so I use a sample dataset):
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
fig = plt.figure(figsize=(plot_size*num_plots_y, plot_size*num_plots_x), facecolor='white')
axes = [fig.add_subplot(num_plots_y,1,i+1) for i in range(num_plots_y)]
for i, ax in enumerate(axes):
start_index = i * num_plots_x
end_index = (i+1) * num_plots_x
if end_index > len(features_names): end_index = len(features_names)
sns.pairplot(x_vars=features_names[start_index:end_index], y_vars=y_name, data = data)
plt.savefig('figure.png')
The above code has two problems. It shows empty box at the top of the figure and then it shows the pairplots. Following is part of the figure that I get.
Second problem is that it only saves the last row as png file, not the whole figure.
If you have any idea to solve this, please let me know. Thank you.

When I run it directly (python script.py) then it opens every row in separated window - so it treats it as separated objects and it saves in file only last object.
Other problem is that sns doesn't need fig and axes - it can't use subplots to put all on one image - and when I remove fig axes then it stops showing first window with empty box.
I found that FacetGrid has col_wrap to put in many rows. And I found that someone suggested to add this col_wrap in pairplot - Add parameter col_wrap to pairplot #2121 and there is also example how to FacetGrid with scatterplot instead of pairplot and then it can use col_wrap.
Here is code which use FacetGrid with col_wrap
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
'''
for i in range(num_plots_y):
start = i * num_plots_x
end = start + num_plots_x
sns.pairplot(x_vars=features_names[start:end], y_vars=y_name, data=data)
'''
g = sns.FacetGrid(pd.DataFrame(features_names), col=0, col_wrap=4, sharex=False)
for ax, x_var in zip(g.axes, features_names):
sns.scatterplot(data=data, x=x_var, y=y_name, ax=ax)
g.tight_layout()
plt.savefig('figure.png')
plt.show()
Result ('figure.png'):

Related

How to draw histogram + QQ plots together for each column?

I have a dataset with lots of numerical columns. I want to draw histogram for each column but also add extra QQ plot just to check more thoroughly if data follow normal distribution. So I would like to have histogram and QQ plot under histogram for each column. Something like that:
I tried to do this using following code but both plots overlap each other:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
num_cols = df.select_dtypes(include=np.number)
cols = num_cols.columns.tolist()
df_sample = df.sample(n=5000)
fig, axes = plt.subplots(4, 5, figsize=(15,12), layout = 'constrained')
for col, axs in zip(cols, axes.flat):
sns.histplot(data = df_sample[col], kde = True, stat = 'density', ax = axs, alpha = .4)
sm.qqplot(df_sample[col], line='45', ax = axs)
plt.show()
How can I generate hist and QQ plots one under another for each column?
Another issue is that my QQ plots look strange, I'm wondering if I need to standarize all my columns before making QQ plot.

Avoiding overlapping plots in seaborn bar plot

I have the following code where I am trying to plot a bar plot in seaborn. (This is a sample data and both x and y variables are continuous variables).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
xvar = [1,2,2,3,4,5,6,8]
yvar = [3,6,-4,4,2,0.5,-1,0.5]
year = [2010,2011,2012,2010,2011,2012,2010,2011]
df = pd.DataFrame()
df['xvar'] = xvar
df['yvar']=yvar
df['year']=year
df
sns.set_style('whitegrid')
fig,ax=plt.subplots()
fig.set_size_inches(10,5)
sns.barplot(data=df,x='xvar',y='yvar',hue='year',lw=0,dodge=False)
It results in the following plot:
Two questions here:
I want to be able to plot the two bars on 2 side by side and not overlapped the way they are now.
For the x-labels, in the original data, I have alot of them. Is there a way I can set xticks to a specific frequency? for instance, in the chart above only I only want to see 1,3 and 6 for x-labels.
Note: If I set dodge = True then the lines become very thin with the original data.
For the first question, get the patches in the bar chart and modify the width of the target patch. It also shifts the position of the x-axis to represent the alignment.
The second question can be done by using slices to set up a list or a manually created list in a specific order.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
xvar = [1,2,2,3,4,5,6,8]
yvar = [3,6,-4,4,2,0.5,-1,0.5]
year = [2010,2011,2012,2010,2011,2012,2010,2011]
df = pd.DataFrame({'xvar':xvar,'yvar':yvar,'year':year})
fig,ax = plt.subplots(figsize=(10,5))
sns.set_style('whitegrid')
g = sns.barplot(data=df, x='xvar', y='yvar', hue='year', lw=0, dodge=False)
for idx,patch in enumerate(ax.patches):
current_width = patch.get_width()
current_pos = patch.get_x()
if idx == 8 or idx == 15:
patch.set_width(current_width/2)
if idx == 15:
patch.set_x(current_pos+(current_width/2))
ax.set_xticklabels([1,'',3,'','',6,''])
plt.show()

Plotting multiple colored lines and vectors in 3D with matplotlib

I'm struggling to create a 3-D plot with multiple colored lines and vectors in matplotlib. The end result should look as follows:
I already found this question. The code
from mpl_toolkits.mplot3d.axes3d import Axes3D
import matplotlib.pyplot as plt
fig, ax = plt.subplots(subplot_kw={'projection': '3d'})
datasets = [{"x":[1,2,3], "y":[1,4,9], "z":[0,0,0], "colour": "red"} for _ in range(6)]
for dataset in datasets:
ax.plot(dataset["x"], dataset["y"], dataset["z"], color=dataset["colour"])
plt.show()
results in the following output:
That's a good starting point but unfortunately not quite what I'm looking for as I don't want to have a grid in the background and clearly distinguishable coordinate axes. Furthermore, the xticks and yticks should not be visible.
Any help is highly appreciated.
I made multiple lines in Plotly.
image of plot
import plotly.express as px
import pandas as pd
#Line 1
d = {"x":[1,2], "y":[1,4], "z":[0,0], "line":[0 for i in range(2)]} #line = [0,0]. index for multible lines
df = pd.DataFrame(data=d)
#Line 2
d2 = {"x":[4,2], "y":[5,4], "z":[3,2], "line":[1 for i in range(2)]} #line = [1,1]. index for multible lines
df2 = pd.DataFrame(data=d2)
#One data frame
df = df.append(df2)
fig = px.line_3d(df, x="x", y="y", z="z", color="line")
fig.show()

Matplotlib: How to add legend to each scatter?

There is my code:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
data = datasets.load_iris(return_X_y=False)
X = data.data
y = data.target
names = data.feature_names
target_names = data.target_names
columns=names+['target']
df = pd.DataFrame(np.hstack([X, y.reshape(-1,1)]), columns=columns)
df.loc[df.target==0, 'target_names'] = 'setosa'
df.loc[df.target==1, 'target_names'] = 'versicolor'
df.loc[df.target==2, 'target_names'] = 'virginica'
indexes = df.index.tolist()
fig,axes = plt.subplots(2,2,figsize=(12,8))
axes[0,0].scatter(indexes,df['sepal length (cm)'],c=y)
axes[0,1].scatter(indexes,df['sepal width (cm)'],c=y)
axes[1,0].scatter(indexes,df['petal length (cm)'],c=y)
axes[1,1].scatter(indexes,df['petal width (cm)'],c=y)
plt.show()
How to add legend to each scatter, where each item is value of y ?
As far as I understand there is no direct way of making the scatter with tags on each data point.
This answer suggests iterating over your data points and labels, once you have created the scatter plots:
for i, txt in enumerate(y):
axes[0,0].annotate(txt, (indexes[i], df['sepal length (cm)'][i]))
...
You can look at formatting options here.

Proper Matplotlib axes construction / reuse

I currently am building a set of scatter plot charts using pandas plot.scatter. In this construction off of two base axes.
My current construction looks akin to
ax1 = pandas.scatter.plot()
ax2 = pandas.scatter.plot(ax=ax1)
for dataframe in list:
output_ax = pandas.scatter.plot(ax2)
output_ax.get_figure().save("outputfile.png")
total_output_ax = total_list.scatter.plot(ax2)
total_output_ax.get_figure().save("total_output.png")
This seems inefficient. For 1...N permutations I want to reuse a base axes that has 50% of the data already plotted. What I am trying to do is:
Add base data to scatter plot
For item x in y: (save data to base scatter and save image)
Add all data to scatter plot and save image
here's one way to do it with plt.scatter.
I plot column 0 on x-axis, and all other columns on y axis, one at a time.
Notice that there is only 1 ax object, and I don't replot all points, I just add points using the same axes with a for loop.
Each time I get a corresponding png image.
import numpy as np
import pandas as pd
np.random.seed(2)
testdf = pd.DataFrame(np.random.rand(20,4))
testdf.head(5) looks like this
0 1 2 3
0 0.435995 0.025926 0.549662 0.435322
1 0.420368 0.330335 0.204649 0.619271
2 0.299655 0.266827 0.621134 0.529142
3 0.134580 0.513578 0.184440 0.785335
4 0.853975 0.494237 0.846561 0.079645
#I put the first axis out of a loop, that can be in the loop as well
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(testdf[0],testdf[1], color='red')
fig.legend()
fig.savefig('fig_1.png')
colors = ['pink', 'green', 'black', 'blue']
for i in range(2,4):
ax.scatter(testdf[0], testdf[i], color=colors[i])
fig.legend()
fig.savefig('full_' + str(i) + '.png')
Then you get these 3 images (fig_1, fig_2, fig_3)
Axes objects cannot be simply copied or transferred. However, it is possible to set artists to visible/invisible in a plot. Given your ambiguous question, it is not fully clear how your data are stored but it seems to be a list of dataframes. In any case, the concept can easily be adapted to different input data.
import matplotlib.pyplot as plt
#test data generation
import pandas as pd
import numpy as np
rng = np.random.default_rng(123456)
df_list = [pd.DataFrame(rng.integers(0, 100, (7, 2))) for _ in range(3)]
#plot all dataframes into an axis object to ensure
#that all plots have the same scaling
fig, ax = plt.subplots()
patch_collections = []
for i, df in enumerate(df_list):
pc = ax.scatter(x=df[0], y=df[1], label=str(i))
pc.set_visible(False)
patch_collections.append(pc)
#store individual plots
for i, pc in enumerate(patch_collections):
pc.set_visible(True)
ax.set_title(f"Dataframe {i}")
fig.savefig(f"outputfile{i}.png")
pc.set_visible(False)
#store summary plot
[pc.set_visible(True) for pc in patch_collections]
ax.set_title("All dataframes")
ax.legend()
fig.savefig(f"outputfile_0_{i}.png")
plt.show()

Categories

Resources