Related
I have a dataset with 76 features and 1 dependent variable (y). I use seaborn to draw pairplot between features and y in Jupyter notebook. Since the No. of features is high, size of plot for every feature is very small, as can be seen below:
I am looking for a way to draw pairplot in several rows. Also, I don't want to copy and paste pairplot code in several cells in notebook. I am looking for a way to make this figure automatically.
The code I am using (I cannot share dataset, so I use a sample dataset):
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
fig = plt.figure(figsize=(plot_size*num_plots_y, plot_size*num_plots_x), facecolor='white')
axes = [fig.add_subplot(num_plots_y,1,i+1) for i in range(num_plots_y)]
for i, ax in enumerate(axes):
start_index = i * num_plots_x
end_index = (i+1) * num_plots_x
if end_index > len(features_names): end_index = len(features_names)
sns.pairplot(x_vars=features_names[start_index:end_index], y_vars=y_name, data = data)
plt.savefig('figure.png')
The above code has two problems. It shows empty box at the top of the figure and then it shows the pairplots. Following is part of the figure that I get.
Second problem is that it only saves the last row as png file, not the whole figure.
If you have any idea to solve this, please let me know. Thank you.
When I run it directly (python script.py) then it opens every row in separated window - so it treats it as separated objects and it saves in file only last object.
Other problem is that sns doesn't need fig and axes - it can't use subplots to put all on one image - and when I remove fig axes then it stops showing first window with empty box.
I found that FacetGrid has col_wrap to put in many rows. And I found that someone suggested to add this col_wrap in pairplot - Add parameter col_wrap to pairplot #2121 and there is also example how to FacetGrid with scatterplot instead of pairplot and then it can use col_wrap.
Here is code which use FacetGrid with col_wrap
from sklearn.datasets import load_boston
import math
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
y = pd.DataFrame(y)
data = pd.concat([X, y], axis=1)
y_name = 'y'
features_names = [f'feature_{i}' for i in range(1, X.shape[1]+1)]
column_names = features_names + [y_name]
data.columns = column_names
plot_size=7
num_plots_x=5 # No. of plots in every row
num_plots_y = math.ceil(len(features_names)/num_plots_x) # No. of plots in y direction
'''
for i in range(num_plots_y):
start = i * num_plots_x
end = start + num_plots_x
sns.pairplot(x_vars=features_names[start:end], y_vars=y_name, data=data)
'''
g = sns.FacetGrid(pd.DataFrame(features_names), col=0, col_wrap=4, sharex=False)
for ax, x_var in zip(g.axes, features_names):
sns.scatterplot(data=data, x=x_var, y=y_name, ax=ax)
g.tight_layout()
plt.savefig('figure.png')
plt.show()
Result ('figure.png'):
I'm trying to do a line plot with one line per column. My dataset looks like this:
I'm using this code, but it's giving me the following error:
ValueError: Wrong number of items passed 3, placement implies 27
plot_x = 'bill__effective_due_date'
plot_y = ['RR_bucket1_perc', 'RR_bucket7_perc', 'RR_bucket14_perc']
ax = sns.pointplot(x=plot_x, y=plot_y, data=df_rollrates_plot, marker="o", palette=sns.color_palette("coolwarm"))
display(ax.figure)
Maybe it's a silly question but I'm new to python so I'm not sure how to do this. This is my expected output:
Thanks!!
You can plot the dataframe as follows (edit: I updated the code below to make bill__effective_due_date the index of the dataframe):
import seaborn as sns
import pandas as pd
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df_rollrates_plot = pd.DataFrame({'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
df_rollrates_plot.index = x
df_rollrates_plot.index.name = 'bill__effective_due_date'
sns.lineplot(data=df_rollrates_plot)
plt.grid()
Your data is in the wrong shape to take advantage of the hue parameter in seaborn's lineplot. You need to stack it so that the columns become categorical values.
import pandas as pd
import seaborn as sns
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df = pd.DataFrame({'bill_effective_due_date':x,
'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
# This is where you are reshaping your data to make it work like you want
df = df.set_index('bill_effective_due_date').stack().reset_index()
df.columns=['bill_effective_due_date','roll_rates_perc','roll_rates']
sns.lineplot(data=df, x='bill_effective_due_date',y='roll_rates', hue='roll_rates_perc', marker='o')
I am trying to plot the lines that connect starting (x,y) and ending (x,y)
That means a line will be connecting (x1start,y1start) to (x1end,y1end)
I have multiple rows in data frame.
The sample data frame that replicate the actual dataframe and shown below:
df = pd.DataFrame()
df ['Xstart'] = [1,2,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
According to that, if we look at the first row of df, a line will be connecting (1,0) to (6,6)
For that I am using for loop to draw a line for each row as follow:
fig,ax = plt.subplots()
fig.set_size_inches(7,5)
for i in range (len(df)):
ax.plot((df.iloc[i]['Xstart'],df.iloc[i]['Xend']),(df.iloc[i]['Ystart'],df.iloc[i]['Yend']))
ax.annotate("",xy = (df.iloc[i]['Xstart'],df.iloc[i]['Xend']),
xycoords = 'data',
xytext = (df.iloc[i]['Ystart'],df.iloc[i]['Yend']),
textcoords = 'data',
arrowprops = dict(arrowstyle = "->", connectionstyle = 'arc3', color = 'blue'))
plt.show()
I have the following error message when I run this.
I got the figure as shown below:
The arrow and line are in as expected. the arrow should be on the end point of each line.
Can anyone advise what is going on here?
Thanks,
Zep
You're mixing up the positions of the arrows. Each coordinate pair in xy and xytext consists of an x and y value.
Also in order to see the arrows in the plot you need to set the limits of the plot manually, because annotations are - for good reason - not taken into account when scaling the data limits.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
df ['Xstart'] = [1,2,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
fig,ax = plt.subplots()
fig.set_size_inches(7,5)
for i in range (len(df)):
ax.annotate("",xy = (df.iloc[i]['Xend'],df.iloc[i]['Yend']),
xycoords = 'data',
xytext = (df.iloc[i]['Xstart'],df.iloc[i]['Ystart']),
textcoords = 'data',
arrowprops = dict(arrowstyle = "->",
connectionstyle = 'arc3', color = 'blue'))
ax.set(xlim=(df[["Xstart","Xend"]].values.min(), df[["Xstart","Xend"]].values.max()),
ylim=(df[["Ystart","Yend"]].values.min(), df[["Ystart","Yend"]].values.max()))
plt.show()
If you want to plot the line segments, the following code works. You may want arrows or some sort of annotate element (notice correct spelling), but your goal seems to be plotting the line segments, which this accomplishes:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
df ['Xstart'] = [1,2,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
fig = plt.figure()
ax = fig.add_subplot(111)
for i in range (len(df)):
ax.plot(
(df.iloc[i]['Xstart'],df.iloc[i]['Xend']),
(df.iloc[i]['Ystart'],df.iloc[i]['Yend'])
)
plt.show()
Not 100% certain but I think in line two you need to make the part after xy= a tuple because otherwise it sets the part in front of the , as keyword parameter and tries passing the part after the , as normal arg
I have data like this :
[ ('2018-04-09', '10:18:11',['s1',10],['s2',15],['s3',5])
('2018-04-09', '10:20:11',['s4',8],['s2',20],['s1',10])
('2018-04-10', '10:30:11',['s4',10],['s5',6],['s6',3]) ]
I want to plot a stacked graph preferably of this data.
X-axis will be time,
it should be like this
I created this image in paint just to show.
X axis will show time like normal graph does( 10:00 ,April 3,2018).
I am stuck because the string value (like 's1',or 's2' ) will change in differnt bar graph.
Just to hard code and verify,I try this:
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import matplotlib
plotly.offline.init_notebook_mode()
def createPage():
graph_data = []
l1=[('com.p1',1),('com.p2',2)('com.p3',3)]
l2=[('com.p1',1),('com.p4',2)('com.p5',3)]
l3=[('com.p2',8),('com.p3',2)('com.p6',30)]
trace_temp = go.Bar(
x='2018-04-09 10:18:11',
y=l1[0],
name = 'top',
)
graph_data.append(trace_temp)
plotly.offline.plot(graph_data, filename='basic-scatter3.html')
createPage()
Error I am getting is Tuple Object is not callable.
So can someone please suggest some code for how I can plot such data.
If needed,I may store data in some other form which may be helpful in plotting.
Edit :
I used the approach suggested in accepted answer and succeed in plotting using plotly like this
fig=df.iplot(kin='bar',barmode='stack',asFigure=True)
plotly.offline.plt(fig,filename="stack1.html)
However I faced one error:
1.When Time intervals are very close,Data overlaps on graph.
Is there a way to overcome it.
You could use pandas stacked bar plot. The advantage is that you can create with pandas easily the table of column/value pairs you have to generate anyhow.
from matplotlib import pyplot as plt
import pandas as pd
all_data = [('2018-04-09', '10:18:11', ['s1',10],['s2',15],['s3',5]),
('2018-04-09', '10:20:11', ['s4',8], ['s2',20],['s1',10]),
('2018-04-10', '10:30:11', ['s4',10],['s5',6], ['s6',3]) ]
#load data into dataframe
df = pd.DataFrame(all_data, columns = list("ABCDE"))
#combine the two descriptors
df["day/time"] = df["A"] + "\n" + df["B"]
#assign each list to a new row with the appropriate day/time label
df = df.melt(id_vars = ["day/time"], value_vars = ["C", "D", "E"])
#split each list into category and value
df[["category", "val"]] = pd.DataFrame(df.value.values.tolist(), index = df.index)
#create a table with category-value pairs from all lists, missing values are set to NaN
df = df.pivot(index = "day/time", columns = "category", values = "val")
#plot a stacked bar chart
df.plot(kind = "bar", stacked = True)
#give tick labels the right orientation
plt.xticks(rotation = 0)
plt.show()
Output:
I am analysing a two-factorial (M)ANOVA; the sampling design consists of two categorical variables with two and three levels respectively and a response of dimension 4. Having done all the data parsing in python, I would like to continue plotting the data within python, too. (Rather than switch to R for the plotting.) My code, though, is not only very verbose, but the whole thing looks and feels like a really bad hack, too. My question: What is the pandas-matplotlib-way of producing the following plot? Out of interest: I would also be happy to see a solution that is not using seaborn.
The solution in R (the plotting is 2 lines of code):
# Data managment
library(reshape2)
# Plotting
library(ggplot2)
# Creating sample data
set.seed(12345)
dat = data.frame(matrix(rnorm(42*4, mean=c(10,3,5,1)), ncol=4, byrow=T))
names(dat) = c('Base', 'State23', 'State42', 'End')
gen = factor(sample(2, size=42, replace=T), labels=c('WT', 'HET'))
env = factor(sample(3, size=42, replace=T), labels=c('heavySmoker', 'casualSmoker', 'nonSmoker'))
dat$genotype = gen
dat$environment = env
# Plotting the data
dam = melt(dat, measure.vars=c('Base', 'State23', 'State42', 'End'))
p = ggplot(dam, aes(genotype, value, fill=environment)) + geom_boxplot() + facet_wrap(~variable, nrow=1)
ggsave('boxplot-r.png', plot=p)
This will produce the following plot:
My current solution in python:
# Numerics
import numpy as np
from numpy.random import randint
# Data managment
import pandas as pd
from pandas import DataFrame
from pandas import Series
# Plotting
import matplotlib
matplotlib.use('Qt4Agg')
import matplotlib.pyplot as pt
import seaborn as sns
# Creating sample data
np.random.seed(12345)
index = pd.Index(np.arange(42))
frame = DataFrame(np.random.randn(42,4) + np.array([10,3,5,1]), columns=['Base', 'State23', 'State42', 'End'], index=index)
genotype = Series(['WT', 'HET'], name='genotype', dtype='category')
environment = Series(['heavySmoker', 'casualSmoker', 'nonSmoker'], name='environment', dtype='category')
gen = genotype[np.random.randint(2, size=42)]
env = environment[np.random.randint(3, size=42)]
gen.index = frame.index
env.index = frame.index
frame['genotype'] = gen
frame['environment'] = env
# Plotting the data
response = ['Base', 'State23', 'State42', 'End']
fig, ax = pt.subplots(1, len(response), sharex=True, sharey=True)
for i, r in enumerate(response):
sns.boxplot(data=frame, x='genotype', y=r, hue='environment', ax=ax[i])
ax[i].set_ylabel('')
ax[i].set_title(r)
fig.subplots_adjust(wspace=0)
fig.savefig('boxplot-python.png')
This will produce the following plot:
As you probably agree, the code is not only verbose, but it also does not really do what I want. For example, I have no idea how to remove the multiple appearance of the legend, and the labelling on the x-axis is odd.
Edited to use factorplot instead of Facetgrid as suggested by mwaskom in the comments.
If you melt the dataframe, then you can take advantage of Seaborn's factorplot:
df = pd.melt(frame, id_vars=['genotype', 'environment'])
sns.factorplot(data=df, x='genotype', y='value',
hue='environment', col='variable',
kind='box', legend=True)
You can rename "value" and "variable" as you wish in the melt function.
Here is the resulting chart:
Previous answer with FacetGrid:
g = sns.FacetGrid(df, col="variable", size=4, aspect=.7)
g.map(sns.boxplot, "genotype", "value", "environment").add_legend(title="environment")