I’m working on a Jupyter notebook script using Python and Matplotlib which is supposed to fetch historical stock prices for specified stocks via the yfinance package and plot each stock’s volatility vs. potential return.
The expected and actual results can be found here.
As you can see in the second image, the annotations beside each point for the stock symbols are completely missing. I’m very new to Matplotlib, so I’m at a bit of a loss. The code being used is as follows:
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
from google.colab import files
sns.set()
directory = '/datasets/stocks/'
stocks = ['AAPL', 'MSFT', 'AMD', 'TWTR', 'TSLA']
#Download each stock's 6-month historical daily stock price and save to a .csv
df_list = list()
for ticker in stocks:
data = yf.download(ticker, group_by="Ticker", period='6mo')
df = pd.concat([data])
csv = df.to_csv()
with open(directory+ticker+'.csv', 'w') as f:
f.write(csv)
#Get the .csv filename as well as the full path to each file
ori_name = []
for stock in stocks:
ori_name.append(stock + '.csv')
stocks = [directory + s for s in ori_name]
dfs = [pd.read_csv(s)[['Date', 'Close']] for s in stocks]
data = reduce(lambda left,right: pd.merge(left,right,on='Date'), dfs).iloc[:, 1:]
returns = data.pct_change()
mean_daily_returns = returns.mean()
volatilities = returns.std()
combine = pd.DataFrame({'returns': mean_daily_returns * 252,
'volatility': volatilities * 252})
g = sns.jointplot("volatility", "returns", data=combine, kind="reg",height=7)
#Apply Annotations
for i in range(combine.shape[0]):
name = ori_name[i].replace(',csv', '')
x = combine.iloc[i, 1]
y = combine.iloc[i, 0]
print(name)
print(x, y)
print('\n')
plt.annotate(name, xy=(x,y))
plt.show()
Printing out the stock name and the respective x,y position I am trying to place the annotation at shows the following:
AAPL.csv
4.285630458382526 0.24836925418906455
MSFT.csv
3.3916453932738966 0.5159276490876817
AMD.csv
6.040090684498841 -0.002179408770566866
TWTR.csv
7.911518867192316 0.8556785016280568
TSLA.csv
9.154424353004579 -0.40596099327336554
Unless I am mistaken, these are the exact points that are being plotted on the graph. As such, I am confused as to why the text isn’t being correctly annotated. I would assume it has something to do with the xycoords argument for plt.annotate(), but I don’t know enough about the different coordinate systems to know which one to use or whether that’s even the root cause of the issue.
Any help would be greatly appreciated. Thank you!
As #JodyKlymak stated in his comment above, the issue with my code stems from jointplot containing several subplots, preventing annotate() from knowing which axes to base the text placement off of. This was easily fixed by simply replacing plt.annotate() with g.ax_joint.annotate().
I want to make forward forecasting for monthly times series of air pollution data such as what would be 3~6 months ahead of estimation on air pollution index. I tried scikit-learn models for forecasting and fitting data to the model works fine. But what I wanted to do is making a forward period estimate such as what would be 6 months ahead of the air pollution output index is going to be. In my current attempt, I could able to train the model by using scikit-learn. But I don't know how that forward forecasting can be done in python. To make a forward period estimate, what should I do? Can anyone suggest a possible workaround to do this? Any idea?
my attempt
import pandas as pd
from sklearn.preprocessing StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import BayesianRidge
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.index <= split_date]
df_test = df.loc[df.index > split_date]
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = linear_model.BayesianRidge()
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['BayesianRidge'] = accuracy_score(df_test['pollution_index'], yhat)
new update 2
this is my attempt using ARMA model
from statsmodels.tsa.arima_model import ARIMA
index = len(df_training)
yhat = list()
for t in tqdm(range(len(df_test['pollution_index']))):
temp_train = df[:len(df_training)+t]
model = ARMA(temp_train['pollution_index'], order=(1, 1))
model_fit = model.fit(disp=False)
predictions = model_fit.predict(start=len(temp_train), end=len(temp_train), dynamic=False)
yhat = yhat + [predictions]
yhat = pd.concat(yhat)
resultsDict['ARMA'] = evaluate(df_test['pollution_index'], yhat.values)
but this can't help me to make forward forecasting of estimating my time series data. what I want to do is, what would be 3~6 months ahead of estimated values of pollution_index. Can anyone suggest me a possible workaround to do this? How to overcome the limitation of my current attempt? What should I do? Can anyone suggest me a better way of doing this? Any thoughts?
update: goal
for the clarification, I am not expecting which model or approach works best, but what I am trying to figure it out is, how to make reliable forward forecasting for given time series (pollution index), how should I correct my current attempt if it is not efficient and not ready to do forward period estimation. Can anyone suggest any possible way to do this?
update-desired output
here is my sketch desired forecasting plot that I want to make:
In order to obtain your desired output, I think you need to use a model that can return the standard deviation in the predicted value. Therefore, I adopt Gaussian process regression. From the code you provided in your post, I don't see how this is a time series forecasting task, so in my solution below, I also treat this task as a usual regression task.
First, prepare the data
import pandas
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process import GaussianProcessRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# drop the date column
df_training.drop(columns=['date'],axis=1,inplace=True)
df_test.drop(columns=['date'],axis=1,inplace=True)
y_train = df_training['pollution_index']
y_test = df_test['pollution_index']
df_training.drop(['pollution_index'],axis=1)
df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_training)
X_train = scaler.transform(df_training)
X_test = scaler.transform(df_test)
X_train_df = pd.DataFrame(X_train,columns=df_training.columns)
X_test_df = pd.DataFrame(X_test,columns=df_test.columns)
with the dataframes prepared above, you can train a GaussianProcessRegressor and make predictions by
gpr = GaussianProcessRegressor(normalize_y=True).fit(X_train_df,y_train)
pred,std = gpr.predict(X_test_df,return_std=True)
in which std is an array of standard deviations in the predicted values. Then, you can plot the data by
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
ax.plot(y_train.index[plot_start:],y_train.values[plot_start:],'navy',marker='o',label='observed')
# plot the test data
ax.plot(y_test.index,y_test.values,'navy',marker='o')
ax.plot(y_test.index,pred,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(std)
ax.fill(np.concatenate([y_test.index,y_test.index[::-1]]),
np.concatenate([pred-1.960*sigma,(pred+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
ax.legend(loc='upper left',prop={'size':16})
the output plot looks like
UPDATE
I thought pollution_index is something that can be predicted by 'dew', 'temp', 'press', 'wnd_spd', 'rain'. If you want a one-step ahead forecasting, here is what you can do
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values
# train an ARIMA model
model = ARIMA(train_polltidx,order=(1,1,1))
model_fit = model.fit(disp=0)
# you can predict as many as you want, here I only predict len(test_dat.index) days
forecast,stderr,conf = model_fit.forecast(len(test_date))
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date,test_polltidx,'navy',marker='o')
plt.plot(test_date,forecast,'darkgreen',marker='o',label='pred')
# ax.errorbar(np.arange(len(pred)),pred,std,fmt='r')
plt.fill(np.concatenate([test_date,test_date[::-1]]),
np.concatenate((conf[:,0],conf[:,1][::-1])),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The output figure is
Clearly, the prediction is very bad, because I haven't done any preprocessing to the training data.
UPDATE 2
Since I'm not familiar with ARIMA, I implement one-step forecasting using GaussianProcessRegressor with the help of this wonderful post.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.preprocessing import StandardScaler
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values[:,None]
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values[:,None]
# preprocessing
scalar = StandardScaler()
scalar.fit(train_polltidx)
train_polltidx = scalar.transform(train_polltidx)
test_polltidx = scalar.transform(test_polltidx)
def series_to_supervised(data,n_in,n_out):
df = pd.DataFrame(data)
cols = list()
for i in range(n_in,0,-1): cols.append(df.shift(i))
for i in range(0, n_out): cols.append(df.shift(-i))
agg = pd.concat(cols,axis=1)
agg.dropna(inplace=True)
return agg.values
months_look_back = 1
# train
pollt_series = series_to_supervised(train_polltidx,months_look_back,1)
x_train,y_train = pollt_series[:,:months_look_back],pollt_series[:,-1]
# test
pollt_series = series_to_supervised(test_polltidx,months_look_back,1)
x_test,y_test = pollt_series[:,:months_look_back],pollt_series[:,-1]
print("The first %i months in the test set won't be predicted." % months_look_back)
def walk_forward_validation(x_train,y_train,x_test,y_test):
predictions = []
history_x = x_train.tolist()
history_y = y_train.tolist()
for rep,target in zip(x_test,y_test):
# train model
gpr = GaussianProcessRegressor(alpha=1e-4,normalize_y=False).fit(history_x,history_y)
pred,std = gpr.predict([rep],return_std=True)
predictions.append([pred,std])
history_x.append(rep)
history_y.append(target)
return predictions
predictions = walk_forward_validation(x_train,y_train,x_test,y_test)
pred_test,pred_std = zip(*predictions)
# put back
pred_test = scalar.inverse_transform(pred_test)
pred_std = scalar.inverse_transform(pred_std)
train_polltidx = scalar.inverse_transform(train_polltidx)
test_polltidx = scalar.inverse_transform(test_polltidx)
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date[months_look_back:],test_polltidx[months_look_back:],'navy',marker='o')
plt.plot(test_date[months_look_back:],pred_test,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([test_date[months_look_back:],test_date[months_look_back:][::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The idea of this script is to cast the time series forecasting task into a supervised regression task. The plot_start is a parameter that controls from which year we want to plot, clearly plot_start cannot be greater than the length of the training data. The output figure of the script is
as you can see, the first month in the test dataset is not predicted, because we need to look back one month to make a prediction.
In order to further make predictions about unseen data, based on this post on CV site, you can train a new model using the predicted value from the last step, therefore, here is how you can do it
unseen_dates = pd.date_range(test_date[-1],periods=180,freq='D').values
all_data = series_to_supervised(df['pollution_index'].values,months_look_back,months_to_predict)
def predict_unseen(unseen_dates,all_data,days_look_back):
predictions = []
history_x = all_data[:,:days_look_back].tolist()
history_y = all_data[:,-1].tolist()
inds = np.arange(unseen_dates.shape[0])
for ind in inds:
# train model
gpr = GaussianProcessRegressor(alpha=1e-2,normalize_y=False).fit(history_x,history_y)
rep = np.array(history_y[-days_look_back:]).reshape(days_look_back,1)
pred,std = gpr.predict(rep,return_std=True)
predictions.append([pred,std])
history_x.append(history_y[-days_look_back:])
history_y.append(pred)
return predictions
predictions = predict_unseen(unseen_dates,all_data,days_look_back=1)
pred_test,pred_std = zip(*predictions)
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the test data
plt.plot(unseen_dates,pred_test,'navy',marker='o')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([unseen_dates,unseen_dates[::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
One very important thing to note: The timestep of the real data is a month, using such data to make predictions about days may not be correct.
The model you have built links what you are trying to model, 'pollution_index', to some input variables, in your case ['dew', 'temp', 'press', 'wnd_spd', 'rain']. So to predict pollution_index into the future using your model, at the high level, you need to estimate what these variables would be over the next 3-6 months, and then run your model on that. Practically, you need to come up with something that looks like X_test but has your projections for these variables for the future, and then call:
yhat = reg.predict(X_test)
... to produce the model estimate of where the pollution_index will be. Hope this makes sense. This gives you a "mechanical" ability to use your model for prediction.
For example, following up on your main example where reg is BayesianRidge() that you fit, we would do the following:
import sys
from io import StringIO
import matplotlib.pyplot as plt
# Here we load your predictions for input variables
# I stubbed it with some random data
df_predict_data = StringIO(
"""
date,dew,temp,press,wnd_spd,rain
2021-01-01,59,28,16,0.78,98.7
2021-02-01,68,32,18,0.79,46.1
2021-03-01,75,34,20,0.81,91.5
2021-04-01,63,31,16,0.83,19.1
2021-05-01,74,38,19,0.83,21.8
2021-06-01,65,32,17,0.85,35.4
""")
df_predict = pd.read_csv(df_predict_data, index_col = 'date')
# scale it using the same scaler you used in training
X_predict = scaler.transform(df_predict)
# predict pollution_index
y_predict = reg.predict(X_predict)
# plot it
plt.plot(df_predict.index, y_predict, '.-')
So we get this:
Whether the linear regression you built is a good model for such prediction is a completely different question. As #Sergey Bushmanov mentioned there is vast literature on forecasting and what models are best for this or that, and this thread is probably not the right place to debate that aspect of your question.
Hej,
I have a dataset from different cohorts and I want to bicluster them with the sklearn function Spectral Biclustering.
As you can see in the link above this approach is using a kind of normalization to calculate the SVD.
Is it necessary to normalize the data before biclustering, eg with StandardScaling (zero mean and std of one)? Because the function above still uses a kind of normalization.
Is that enough or do I have to normalise them before, eg when the data is coming from different distributions?
I am getting different results with and without standardscaling and I can not find information in the original paper if it is necessary or not.
You can find the code and an example of my dataset. This is real data so I do not know the truth. I calculated at the end the consensus score to compare the 2 biclusters. Unfortunately the clusters are not the same.
I tried it also with artificial data (see example last link) and here the results are the same, but not with the real data.
So how do I know which approach is the right one?
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler
n_clusters = (4, 4)
data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0)
# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)
# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()
data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]
models_all = []
for name, data in zip(data_type,data_all):
# spectral biclustering on the shuffled dataset
model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
, svd_method='randomized', n_jobs=-1
, random_state=0
)
model.fit(data)
newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_row = [i[1] for i in newOrder_row]
newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_col = [i[1] for i in newOrder_col]
# reorder the data matrix
X_plot = data_scaled.copy()
X_plot = X_plot.reindex(order_row) # rows
X_plot = X_plot[[str(x) for x in order_col]] # columns
# use clustermap without clustering
cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
,row_cluster=False, row_colors=None
, col_cluster=False, col_colors=None
, yticklabels=1, xticklabels=1
, standard_scale=None, z_score=None, robust=False
, vmin=-3, vmax=5
)
ax = cm.ax_heatmap
# set labelsize smaller
cm_ax = plt.gcf().axes[-2]
cm_ax.tick_params(labelsize=5.5)
# plot lines for the different clusters
hor_lines = [sum(item) for item in model.biclusters_[0]]
hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))
ver_lines = [sum(item) for item in model.biclusters_[1]]
ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))
for pp in range(len(hor_lines)-1):
cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')
for pp in range(len(ver_lines)-1):
cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')
# title
title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
plt.title(title)
cm.savefig(title,dpi=300)
plt.show()
# save models
models_all.append(model)
# compare models
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))
The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)