Plot standard deviation from external datasource using seaborn - python

I am trying to visualize a lineplot via seaborn, in which I want to plot the average and standard deviation of a column. As I am using large files (with millions of rows) the plot takes a while to load.
To reduce computational time, I pre-computed the average of the columns and the corresponding standard deviation. Subsequently, I use this pre-computed data as input for the lineplot, instead the supplying the complete Pandas dataframe.
This is the code I currently use:
df = open_pickle("data/experiment")
sns.lineplot(x="rho", y="wait_time_mean", hue="c", style="service_type", data=df)
This will only show the average. I was wondering if it would be possible to manually supply values for the standard deviation to seaborn.

sns.lineplot returns the Axes object of the plot which then can be used to draw on it. Assuming your standard deviation is also in df you can adapt your code in the following way, which now uses the matplotlib-function fill_beetween:
df = open_pickle("data/experiment")
ax = sns.lineplot(x="rho", y="wait_time_mean", hue="c", style="service_type", data=df)
ax.fill_between(df["rho"], y1=df["wait_time_mean"] - df["wait_time_std"], y2=df["wait_time_mean"] + df["wait_time_std"], alpha=.5)

Related

why i got very small values in y-axis with seaborn Kdeplot?

I am using Python Seaborn package to plot the kde of both original and sampled data. The issue is that the values in the y-axis is very small with multiple zeros. Is it possible to normalized the values or make it look more elegant?.
My implementation:
ax=sns.kdeplot(old_d,shade=True,label='Original kde')
ax=sns.kdeplot(new_d,shade=True, label='Sampled kde')
plt.legend(prop={'size': 12})
ax.set_xlabel('CPU time (in microsecond)',size=16)
ax.set_ylabel('Probability',size=16)
plt.show()
Example of this code,

How to show the value of the error bars generated by seaborn?

I am trying to visualize some data using seaborns. I am using a catplot that is set to be a bar plot. I have it showing the error bars to be the standard deviation. I want to know what value it is using for the mean and standard deviation it is using in the visualization, however I do not know how to retrieve that information from the plot. How would I go about getting that information?
bar_graph = seaborn.catplot(x="x", y="y", hue="z", data=data, ci="sd", capsize=0.1, kind="bar")
Trying to get that data from the plot generated by seaborn would not be impossible, but would be very cumbersome, as seaborn does not return the artists that it creates and catplot() can generate a number of subplots, etc.
However, I expect you don't need to get the data from the plot, you can get them directly from the dataframe, can't you? This simple demonstration shows that the plot and the calculated values do match:
titanic = sns.load_dataset("titanic")
sns.catplot(x='sex',y='age',hue="class", data=titanic, ci="sd", capsize=0.1, kind="bar")
titanic.groupby(['sex','class'])['age'].describe()[['mean','std']]
mean std
sex class
female First 34.611765 13.612052
Second 28.722973 12.872702
Third 21.750000 12.729964
male First 41.281386 15.139570
Second 30.740707 14.793894
Third 26.507589 12.159514

How to plot a boxplot using aggregates in plotly?

I need to plot a series of boxplots, based on results of numerical air quality model. Since this is a significant amount of data, I trigger calculation of aggregates (min, max, quartiles, etc.) every time when new model results become ready and store them in PostgreSQL. For visualization purpose I load the aggregates into pandas and I plot them using dash. I am able to plot line plots of timeseries, however I would like to get something like this example, but also interactive.
As I went through plotly examples, it looks like it always require the raw data for ploting boxplots ( https://plot.ly/python/box-plots/#basic-box-plot ). I really enjoy the concept of presentation and logic separation. Is it possible to get a plotly box plot based on aggregated data?
You can provide your aggreate values to a Plotly boxplot in Python by providing it in the following format:
plotly.graph_objs.Box(y=[val_min,
val_lower_box,
val_lower_box,
val_median,
val_upper_box,
val_upper_box,
val_max])
e.g.
import plotly
plotly.offline.init_notebook_mode()
val_min = 1
val_lower_box = 2
val_median = 3
val_upper_box = 4.5
val_max = 6
box_plot = plotly.graph_objs.Box(y=[val_min,
val_lower_box,
val_lower_box,
val_median,
val_upper_box,
val_upper_box,
val_max])
plotly.offline.iplot([box_plot])
gives you

modifying scipy stats.probplot plotting function with matplotlib

I am not am expert with matplotlib, so I am having a hard time trying to set the parameters of scipy stats.
My code takes a pandas df column, iterates over the columns, and attempts to plot the values of the columns using the stats.probplot function. This is my code:
plt.figure(figsize=(10,5))
for col in model_predictions.columns:
res = stats.probplot(df[col]), plot=plt)
plt.legend = col
plt.show()
This generates the charts I want, but difficult to read (no legends, sames colors). Aside from plotting them on top of each other, I would like to plot each line in a different color, as well as add a legend for each line equal to the str in col. Any way to do this?
I can always take the tuple output of the function, run it by another new def, and add the outputs to a new pandas df (to later plot with more control); but I was wondering if there is a quicker way.
Thanks
You can plot them manually by taking the output of stats.probplot, i.e.:
from scipy.stats import probplot
for col in model_predictions.columns
plt.plot(*stats.probplot(df[col])[0], label=col)
plt.legend(loc='best')
plt.show()

Plotting a column with millions of rows

I have a data-frame with millions of rows (almost 8 million). I need to see the distribution of the values in one of the columns. This column is called 'price_per_mile'. I also have a column called 'Borough'. The final goal is doing a t-test.
First I want to see the distribution of data in 'price_per_mile', to see if data is normal and if I need to do some data cleaning. Then group-by based on five categories in 'borough' column and then do the t-test for each possible pair of boroughs.
I have tried to plot the distribution with sns.distplot() but it doesn't give me a clear plot as it seems there's a scaling of the values on the y-axis. Also, the range of values contained in 'price_per_mile' is big.
Then I tried to plot a section of values, again the plot doesn't look clear and informative enough. Scaling happens again.
result.drop(result[(result.price_per_mile <1) | (result.price_per_mile>200)].index, inplace=True)
What do I need to do to have a better-looking plot which gives me the true value of each bin and not just a normalized value?
I read the documentation for sns.distplot() but didn't find something helpful.
As per the documentation for displot (emphasis mine)
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
Which means that if you want the non-normalized histogram, you have to make sure to instruct seaborn to not plot the KDE at the same time
sns.distplot(a, kde=True, norm_hist=False)
sns.distplot(a, kde=False, norm_hist=False)

Categories

Resources