Min and max values in seaborn violinplot are invalid - python

I'm plotting the distribution of daily returns from stock index in a particular year using seaborn violinplot. However some extreme values on the chart looks to be plotted improperly.
The chart below is an example for one year. As you can see the lowest value for 'Piątek' is something near -6.
sns.violinplot( x=wig20.iloc[1500:1751,3], y=wig20.iloc[1500:1751,2], width=1, order=['Poniedziałek','Wtorek','Środa','Czwartek','Piątek'])
Data looks like:
wig20.iloc[1500:1751,0:4].head()
Date wig20 [%] weekday
1500 2016-01-04 1804.42 -2.943818 Poniedziałek
1501 2016-01-05 1792.01 -0.687756 Wtorek
1502 2016-01-07 1745.46 -2.597642 Czwartek
1503 2016-01-08 1725.14 -1.164163 Piątek
1504 2016-01-11 1703.78 -1.238160 Poniedziałek
However when i checked the data i can see
wig20.iloc[1500:1751,2].min()
-4.533610974747937
So the chart is completely missleading. On the chart above the low for 'Piątek' is definitely below -5. I checked diffrent years and it seems that every max/min value of more than 4 is near the 6 on the chart and i have no clue why it is that way.

You can pass cut=0 to sns.violinplot to cut the violin plot at the minimum and maximum values.

Related

Plot boxplots for minute - and hourly data using pandas and seaborn

I've got the following dataframe:
dfB1
Date_and_time MP
2020-08-28 19:05:00.066663676 75.0
2020-08-28 19:05:00.133330342 70.0
2020-08-28 19:05:00.199997008 76.0
2020-08-28 19:05:00.266663674 85.0
2020-08-28 19:05:00.333330340 73.0
... ...
2020-08-29 01:59:50.666414770 1454.0
2020-08-29 01:59:50.733081436 1359.0
2020-08-29 01:59:50.799748102 1320.0
2020-08-29 01:59:50.866414768 1217.0
2020-08-29 01:59:50.933081434 1246.0
373364 rows × 1 columns
My goal is to create a plot which displays boxplots for every 1 or 5 or 30 minutes, or even every 1 hour. The datetimeindex is in the correct format (data was collected at 15 Hz, which means every datapoint is 66666666 nanaseconds) in order to index to 'hours'.
dfB1.index
DatetimeIndex(['2020-08-28 19:05:00.066663676',
'2020-08-28 19:05:00.133330342',
'2020-08-28 19:05:00.199997008',
'2020-08-28 19:05:00.266663674',
'2020-08-28 19:05:00.333330340',
...
'2020-08-29 01:59:50.666414770',
'2020-08-29 01:59:50.733081436',
'2020-08-29 01:59:50.799748102',
'2020-08-29 01:59:50.866414768',
'2020-08-29 01:59:50.933081434'],
dtype='datetime64[ns]', name='Date_and_time', length=373364, freq='66666666N')
I've tried plotting using seaborn, and I get a result. But I can't interact with the plot and it is also plotted very poorly. I am familiar with plotly, but I can't seem to find a way to integrate plotly. Also, the minute plot is completely wrong. I only get 59 points on the x-axis. What should I do to interact with the plots and to get boxplots every minute (or every 5 minutes)? I've also read and tried functions described here: Box plot of hourly data in Time Series Python
import seaborn as sns
fig, ax = plt.subplots(figsize=(15,5))
sns.boxplot(x=dfB1.index.hour, y=dfB1['MP'], ax=ax)
hour gives only the hours, i.e. both 2020-01-01 00:00 and 2020-01-10 00:00 will give 0. I think you want .floor:
sns.boxplot(x=dfB1.index.floor('H'), y=dfB1['MP'], ax=ax)
and also:
sns.boxplot(x=dfB1.index.floor('5Min'), y=dfB1['MP'], ax=ax)

Iteratively plot data through datetime in pandas dataframe

I have a dataframe here that contains a value daily since 2000 (ignore the index).
Extent Date
6453 13.479 2001-01-01
6454 13.385 2001-01-02
6455 13.418 2001-01-03
6456 13.510 2001-01-04
6457 13.566 2001-01-05
I would like to make a plot where the x-axis is the day of the year, and the y-axis is the value. The plot would contain 20 different lines, with each line corresponding to the year of the data. Is there an intuitive way to do this using pandas, or is it easier to do with matplotlib?
Here is a quick paint sketch to illustrate.
One quick way is to plot x-axis as strings:
df['Date'] = pd.to_datetime(df['Date'])
(df.set_index([df.Date.dt.strftime('%m-%d'),
df.Date.dt.year])
.Extent.unstack()
.plot()
)

How can I plot different length pandas series with matplotlib?

I've got two pandas series, one with a 7 day rolling mean for the entire year and another with monthly averages. I'm trying to plot them both on the same matplotlib figure, with the averages as a bar graph and the 7 day rolling mean as a line graph. Ideally, the line would be graph on top of the bar graph.
The issue I'm having is that, with my current code, the bar graph is showing up without the line graph, but when I try plotting the line graph first I get a ValueError: ordinal must be >= 1.
Here's what the series' look like:
These are first 15 values of the 7 day rolling mean series, it has a date and a value for the entire year:
date
2016-01-01 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 0.088473
2016-01-09 0.099122
2016-01-10 0.086265
2016-01-11 0.084836
2016-01-12 0.076741
2016-01-13 0.070670
2016-01-14 0.079731
2016-01-15 0.079187
2016-01-16 0.076395
This is the entire monthly average series:
dt_month
2016-01-01 0.498323
2016-02-01 0.497795
2016-03-01 0.726562
2016-04-01 1.000000
2016-05-01 0.986411
2016-06-01 0.899849
2016-07-01 0.219171
2016-08-01 0.511247
2016-09-01 0.371673
2016-10-01 0.000000
2016-11-01 0.972478
2016-12-01 0.326921
Here's the code I'm using to try and plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
series_two.plot(ax=ax)
plt.show()
Here's the graph that generates:
Any help is hugely appreciated! Also, advice on formatting this question and creating code to make two series for a minimum working example would be awesome.
Thanks!!
The problem is that pandas bar plots are categorical (Bars are at subsequent integer positions). Since in your case the two series have a different number of elements, plotting the line graph in categorical coordinates is not really an option. What remains is to plot the bar graph in numerical coordinates as well. This is not possible with pandas, but is the default behaviour with matplotlib.
Below I shift the monthly dates by 15 days to the middle of the month to have nicely centered bars.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
t1 = pd.date_range("2018-01-01", "2018-12-31", freq="D")
s1 = pd.Series(np.cumsum(np.random.randn(len(t1)))+14, index=t1)
s1[:6] = np.nan
t2 = pd.date_range("2018-01-01", "2018-12-31", freq="MS")
s2 = pd.Series(np.random.rand(len(t2))*15+5, index=t2)
# shift monthly data to middle of month
s2.index += pd.Timedelta('15 days')
fig, ax = plt.subplots()
ax.bar(s2.index, s2.values, width=14, alpha=0.3)
ax.plot(s1.index, s1.values)
plt.show()
The problem might be the two series' indices are of very different scales. You can use ax.twiny to plot them:
ax = series_one.plot(kind="bar", figsize=(20,2))
ax_tw = ax.twiny()
series_two.plot(ax=ax_tw)
plt.show()
Output:

Datetime, Timedelta and separate lineplot, plot area

My df has index with time datetime64 and my columns are timedelta and float64. Below 3 example rows of my df.
CTIL downtime ratio
quater
2015-04-01 4859 days 01:46:00 1699 days 17:20:00 0.349804
2015-07-01 4553 days 14:16:00 1862 days 03:27:00 0.408939
2015-10-01 5502 days 21:18:00 2442 days 20:15:00 0.443920
I would like to plot in on one chart. CTIL and downtime should be area plots and ratio should be a line chart.
Current I have 2 separate plots:
df_quater[['CTIL', 'downtime']].plot()
df_quater['ratio'].plot()
Question 1:
How can I plot area plot when type of x is different than y.
I try this:
df_quater[['CTIL', 'downtime']].plot(kind='area')
It generate error
TypeError: Cannot cast ufunc greater_equal input (...) with casting rule 'same_kind'
Question 2:
Can my labels on y be in deltatime format? Current plot has numbers.
Qustion 3:
Can I connect this 2 plot into one? Label for CTIL and downright.time should be on left and label for ratio should be on

bokeh x-axis datetime returning incorrect dates

I have a dataframe df2 with dates in string form and numbers
date | value
2018-02-02 130
2018-02-05 360
2018-02-06 98
2018-02-07 150
When I plot the dates in a plot along the x axis, the dates returned are incorrect. They seem to translate like so:
2018-02-06 = Jan 1, 1970
2018-02-05 = Jan 13.5, 1970
source=ColumnDataSource(data=df2)
p= figure(plot_width=700, plot_height=400, x_axis_type="datetime")
p.xaxis.formatter= DatetimeTickFormatter(days="%B/%d/%Y")
p.triangle("DATE","VALUE",color='black',source=source)
The glyphs don't fall exactly on the gridlines either. What is happening?
p.triangle(df2["DATE"].dt.date,df2["VALUE"],color='Black') will yield the date
I have a nasty habit of finding the right answers shortly after posting it on stackoverflow. I think I may do that more often!

Categories

Resources