Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
Related
This question is from this tutorial found here:
I want my plot to look like the one below but with time series data and the zoomed data not being x_lim , y_lim data but from a different source.
So in the plot above i would like the intraday data that is from a different source and the plot below would be daily data for some stock. But because they both have different source i cannot use a limit to zoom. For this i will be using yahoo datareader for daily and yfinance for intraday.
The code:
import pandas as pd
from pandas_datareader import data as web
from matplotlib.patches import ConnectionPatch
df = web.DataReader('goog', 'yahoo')
df.Close = pd.to_numeric(df['Close'], errors='coerce')
fig = plt.figure(figsize=(6, 5))
plt.subplots_adjust(bottom = 0., left = 0, top = 1., right = 1)
sub1 = fig.add_subplot(2,2,1)
sub1 = df.Close.plot()
sub2 = fig.add_subplot(2,1,2) # two rows, two columns, second cell
df.Close.pct_change().plot(ax =sub2)
sub2.plot(theta, y, color = 'orange')
con1 = ConnectionPatch(xyA=(df[1:2].index, df[2:3].Close), coordsA=sub1.transData,
xyB=(df[4:5].index, df[5:6].Close), coordsB=sub2.transData, color = 'green')
fig.add_artist(con1)
I am having trouble with xy coordinates. With the code above i am getting :
TypeError: Cannot cast array data from dtype('O') to dtype('float64')
according to the rule 'safe'
xyA=(df[1:2].index, df[2:3].Close)
What i had done here is that my xvalue is the date df[1:2].index and my y value is the price df[2:3].Close
Is converting the df to an array and then ploting my only option here? If there is any other way to get the ConnectionPatch to work kindly please advise.
df.dtypes
High float64
Low float64
Open float64
Close float64
Volume int64
Adj Close float64
dtype: object
The way matplotlib dates are plotted are by converting dates to floats as a number of days, starting with 0 on 1970-1-1, i.e. the POSIX timestamp zero. It’s different from that timestamp as it’s not the same resolution, i.e. “1” is a day instead of a second.
There’s 3 ways to compute that number,
either use matplotlib.dates.date2num
or use .toordinal() which gives you the right resolution and remove the offset corresponding to 1970-1-1,
or get the POSIX timestamp and divide by the number of seconds in a day:
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df['Change'] = df['Close'].pct_change()
con1 = ConnectionPatch(xyA=(df.index[0].toordinal() - pd.Timestamp(0).toordinal(), df['Close'].iloc[0]), coordsA=sub1.transData,
xyB=(df.index[1].toordinal() - pd.Timestamp(0).toordinal(), df['Change'].iloc[1]), coordsB=sub2.transData, color='green')
fig.add_artist(con1)
con2 = ConnectionPatch(xyA=(df.index[-1].timestamp() / 86_400, df['Close'].iloc[-1]), coordsA=sub1.transData,
xyB=(df.index[-1].timestamp() / 86_400, df['Change'].iloc[-1]), coordsB=sub2.transData, color='green')
fig.add_artist(con2)
You also need to make sure that you’re using values that are in range for the targeted axes, in your example you use Close values on sub2 which contains pct_change’d values.
Of course if you want the bottom of the boxes as in your example it’s easier to express the coordinates using the axes transform instead of the data transform:
from matplotlib.dates import date2num
con1 = ConnectionPatch(xyA=(0, 0), coordsA=sub1.transAxes,
xyB=(date2num(df.index[1]), df['Change'].iloc[1]), coordsB=sub2.transData, color='green')
fig.add_artist(con1)
con2 = ConnectionPatch(xyA=(1, 0), coordsA=sub1.transAxes,
xyB=(date2num(df.index[-1]), df['Change'].iloc[-1]), coordsB=sub2.transData, color='green')
fig.add_artist(con2)
To plot your candlesticks, I’d recommend using the mplfinance (previously matplotlib.finance) package:
import mplfinance as mpf
sub3 = fig.add_subplot(2, 2, 2)
mpf.plot(df.iloc[30:70], type='candle', ax=sub3)
Putting all this together in a single script, it could look like this:
import pandas as pd, mplfinance as mpf, matplotlib.pyplot as plt
from pandas_datareader import data as web
from matplotlib.patches import ConnectionPatch
from matplotlib.dates import date2num, ConciseDateFormatter, AutoDateLocator
from matplotlib.ticker import PercentFormatter
# Get / compute data
df = web.DataReader('goog', 'yahoo')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df['Change'] = df['Close'].pct_change()
# Pick zoom range
zoom_start = df.index[30]
zoom_end = df.index[30 + 8 * 5] # 8 weeks ~ 2 months
# Create figures / axes
fig = plt.figure(figsize=(18, 12))
top_left = fig.add_subplot(2, 2, 1)
top_right = fig.add_subplot(2, 2, 2)
bottom = fig.add_subplot(2, 1, 2)
fig.subplots_adjust(hspace=.35)
# Plot all 3 data
df['Close'].plot(ax=bottom, linewidth=1, rot=0, title='Daily closing value', color='purple')
bottom.set_ylim(0)
df.loc[zoom_start:zoom_end, 'Change'].plot(ax=top_left, linewidth=1, rot=0, title='Daily Change, zoomed')
top_left.yaxis.set_major_formatter(PercentFormatter())
# Here instead of df.loc[...] use your intra-day data
mpf.plot(df.loc[zoom_start:zoom_end], type='candle', ax=top_right, xrotation=0, show_nontrading=True)
top_right.set_title('Last day OHLC')
# Put ConciseDateFormatters on all x-axes for fancy date display
for ax in fig.axes:
locator = AutoDateLocator()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(ConciseDateFormatter(locator))
# Add the connection patches
fig.add_artist(ConnectionPatch(
xyA=(0, 0), coordsA=top_left.transAxes,
xyB=(date2num(zoom_start), df.loc[zoom_start, 'Close']), coordsB=bottom.transData,
color='green'
))
fig.add_artist(ConnectionPatch(
xyA=(1, 0), coordsA=top_left.transAxes,
xyB=(date2num(zoom_end), df.loc[zoom_end, 'Close']), coordsB=bottom.transData,
color='green'
))
plt.show()
I am trying to plot a density curve with seaborn using age of vehicles.
My density curve has dips between the whole numbers while my age values are all whole number.
Can't seem to find anything related to this issue so I thought I would try my luck here, any input is appreciated.
My fix currently is just using a histogram with a larger bin but would like to get this working with a density plot.
Thanks!
In seaborn.displot you are passing the kind = 'kde' parameter, in order to get a continuous corve. However, this parameter triggers the Kernel Density Estimation computation, which compute values for all number, included non integers ones.
Instead, you need to tune seaborn.histplot in order to get a continuous step curve with element and fill parameters (I create a fake dataframe just to draw a plot, since you didn't provide your data):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
N = 10000
df = pd.DataFrame({'age': np.random.poisson(lam = 4, size = N)})
df['age'] = df['age'] + 1
fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.histplot(ax = ax[0], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1))
sns.histplot(ax = ax[1], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1), element = 'step', fill = False)
ax[0].set_xticks(range(1, 14))
ax[1].set_xticks(range(1, 14))
plt.show()
As a comparison, here the seaborn.displot on the same dataframe, passing kind = 'kde' parameter:
I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.
You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:
You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html
I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.
In pandas' documentation you can find a discussion on area plots, and in particular stacking them. Is there an easy and straightforward way to get a 100% area stack plot like this one
from this post?
The method is basically the same as in the other SO answer; divide each row by the sum of the row:
df = df.divide(df.sum(axis=1), axis=0)
Then you can call df.plot(kind='area', stacked=True, ...) as usual.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(2015)
y = np.random.randint(5, 50, (10,3))
x = np.arange(10)
df = pd.DataFrame(y, index=x)
df = df.divide(df.sum(axis=1), axis=0)
ax = df.plot(kind='area', stacked=True, title='100 % stacked area chart')
ax.set_ylabel('Percent (%)')
ax.margins(0, 0) # Set margins to avoid "whitespace"
plt.show()
yields