I have a scatter plot that has time on the x-axis
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
d = ({
'A' : ['08:00:00','08:10:00','08:12:00','08:26:00','08:29:00','08:31:00','10:10:00','10:25:00','10:29:00','10:31:00'],
'B' : ['1','1','1','2','2','2','7','7','7','7'],
'C' : ['X','Y','Z','X','Y','Z','A','X','Y','Z'],
})
df = pd.DataFrame(data=d)
fig,ax = plt.subplots()
x = df['A']
y = df['B']
x_numbers = (pd.to_timedelta(df['A']).dt.total_seconds())
plt.scatter(x_numbers, y)
plt.show()
Output 1:
I wanted to swap total seconds for actual timestamps so I included:
plt.xticks(x_numbers, x)
This results in the x-ticks overlapping each other.
If I use:
plt.locator_params(axis='x', nbins=10)
The results is the same as above. If I change the nbins to something smaller the ticks don't overlap but they don't align with their respective scatter points. As in the scatter points don't line up with the correct timestamp.
If I use:
M = 10
xticks = ticker.MaxNLocator(M)
ax.xaxis.set_major_locator(xticks)
The ticks don't overlap but the don't align with their respective scatter points.
Is it possible to pick the number of x-ticks you use but is still aligned to the respective data point.
E.g. For the figure below. Can I just use n number of ticks instead of all of them?
Output 2:
Let use some xticklabel manipulations:
d = ({
'A' : ['08:00:00','08:10:00','08:12:00','08:26:00','08:29:00','08:31:00','10:10:00','10:25:00','10:29:00','10:31:00'],
'B' : ['1','1','1','2','2','2','7','7','7','7'],
'C' : ['X','Y','Z','X','Y','Z','A','X','Y','Z'],
})
df = pd.DataFrame(data=d)
fig,ax = plt.subplots()
x = df['A']
y = df['B']
x_numbers = (pd.to_timedelta(df['A']).dt.total_seconds())
plt.scatter(x_numbers, y)
loc, labels = plt.xticks()
newlabels = [str(pd.Timedelta(str(i)+ ' seconds')).split()[2] for i in loc]
plt.xticks(loc, newlabels)
plt.show()
Output:
Firstly, the time interval is not consistent.
Secondly, it's a high-frequency series.
In a general case, you won't be required to match the xticks corresponding to each entry. And, in those scenarios, you can exploit something like plt.plot_date(x, y) along-with tick locators and formatters like, DayLocator() and DateFormatter('%Y-%m-%d').
Though for this very specific case where data is at minute level and few points are really close, the hack may be to try and play with the numeric Series you are using for x-axis, x_numbers. For increasing the gap between two points, I tried cumsum() and for eliminate overlapping to an extent, gave some rotation to xticks.
fig, ax = plt.subplots(figsize=(10,6))
x = df['A']
y = df['B']
x_numbers = (pd.to_timedelta(df['A']).dt.total_seconds()).cumsum()
plt.scatter(x_numbers, y)
plt.xticks(x_numbers, x, rotation=50)
plt.show()
Related
I have a number of charts, made with matplotlib and seaborn, that look like the example below.
I show how certain quantities evolve over time on a lineplot
The x-axis labels are not numbers but strings (e.g. 'Q1' or '2018 first half' etc)
I need to "extend" the x-axis to the right, with an empty period. The chart must show from Q1 to Q4, but there is no data for Q4 (the Q4 column is full of nans)
I need this because I need the charts to be side-by-side with others which do have data for Q4
matplotlib doesn't display the column full of nans
If the x-axis were numeric, it would be easy to extend the range of the plot; since it's not numeric, I don't know which x_range each tick corresponds to
I have found the solution below. It works, but it's not elegant: I use integers for the x-axis, add 1, then set the labels back to the strings. Is there a more elegant way?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.ticker import FuncFormatter
import seaborn as sns
df =pd.DataFrame()
df['period'] = ['Q1','Q2','Q3','Q4']
df['a'] = [3,4,5,np.nan]
df['b'] = [4,4,6,np.nan]
df = df.set_index( 'period')
fig, ax = plt.subplots(1,2)
sns.lineplot( data = df, ax =ax[0])
df_idx = df.index
df2 = df.set_index( np.arange(1, len(df_idx) + 1 ))
sns.lineplot(data = df2, ax = ax[1])
ax[1].set_xlim(1,4)
ax[1].set_xticklabels(df.index)
You can add these lines of code for ax[0]
left_buffer,right_buffer = 3,2
labels = ['Q1','Q2','Q3','Q4']
extanded_labels = ['']*left_buffer + labels + ['']*right_buffer
left_range = list(range(-left_buffer,0))
right_range = list(range(len(labels),len(labels)+right_buffer))
ticks_range = left_range + list(range(len(labels))) + right_range
aux_range = list(range(len(extanded_labels)))
ax[0].set_xticks(ticks_range)
ax[0].set_xticklabels(extanded_labels)
xticks = ax[0].xaxis.get_major_ticks()
for ind in aux_range[0:left_buffer]: xticks[ind].tick1line.set_visible(False)
for ind in aux_range[len(labels)+left_buffer:len(labels)+left_buffer+right_buffer]: xticks[ind].tick1line.set_visible(False)
in which left_buffer and right_buffer are margins you want to add to the left and to the right, respectively. Running the code, you will get
I may have actually found a simpler solution: I can draw a transparent line (alpha = 0 ) by plotting x = index of the dataframe, ie with all the labels, including those for which all values are nans, and y = the average value of the dataframe, so as to be sure it's within the range:
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * df.mean().mean() , ax = ax[0], alpha =0 )
This assumes the scale of the y a xis has not been changed manually; a better way of doing it would be to check whether it has:
y_centre = np.mean([ax[0].get_ylim()])
sns.lineplot(x = df.index, y = np.ones(df.shape[0]) * y_centre , ax = ax[0], alpha =0 )
Drawing a transparent line forces matplotlib to extend the axes so as to show all the x values, even those for which all the other values are nans.
Below I have my code to plot my graph.
#can change the 'iloc[x:y]' component to plot sections of chart
#ax = df['Data'].iloc[300:].plot(color = 'black', title = 'Past vs. Expected Future Path')
ax = df.plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
df.loc[df.index >= idx, 'up2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'down2SD'].plot(color = 'r', ax = ax)
df.loc[df.index >= idx, 'Data'].plot(color = 'b', ax = ax)
plt.show()
#resize the plot
plt.rcParams["figure.figsize"] = [10,6]
plt.show()
Lines 2 (commented out) and 3 both work to plot all of the lines together as seen, however I wish to have the dates on the x-axis and also be able to be able to plot sections of the graph (defined by x-axis, i.e. date1 to date2).
Using line 3 I can plot with dates on the x-axis, however using ".iloc[300:]" like in line 2 does not appear to work as the 3 coloured lines disconnect from the main line as seen below:
ax = df.iloc[300:].plot('Date','Data',color = 'black', title = 'Past vs. Expected Future Path')
Using line 2, I can edit the x-axis' length, however it doesn't have dates on the x-axis.
Does anyone have any advice on how to both have dates and be able to edit the x-axis periods?
For this to work as desired, you need to set the 'date' column as index of the dataframe. Otherwise, df.plot has no way to know what needs to be used as x-axis. With the date set as index, pandas accepts expressions such as df.loc[df.index >= '20180101', 'data2'] to select a time range and a specific column.
Here is some example code to demonstrate the concept.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
dates = pd.date_range('20160101', '20191231', freq='D')
data1 = np.random.normal(-0.5, 0.2, len(dates))
data2 = np.random.normal(-0.7, 0.2, len(dates))
df = pd.DataFrame({'date': dates, 'data1':data1, 'data2':data2})
df.set_index('date', inplace=True)
df['data1'].iloc[300:].plot(color='crimson')
df.loc[df.index >= '20180101', 'data2'].plot(color='dodgerblue')
plt.tight_layout()
plt.show()
I am trying to alter the x-ticks on the plot below. When I run the code below I'm getting an error:
ValueError: unit abbreviation w/o a number
I can't seem to find anything on this except it's related to pd.to_timedelta. However, I can't find any solutions on this.
I've upgraded all relevant packs including matplotlib.
import pandas as pd
import matplotlib.pyplot as plt
d = ({
'A' : ['08:00:00','08:10:00','08:12:00','08:26:00','08:29:00','08:31:00','10:10:00','10:25:00','10:29:00','10:31:00'],
'B' : ['1','1','1','2','2','2','7','7','7','7'],
'C' : ['X','Y','Z','X','Y','Z','A','X','Y','Z'],
})
df = pd.DataFrame(data=d)
fig,ax = plt.subplots()
x = df['A']
y = df['B']
x_numbers = (pd.to_timedelta(df['A']).dt.total_seconds())
plt.scatter(x_numbers, y)
xaxis = ax.get_xaxis()
ax.set_xticklabels([str(pd.Timedelta(i.get_text()+' seconds')).split()[2] for i in xaxis.get_majorticklabels()], rotation=45)
plt.show()
Any suggestions? Has anyone come across this?
Based on this SO question and answer, one solution is to trigger axis tick positioning with a call to fig.canvas.draw() after the scatter, but before getting the labels:
[...]
plt.scatter(x_numbers, y)
# draw canvas to trigger tick positioning
fig.canvas.draw()
xaxis = ax.get_xaxis()
[...]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = ({
'A' : ['08:00:00','08:10:00','08:12:00','08:26:00','08:29:00','08:31:00','10:10:00','10:25:00','10:29:00','10:31:00'],
'B' : ['1','1','1','2','2','2','7','7','7','7'],
'C' : ['X','Y','Z','X','Y','Z','A','X','Y','Z'],
})
df = pd.DataFrame(data=d)
x = df['A']
y = df['B']
x_numbers = (pd.to_timedelta(df['A']).dt.total_seconds())
fig, axes = plt.subplots(figsize=(10, 4))
axes.scatter(x_numbers, y)
axes.set_xticks(x_numbers)
axes.set_xticklabels([i+' seconds' for i in df['A'].get_values()], rotation=90)
plt.tight_layout()
output:
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
I have many data frames that I am plotting for a presentation. These all have different columns, but all contain the same additional column foobar. At the moment, I am plotting these different data frames using
df.plot(secondary_y='foobar')
Unfortunately, since these data frames all have different additional columns with different ordering, the color of foobar is always different. This makes the presentation slides unnecessary complicated. I would like, throughout the different plots, assign that foobar is plotted bold and black.
Looking at the docs, the only thing coming close appears to be the parameter colormap - I would need to ensure that the xth color in the color map is always black, where x is the order of foobar in the data frame. Seems to be more complicated than it should be, also this wouldn't make it bold.
Is there a (better) approach?
I would suggest using matplotlib directly rather than the dataframe plotting methods. If df.plot returned the artists it added instead of an Axes object it wouldn't be too bad to change the color of the line after it was plotted.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
def pandas_plot(ax, df, callout_key):
"""
Parameters
----------
ax : mpl.Axes
The axes to draw to
df : DataFrame
Data to plot
callout_key : str
key to highlight
"""
artists = {}
x = df.index.values
for k, v in df.iteritems():
style_kwargs = {}
if k == callout_key:
style_kwargs['c'] = 'k'
style_kwargs['lw'] = 2
ln, = ax.plot(x, v.values, **style_kwargs)
artists[k] = ln
ax.legend()
ax.set_xlim(np.min(x), np.max(x))
return artists
Usage:
fig, ax = plt.subplots()
ax2 = ax.twinx()
th = np.linspace(0, 2*np.pi, 1024)
df = pd.DataFrame({'cos': np.cos(th), 'sin': np.sin(th),
'foo': np.sin(th + 1), 'bar': np.cos(th +1)}, index=th)
df2 = pd.DataFrame({'cos': -np.cos(th), 'sin': -np.sin(th)}, index=th)
pandas_plot(ax, df, 'sin')
pandas_plot(ax2, df2, 'sin')
Perhaps you could define a function which handles the special column in a separate plot call:
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
df[columns].plot(ax=ax)
df[col].plot(ax=ax, **emphargs)
Using code from tcaswell's example,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
df[columns].plot(ax=ax)
df[col].plot(ax=ax, **emphargs)
fig, ax = plt.subplots()
th = np.linspace(0, 2*np.pi, 1024)
df = pd.DataFrame({'cos': np.cos(th), 'foobar': np.sin(th),
'foo': np.sin(th + 1), 'bar': np.cos(th +1)}, index=th)
df2 = pd.DataFrame({'cos': -np.cos(th), 'foobar': -np.sin(th)}, index=th)
emphasize_plot(ax, df, 'foobar', lw=2, c='k')
emphasize_plot(ax, df2, 'foobar', lw=2, c='k')
plt.show()
yields
I used #unutbut's answer and extended it to allow for a secondary y axis and correct legends:
def emphasize_plot(ax, df, col, **emphargs):
columns = [c for c in df.columns if c != col]
ax2 = ax.twinx()
df[columns].plot(ax=ax)
df[col].plot(ax=ax2, **emphargs)
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc=0)