Plotting pandas Series line becomes curved - python

The problem is to plot a straight line with uneven distribution of dates. Using the series values data fixes the curviness problem, but loses the timeline (the dates). Is there a way to fix this?
Edit: Why aren't the dates mapped straight to ticks on x axis:
0 -> 2017-02-17,
1 -> 2017-02-20,
... ?
Now there seems to be 12 ticks for the orange line but only 8 datapoints.
import pandas as pd
import matplotlib.pyplot as plt
def straight_line(index):
y = [3 + 2*x for x in range(len(index))]
zserie = pd.Series(y, index=index)
return zserie
if __name__ == '__main__':
start = '2017-02-10'
end = '2017-02-17'
index = pd.date_range(start,end)
index1 = pd.DatetimeIndex(['2017-02-17', '2017-02-20', '2017-02-21', '2017-02-22',
'2017-02-23', '2017-02-24', '2017-02-27', '2017-02-28',],
dtype='datetime64[ns]', name='pvm', freq=None)
plt.figure(1, figsize=(8, 4))
zs = straight_line(index)
zs.plot()
zs = straight_line(index1)
zs.plot()
plt.figure(2, figsize=(8, 4))
zs = straight_line(index1)
plt.plot(zs.values)

The graph is treating the dates correctly as a continuous variable. The days of index_1 should be plotted at x coordinates of 17, 20, 21, 22, 23, 24, 27, and 28. So, the graph with the orange line is correct.
The problem is with the way you calculate the y-values in the straight_line() function. You are treating the dates as if they are just categorical values and ignoring the gaps between the dates. A linear regression calculation won't do this--it will treat the dates as continuous values.
To get a straight line in your example code you should convert the values in index_1 from absolute dates to relative differences by using td = (index - index[0]) (which returns a pandas TimedeltaIndex) and then use the days from td for the x-values of your calculation. I've shown how you can do this in the reg_line() function below:
import pandas as pd
import matplotlib.pyplot as plt
def reg_line(index):
td = (index - index[0]).days #array containing the number of days since the first day
y = 3 + 2*td
zserie = pd.Series(y, index=index)
return zserie
if __name__ == '__main__':
start = '2017-02-10'
end = '2017-02-17'
index = pd.date_range(start,end)
index1 = pd.DatetimeIndex(['2017-02-17', '2017-02-20', '2017-02-21', '2017-02-22',
'2017-02-23', '2017-02-24', '2017-02-27', '2017-02-28',],
dtype='datetime64[ns]', name='pvm', freq=None)
plt.figure(1, figsize=(8, 4))
zs = reg_line(index)
zs.plot(style=['o-'])
zs = reg_line(index1)
zs.plot(style=['o-'])
Which produces the following figure:
NOTE: I've added points to the graph to make it clear which values are being drawn on the figure. As you can see, the orange line is straight even though there are no values for some of the days within the range.

Related

Overlay Graphs at same point

I want to overlay some graphs out of CSV data (two datasets).
The graph I got from my dataset is shown down below.
Is there any way to plot those datasets over specific points? I would like to overlay these plots by using the anchor of the "big drop" to compare them in a better way.
The code used:
import pandas as pd
import matplotlib.pyplot as plt
# Read the data
data1 = pd.read_csv('data1.csv', delimiter=";", decimal=",")
data2 = pd.read_csv('data2.csv', delimiter=";", decimal=",")
data3 = pd.read_csv('data3.csv', delimiter=";", decimal=",")
data4 = pd.read_csv('data4.csv', delimiter=";", decimal=",")
# Plot the data
plt.plot(data1['Zeit'], data1['Kanal A'])
plt.plot(data2['Zeit'], data2['Kanal A'])
plt.plot(data3['Zeit'], data3['Kanal A'])
plt.plot(data4['Zeit'], data4['Kanal A'])
plt.show()
plt.close()
I would like to share you some data here:
Link to data
Part 1: Anchor times
A simple way is to find the times of interest (lowest point) in each frame, then plot each series with x=t - t_peak instead of x=t. Two ways come to mind to find the desired anchor points:
Simply using the global minimum (in your plots, that would work fine), or
Using the most prominent local minimum, either from first principles, or using scipy's find_peaks().
But first of all, let us attempt to build a reproducible example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def make_sample(t_peak, tmax_approx=17.5, n=100):
# uneven times
t = np.random.uniform(0, 2*tmax_approx/n, n).cumsum()
y = -1 / (0.1 + 2 * np.abs(t - t_peak))
trend = 4 * np.random.uniform(-1, 1) / n
level = np.random.uniform(10, 12)
y += np.random.normal(trend, 1/n, n).cumsum() + level
return pd.DataFrame({'t': t, 'y': y})
poi = [2, 2.48, 2.6, 2.1]
np.random.seed(0)
frames = [make_sample(t_peak) for t_peak in poi]
plt.rcParams['figure.figsize'] = (6,2)
fig, ax = plt.subplots()
for df in frames:
ax.plot(*df.values.T)
In this case, we made the problem maximally inconvenient by giving each time series its own, independent, unevenly distributed time sampling.
Now, retrieving the "maximum drop" by global minimum:
peaks = [df.loc[df['y'].idxmin(), 't'] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
But imagine a case where the global minimum is not suitable. For example, add a large sine wave to each series:
frames = [df.assign(y=df['y'] + 5 * np.sin(df['t'])) for df in frames]
# just plotting the first series
df = frames[0]
plt.plot(*df.values.T)
Clearly, there are several local minima, and the one we want ("sharpest drop") is not the global one.
A simple way to find the desired sharpest drop time is by looking at the difference from each point to its two neighbors:
def arg_steepest_min(v):
# simply find the minimum that is most separated from the surrounding points
diff = np.diff(v)
i = np.argmin(diff[:-1] - diff[1:]) + 1
return i
peaks = [df['t'].iloc[arg_steepest_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
# just plotting the first curve and the peak found
df = frames[0]
plt.plot(*df.values.T)
plt.plot(*df.iloc[arg_steepest_min(df['y'])].T, 'x')
There are more complex cases where you want to bring the full power of find_peaks(). Here is an example that uses the most prominent minimum, using a certain number of samples for neighborhood:
from scipy.signal import find_peaks, peak_prominences
def arg_most_prominent_min(v, prominence=1, wlen=10):
peaks, details = find_peaks(-v, prominence=prominence, wlen=wlen)
i = peaks[np.argmax(details['prominences'])]
return i
peaks = [df['t'].iloc[arg_most_prominent_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
In this case, the peaks found by both methods are the same. Aligning the curves gives:
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
Part 2: aligning the time series for numeric operations
Having found the anchor times and plotted the time series by shifting the x-axis accordingly, suppose now that we want to align all the time series, for example to somehow compare them to one another (e.g.: differences, correlation, etc.). In this example we made up, the time samples are not equidistant and all series have their own sampling.
We can use resample() to achieve our goal. Let us convert the frames into actual time series, transforming the column t (supposed in seconds) into a DateTimeIndex, after shifting the time using the previously found t_peak and using an arbitrary "0" date:
frames = [
pd.Series(
df['y'].values,
index=pd.Timestamp(0) + (df['t'] - t_peak) * pd.Timedelta(1, 's')
) for t_peak, df in zip(peaks, frames)]
>>> frames[0]
t
1969-12-31 23:59:58.171107267 11.244308
1969-12-31 23:59:58.421423545 12.387291
1969-12-31 23:59:58.632390727 13.268186
1969-12-31 23:59:58.823099841 13.942224
1969-12-31 23:59:58.971379021 14.359900
...
1970-01-01 00:00:14.022717327 10.422229
1970-01-01 00:00:14.227996854 9.504693
1970-01-01 00:00:14.235034496 9.489011
1970-01-01 00:00:14.525163506 8.388377
1970-01-01 00:00:14.526806922 8.383366
Length: 100, dtype: float64
At this point, the sampling is still uneven, so we use resample to get a fixed frequency. One strategy is to oversample and interpolate:
frames = [df.resample('100ms').mean().interpolate() for df in frames]
for df in frames:
df.plot()
At this point, we can compare the Series. Here are the pairwise differences and correlations:
fig, axes = plt.subplots(nrows=len(frames), ncols=len(frames), figsize=(10, 5))
for axrow, a in zip(axes, frames):
for ax, b in zip(axrow, frames):
(b-a).plot(ax=ax)
ax.set_title(fr'$\rho = {b.corr(a):.3f}$')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.tight_layout()

Why did plt.hline() show an extended long X axis than number of dates in the data? - matplotlib [duplicate]

This question already has answers here:
Plot a horizontal line on a given plot
(7 answers)
Closed 8 months ago.
I'm trying to replicate a plot example but ran into an issue with the x axis and date range. When the plt.hlines() is included, the range goes back to 1970. When removed, the date range is correct. What could be causing the issue?
import yfinance as yf
import matplotlib.pyplot as plt
AAPL = yf.download('AAPL', start = '2020-4-5', end = '2021-6-5',)
data = AAPL['Close']
mean = AAPL['Close'].mean()
std = AAPL['Close'].std()
min_value = min(data)
max_value = max(data)
plt.title("AAPL")
plt.ylim(min_value -20, max_value + 20)
plt.scatter(x=AAPL.index, y=AAPL['Close'])
plt.hlines(y=mean, xmin=0, xmax=len(data)) # If this line is Removed, the X axis works with Date Range.
plt.show()
The issue is with:
plt.hlines(y=mean, xmin=0, xmax=len(data)) # If this line is Removed, the X axis works with Date Range.
Your data points has data between start = '2020-4-5', end = '2021-6-5'
But when you use the hlines (horizontal line), the xmin and xmax functions are the arguments not what you were assuming them to be.
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hlines.html
xmin, xmax refers to the respective beginning and end of each line. If scalars are provided, all lines will have same length.
When you set the xmax=len(data), you are asking for len(data) "units" to be shown on the x-axis.
When you remove that plt.hlines in your code snippet, you are essentially asking matplotlib to automatically determine the x-axis range, that is why it worked.
Perhaps what you are looking for is to specify the date-range, e.g.
plt.xlim([datetime.date(2020, 4, 5), datetime.date(2021, 6, 5)])
Full example:
import datetime
import yfinance as yf
import matplotlib.pyplot as plt
AAPL = yf.download('AAPL', start = '2020-4-5', end = '2021-6-5',)
data = AAPL['Close']
mean = AAPL['Close'].mean()
std = AAPL['Close'].std()
min_value = min(data)
max_value = max(data)
plt.title("AAPL")
plt.ylim(min_value -20, max_value + 20)
plt.xlim([datetime.date(2020, 4, 5), datetime.date(2021, 6, 5)])
plt.scatter(x=AAPL.index, y=AAPL['Close'])
plt.show()

Python Matplotlib - Smooth plot line for x-axis with date values

Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.

Matplotlib and Numpy - Create a calendar heatmap

Is it possible to create a calendar heatmap without using pandas?
If so, can someone post a simple example?
I have dates like Aug-16 and a count value like 16 and I thought this would be a quick and easy way to show intensity of counts between days for a long period of time.
Thank you
It's certainly possible, but you'll need to jump through a few hoops.
First off, I'm going to assume you mean a calendar display that looks like a calendar, as opposed to a more linear format (a linear formatted "heatmap" is much easier than this).
The key is reshaping your arbitrary-length 1D series into an Nx7 2D array where each row is a week and columns are days. That's easy enough, but you also need to properly label months and days, which can get a touch verbose.
Here's an example. It doesn't even remotely try to handle crossing across year boundaries (e.g. Dec 2014 to Jan 2015, etc). However, hopefully it gets you started:
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
def main():
dates, data = generate_data()
fig, ax = plt.subplots(figsize=(6, 10))
calendar_heatmap(ax, dates, data)
plt.show()
def generate_data():
num = 100
data = np.random.randint(0, 20, num)
start = dt.datetime(2015, 3, 13)
dates = [start + dt.timedelta(days=i) for i in range(num)]
return dates, data
def calendar_array(dates, data):
i, j = zip(*[d.isocalendar()[1:] for d in dates])
i = np.array(i) - min(i)
j = np.array(j) - 1
ni = max(i) + 1
calendar = np.nan * np.zeros((ni, 7))
calendar[i, j] = data
return i, j, calendar
def calendar_heatmap(ax, dates, data):
i, j, calendar = calendar_array(dates, data)
im = ax.imshow(calendar, interpolation='none', cmap='summer')
label_days(ax, dates, i, j, calendar)
label_months(ax, dates, i, j, calendar)
ax.figure.colorbar(im)
def label_days(ax, dates, i, j, calendar):
ni, nj = calendar.shape
day_of_month = np.nan * np.zeros((ni, 7))
day_of_month[i, j] = [d.day for d in dates]
for (i, j), day in np.ndenumerate(day_of_month):
if np.isfinite(day):
ax.text(j, i, int(day), ha='center', va='center')
ax.set(xticks=np.arange(7),
xticklabels=['M', 'T', 'W', 'R', 'F', 'S', 'S'])
ax.xaxis.tick_top()
def label_months(ax, dates, i, j, calendar):
month_labels = np.array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
months = np.array([d.month for d in dates])
uniq_months = sorted(set(months))
yticks = [i[months == m].mean() for m in uniq_months]
labels = [month_labels[m - 1] for m in uniq_months]
ax.set(yticks=yticks)
ax.set_yticklabels(labels, rotation=90)
main()
Edit: I now see the question asks for a plot without pandas. Even so, this question is a first page Google result for "python calendar heatmap", so I will leave this here. I recommend using pandas anyway. You probably already have it as a dependency of another package, and pandas has by far the best APIs for working with datetime data (pandas.Timestamp and pandas.DatetimeIndex).
The only Python package that I can find for these plots is calmap which is unmaintained and incompatible with recent matplotlib. So I decided to write my own. It produces plots like the following:
Here is the code. The input is a series with a datetime index giving the values for the heatmap:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
DAYS = ['Sun.', 'Mon.', 'Tues.', 'Wed.', 'Thurs.', 'Fri.', 'Sat.']
MONTHS = ['Jan.', 'Feb.', 'Mar.', 'Apr.', 'May', 'June', 'July', 'Aug.', 'Sept.', 'Oct.', 'Nov.', 'Dec.']
def date_heatmap(series, start=None, end=None, mean=False, ax=None, **kwargs):
'''Plot a calendar heatmap given a datetime series.
Arguments:
series (pd.Series):
A series of numeric values with a datetime index. Values occurring
on the same day are combined by sum.
start (Any):
The first day to be considered in the plot. The value can be
anything accepted by :func:`pandas.to_datetime`. The default is the
earliest date in the data.
end (Any):
The last day to be considered in the plot. The value can be
anything accepted by :func:`pandas.to_datetime`. The default is the
latest date in the data.
mean (bool):
Combine values occurring on the same day by mean instead of sum.
ax (matplotlib.Axes or None):
The axes on which to draw the heatmap. The default is the current
axes in the :module:`~matplotlib.pyplot` API.
**kwargs:
Forwarded to :meth:`~matplotlib.Axes.pcolormesh` for drawing the
heatmap.
Returns:
matplotlib.collections.Axes:
The axes on which the heatmap was drawn. This is set as the current
axes in the `~matplotlib.pyplot` API.
'''
# Combine values occurring on the same day.
dates = series.index.floor('D')
group = series.groupby(dates)
series = group.mean() if mean else group.sum()
# Parse start/end, defaulting to the min/max of the index.
start = pd.to_datetime(start or series.index.min())
end = pd.to_datetime(end or series.index.max())
# We use [start, end) as a half-open interval below.
end += np.timedelta64(1, 'D')
# Get the previous/following Sunday to start/end.
# Pandas and numpy day-of-week conventions are Monday=0 and Sunday=6.
start_sun = start - np.timedelta64((start.dayofweek + 1) % 7, 'D')
end_sun = end + np.timedelta64(7 - end.dayofweek - 1, 'D')
# Create the heatmap and track ticks.
num_weeks = (end_sun - start_sun).days // 7
heatmap = np.zeros((7, num_weeks))
ticks = {} # week number -> month name
for week in range(num_weeks):
for day in range(7):
date = start_sun + np.timedelta64(7 * week + day, 'D')
if date.day == 1:
ticks[week] = MONTHS[date.month - 1]
if date.dayofyear == 1:
ticks[week] += f'\n{date.year}'
if start <= date < end:
heatmap[day, week] = series.get(date, 0)
# Get the coordinates, offset by 0.5 to align the ticks.
y = np.arange(8) - 0.5
x = np.arange(num_weeks + 1) - 0.5
# Plot the heatmap. Prefer pcolormesh over imshow so that the figure can be
# vectorized when saved to a compatible format. We must invert the axis for
# pcolormesh, but not for imshow, so that it reads top-bottom, left-right.
ax = ax or plt.gca()
mesh = ax.pcolormesh(x, y, heatmap, **kwargs)
ax.invert_yaxis()
# Set the ticks.
ax.set_xticks(list(ticks.keys()))
ax.set_xticklabels(list(ticks.values()))
ax.set_yticks(np.arange(7))
ax.set_yticklabels(DAYS)
# Set the current image and axes in the pyplot API.
plt.sca(ax)
plt.sci(mesh)
return ax
def date_heatmap_demo():
'''An example for `date_heatmap`.
Most of the sizes here are chosen arbitrarily to look nice with 1yr of
data. You may need to fiddle with the numbers to look right on other data.
'''
# Get some data, a series of values with datetime index.
data = np.random.randint(5, size=365)
data = pd.Series(data)
data.index = pd.date_range(start='2017-01-01', end='2017-12-31', freq='1D')
# Create the figure. For the aspect ratio, one year is 7 days by 53 weeks.
# We widen it further to account for the tick labels and color bar.
figsize = plt.figaspect(7 / 56)
fig = plt.figure(figsize=figsize)
# Plot the heatmap with a color bar.
ax = date_heatmap(data, edgecolor='black')
plt.colorbar(ticks=range(5), pad=0.02)
# Use a discrete color map with 5 colors (the data ranges from 0 to 4).
# Extending the color limits by 0.5 aligns the ticks in the color bar.
cmap = mpl.cm.get_cmap('Blues', 5)
plt.set_cmap(cmap)
plt.clim(-0.5, 4.5)
# Force the cells to be square. If this is set, the size of the color bar
# may look weird compared to the size of the heatmap. That can be corrected
# by the aspect ratio of the figure or scale of the color bar.
ax.set_aspect('equal')
# Save to a file. For embedding in a LaTeX doc, consider the PDF backend.
# http://sbillaudelle.de/2015/02/23/seamlessly-embedding-matplotlib-output-into-latex.html
fig.savefig('heatmap.pdf', bbox_inches='tight')
# The firgure must be explicitly closed if it was not shown.
plt.close(fig)
Disclaimer: This is is a plug for my own package. Though I am a couple of years late to help OP, I hope that someone else will find it useful.
I did some digging around on a related issue. I ended up writing a new package exactly for this purpose when I couldn't find any other package that met all my requirements.
The package is still unpolished and it still has a sparse documentation, but I published it on PyPi anyway to make it available for others. Any feedback is appreciated, either here or on my GitHub.
july
The package is called july and can be installed with pip:
$ pip install july
Here are some use cases straight from the README:
Import packages and generate data
import numpy as np
import july
from july.utils import date_range
dates = date_range("2020-01-01", "2020-12-31")
data = np.random.randint(0, 14, len(dates))
GitHub Activity like plot:
july.heatmap(dates, data, title='Github Activity', cmap="github")
Daily heatmap for continuous data (with colourbar):
july.heatmap(
osl_df.date, # Here, osl_df is a pandas data frame.
osl_df.temp,
cmap="golden",
colorbar=True,
title="Average temperatures: Oslo , Norway"
)
Outline each month with month_grid=True
july.heatmap(dates=dates,
data=data,
cmap="Pastel1",
month_grid=True,
horizontal=True,
value_label=False,
date_label=False,
weekday_label=True,
month_label=True,
year_label=True,
colorbar=False,
fontfamily="monospace",
fontsize=12,
title=None,
titlesize="large",
dpi=100)
Finally, you can also create month or calendar plots:
# july.month_plot(dates, data, month=5) # This will plot only May.
july.calendar_plot(dates, data)
Similar packages:
calplot by Tom Kwok.
GitHub: Link
Install: pip install calplot
Actively maintained and better documentation than july.
Pandas centric, takes in a pandas series with dates and values.
Very good option if you are only looking for the heatmap functionality and don't need month_plot or calendar_plot.
calmap by Martijn Vermaat.
GitHub: Link
Install: pip install calmap
The package that calplot sprung out from.
Seems to be longer actively maintained.
I was looking to create a calendar heatmap where each month is displayed separately. I also needed to annotate each day with the day number (day_of_month) and it's value label.
I've been inspired by the answers posted here and also the following sites:
Here, although in R
Heatmap using pcolormesh
However I didn't seem to find something exactly as I was looking for, so I've decided to post my solution here to perhaps save others wanting the same kind of plot some time.
My example uses a bit of Pandas simply to generate some dummy data, so you can easily plug your own data source instead. Other than that it's just matplotlib.
Output from the code is given below. For my needs I also wanted to highlight days where the data was 0 (see 1st January).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
# Settings
years = [2018] # [2018, 2019, 2020]
weeks = [1, 2, 3, 4, 5, 6]
days = ['M', 'T', 'W', 'T', 'F', 'S', 'S']
month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August',
'September', 'October', 'November', 'December']
def generate_data():
idx = pd.date_range('2018-01-01', periods=365, freq='D')
return pd.Series(range(len(idx)), index=idx)
def split_months(df, year):
"""
Take a df, slice by year, and produce a list of months,
where each month is a 2D array in the shape of the calendar
:param df: dataframe or series
:return: matrix for daily values and numerals
"""
df = df[df.index.year == year]
# Empty matrices
a = np.empty((6, 7))
a[:] = np.nan
day_nums = {m:np.copy(a) for m in range(1,13)} # matrix for day numbers
day_vals = {m:np.copy(a) for m in range(1,13)} # matrix for day values
# Logic to shape datetimes to matrices in calendar layout
for d in df.iteritems(): # use iterrows if you have a DataFrame
day = d[0].day
month = d[0].month
col = d[0].dayofweek
if d[0].is_month_start:
row = 0
day_nums[month][row, col] = day # day number (0-31)
day_vals[month][row, col] = d[1] # day value (the heatmap data)
if col == 6:
row += 1
return day_nums, day_vals
def create_year_calendar(day_nums, day_vals):
fig, ax = plt.subplots(3, 4, figsize=(14.85, 10.5))
for i, axs in enumerate(ax.flat):
axs.imshow(day_vals[i+1], cmap='viridis', vmin=1, vmax=365) # heatmap
axs.set_title(month_names[i])
# Labels
axs.set_xticks(np.arange(len(days)))
axs.set_xticklabels(days, fontsize=10, fontweight='bold', color='#555555')
axs.set_yticklabels([])
# Tick marks
axs.tick_params(axis=u'both', which=u'both', length=0) # remove tick marks
axs.xaxis.tick_top()
# Modify tick locations for proper grid placement
axs.set_xticks(np.arange(-.5, 6, 1), minor=True)
axs.set_yticks(np.arange(-.5, 5, 1), minor=True)
axs.grid(which='minor', color='w', linestyle='-', linewidth=2.1)
# Despine
for edge in ['left', 'right', 'bottom', 'top']:
axs.spines[edge].set_color('#FFFFFF')
# Annotate
for w in range(len(weeks)):
for d in range(len(days)):
day_val = day_vals[i+1][w, d]
day_num = day_nums[i+1][w, d]
# Value label
axs.text(d, w+0.3, f"{day_val:0.0f}",
ha="center", va="center",
fontsize=7, color="w", alpha=0.8)
# If value is 0, draw a grey patch
if day_val == 0:
patch_coords = ((d - 0.5, w - 0.5),
(d - 0.5, w + 0.5),
(d + 0.5, w + 0.5),
(d + 0.5, w - 0.5))
square = Polygon(patch_coords, fc='#DDDDDD')
axs.add_artist(square)
# If day number is a valid calendar day, add an annotation
if not np.isnan(day_num):
axs.text(d+0.45, w-0.31, f"{day_num:0.0f}",
ha="right", va="center",
fontsize=6, color="#003333", alpha=0.8) # day
# Aesthetic background for calendar day number
patch_coords = ((d-0.1, w-0.5),
(d+0.5, w-0.5),
(d+0.5, w+0.1))
triangle = Polygon(patch_coords, fc='w', alpha=0.7)
axs.add_artist(triangle)
# Final adjustments
fig.suptitle('Calendar', fontsize=16)
plt.subplots_adjust(left=0.04, right=0.96, top=0.88, bottom=0.04)
# Save to file
plt.savefig('calendar_example.pdf')
for year in years:
df = generate_data()
day_nums, day_vals = split_months(df, year)
create_year_calendar(day_nums, day_vals)
There is probably a lot of room for optimisation, but this gets what I need done.
Below is a code that can be used to generate a calendar map for daily profiles of a value.
"""
Created on Tue Sep 4 11:17:25 2018
#author: woldekidank
"""
import numpy as np
from datetime import date
import datetime
import matplotlib.pyplot as plt
import random
D = date(2016,1,1)
Dord = date.toordinal(D)
Dweekday = date.weekday(D)
Dsnday = Dord - Dweekday + 1 #find sunday
square = np.array([[0, 0],[ 0, 1], [1, 1], [1, 0], [0, 0]])#x and y to draw a square
row = 1
count = 0
while row != 0:
for column in range(1,7+1): #one week per row
prof = np.ones([24, 1])
hourly = np.zeros([24, 1])
for i in range(1,24+1):
prof[i-1, 0] = prof[i-1, 0] * random.uniform(0, 1)
hourly[i-1, 0] = i / 24
plt.title('Temperature Profile')
plt.plot(square[:, 0] + column - 1, square[:, 1] - row + 1,color='r') #go right each column, go down each row
if date.fromordinal(Dsnday).month == D.month:
if count == 0:
plt.plot(hourly, prof)
else:
plt.plot(hourly + min(square[:, 0] + column - 1), prof + min(square[:, 1] - row + 1))
plt.text(column - 0.5, 1.8 - row, datetime.datetime.strptime(str(date.fromordinal(Dsnday)),'%Y-%m-%d').strftime('%a'))
plt.text(column - 0.5, 1.5 - row, date.fromordinal(Dsnday).day)
Dsnday = Dsnday + 1
count = count + 1
if date.fromordinal(Dsnday).month == D.month:
row = row + 1 #new row
else:
row = 0 #stop the while loop
Below is the output from this code

Can Pandas plot a histogram of dates?

I've taken my Series and coerced it to a datetime column of dtype=datetime64[ns] (though only need day resolution...not sure how to change).
import pandas as pd
df = pd.read_csv('somefile.csv')
column = df['date']
column = pd.to_datetime(column, coerce=True)
but plotting doesn't work:
ipdb> column.plot(kind='hist')
*** TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')
I'd like to plot a histogram that just shows the count of dates by week, month, or year.
Surely there is a way to do this in pandas?
Given this df:
date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29
and, if it's not already the case:
df["date"] = df["date"].astype("datetime64")
To show the count of dates by month:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
.dt allows you to access the datetime properties.
Which will give you:
You can replace month by year, day, etc..
If you want to distinguish year and month for instance, just do:
df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")
Which gives:
I think resample might be what you are looking for. In your case, do:
df.set_index('date', inplace=True)
# for '1M' for 1 month; '1W' for 1 week; check documentation on offset alias
df.resample('1M').count()
It is only doing the counting and not the plot, so you then have to make your own plots.
See this post for more details on the documentation of resample
pandas resample documentation
I have ran into similar problems as you did. Hope this helps.
Rendered example
Example Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Create random datetime object."""
# core modules
from datetime import datetime
import random
# 3rd party modules
import pandas as pd
import matplotlib.pyplot as plt
def visualize(df, column_name='start_date', color='#494949', title=''):
"""
Visualize a dataframe with a date column.
Parameters
----------
df : Pandas dataframe
column_name : str
Column to visualize
color : str
title : str
"""
plt.figure(figsize=(20, 10))
ax = (df[column_name].groupby(df[column_name].dt.hour)
.count()).plot(kind="bar", color=color)
ax.set_facecolor('#eeeeee')
ax.set_xlabel("hour of the day")
ax.set_ylabel("count")
ax.set_title(title)
plt.show()
def create_random_datetime(from_date, to_date, rand_type='uniform'):
"""
Create random date within timeframe.
Parameters
----------
from_date : datetime object
to_date : datetime object
rand_type : {'uniform'}
Examples
--------
>>> random.seed(28041990)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(1998, 12, 13, 23, 38, 0, 121628)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(2000, 3, 19, 19, 24, 31, 193940)
"""
delta = to_date - from_date
if rand_type == 'uniform':
rand = random.random()
else:
raise NotImplementedError('Unknown random mode \'{}\''
.format(rand_type))
return from_date + rand * delta
def create_df(n=1000):
"""Create a Pandas dataframe with datetime objects."""
from_date = datetime(1990, 4, 28)
to_date = datetime(2000, 12, 31)
sales = [create_random_datetime(from_date, to_date) for _ in range(n)]
df = pd.DataFrame({'start_date': sales})
return df
if __name__ == '__main__':
import doctest
doctest.testmod()
df = create_df()
visualize(df)
Here is a solution for when you just want to have a histogram like you expect it. This doesn't use groupby, but converts datetime values to integers and changes labels on the plot. Some improvement could be done to move the tick labels to even locations. Also with approach a kernel density estimation plot (and any other plot) is also possible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"datetime": pd.to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 100, dtype=np.int64))})
fig, ax = plt.subplots()
df["datetime"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()
All of these answers seem overly complex, as least with 'modern' pandas it's two lines.
df.set_index('date', inplace=True)
df.resample('M').size().plot.bar()
If you have a series with a DatetimeIndex then just run the second line
series.resample('M').size().plot.bar() # Just counts the rows/month
or
series.resample('M').sum().plot.bar(). # Sums up the values in the series
I was able to work around this by (1) plotting with matplotlib instead of using the dataframe directly and (2) using the values attribute. See example:
import matplotlib.pyplot as plt
ax = plt.gca()
ax.hist(column.values)
This doesn't work if I don't use values, but I don't know why it does work.
I think for solving that problem, you can use this code, it converts date type to int types:
df['date'] = df['date'].astype(int)
df['date'] = pd.to_datetime(df['date'], unit='s')
for getting date only, you can add this code:
pd.DatetimeIndex(df.date).normalize()
df['date'] = pd.DatetimeIndex(df.date).normalize()
I was just having trouble with this as well. I imagine that since you're working with dates you want to preserve chronological ordering (like I did.)
The workaround then is
import matplotlib.pyplot as plt
counts = df['date'].value_counts(sort=False)
plt.bar(counts.index,counts)
plt.show()
Please, if anyone knows of a better way please speak up.
EDIT:
for jean above, here's a sample of the data [I randomly sampled from the full dataset, hence the trivial histogram data.]
print dates
type(dates),type(dates[0])
dates.hist()
plt.show()
Output:
0 2001-07-10
1 2002-05-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2002-01-17
Name: Date, dtype: object
<class 'pandas.core.series.Series'> <type 'datetime.date'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-f39e334eece0> in <module>()
2 print dates
3 print type(dates),type(dates[0])
----> 4 dates.hist()
5 plt.show()
/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
2570 values = self.dropna().values
2571
-> 2572 ax.hist(values, bins=bins, **kwds)
2573 ax.grid(grid)
2574 axes = np.array([ax])
/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
5620 for xi in x:
5621 if len(xi) > 0:
-> 5622 xmin = min(xmin, xi.min())
5623 xmax = max(xmax, xi.max())
5624 bin_range = (xmin, xmax)
TypeError: can't compare datetime.date to float
I was stuck a long time trying to plot time-series with "bar". It gets really weird when trying to plot two time series with different indexes, like daily and monthly data for instance. Then I re-read the doc, and matplotlib doc states indeed explicitely that bar is meant for categorical data.
The plotting function to use is step.
With more recent matplotlib version, this limitation appears to be lifted.
You can now use Axes.bar to plot time-series.
With default options, bars are centered on the dates given as abscissis, with a width of 0.8 day. Bar position can be shifted with the "align" parameter and width can be assigned as a scalar or a list of the same dimension as abscissis list.
Just add the following line to have nice date labels whatever the zoom factor :
plt.rcParams['date.converter'] = 'concise'

Categories

Resources