Can Pandas plot a histogram of dates? - python

I've taken my Series and coerced it to a datetime column of dtype=datetime64[ns] (though only need day resolution...not sure how to change).
import pandas as pd
df = pd.read_csv('somefile.csv')
column = df['date']
column = pd.to_datetime(column, coerce=True)
but plotting doesn't work:
ipdb> column.plot(kind='hist')
*** TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')
I'd like to plot a histogram that just shows the count of dates by week, month, or year.
Surely there is a way to do this in pandas?

Given this df:
date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29
and, if it's not already the case:
df["date"] = df["date"].astype("datetime64")
To show the count of dates by month:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
.dt allows you to access the datetime properties.
Which will give you:
You can replace month by year, day, etc..
If you want to distinguish year and month for instance, just do:
df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")
Which gives:

I think resample might be what you are looking for. In your case, do:
df.set_index('date', inplace=True)
# for '1M' for 1 month; '1W' for 1 week; check documentation on offset alias
df.resample('1M').count()
It is only doing the counting and not the plot, so you then have to make your own plots.
See this post for more details on the documentation of resample
pandas resample documentation
I have ran into similar problems as you did. Hope this helps.

Rendered example
Example Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Create random datetime object."""
# core modules
from datetime import datetime
import random
# 3rd party modules
import pandas as pd
import matplotlib.pyplot as plt
def visualize(df, column_name='start_date', color='#494949', title=''):
"""
Visualize a dataframe with a date column.
Parameters
----------
df : Pandas dataframe
column_name : str
Column to visualize
color : str
title : str
"""
plt.figure(figsize=(20, 10))
ax = (df[column_name].groupby(df[column_name].dt.hour)
.count()).plot(kind="bar", color=color)
ax.set_facecolor('#eeeeee')
ax.set_xlabel("hour of the day")
ax.set_ylabel("count")
ax.set_title(title)
plt.show()
def create_random_datetime(from_date, to_date, rand_type='uniform'):
"""
Create random date within timeframe.
Parameters
----------
from_date : datetime object
to_date : datetime object
rand_type : {'uniform'}
Examples
--------
>>> random.seed(28041990)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(1998, 12, 13, 23, 38, 0, 121628)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(2000, 3, 19, 19, 24, 31, 193940)
"""
delta = to_date - from_date
if rand_type == 'uniform':
rand = random.random()
else:
raise NotImplementedError('Unknown random mode \'{}\''
.format(rand_type))
return from_date + rand * delta
def create_df(n=1000):
"""Create a Pandas dataframe with datetime objects."""
from_date = datetime(1990, 4, 28)
to_date = datetime(2000, 12, 31)
sales = [create_random_datetime(from_date, to_date) for _ in range(n)]
df = pd.DataFrame({'start_date': sales})
return df
if __name__ == '__main__':
import doctest
doctest.testmod()
df = create_df()
visualize(df)

Here is a solution for when you just want to have a histogram like you expect it. This doesn't use groupby, but converts datetime values to integers and changes labels on the plot. Some improvement could be done to move the tick labels to even locations. Also with approach a kernel density estimation plot (and any other plot) is also possible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"datetime": pd.to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 100, dtype=np.int64))})
fig, ax = plt.subplots()
df["datetime"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()

All of these answers seem overly complex, as least with 'modern' pandas it's two lines.
df.set_index('date', inplace=True)
df.resample('M').size().plot.bar()
If you have a series with a DatetimeIndex then just run the second line
series.resample('M').size().plot.bar() # Just counts the rows/month
or
series.resample('M').sum().plot.bar(). # Sums up the values in the series

I was able to work around this by (1) plotting with matplotlib instead of using the dataframe directly and (2) using the values attribute. See example:
import matplotlib.pyplot as plt
ax = plt.gca()
ax.hist(column.values)
This doesn't work if I don't use values, but I don't know why it does work.

I think for solving that problem, you can use this code, it converts date type to int types:
df['date'] = df['date'].astype(int)
df['date'] = pd.to_datetime(df['date'], unit='s')
for getting date only, you can add this code:
pd.DatetimeIndex(df.date).normalize()
df['date'] = pd.DatetimeIndex(df.date).normalize()

I was just having trouble with this as well. I imagine that since you're working with dates you want to preserve chronological ordering (like I did.)
The workaround then is
import matplotlib.pyplot as plt
counts = df['date'].value_counts(sort=False)
plt.bar(counts.index,counts)
plt.show()
Please, if anyone knows of a better way please speak up.
EDIT:
for jean above, here's a sample of the data [I randomly sampled from the full dataset, hence the trivial histogram data.]
print dates
type(dates),type(dates[0])
dates.hist()
plt.show()
Output:
0 2001-07-10
1 2002-05-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2002-01-17
Name: Date, dtype: object
<class 'pandas.core.series.Series'> <type 'datetime.date'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-f39e334eece0> in <module>()
2 print dates
3 print type(dates),type(dates[0])
----> 4 dates.hist()
5 plt.show()
/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
2570 values = self.dropna().values
2571
-> 2572 ax.hist(values, bins=bins, **kwds)
2573 ax.grid(grid)
2574 axes = np.array([ax])
/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
5620 for xi in x:
5621 if len(xi) > 0:
-> 5622 xmin = min(xmin, xi.min())
5623 xmax = max(xmax, xi.max())
5624 bin_range = (xmin, xmax)
TypeError: can't compare datetime.date to float

I was stuck a long time trying to plot time-series with "bar". It gets really weird when trying to plot two time series with different indexes, like daily and monthly data for instance. Then I re-read the doc, and matplotlib doc states indeed explicitely that bar is meant for categorical data.
The plotting function to use is step.

With more recent matplotlib version, this limitation appears to be lifted.
You can now use Axes.bar to plot time-series.
With default options, bars are centered on the dates given as abscissis, with a width of 0.8 day. Bar position can be shifted with the "align" parameter and width can be assigned as a scalar or a list of the same dimension as abscissis list.
Just add the following line to have nice date labels whatever the zoom factor :
plt.rcParams['date.converter'] = 'concise'

Related

Time series data visualization issue

I have a time series data like below where the data consists of year and week. So, the data is from 2014 1st week to 2015 52 weeks.
Now, below is the line plot of the above mentioned data
As you can see the x axis labelling is not quite what I was trying to achieve since the point after 201453 should be 201501 and there should not be any straight line and it should not be up to 201499. How can I rescale the xaxis exactly according to Due_date column? Below is the code
rand_products = np.random.choice(Op_2['Sp_number'].unique(), 3)
selected_products = Op_2[Op_2['Sp_number'].isin(rand_products)][['Due_date', 'Sp_number', 'Billing']]
plt.figure(figsize=(20,10))
plt.grid(True)
g = sns.lineplot(data=selected_products, x='Due_date', y='Billing', hue='Sp_number', ci=False, legend='full', palette='Set1');
the issue is because 201401... etc. are read as numbers and that is the reason the line chart has that gap. To fix it, you will need to change the numbers to date format and plot it.
As the full data is not available, below is the two column dataframe which has the Due_date in the form of integer YYYYWW. Billing column is a bunch of random numbers. Use the method here to convert the integers to dateformat and plot. The gap will be removed....
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
Due_date = list(np.arange(201401,201454)) #Year 2014
Due_date.extend(np.arange(201501,201553)) #Year 2915
Billing = random.sample(range(500, 1000), 105) #billing numbers
df = pd.DataFrame({'Due_date': Due_date, 'Billing': Billing})
df.Due_date = df.Due_date.astype(str)
df.Due_date = pd.to_datetime(df['Due_date']+ '-1',format="%Y%W-%w") #Convert to date
plt.figure(figsize=(20,10))
plt.grid(True)
ax = sns.lineplot(data=df, x='Due_date', y='Billing', ci=False, legend='full', palette='Set1')
Output graph

Python plot amount of x values in data

I have a huge csv file of data, it looks like this:
STAID, SOUID, DATE, TX, Q_TX
162,100522,19010101, -31, 0
162,100522,19010102, -13, 0
TX is temperature, data goes on for a few thousand more lines to give you an idea.
For every year, I want to plot the amount of days with a temperature above 25 degrees.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv")
zomers = data.index[data["TX"] > 250].tolist()
x_values = []
y_values = []
plt.xlabel("Years")
plt.ylabel("Amount of days with TX > 250")
plt.title("Zomerse Dagen Per Jaar")
plt.plot(x_values, y_values)
# save plot
plt.savefig("zomerse_dagen.png")
X-axis should be the years say 1900-2010 or something, and the y-axis should be the amount of days with a temperature higher than 250 in that year.
How do I go about this? >_< I can't quite get a grasp on how to extract the amount of days from the data.... and use it in a plot.
You can create the data points separately to make it a little easier to comprehend. Then use pandas.pivot_table to aggregate. Here is a working example that should get you going.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("klimaat.csv", parse_dates=["DATE"])
data.sort_values("DATE", inplace=True)
data["above_250"] = data.TX > 250
data["year"] = data.apply(lambda x: x["DATE"].year, axis=1).astype("category")
plot_df = pd.pivot_table(data, index="year", values="above_250", aggfunc="sum")
years = plot_df.index
y_pos = np.arange(len(years))
values = plot_df.above_250
plt.bar(y_pos, values, align='center', alpha=0.5)
plt.xticks(y_pos, years)
plt.ylabel("Amount of days with TX > 250")
plt.xlabel("Year")
plt.title("Zomerse Dagen Per Jaar")
plt.show()
You can use the datetime module from the python standard library to parse the dates, in particular, have a look at the strptime function. You can then use the datetime.year attribute to aggregate your data.
You can also use an OrderedDict to keep track of your aggregation before you assign OrderedDict.keys() and OrdredDict.values() to x_values and y_values respectively.

How to set correct time values to seaborn (tsplot) x-axis [duplicate]

Below I have the following script which creates a simple time series plot:
%matplotlib inline
import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
df = []
start_date = datetime.datetime(2015, 7, 1)
for i in range(10):
for j in [1,2]:
unit = 'Ones' if j == 1 else 'Twos'
date = start_date + datetime.timedelta(days=i)
df.append({
'Date': date.strftime('%Y%m%d'),
'Value': i * j,
'Unit': unit
})
df = pd.DataFrame(df)
sns.tsplot(df, time='Date', value='Value', unit='Unit', ax=ax)
fig.autofmt_xdate()
And the result of this is the following:
As you can see the x-axis has strange numbers for the datetimes, and not the usual "nice" representations that come with matplotlib and other plotting utilities. I've tried many things, re-formatting the data but it never comes out clean. Anyone know a way around?
Matplotlib represents dates as floating point numbers (in days), thus unless you (or pandas or seaborn), tell it that your values are representing dates, it will not format the ticks as dates. I'm not a seaborn expert, but it looks like it (or pandas) does convert the datetime objects to matplotlib dates, but then does not assign proper locators and formatters to the axes. This is why you get these strange numbers, which are in fact just the days since 0001.01.01. So you'll have to take care of the ticks manually (which, in most cases, is better anyways as it gives you more control).
So you'll have to assign a date locator, which decides where to put ticks, and a date formatter, which will then format the strings for the tick labels.
import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# build up the data
df = []
start_date = datetime.datetime(2015, 7, 1)
for i in range(10):
for j in [1,2]:
unit = 'Ones' if j == 1 else 'Twos'
date = start_date + datetime.timedelta(days=i)
# I believe it makes more sense to directly convert the datetime to a
# "matplotlib"-date (float), instead of creating strings and then let
# pandas parse the string again
df.append({
'Date': mdates.date2num(date),
'Value': i * j,
'Unit': unit
})
df = pd.DataFrame(df)
# build the figure
fig, ax = plt.subplots()
sns.tsplot(df, time='Date', value='Value', unit='Unit', ax=ax)
# assign locator and formatter for the xaxis ticks.
ax.xaxis.set_major_locator(mdates.AutoDateLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y.%m.%d'))
# put the labels at 45deg since they tend to be too long
fig.autofmt_xdate()
plt.show()
Result:
For me, #hitzg's answer results in "OverflowError: signed integer is greater than maximum" in the depths of DateFormatter.
Looking at my dataframe, my indices are datetime64, not datetime. Pandas converts these nicely though. The following works great for me:
import matplotlib as mpl
def myFormatter(x, pos):
return pd.to_datetime(x)
[ . . . ]
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(myFormatter))
Here is a potentially inelegant solution, but it's the only one I have ... Hope it helps!
g = sns.pointplot(x, y, data=df, ci=False);
unique_dates = sorted(list(df['Date'].drop_duplicates()))
date_ticks = range(0, len(unique_dates), 5)
g.set_xticks(date_ticks);
g.set_xticklabels([unique_dates[i].strftime('%d %b') for i in date_ticks], rotation='vertical');
g.set_xlabel('Date');
Let me know if you see any issues!
def myFormatter(x, pos):
return pd.to_datetime(x).strftime('%Y%m%d')
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(myFormatter))

Plotting pandas Series line becomes curved

The problem is to plot a straight line with uneven distribution of dates. Using the series values data fixes the curviness problem, but loses the timeline (the dates). Is there a way to fix this?
Edit: Why aren't the dates mapped straight to ticks on x axis:
0 -> 2017-02-17,
1 -> 2017-02-20,
... ?
Now there seems to be 12 ticks for the orange line but only 8 datapoints.
import pandas as pd
import matplotlib.pyplot as plt
def straight_line(index):
y = [3 + 2*x for x in range(len(index))]
zserie = pd.Series(y, index=index)
return zserie
if __name__ == '__main__':
start = '2017-02-10'
end = '2017-02-17'
index = pd.date_range(start,end)
index1 = pd.DatetimeIndex(['2017-02-17', '2017-02-20', '2017-02-21', '2017-02-22',
'2017-02-23', '2017-02-24', '2017-02-27', '2017-02-28',],
dtype='datetime64[ns]', name='pvm', freq=None)
plt.figure(1, figsize=(8, 4))
zs = straight_line(index)
zs.plot()
zs = straight_line(index1)
zs.plot()
plt.figure(2, figsize=(8, 4))
zs = straight_line(index1)
plt.plot(zs.values)
The graph is treating the dates correctly as a continuous variable. The days of index_1 should be plotted at x coordinates of 17, 20, 21, 22, 23, 24, 27, and 28. So, the graph with the orange line is correct.
The problem is with the way you calculate the y-values in the straight_line() function. You are treating the dates as if they are just categorical values and ignoring the gaps between the dates. A linear regression calculation won't do this--it will treat the dates as continuous values.
To get a straight line in your example code you should convert the values in index_1 from absolute dates to relative differences by using td = (index - index[0]) (which returns a pandas TimedeltaIndex) and then use the days from td for the x-values of your calculation. I've shown how you can do this in the reg_line() function below:
import pandas as pd
import matplotlib.pyplot as plt
def reg_line(index):
td = (index - index[0]).days #array containing the number of days since the first day
y = 3 + 2*td
zserie = pd.Series(y, index=index)
return zserie
if __name__ == '__main__':
start = '2017-02-10'
end = '2017-02-17'
index = pd.date_range(start,end)
index1 = pd.DatetimeIndex(['2017-02-17', '2017-02-20', '2017-02-21', '2017-02-22',
'2017-02-23', '2017-02-24', '2017-02-27', '2017-02-28',],
dtype='datetime64[ns]', name='pvm', freq=None)
plt.figure(1, figsize=(8, 4))
zs = reg_line(index)
zs.plot(style=['o-'])
zs = reg_line(index1)
zs.plot(style=['o-'])
Which produces the following figure:
NOTE: I've added points to the graph to make it clear which values are being drawn on the figure. As you can see, the orange line is straight even though there are no values for some of the days within the range.

Matplotlib's fill_between doesnt work with plot_date, any alternatives?

I want to create a plot just like this:
The code:
P.fill_between(DF.start.index, DF.lwr, DF.upr, facecolor='blue', alpha=.2)
P.plot(DF.start.index, DF.Rt, '.')
but with dates in the x axis, like this (without bands):
the code:
P.plot_date(DF.start, DF.Rt, '.')
the problem is that fill_between fails when x values are date_time objects.
Does anyone know of a workaround? DF is a pandas DataFrame.
It would help if you show how df is defined. What does df.info() report? This will show us the dtypes of the columns.
There are many ways that dates can be represented: as strings, ints, floats, datetime.datetime, NumPy datetime64s, Pandas Timestamps, or Pandas DatetimeIndex. The correct way to plot it depends on what you have.
Here is an example showing your code works if df.index is a DatetimeIndex:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
plt.fill_between(df.index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(df.index, df.Rt, '.')
plt.show()
If the index has string representations of dates, then (with Matplotlib version 1.4.2) you would get a TypeError:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'])
index = [item.strftime('%Y-%m-%d') for item in index]
plt.fill_between(index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(index, df.Rt, '.')
plt.show()
yields
File "/home/unutbu/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/ma/core.py", line 2237, in masked_invalid
condition = ~(np.isfinite(a))
TypeError: Not implemented for this type
In this case, the fix is to convert the strings to Timestamps:
index = pd.to_datetime(index)
Regarding the error reported by chilliq:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs
could not be safely coerced to any supported types according to the casting
rule ''safe''
This can be produced if the DataFrame columns have "object" dtype when using fill_between. Changing the example column types and then trying to plot, as follows, results in the error above:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
dfo = df.astype(object)
plt.fill_between(df0.index, df0.lwr, df0.upr, facecolor='blue', alpha=.2)
plt.show()
From dfo.info() we see that the column types are "object":
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 180 entries, 2000-01-31 to 2014-12-31
Freq: M
Data columns (total 3 columns):
lwr 180 non-null object
Rt 180 non-null object
upr 180 non-null object
dtypes: object(3)
memory usage: 5.6+ KB
Ensuring that the DataFrame has numerical columns will solve the problem. To do this we can use pandas.to_numeric to convert, as follows:
dfn = dfo.apply(pd.to_numeric, errors='ignore')
plt.fill_between(dfn.index, dfn.lwr, dfn.upr, facecolor='blue', alpha=.2)
plt.show()
I got similar error while using fill_between:
ufunc 'bitwise_and' not supported
However, in my case the cause of error was rather stupid. I was passing color parameter but without explicit argument name which caused it to be #4 parameter called where. So simply making sure keyword parameters has key solved the issue:
ax.fill_between(xdata, highs, lows, color=color, alpha=0.2)
I think none of the answers addresses the original question, they all change it a little bit.
If you want to plot timdeltas you can use this workaround
ax = df.Rt.plot()
x = ax.get_lines()[0].get_xdata().astype(float)
ax.fill_between(x, df.lwr, df.upr, color="b", alpha=0.2)
plt.show()
This work sin your case. In general, the only caveat is that you always need to plot the index using pandas and then get the coordinates from the artist. I am sure that by looking at pandas code, one can actually find how they plot the timedeltas. Then one can apply that to the code, and the first plot is not needed anymore.

Categories

Resources