Matplotlib's fill_between doesnt work with plot_date, any alternatives? - python

I want to create a plot just like this:
The code:
P.fill_between(DF.start.index, DF.lwr, DF.upr, facecolor='blue', alpha=.2)
P.plot(DF.start.index, DF.Rt, '.')
but with dates in the x axis, like this (without bands):
the code:
P.plot_date(DF.start, DF.Rt, '.')
the problem is that fill_between fails when x values are date_time objects.
Does anyone know of a workaround? DF is a pandas DataFrame.

It would help if you show how df is defined. What does df.info() report? This will show us the dtypes of the columns.
There are many ways that dates can be represented: as strings, ints, floats, datetime.datetime, NumPy datetime64s, Pandas Timestamps, or Pandas DatetimeIndex. The correct way to plot it depends on what you have.
Here is an example showing your code works if df.index is a DatetimeIndex:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
plt.fill_between(df.index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(df.index, df.Rt, '.')
plt.show()
If the index has string representations of dates, then (with Matplotlib version 1.4.2) you would get a TypeError:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'])
index = [item.strftime('%Y-%m-%d') for item in index]
plt.fill_between(index, df.lwr, df.upr, facecolor='blue', alpha=.2)
plt.plot(index, df.Rt, '.')
plt.show()
yields
File "/home/unutbu/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/ma/core.py", line 2237, in masked_invalid
condition = ~(np.isfinite(a))
TypeError: Not implemented for this type
In this case, the fix is to convert the strings to Timestamps:
index = pd.to_datetime(index)

Regarding the error reported by chilliq:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs
could not be safely coerced to any supported types according to the casting
rule ''safe''
This can be produced if the DataFrame columns have "object" dtype when using fill_between. Changing the example column types and then trying to plot, as follows, results in the error above:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
index = pd.date_range(start='2000-1-1', end='2015-1-1', freq='M')
N = len(index)
poisson = (stats.poisson.rvs(1000, size=(N,3))/100.0)
poisson.sort(axis=1)
df = pd.DataFrame(poisson, columns=['lwr', 'Rt', 'upr'], index=index)
dfo = df.astype(object)
plt.fill_between(df0.index, df0.lwr, df0.upr, facecolor='blue', alpha=.2)
plt.show()
From dfo.info() we see that the column types are "object":
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 180 entries, 2000-01-31 to 2014-12-31
Freq: M
Data columns (total 3 columns):
lwr 180 non-null object
Rt 180 non-null object
upr 180 non-null object
dtypes: object(3)
memory usage: 5.6+ KB
Ensuring that the DataFrame has numerical columns will solve the problem. To do this we can use pandas.to_numeric to convert, as follows:
dfn = dfo.apply(pd.to_numeric, errors='ignore')
plt.fill_between(dfn.index, dfn.lwr, dfn.upr, facecolor='blue', alpha=.2)
plt.show()

I got similar error while using fill_between:
ufunc 'bitwise_and' not supported
However, in my case the cause of error was rather stupid. I was passing color parameter but without explicit argument name which caused it to be #4 parameter called where. So simply making sure keyword parameters has key solved the issue:
ax.fill_between(xdata, highs, lows, color=color, alpha=0.2)

I think none of the answers addresses the original question, they all change it a little bit.
If you want to plot timdeltas you can use this workaround
ax = df.Rt.plot()
x = ax.get_lines()[0].get_xdata().astype(float)
ax.fill_between(x, df.lwr, df.upr, color="b", alpha=0.2)
plt.show()
This work sin your case. In general, the only caveat is that you always need to plot the index using pandas and then get the coordinates from the artist. I am sure that by looking at pandas code, one can actually find how they plot the timedeltas. Then one can apply that to the code, and the first plot is not needed anymore.

Related

How to apply best fit line to time series in python

I am trying to apply a best fit line to time series showing NDVI over time but I keep running into errors. my x, in this case, are different dates as strings that are not evenly spaced and y is the NDVI value for use each date.
When I use the poly1d function in numpy I get the following error:
TypeError: ufunc 'add' did not contain a loop with signature matching types
dtype('<U32') dtype('<U32') dtype('<U32')
I have attached a sample of the data set I am working with
# plot Data and and models
plt.subplots(figsize=(20, 10))
plt.xticks(rotation=90)
plt.plot(x,y,'-', color= 'blue')
plt.title('WSC-10-50')
plt.ylabel('NDVI')
plt.xlabel('Date')
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(y)))
plt.legend(loc='upper right')
Any help fixing my code or a better way I can get the best fit line for my data?
When I apply a best fit line to time series data, I create an evenly spaced line that represents the dates to simplify the regression. So I use np.linspace() to create a set of intervals equal to the number of dates.
Code:
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = StringIO("""
date value
24-Jan-16 0.786
25-Feb-16 0.781
29-Apr-16 0.786
15-May-16 0.761
16-Jun-16 0.762
04-Sep-16 0.783
22-Oct-16 0.797
""")
df = pd.read_table(data, delim_whitespace=True)
# To read from csv use:
# df = pd.read_csv("/path/to/file.csv")
df.loc[:, "date"] = pd.to_datetime(df.loc[:, "date"], format="%d-%b-%y")
y_values = df.loc[:, "value"]
x_values = np.linspace(0,1,len(df.loc[:, "value"]))
poly_degree = 3
coeffs = np.polyfit(x_values, y_values, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
plt.figure(figsize=(12,8))
plt.plot(df.loc[:, "date"], df.loc[:,"value"], "ro")
plt.plot(df.loc[:, "date"],y_hat)
plt.title('WSC-10-50')
plt.ylabel('NDVI')
plt.xlabel('Date')
plt.savefig("NDVI_plot.png")
Output:

Python: error in plotting data created with `period_range` (pandas)

I have a problem in plotting time series data, created using pandas date_range and period_range. The former works, but the latter does not. To illustrate the problem, consider the following
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# random numbers
N = 100
z = np.random.randn(N)
ts = pd.DataFrame({'Y': z, 'X': np.cumsum(z)})
# 'date_range' is used
month_date = pd.date_range('1978-02', periods=N, freq='M')
df_date = ts.set_index(month_date)
# 'period_range' is used
month_period = pd.period_range('1978-02', periods=N, freq='M')
df_period = ts.set_index(month_period)
# plot
plt.plot(df_date);plt.show()
plt.plot(df_period);plt.show()
plt.plot(df_date) gives a nice figure, whereas plt.plot(df_period) generates the following error, which I do not understand:
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units<Figure size 432x288 with 1 Axes>
Is this an expected result? What am I missing here?
BTW, df_date.plot() and df_period.plot() both work fine, causing no problem.
Thanks in advance.
The indexes of the two dateframes are of different type:
print(type(df_date)) # pandas.core.indexes.datetimes.DatetimeIndex
print(type(df_period)) # pandas.core.indexes.period.PeriodIndex
Matplotlib does not know how to plot a PeriodIndex.
You may convert the PeriodIndex to a DatetimeIndex and plot that one instead,
plt.plot(df_period.to_timestamp())

Matplotlib Boxplot and pandas dataframe data type

So I set up this empty dataframe DF and load data into the dataframe according to some conditions. As such, some its elements would then be empty (nan). I noticed that if I don't specify the datatype as float when I create the empty dataframe, DF.boxplot() will give me an 'Index out of range' error.
As I understand it, pandas' DF.boxplot() uses matplotlib's plt.boxplot() function, so naturally I tried using plt.boxplot(DF.iloc[:,0]) to plot the boxplot of the first column. I noticed a reversed behavior: When dtype of DF is float, it will not work: it will just show me an empty plot. See the code below where DF.boxplot() wont work, but plt.boxplot(DF.iloc[:,0]) will plot a boxplot (when i add dtype='float' when first creating the dataframe, plt.boxplot(DF.iloc[:,0]) will give me an empty plot):
import numpy as np
import pandas as pd
DF=pd.DataFrame(index=range(10),columns=range(4))
for i in range(10):
for j in range(4):
if i==j:
continue
DF.iloc[i,j]=i
I am wondering does this has to do with how plt.boxplot() handles nan for different data types? If so, why did setting the dataframe's data type as 'object' didn't work for DF.boxplot(), if pandas is just using matplotlib's boxplot function?
I think we can agree that neither df.boxplot() nor plt.boxplot can handle dataframes of type "object". Instead they need to be of a numeric datatype.
If the data is numeric, df.boxplot() will work as expected, even with nan values, because they are removed before plotting.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
df.boxplot()
plt.show()
Using plt.boxplot you would need to remove the nans manually, e.g. using df.dropna().
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
data = [df[i].dropna() for i in range(4)]
plt.boxplot(data)
plt.show()
To summarize:

TypeError when plotting heatmap with seaborn

I have the following dataset in a pandas DataFrame, which I've tidied and saved into "filename1.csv":
import pandas as pd
df = pd.read_csv("filename1.csv")
print(df)
samples a b c percent_a percent_c ratio_a:b ratio_c:b
0 sample1 185852 6509042 253303 0.028553 0.038916 35.022717 25.696664
1 sample2 218178 6456571 273448 0.033792 0.042352 29.593135 23.611696
2 sample3 251492 6353453 343252 0.039584 0.054026 25.263042 18.509588
3 sample4 232299 6431376 284522 0.036120 0.044240 27.685767 22.604143
..............................
I would like to plot this DataFrame as a heatmap using seaborn. At first, it would be interesting to see the samples (one sample per row) against two columns, percent_a and percent_c:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# drop unnecessary columns
df = df.drop(["a", "b", "c", "ratio_a:b", "ratio_c:b"], axis = 1)
sns.heatmap(df)
plt.show()
However, this throws an error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs
could not be safely coerced to any supported types according to the casting rule ''safe''
I originally thought this meant that there were NaN values in this DataFrame. However, it appears wrong, as
df.isnull().values.any()
outputs False. So, I suspect this is because samples is a column of non-numerical values.
How do I plot a seaborn heatmap such that these categorical values are shown?
If you just drop the "samples" column as well, isn't that what you are looking for?! You can then put the sample names in later using matplotlib's ax.set_yticklabels function. Note that you need to reverse sample names list, since matplotlib starts the labeling from the bottom.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("SO_pandassnsheatmap.txt", delim_whitespace=True)
df2 = df.drop(["samples", "a", "b", "c", "ratio_a:b", "ratio_c:b"], axis = 1)
ax = sns.heatmap(df2)
ax.set_yticklabels(df.samples.values[::-1])
plt.show()

Can Pandas plot a histogram of dates?

I've taken my Series and coerced it to a datetime column of dtype=datetime64[ns] (though only need day resolution...not sure how to change).
import pandas as pd
df = pd.read_csv('somefile.csv')
column = df['date']
column = pd.to_datetime(column, coerce=True)
but plotting doesn't work:
ipdb> column.plot(kind='hist')
*** TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('float64')
I'd like to plot a histogram that just shows the count of dates by week, month, or year.
Surely there is a way to do this in pandas?
Given this df:
date
0 2001-08-10
1 2002-08-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2003-08-14
8 2003-07-29
and, if it's not already the case:
df["date"] = df["date"].astype("datetime64")
To show the count of dates by month:
df.groupby(df["date"].dt.month).count().plot(kind="bar")
.dt allows you to access the datetime properties.
Which will give you:
You can replace month by year, day, etc..
If you want to distinguish year and month for instance, just do:
df.groupby([df["date"].dt.year, df["date"].dt.month]).count().plot(kind="bar")
Which gives:
I think resample might be what you are looking for. In your case, do:
df.set_index('date', inplace=True)
# for '1M' for 1 month; '1W' for 1 week; check documentation on offset alias
df.resample('1M').count()
It is only doing the counting and not the plot, so you then have to make your own plots.
See this post for more details on the documentation of resample
pandas resample documentation
I have ran into similar problems as you did. Hope this helps.
Rendered example
Example Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Create random datetime object."""
# core modules
from datetime import datetime
import random
# 3rd party modules
import pandas as pd
import matplotlib.pyplot as plt
def visualize(df, column_name='start_date', color='#494949', title=''):
"""
Visualize a dataframe with a date column.
Parameters
----------
df : Pandas dataframe
column_name : str
Column to visualize
color : str
title : str
"""
plt.figure(figsize=(20, 10))
ax = (df[column_name].groupby(df[column_name].dt.hour)
.count()).plot(kind="bar", color=color)
ax.set_facecolor('#eeeeee')
ax.set_xlabel("hour of the day")
ax.set_ylabel("count")
ax.set_title(title)
plt.show()
def create_random_datetime(from_date, to_date, rand_type='uniform'):
"""
Create random date within timeframe.
Parameters
----------
from_date : datetime object
to_date : datetime object
rand_type : {'uniform'}
Examples
--------
>>> random.seed(28041990)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(1998, 12, 13, 23, 38, 0, 121628)
>>> create_random_datetime(datetime(1990, 4, 28), datetime(2000, 12, 31))
datetime.datetime(2000, 3, 19, 19, 24, 31, 193940)
"""
delta = to_date - from_date
if rand_type == 'uniform':
rand = random.random()
else:
raise NotImplementedError('Unknown random mode \'{}\''
.format(rand_type))
return from_date + rand * delta
def create_df(n=1000):
"""Create a Pandas dataframe with datetime objects."""
from_date = datetime(1990, 4, 28)
to_date = datetime(2000, 12, 31)
sales = [create_random_datetime(from_date, to_date) for _ in range(n)]
df = pd.DataFrame({'start_date': sales})
return df
if __name__ == '__main__':
import doctest
doctest.testmod()
df = create_df()
visualize(df)
Here is a solution for when you just want to have a histogram like you expect it. This doesn't use groupby, but converts datetime values to integers and changes labels on the plot. Some improvement could be done to move the tick labels to even locations. Also with approach a kernel density estimation plot (and any other plot) is also possible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"datetime": pd.to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 100, dtype=np.int64))})
fig, ax = plt.subplots()
df["datetime"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)
plt.show()
All of these answers seem overly complex, as least with 'modern' pandas it's two lines.
df.set_index('date', inplace=True)
df.resample('M').size().plot.bar()
If you have a series with a DatetimeIndex then just run the second line
series.resample('M').size().plot.bar() # Just counts the rows/month
or
series.resample('M').sum().plot.bar(). # Sums up the values in the series
I was able to work around this by (1) plotting with matplotlib instead of using the dataframe directly and (2) using the values attribute. See example:
import matplotlib.pyplot as plt
ax = plt.gca()
ax.hist(column.values)
This doesn't work if I don't use values, but I don't know why it does work.
I think for solving that problem, you can use this code, it converts date type to int types:
df['date'] = df['date'].astype(int)
df['date'] = pd.to_datetime(df['date'], unit='s')
for getting date only, you can add this code:
pd.DatetimeIndex(df.date).normalize()
df['date'] = pd.DatetimeIndex(df.date).normalize()
I was just having trouble with this as well. I imagine that since you're working with dates you want to preserve chronological ordering (like I did.)
The workaround then is
import matplotlib.pyplot as plt
counts = df['date'].value_counts(sort=False)
plt.bar(counts.index,counts)
plt.show()
Please, if anyone knows of a better way please speak up.
EDIT:
for jean above, here's a sample of the data [I randomly sampled from the full dataset, hence the trivial histogram data.]
print dates
type(dates),type(dates[0])
dates.hist()
plt.show()
Output:
0 2001-07-10
1 2002-05-31
2 2003-08-29
3 2006-06-21
4 2002-03-27
5 2003-07-14
6 2004-06-15
7 2002-01-17
Name: Date, dtype: object
<class 'pandas.core.series.Series'> <type 'datetime.date'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-f39e334eece0> in <module>()
2 print dates
3 print type(dates),type(dates[0])
----> 4 dates.hist()
5 plt.show()
/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
2570 values = self.dropna().values
2571
-> 2572 ax.hist(values, bins=bins, **kwds)
2573 ax.grid(grid)
2574 axes = np.array([ax])
/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
5620 for xi in x:
5621 if len(xi) > 0:
-> 5622 xmin = min(xmin, xi.min())
5623 xmax = max(xmax, xi.max())
5624 bin_range = (xmin, xmax)
TypeError: can't compare datetime.date to float
I was stuck a long time trying to plot time-series with "bar". It gets really weird when trying to plot two time series with different indexes, like daily and monthly data for instance. Then I re-read the doc, and matplotlib doc states indeed explicitely that bar is meant for categorical data.
The plotting function to use is step.
With more recent matplotlib version, this limitation appears to be lifted.
You can now use Axes.bar to plot time-series.
With default options, bars are centered on the dates given as abscissis, with a width of 0.8 day. Bar position can be shifted with the "align" parameter and width can be assigned as a scalar or a list of the same dimension as abscissis list.
Just add the following line to have nice date labels whatever the zoom factor :
plt.rcParams['date.converter'] = 'concise'

Categories

Resources