I have time series data in a pandas Series object. The values are floats representing the size of an event. Together, event times and sizes tell me how busy my system is. I'm using a scatterplot.
I want to look at traffic patterns over various time periods, to answer questions like "Is there a daily spike at noon?" or "Which days of the week have the most traffic?" Therefore I need to overlay data from many different periods. I do this by converting timestamps to timedeltas (first by subtracting the start of the first period, then by doing a mod with the period length).
Now my index uses time intervals relative to an "abstract" time period, like a day or a week. I would like to produce plots where the x-axis shows something other than just nanoseconds. Ideally it would show month, day of the week, hour, etc. depending on timescale as you zoom in and out (as Bokeh graphs generally do for time series).
The code below shows an example of how I currently plot. The resulting graph has an x-axis in units of nanoseconds, which is not what I want. How do I get a smart x-axis that behaves more like what I would see for timestamps?
import numpy as np
import pandas as pd
from bokeh.charts import show, output_file
from bokeh.plotting import figure
oneDay = np.timedelta64(24*60*60,'s')
fourHours = 24000000000000 # four hours in nanoseconds (ugly)
time = [pd.Timestamp('2015-04-27 01:00:00'), # a Monday
pd.Timestamp('2015-05-04 02:00:00'), # a Monday
pd.Timestamp('2015-05-11 03:00:00'), # a Monday
pd.Timestamp('2015-05-12 04:00:00') # a Tuesday
]
resp = [2.0, 1.3, 2.6, 1.3]
ts = pd.Series(resp, index=time)
days = dict(list(ts.groupby(lambda x: x.weekday)))
monday = days[0] # this TimeSeries consists of all data for all Mondays
# Now convert timestamps to timedeltas
# First subtract the timestamp of the starting date
# Then take the remainder after dividing by one day
# Result: each index value is in the 24 hour range [00:00:00, 23:59:59]
tdi = monday.index - pd.Timestamp(monday.index.date[0])
x = pd.TimedeltaIndex([td % oneDay for td in tdi])
y = monday.values
output_file('bogus.html')
xmax = fourHours # why doesn't np.timedelta64 work for xmax?
fig = figure(x_range=[0,xmax], y_range=[0,4])
fig.circle(x, y)
show(fig)
Related
I'm not sure if my question makes sense, so apologies on that.
Basically, I am plotting some data that is ~100 hours long. On the x-axis, I want to make it so that the range goes from -50 to 50, with -1 to -50 representing the 50 hours prior to the event, 0 being in the middle representing the start of the event, and 1-50 representing the 50 hours following the start of the event. Basically, there are 107 hours worth of data and I want to try to divide the hours between each side of 0.
I initially tried using the plt.xlim() function, but that just shifts all the data to one side of the plot.
I've tried using plt.xticks and then labeling the x ticks with "-50", "-25", "0", "25", and "50", and while that somewhat works, it still does not look great. I'll add an example figure of doing it this way to add better clarification of what I'm trying to do, as well as the original plot:
Original plot:
Goal:
edit
Here's my code for plotting it:
fig_1 = plt.figure(figsize=(30,20))
file.plot(x='start',y='value')
plt.xlabel('hour')
plt.ylabel('value')
plt.xticks([0,25,50,75,100],["-50","-25","0","25","50"])
You could obtain a zero mean for the ticks using df.sub(df.mean() or np.mean().
Alternative 1:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate data
left = np.linspace(10,60, 54)
right = np.linspace(60,10, 53)
noise_left = np.random.normal(0, 1, 54)
noise_right = np.random.normal(0, 1, 53)
all = np.append(left + noise_left, right + noise_right)
file = pd.DataFrame({'start':np.linspace(1,107,107),'value':all})
# subtract mean
file['start'] = file['start'].sub(file['start'].mean())
fig_1 = plt.figure(figsize=(30,20))
file.plot(x='start',y='value')
plt.xlabel('hour')
plt.ylabel('value')
Output:
Alternative 2:
# subtract the mean from start to obtain zero mean ticks
ticks = file['start'] - np.mean(file['start'])
# set distance between each tick to 10
plt.xticks(file['start'][::10], ticks[::10],rotation=45)
I have seen many answers how to calculate the monthly mean from daily data across multiple years.
But what I want to do is to calculate the monthly mean from daily data for each year in my xarray separately. So, I want to end up with a mean for Jan 2020, Feb 2020 ... Dec 2024 for each lon/lat gridpoint.
My xarray has the dimensions Frozen({'time': 1827, 'lon': 180, 'lat': 90})
I tried using
var_resampled = var_diff.resample(time='1M').mean()
but this calcualtes the mean across all years (ie mean for Jan 2020-2024).
I also tried
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
This seems to do what I want but I end up with different dimensions (ie "month" and "year" instead of the original "time" dimension).
Is there a different way to calculate the monthly mean from daily data for each year separately or is there a way that the code using groupby above retains the same time dimension as before just with year and month now?
P.S. I also tried "cdo monmean" but as far as I understand this also just gives mean the monthly mean across all years.
Thanks!
Solution
I found a way using
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
and then using
var_diff_mon.stack(time=("year", "month"))
to get my original time dimension back
Is var_diff.resample(time='M') (or time='MS') doing what you expect ?
Let's create a toy dataset like yours:
import numpy as np
import pandas as pd
import xarray as xr
dims = ('time', 'lat', 'lon')
time = pd.date_range("2021-01-01T00", "2023-12-31T23", freq="H")
lat = [0, 1]
lon = [0, 1]
coords = (time, lat, lon)
ds = xr.DataArray(data=np.random.randn(len(time), len(lat), len(lon)), coords=coords, dims=dims).rename("my_var")
ds = ds.to_dataset()
ds
Let's resample it:
ds.resample(time="MS").mean()
The dataset has now 36 time steps, associated with the 36 months which are in the original dataset.
I have a table of sunrise and sunset times. I would like to plot the years on the x-axis and the time of day on the y axis. However, when just plugging in the dataframe, I get floats, unrelated to time of day, on the y axis.
I have tried various solutions, which I can't seem to make work. Including using subplots with the following:
ax.yaxis.set_major_locator(HourLocator())
ax.yaxis.set_major_formatter(DateFormatter('%H:%M'))
Here is the code:
import sunriset
lat = 34.0522
long = -118.2437
local_tz = -8
number_of_years = 10
start_date = datetime.date(2010,1,1)
df = sunriset.to_pandas(start_date, lat, long, local_tz, number_of_years)
sunrise_set = df[['Sunrise', 'Sunset', 'Sunlight Durration (minutes)']]
sunrise_set.index = pd.to_datetime(sunrise_set.index)
plt.plot(sunrise_set['Sunrise'])
plt.show()
I would like for this to have the time in the y axis not the floats.
Well, I figured it out. The format of the pandas columns needed to be converted to to datetime, even though they show up as datetime when doing df.describe()
sunrise_set['Sunrise'] = pd.to_datetime(sunrise_set['Sunrise'])
sunrise_set['Sunrise'] = [time.time() for time in sunrise_set['Sunrise']]
sunrise_set['Sunset'] = pd.to_datetime(sunrise_set['Sunset'])
sunrise_set['Sunset'] = [time.time() for time in sunrise_set['Sunset']]
I have list of timestamps in the format of HH:MM:SS and want to plot against some values using datetime.time. Seems like python doesn't like the way I do it. Can someone please help ?
import datetime
import matplotlib.pyplot as plt
# random data
x = [datetime.time(12,10,10), datetime.time(12, 11, 10)]
y = [1,5]
# plot
plt.plot(x,y)
plt.show()
*TypeError: float() argument must be a string or a number*
Well, a two-step story to get 'em PLOT really nice
Step 1: prepare data into a proper format
from a datetime to a matplotlib convention compatible float for dates/times
As usual, devil is hidden in detail.
matplotlib dates are almost equal, but not equal:
# mPlotDATEs.date2num.__doc__
#
# *d* is either a class `datetime` instance or a sequence of datetimes.
#
# Return value is a floating point number (or sequence of floats)
# which gives the number of days (fraction part represents hours,
# minutes, seconds) since 0001-01-01 00:00:00 UTC, *plus* *one*.
# The addition of one here is a historical artifact. Also, note
# that the Gregorian calendar is assumed; this is not universal
# practice. For details, see the module docstring.
So, highly recommended to re-use their "own" tool:
from matplotlib import dates as mPlotDATEs # helper functions num2date()
# # and date2num()
# # to convert to/from.
Step 2: manage axis-labels & formatting & scale (min/max) as a next issue
matplotlib brings you arms for this part too.
Check code in this answer for all details
It is still valid issue in Python 3.5.3 and Matplotlib 2.1.0.
A workaround is to use datetime.datetime objects instead of datetime.time ones:
import datetime
import matplotlib.pyplot as plt
# random data
x = [datetime.time(12,10,10), datetime.time(12, 11, 10)]
x_dt = [datetime.datetime.combine(datetime.date.today(), t) for t in x]
y = [1,5]
# plot
plt.plot(x_dt, y)
plt.show()
By deafult date part should not be visible. Otherwise you can always use DateFormatter:
import matplotlib.dates as mdates
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H-%M-%S'))
I came to this page because I have a similar issue. I have a Pandas DataFrame df with a datetime column df.dtm and a data column df.x, spanning several days, but I want to plot them using matplotlib.pyplot as a function of time of day, not date and time (datetime, datetimeindex). I.e., I want all data points to be folded into the same 24h range in the plot. I can plot df.x vs. df.dtm without issue, but I've just spent two hours trying to figure out how to convert df.dtm to df.time (containing the time of day without a date) and then plotting it. The (to me) straightforward solution does not work:
df.dtm = pd.to_datetime(df.dtm)
ax.plot(df.dtm, df.x)
# Works (with times on different dates; a range >24h)
df['time'] = df.dtm.dt.time
ax.plot(df.time, df.x)
# DOES NOT WORK: ConversionError('Failed to convert value(s) to axis '
matplotlib.units.ConversionError: Failed to convert value(s) to axis units:
array([datetime.time(0, 0), datetime.time(0, 5), etc.])
This does work:
pd.plotting.register_matplotlib_converters() # Needed to plot Pandas df with Matplotlib
df.dtm = pd.to_datetime(df.dtm, utc=True) # NOTE: MUST add a timezone, even if undesired
ax.plot(df.dtm, df.x)
# Works as before
df['time'] = df.dtm.dt.time
ax.plot(df.time, df.x)
# WORKS!!! (with time of day, all data in the same 24h range)
Note that the differences are in the first two lines. The first line allows better collaboration between Pandas and Matplotlib, the second seems redundant (or even wrong), but that doesn't matter in my case, since I use a single timezone and it is not plotted.
Given some arbitrary numpy array of data, how can I plot it to make dates appear on the x axis? In this case, sample 0 will be some time, say 7:00, and every sample afterwards will be spaced one minute apart, so that for 60 samples, the time displayed should be 7:00, 7:01, ..., 7:59.
I looked at some of the other questions on here but they all required actually setting a date and some other stuff that felt very over the top compared to what I'd like to do.
Thanks!
Christoph
If you use an array of datetime objects for your x-axis, the plot() function will behave like you want (assuming that you don't want all 60 labels from 7:00 to 7:59 to be displayed). Here is a sample code:
import random
from pylab import *
from datetime import *
N = 60
t0 = datetime.combine(date.today(), time(7,0,0))
delta_t = timedelta(minutes=1)
x_axis = t0 + arange(N)*delta_t
plot(x_axis, random(N))
show()
Concerning the use of the combine() function, see question python time + timedelta equivalent