Wrong labels when plotting a time series pandas dataframe with matplotlib

Wrong labels when plotting a time series pandas dataframe with matplotlib - python

I am working with a dataframe containing data of 1 week.
y
ds
2017-08-31 10:15:00 1.000000
2017-08-31 10:20:00 1.049107
2017-08-31 10:25:00 1.098214
...
2017-09-07 10:05:00 99.901786
2017-09-07 10:10:00 99.950893
2017-09-07 10:15:00 100.000000
I create a new index by combining the weekday and time i.e.
y
dayIndex
4 - 10:15 1.000000
4 - 10:20 1.049107
4 - 10:25 1.098214
...
4 - 10:05 99.901786
4 - 10:10 99.950893
4 - 10:15 100.000000
The plot of this data is the following:
The plot is correct as the labels reflect the data in the dataframe. However, when zooming in, the labels do not seem correct as they no longer correspond to their original values:
What is causing this behavior?
Here is the code to reproduce this:
import datetime
import numpy as np
import pandas as pd
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='5T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df = df.resample('5T').mean()
df['dayIndex'] = df.index.strftime('%w - %H:%M')
df= df.set_index('dayIndex')
df.plot()

"What is causing this behavior?"
The formatter of an axes of a pandas dates plot is a matplotlib.ticker.FixedFormatter (see e.g.
print plt.gca().xaxis.get_major_formatter()). "Fixed" means that it formats the ith tick (if shown) with some constant string.
When zooming or panning, you shift the tick locations, but not the format strings.
In short: A pandas date plot may not be the best choice for interactive plots.
Solution
A solution is usually to use matplotlib formatters directly. This requires the dates to be datetime objects (which can be ensured using df.index.to_pydatetime()).
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='110T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df.index.to_pydatetime()
df.plot(marker="o")
plt.gca().xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%w - %H:%M'))
plt.show()

Related

seaborn : plotting time on x-axis

I'm working with a dataset that only contains datetime objects and I have retrieved the day of the week and reformatted the time in a separate column like this (conversion functions included below):
datetime day_of_week time_of_day
0 2021-06-13 12:56:16 Sunday 20:00:00
5 2021-06-13 12:56:54 Sunday 20:00:00
6 2021-06-13 12:57:27 Sunday 20:00:00
7 2021-07-16 18:55:42 Friday 20:00:00
8 2021-07-16 18:56:03 Friday 20:00:00
9 2021-06-04 18:42:06 Friday 20:00:00
10 2021-06-04 18:49:05 Friday 20:00:00
11 2021-06-04 18:58:22 Friday 20:00:00
What I would like to do is create a kde plot with x-axis = time_of_day (spanning 00:00:00 to 23:59:59), y-axis to be the count of each day_of_week at each hour of the day, and hue = day_of_week. In essence, I'd have seven different distributions representing occurrences during each day of the week.
Here's a sample of the data and my code. Any help would be appreciated:
df = pd.DataFrame([
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:54',
'2021-06-13 12:56:54',
'2021-06-13 12:57:27',
'2021-07-16 18:55:42',
'2021-07-16 18:56:03',
'2021-06-04 18:42:06',
'2021-06-04 18:49:05',
'2021-06-04 18:58:22',
'2021-06-08 21:31:44',
'2021-06-09 02:14:30',
'2021-06-09 02:20:19',
'2021-06-12 18:05:47',
'2021-06-15 23:46:41',
'2021-06-15 23:47:18',
'2021-06-16 14:19:08',
'2021-06-17 19:08:17',
'2021-06-17 22:37:27',
'2021-06-21 23:31:32',
'2021-06-23 20:32:09',
'2021-06-24 16:04:21',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-08-31 21:38:07',
'2020-08-31 21:38:22',
'2020-08-31 21:38:42',
'2020-08-31 21:39:03',
], columns=['datetime'])
def convert_date(date):
return calendar.day_name[date.weekday()]
def convert_hour(time):
return time[:2]+':00:00'
df['day_of_week'] = pd.to_datetime(df['datetime']).apply(convert_date)
df['time_of_day'] = df['datetime'].astype(str).apply(convert_hour)

Let's try:
converting the datetime column to_datetime
Create a Categorical column from day_of_week codes (so categorical ordering functions correctly)
normalizing the time_of_day to a single day (so comparisons function correctly). This makes it seem like all events occurred within the same day making plotting logic much simpler.
plot the kdeplot
set the xaxis formatter to only display HH:MM:SS
import calendar
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt, dates as mdates
# df = pd.DataFrame({...})
# Convert to datetime
df['datetime'] = pd.to_datetime(df['datetime'])
# Create Categorical Column
cat_type = pd.CategoricalDtype(list(calendar.day_name), ordered=True)
df['day_of_week'] = pd.Categorical.from_codes(
df['datetime'].dt.day_of_week, dtype=cat_type
)
# Create Normalized Date Column
df['time_of_day'] = pd.to_datetime('2000-01-01 ' +
df['datetime'].dt.time.astype(str))
# Plot
ax = sns.kdeplot(data=df, x='time_of_day', hue='day_of_week')
# X axis format
ax.set_xlim([pd.to_datetime('2000-01-01 00:00:00'),
pd.to_datetime('2000-01-01 23:59:59')])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
plt.tight_layout()
plt.show()
Note sample size is small here:
If looking for count on y then maybe histplot is better:
ax = sns.histplot(data=df, x='time_of_day', hue='day_of_week')

I would use Timestamp of pandas straight away. By the way your convert_hour function seems to do wrong. It gives time_of_the day as 20:00:00 for all data.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_context("paper", font_scale=2)
sns.set_style('whitegrid')
df['day_of_week'] = df['datetime'].apply(lambda x: pd.Timestamp(x).day_name())
df['time_of_day'] = df['datetime'].apply(lambda x: pd.Timestamp(x).hour)
plt.figure(figsize=(8, 4))
for idx, day in enumerate(days):
sns.kdeplot(df[df.day_of_week == day]['time_of_day'], label=day)
The kde for wednesday, looks a bit strange because the time varies between 2 and 20, hence the long tail from -20 to 40 in the plot.

Here is a simple code and using df.plot.kde.
Added more data so that multiple values are present for each day_of_week for kde to plot. Simplified the code to remove functions.
df1 = pd.DataFrame([
'2020-09-01 16:39:03',
'2020-09-02 16:39:03',
'2020-09-03 16:39:03',
'2020-09-04 16:39:03',
'2020-09-05 16:39:03',
'2020-09-06 16:39:03',
'2020-09-07 16:39:03',
'2020-09-08 16:39:03',
], columns=['datetime'])
df = pd.concat([df,df1]).reset_index(drop=True)
df['day_of_week'] = pd.to_datetime(df['datetime']).dt.day_name()
df['time_of_day'] = df['datetime'].str.split(expand=True)[1].str.split(':',expand=True)[0].astype(int)
df.pivot(columns='day_of_week').time_of_day.plot.kde()
Plots:

Pd.Grouper to shift the bin intervals

In pandas 1.2 i would like to possibly adjust origin and offset to have the bins in the below example with exact yyyy-05-15 and yyyy-11-15 6M intervals. Here is the help link to pd.Grouper.
my attempt with the following code
import random
import pandas as pd
n_rows = 600
df = pd.DataFrame({'date': pd.date_range(periods=n_rows, end='2020-04-15'),
'current_start_date': '2020-11-15',
'current_end_date': '2021-05-15',
"a": range(n_rows)})
df[['current_start_date', 'current_end_date']] = df[['current_start_date',
'current_end_date']].apply(pd.to_datetime)
end = df[ 'current_end_date'].iloc[0]
df.groupby(pd.Grouper(freq='6M', key='date', origin=end, label='right', closed='right')).sum()
produces
Out[3]:
a
date
2018-08-31 21
2019-02-28 17557
2019-08-31 51428
2020-02-29 84175
2020-08-31 26519

How to plot data, time on x-axis not datetime

id timestamp energy
0 a 2012-03-18 10:00:00 0.034
1 b 2012-03-20 10:30:00 0.052
2 c 2013-05-29 11:00:00 0.055
3 d 2014-06-20 01:00:00 0.028
4 a 2015-02-10 12:00:00 0.069
I want to plot these data like below.
just time on x-axis, not date nor datetime.
because I want to see the values per each hour.
https://i.stack.imgur.com/u73eJ.png
but this code plot like this.
plt.plot(df['timestamp'], df['energy'])
https://i.stack.imgur.com/yd6NL.png
I tried some codes but they just format the X data hide date part and plot like second graph.
+ df['timestamp'] is datetime type.
what should I do? Thanks.

you can convert your datetime into time, if your df["timestamp"] is already in datetime format then
df["time"] = df["timestamp"].map(lambda x: x.time())
plt.plot(df['time'], df['energy'])
if df["timestamp"] is of type string then you can add one more line in front as df["timestamp"] = pd.to_datetime(df["timestamp"])
Update: look like matplotlib does not accept time types, just convert to string
df["time"] = df["timestamp"].map(lambda x: x.strftime("%H:%M"))
plt.scatter(df['time'], df['energy'])

First check, if type of df["timestamp"] is in datetime format.
if not
import pandas as pd
time = pd.to_datetime(df["timestamp"])
print(type(time))
Then,
import matplotlib.pyplot as plt
values = df['energy']
plt.plot_date(dates , values )
plt.xticks(rotation=45)
plt.show()

Resampling at irregular intervals

I have a regularly spaced time series stored in a pandas data frame:
1998-01-01 00:00:00 5.71
1998-01-01 12:00:00 5.73
1998-01-02 00:00:00 5.68
1998-01-02 12:00:00 5.69
...
I also have a list of dates that are irregularly spaced:
1998-01-01
1998-07-05
1998-09-21
....
I would like to calculate the average of the time series between each time interval of the list of dates. Is this somehow possible using pandas.DataFrame.resample? If not, what is the easiest way to do it?
Edited:
For example, calculate the mean of 'series' in between the dates in 'dates', created by the following code:
import pandas as pd
import numpy as np
import datetime
rng = pd.date_range('1998-01-01', periods=365, freq='D')
series = pd.DataFrame(np.random.randn(len(rng)), index=rng)
dates = [pd.Timestamp('1998-01-01'), pd.Timestamp('1998-07-05'), pd.Timestamp('1998-09-21')]

You can loop through the dates and use select only the rows falling in between those dates like this,
import pandas as pd
import numpy as np
import datetime
rng = pd.date_range('1998-01-01', periods=365, freq='D')
series = pd.DataFrame(np.random.randn(len(rng)), index=rng)
dates = [pd.Timestamp('1998-01-01'), pd.Timestamp('1998-07-05'), pd.Timestamp('1998-09-21')]
for i in range(len(dates)-1):
start = dates[i]
end = dates[i+1]
sample = series.loc[(series.index > start) & (series.index <= end)]
print(f'Mean value between {start} and {end} : {sample.mean()[0]}')
# Output
Mean value between 1998-01-01 00:00:00 and 1998-07-05 00:00:00 : -0.024342221543215112
Mean value between 1998-07-05 00:00:00 and 1998-09-21 00:00:00 : 0.13945008064765074
Instead of a loop, you can also use a list comprehension like this,
print([series.loc[(series.index > dates[i]) & (series.index <= dates[i+1])].mean()[0] for i in range(len(dates) - 1) ]) # [-0.024342221543215112, 0.13945008064765074]

You could iterate over the dates like this:
for ti in range(1,len(dates)):
start_date, end_date = dates[ti-1],dates[ti]
mask = (series.index > start_date) & (series.index <= end_date)
print(series[mask].mean())

Plotting CSV data using myplotlib and pandas in python

This is what I currently have, I need to plot time on the x and turbidity on the y. Before I can plot the volts from a csv file need to go into the equation Turbidity = (0.07642 * volts) + (-15.122)) and then graphed. I am also getting a date error, I will post the columns below. Here are the columns below, how can I get it to overlook the logger time and the loggerID? I just need the date time on x and the raw sensor converted to turbidity on the y.
Date/Time (UTC) Logger Time (unix timestamp) Raw Sensor (mV) LoggerID
6/27/2018 18:45 1530125111 4.61 Mill Creek B
7/3/2018 18:30 1530642609 92.14 Mill Creek B
7/3/2018 18:45 1530643509 92.03 Mill Creek B
7/3/2018 20:00 1530648013 91.24 Mill Creek B
...
import pandas as pd
from datetime import datetime
import csv
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
headers = ['Raw Sensor','Date','Time']
df = pd.read_csv('turbiditydata.csv',names=headers)
print (df)
df['Date'] = df['Date'].map(lambda x: datetime.strptime(str(x), '%d/%m/%y %H:%M'))
x = df['Date']
y = df['Turbidity']
plt.plot(x,y)
plt.gcf().autofmt_xdate()
plt.title('Turbidity Over Time')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wrong labels when plotting a time series pandas dataframe with matplotlib - python

Related

seaborn : plotting time on x-axis

Pd.Grouper to shift the bin intervals

How to plot data, time on x-axis not datetime

Resampling at irregular intervals

Plotting CSV data using myplotlib and pandas in python

Categories

Resources