id timestamp energy
0 a 2012-03-18 10:00:00 0.034
1 b 2012-03-20 10:30:00 0.052
2 c 2013-05-29 11:00:00 0.055
3 d 2014-06-20 01:00:00 0.028
4 a 2015-02-10 12:00:00 0.069
I want to plot these data like below.
just time on x-axis, not date nor datetime.
because I want to see the values per each hour.
https://i.stack.imgur.com/u73eJ.png
but this code plot like this.
plt.plot(df['timestamp'], df['energy'])
https://i.stack.imgur.com/yd6NL.png
I tried some codes but they just format the X data hide date part and plot like second graph.
+ df['timestamp'] is datetime type.
what should I do? Thanks.
you can convert your datetime into time, if your df["timestamp"] is already in datetime format then
df["time"] = df["timestamp"].map(lambda x: x.time())
plt.plot(df['time'], df['energy'])
if df["timestamp"] is of type string then you can add one more line in front as df["timestamp"] = pd.to_datetime(df["timestamp"])
Update: look like matplotlib does not accept time types, just convert to string
df["time"] = df["timestamp"].map(lambda x: x.strftime("%H:%M"))
plt.scatter(df['time'], df['energy'])
First check, if type of df["timestamp"] is in datetime format.
if not
import pandas as pd
time = pd.to_datetime(df["timestamp"])
print(type(time))
Then,
import matplotlib.pyplot as plt
values = df['energy']
plt.plot_date(dates , values )
plt.xticks(rotation=45)
plt.show()
Related
I'm working with a dataset that only contains datetime objects and I have retrieved the day of the week and reformatted the time in a separate column like this (conversion functions included below):
datetime day_of_week time_of_day
0 2021-06-13 12:56:16 Sunday 20:00:00
5 2021-06-13 12:56:54 Sunday 20:00:00
6 2021-06-13 12:57:27 Sunday 20:00:00
7 2021-07-16 18:55:42 Friday 20:00:00
8 2021-07-16 18:56:03 Friday 20:00:00
9 2021-06-04 18:42:06 Friday 20:00:00
10 2021-06-04 18:49:05 Friday 20:00:00
11 2021-06-04 18:58:22 Friday 20:00:00
What I would like to do is create a kde plot with x-axis = time_of_day (spanning 00:00:00 to 23:59:59), y-axis to be the count of each day_of_week at each hour of the day, and hue = day_of_week. In essence, I'd have seven different distributions representing occurrences during each day of the week.
Here's a sample of the data and my code. Any help would be appreciated:
df = pd.DataFrame([
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:16',
'2021-06-13 12:56:54',
'2021-06-13 12:56:54',
'2021-06-13 12:57:27',
'2021-07-16 18:55:42',
'2021-07-16 18:56:03',
'2021-06-04 18:42:06',
'2021-06-04 18:49:05',
'2021-06-04 18:58:22',
'2021-06-08 21:31:44',
'2021-06-09 02:14:30',
'2021-06-09 02:20:19',
'2021-06-12 18:05:47',
'2021-06-15 23:46:41',
'2021-06-15 23:47:18',
'2021-06-16 14:19:08',
'2021-06-17 19:08:17',
'2021-06-17 22:37:27',
'2021-06-21 23:31:32',
'2021-06-23 20:32:09',
'2021-06-24 16:04:21',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-05-22 18:29:02',
'2020-08-31 21:38:07',
'2020-08-31 21:38:22',
'2020-08-31 21:38:42',
'2020-08-31 21:39:03',
], columns=['datetime'])
def convert_date(date):
return calendar.day_name[date.weekday()]
def convert_hour(time):
return time[:2]+':00:00'
df['day_of_week'] = pd.to_datetime(df['datetime']).apply(convert_date)
df['time_of_day'] = df['datetime'].astype(str).apply(convert_hour)
Let's try:
converting the datetime column to_datetime
Create a Categorical column from day_of_week codes (so categorical ordering functions correctly)
normalizing the time_of_day to a single day (so comparisons function correctly). This makes it seem like all events occurred within the same day making plotting logic much simpler.
plot the kdeplot
set the xaxis formatter to only display HH:MM:SS
import calendar
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt, dates as mdates
# df = pd.DataFrame({...})
# Convert to datetime
df['datetime'] = pd.to_datetime(df['datetime'])
# Create Categorical Column
cat_type = pd.CategoricalDtype(list(calendar.day_name), ordered=True)
df['day_of_week'] = pd.Categorical.from_codes(
df['datetime'].dt.day_of_week, dtype=cat_type
)
# Create Normalized Date Column
df['time_of_day'] = pd.to_datetime('2000-01-01 ' +
df['datetime'].dt.time.astype(str))
# Plot
ax = sns.kdeplot(data=df, x='time_of_day', hue='day_of_week')
# X axis format
ax.set_xlim([pd.to_datetime('2000-01-01 00:00:00'),
pd.to_datetime('2000-01-01 23:59:59')])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
plt.tight_layout()
plt.show()
Note sample size is small here:
If looking for count on y then maybe histplot is better:
ax = sns.histplot(data=df, x='time_of_day', hue='day_of_week')
I would use Timestamp of pandas straight away. By the way your convert_hour function seems to do wrong. It gives time_of_the day as 20:00:00 for all data.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_context("paper", font_scale=2)
sns.set_style('whitegrid')
df['day_of_week'] = df['datetime'].apply(lambda x: pd.Timestamp(x).day_name())
df['time_of_day'] = df['datetime'].apply(lambda x: pd.Timestamp(x).hour)
plt.figure(figsize=(8, 4))
for idx, day in enumerate(days):
sns.kdeplot(df[df.day_of_week == day]['time_of_day'], label=day)
The kde for wednesday, looks a bit strange because the time varies between 2 and 20, hence the long tail from -20 to 40 in the plot.
Here is a simple code and using df.plot.kde.
Added more data so that multiple values are present for each day_of_week for kde to plot. Simplified the code to remove functions.
df1 = pd.DataFrame([
'2020-09-01 16:39:03',
'2020-09-02 16:39:03',
'2020-09-03 16:39:03',
'2020-09-04 16:39:03',
'2020-09-05 16:39:03',
'2020-09-06 16:39:03',
'2020-09-07 16:39:03',
'2020-09-08 16:39:03',
], columns=['datetime'])
df = pd.concat([df,df1]).reset_index(drop=True)
df['day_of_week'] = pd.to_datetime(df['datetime']).dt.day_name()
df['time_of_day'] = df['datetime'].str.split(expand=True)[1].str.split(':',expand=True)[0].astype(int)
df.pivot(columns='day_of_week').time_of_day.plot.kde()
Plots:
I have a dataframe of a long time range in format datetime64[ns] and a int value
Data looks like this:
MIN_DEP DELAY
0 2018-01-01 05:09:00 0
1 2018-01-01 05:13:00 0
2 2018-01-01 05:39:00 0
3 2018-01-01 05:43:00 0
4 2018-01-01 06:12:00 34
... ... ...
77005 2020-09-30 23:42:00 0
77006 2020-09-30 23:43:00 0
77007 2020-09-30 23:43:00 43
77008 2020-10-01 00:18:00 0
77009 2020-10-01 00:59:00 0
[77010 rows x 2 columns]
MIN_DEP datetime64[ns]
DELAY int64
dtype: object
Target is to plot all the data in just a 00:00 - 24:00 range on the x-axis, no dates anymore.
As i try to plot it, the timeline is 00:00 at any point. How to fix this?
import matplotlib.dates as mdates
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()
tried to convert the timestamps before to dt.time and plot it then
pd_to_stat['time'] = pd.to_datetime(pd_to_stat['MIN_DEP'], format='%H:%M').dt.time
fig, ax = plt.subplots()
ax.plot(pd_to_stat['time'],pd_to_stat['DELAY'])
plt.show()
Plot does not allow to do that:
TypeError: float() argument must be a string or a number, not 'datetime.time'
According to your requirement, I guess you don't need the dates and as well as the seconds field in your timestamp. So you need a little bit of preprocessing at first.
Remove the seconds field using the code below
dataset['MIN_DEP'] = dataset['MIN_DEP'].strftime("%H:%M")
Then you can remove the date from your timestamp in the following manner
dataset['MIN_DEP'] = pd.Series([val.time() for val in dataset['MIN_DEP']])
Then you can plot your data in the usual manner.
This seems to work now. I did not recognise, the plot was still splitting up in dates. To work around I hat to replace all the dates with the same date and plottet it hiding the date using DateFormatter
import matplotlib.dates as mdates
pd_to_stat['MIN_DEP'] = pd_to_stat['MIN_DEP'].map(lambda t: t.replace(year=2020, month=1, day=1))
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()
I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164
I am using Pandas dataframes with DatetimeIndex to manipulate timeseries data. The data is stored at UTC time and I usually keep it that way (with naive DatetimeIndex), and only use timezones for output. I like it that way because nothing in the world confuses me more than trying to manipuluate timezones.
e.g.
In: ts = pd.date_range('2017-01-01 00:00','2017-12-31 23:30',freq='30Min')
data = np.random.rand(17520,1)
df= pd.DataFrame(data,index=ts,columns = ['data'])
df.head()
Out[15]:
data
2017-01-01 00:00:00 0.697478
2017-01-01 00:30:00 0.506914
2017-01-01 01:00:00 0.792484
2017-01-01 01:30:00 0.043271
2017-01-01 02:00:00 0.558461
I want to plot a chart of data versus time for each day of the year so I reshape the dataframe to have time along the index and dates for columns
df.index = [df.index.time,df.index.date]
df_new = df['data'].unstack()
In: df_new.head()
Out :
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 \
00:00:00 0.697478 0.143626 0.189567 0.061872 0.748223
00:30:00 0.506914 0.470634 0.430101 0.551144 0.081071
01:00:00 0.792484 0.045259 0.748604 0.305681 0.333207
01:30:00 0.043271 0.276888 0.034643 0.413243 0.921668
02:00:00 0.558461 0.723032 0.293308 0.597601 0.120549
If I'm not worried about timezones i can plot like this:
fig, ax = plt.subplots()
ax.plot(df_new.index,df_new)
but I want to plot the data in the local timezone (tz = pytz.timezone('Australia/Sydney') making allowance for daylight savings time, but the times and dates are no longer Timestamp objects so I can't use Pandas timezone handling. Or can I?
Assuming I can't, I'm trying to do the shift manually, (given DST starts 1/10 at 2am and finishes 1/4 at 2am), so I've got this far:
df_new[[c for c in df_new.columns if c >= dt.datetime(2017,4,1) and c <dt.datetime(2017,10,1)]].shift_by(+10)
df_new[[c for c in df_new.columns if c < dt.datetime(2017,4,1) or c >= dt.datetime(2017,10,1)]].shift_by(+11)
but am not sure how to write the function shift_by.
(This doesn't handle midnight to 2am on teh changeover days correctly, which is not ideal, but I could live with)
Use dt.tz_localize + dt.tz_convert to convert the dataframe dates to a particular timezone.
df.index = df.index.tz_localize('UTC').tz_convert('Australia/Sydney')
df.index = [df.index.time, df.index.date]
Be a little careful when creating the MuliIndex - as you observed, it creates two rows of duplicate timestamps, so if that's the case, get rid of it with duplicated:
df = df[~df.index.duplicated()]
df = df['data'].unstack()
You can also create subplots with df.plot:
df.plot(subplots=True)
plt.show()
I am working with a dataframe containing data of 1 week.
y
ds
2017-08-31 10:15:00 1.000000
2017-08-31 10:20:00 1.049107
2017-08-31 10:25:00 1.098214
...
2017-09-07 10:05:00 99.901786
2017-09-07 10:10:00 99.950893
2017-09-07 10:15:00 100.000000
I create a new index by combining the weekday and time i.e.
y
dayIndex
4 - 10:15 1.000000
4 - 10:20 1.049107
4 - 10:25 1.098214
...
4 - 10:05 99.901786
4 - 10:10 99.950893
4 - 10:15 100.000000
The plot of this data is the following:
The plot is correct as the labels reflect the data in the dataframe. However, when zooming in, the labels do not seem correct as they no longer correspond to their original values:
What is causing this behavior?
Here is the code to reproduce this:
import datetime
import numpy as np
import pandas as pd
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='5T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df = df.resample('5T').mean()
df['dayIndex'] = df.index.strftime('%w - %H:%M')
df= df.set_index('dayIndex')
df.plot()
"What is causing this behavior?"
The formatter of an axes of a pandas dates plot is a matplotlib.ticker.FixedFormatter (see e.g.
print plt.gca().xaxis.get_major_formatter()). "Fixed" means that it formats the ith tick (if shown) with some constant string.
When zooming or panning, you shift the tick locations, but not the format strings.
In short: A pandas date plot may not be the best choice for interactive plots.
Solution
A solution is usually to use matplotlib formatters directly. This requires the dates to be datetime objects (which can be ensured using df.index.to_pydatetime()).
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='110T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df.index.to_pydatetime()
df.plot(marker="o")
plt.gca().xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%w - %H:%M'))
plt.show()