ploting histogram with timedelta series - python

I have some series of data, which is a timedelta data type. I wanted to plot these timedelta into a bar diagram where the y axis should only be marked in hours instead of some other format. Previously, when I was trying with a line plot in matplotlib, it showed some not understandable numbers. The following is the sample of my timedelta series of pandas:
date
2020-04-11 0 days 02:00:00
2020-04-12 0 days 03:00:00
2020-04-13 0 days 02:00:00
2020-04-14 0 days 03:00:00
2020-04-15 0 days 01:00:00
2020-04-16 0 days 03:00:00
Freq: D, dtype: timedelta64[ns]
When I am trying to plot it in matplotlib, it results in a plot with y axis values look weird to me.
Please help me to work out with the plots, where the y-axis tick labels should be in 01:00, 02:00 like format.
Eagerly waiting for some of the help.

A possible way is to convert the deltas in seconds and define a FuncFormatter.
This is my test series and my final plot:
2020-04-11 02:00:00
2020-04-12 03:00:00
2020-04-13 05:00:00
dtype: timedelta64[ns]
def delta(x, pos):
out = str(datetime.timedelta(seconds=x) )
return out
fig = plt.figure()
ax = fig.gca()
form = matplotlib.ticker.FuncFormatter(delta)
ax.yaxis.set_major_formatter(form)
ax.plot(s.index, s/np.timedelta64(1,'s'))
ax.set_yticks(s/np.timedelta64(1,'s'))
ax.set_xticks(s.index)
plt.show()

Related

How to plot part of date time in Python

I have massive data from CSV which spans every hour for a whole year. It has not been difficult plotting the whole data (or specific data) through the whole year.
However, I would like to take a closer look at month (for ex just plot January or February), and for the life of me, I haven't found out how to do that.
Date Company1 Company2
2020-01-01 00:00:00 100 200
2020-01-01 01:00:00 110 180
2020-01-01 02:00:00 90 210
2020-01-01 03:00:00 100 200
.... ... ...
2020-12-31 21:00:00 100 200
2020-12-31 22:00:00 80 230
2020-12-31 23:00:00 120 220
All of the columns are correctly formatted, the datetime is correctly formatted. How can I slice or define exactly the period I want to plot?
You can extract the month portion of a pandas datetime using .dt.month on a datetime series. Then check if that is equal to the month in question:
df_january = df[df['Date'].dt.month == 1]
You can then plot using your df_january dataframe. N.B. this will pick up data from other years as well if your dataset expanded to cover other years.
#WakemeUpNow had the solution I hadn't noticed. defining xlin while plotting did the trick.
df.DateTime.plot(x='Date', y='Company', xlim=('2020-01-01 00:00:00 ', '2020-12-31 23:00:00'))
plt.show()

Plot non-continuous time series with Matplotlib as datetime not string [duplicate]

I am trying to plot stock prices data which collected hourly from 8:00 am to 4:00 pm on working days. Now because the market is closed from 4:00 pm to 8:00 am, when I try to plot this series there are long straight lines in between. Is there a hack Matplotlib hack to remove these lines and plot the data continuously?
Following snippet shows the point where there break between days.
date price
2021-01-01 08:00:03 338.50
2021-01-01 09:00:02 338.50
2021-01-01 10:00:03 338.50
2021-01-01 11:00:03 338.50
2021-01-01 12:00:02 338.50
2021-01-01 13:00:02 338.50
2021-01-01 14:00:02 338.50
2021-01-01 15:00:03 338.50
2021-01-01 16:00:02 338.50 <------ Break
2021-01-04 08:00:04 338.50
2021-01-04 09:00:06 335.61
2021-01-04 10:02:09 332.08
2021-01-04 11:00:05 331.11
2021-01-04 12:00:40 330.78
2021-01-04 13:00:03 331.93
2021-01-04 14:00:03 333.00
2021-01-04 15:00:04 334.59
2021-01-04 16:00:03 334.59
Following image shows the gaps that I want to remove!
Tried to plot them iteratively as follows. The step size 9 in the following script is the number of working hours per day from 8:00 am - 16:00 pm
for i in range(0, 72, 9):
plt.plot(uber_df['date'][i:i+9], uber_df['price'][i:i+9])
plt.show()
got the following plot:
You can get rid of the gaps if your x-axis had ordered categorical (ordinal) variables. One way of achieving this is to convert the datetime objects in your 'date' column to strings:
df['date'] = df['date'].astype(str)
df = df.set_index('date')
df.plot()
plt.gcf().autofmt_xdate()
plt.show()

How to plot time only of pandas datetime64[ns] attribute

I have a dataframe of a long time range in format datetime64[ns] and a int value
Data looks like this:
MIN_DEP DELAY
0 2018-01-01 05:09:00 0
1 2018-01-01 05:13:00 0
2 2018-01-01 05:39:00 0
3 2018-01-01 05:43:00 0
4 2018-01-01 06:12:00 34
... ... ...
77005 2020-09-30 23:42:00 0
77006 2020-09-30 23:43:00 0
77007 2020-09-30 23:43:00 43
77008 2020-10-01 00:18:00 0
77009 2020-10-01 00:59:00 0
[77010 rows x 2 columns]
MIN_DEP datetime64[ns]
DELAY int64
dtype: object
Target is to plot all the data in just a 00:00 - 24:00 range on the x-axis, no dates anymore.
As i try to plot it, the timeline is 00:00 at any point. How to fix this?
import matplotlib.dates as mdates
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()
tried to convert the timestamps before to dt.time and plot it then
pd_to_stat['time'] = pd.to_datetime(pd_to_stat['MIN_DEP'], format='%H:%M').dt.time
fig, ax = plt.subplots()
ax.plot(pd_to_stat['time'],pd_to_stat['DELAY'])
plt.show()
Plot does not allow to do that:
TypeError: float() argument must be a string or a number, not 'datetime.time'
According to your requirement, I guess you don't need the dates and as well as the seconds field in your timestamp. So you need a little bit of preprocessing at first.
Remove the seconds field using the code below
dataset['MIN_DEP'] = dataset['MIN_DEP'].strftime("%H:%M")
Then you can remove the date from your timestamp in the following manner
dataset['MIN_DEP'] = pd.Series([val.time() for val in dataset['MIN_DEP']])
Then you can plot your data in the usual manner.
This seems to work now. I did not recognise, the plot was still splitting up in dates. To work around I hat to replace all the dates with the same date and plottet it hiding the date using DateFormatter
import matplotlib.dates as mdates
pd_to_stat['MIN_DEP'] = pd_to_stat['MIN_DEP'].map(lambda t: t.replace(year=2020, month=1, day=1))
fig, ax = plt.subplots()
ax.plot(pd_to_stat['MIN_DEP'],pd_to_stat['DELAY'])
xfmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_major_formatter(xfmt)
plt.show()

how to plot only with the dates inside my df and not all the dates

I have this following df :
date values
2020-08-06 08:00:00 5
2020-08-06 09:00:00 10
2020-08-06 10:00:00 0
2020-08-17 08:00:00 8
2020-08-17 09:00:00 15
I want to plot this df so I do : df.set_index('date')['values'].plot(kind='line') but it shows all the dates between the 6th and the 17th.
How can I plot the graph only with the dates inside my df ?
I assume that date column is of datetime type.
To draw for selected dates only, the index must be built on
the principle "number of day from a unique list + hour".
But to suppress the default x label ticks, you have to define
your own, e.g. each 8 h in each date to be drawn.
Start from converting your DataFrame as follows:
idx = df['date'].dt.normalize().unique()
dateMap = pd.Series(np.arange(idx.size) * 24, index=idx)
df.set_index(df.date.dt.date.map(dateMap) + df.date.dt.hour, inplace=True)
df.index.rename('HourNo', inplace=True); df
Now, for your data sample, it has the following content:
date values
HourNo
8 2020-08-06 08:00:00 5
9 2020-08-06 09:00:00 10
10 2020-08-06 10:00:00 0
32 2020-08-17 08:00:00 8
33 2020-08-17 09:00:00 15
Then generate your plot and x ticks positions and labels:
fig, ax = plt.subplots(tight_layout=True)
df.loc[:, 'values'].plot(style='o-', rot=30, ax=ax)
xLoc = np.arange(0, dateMap.index.size * 24, 8)
xLbl = pd.concat([ pd.Series(d + pd.timedelta_range(start=0, freq='8H',
periods=3)) for d in dateMap.index ]).dt.strftime('%Y-%m-%d\n%H:%M')
plt.xticks(ticks=xLoc, labels=xLbl, ha='right')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
ax.set_title('Set the proper heading')
ax.grid()
plt.show()
I added also the grid.
The result is:
And the final remark: Avoid column names which are the same as existing
Pandas methods or arrtibutes (e.g. values).
Sometimes it is the cause of "stupid" errors (you intend to refer to
a column, but you actually refer to a metod or attribute).

Data Frame in Panda with Time series data

I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()

Categories

Resources