Plot non-continuous time series with Matplotlib as datetime not string [duplicate]

Plot non-continuous time series with Matplotlib as datetime not string [duplicate] - python

I am trying to plot stock prices data which collected hourly from 8:00 am to 4:00 pm on working days. Now because the market is closed from 4:00 pm to 8:00 am, when I try to plot this series there are long straight lines in between. Is there a hack Matplotlib hack to remove these lines and plot the data continuously?
Following snippet shows the point where there break between days.
date price
2021-01-01 08:00:03 338.50
2021-01-01 09:00:02 338.50
2021-01-01 10:00:03 338.50
2021-01-01 11:00:03 338.50
2021-01-01 12:00:02 338.50
2021-01-01 13:00:02 338.50
2021-01-01 14:00:02 338.50
2021-01-01 15:00:03 338.50
2021-01-01 16:00:02 338.50 <------ Break
2021-01-04 08:00:04 338.50
2021-01-04 09:00:06 335.61
2021-01-04 10:02:09 332.08
2021-01-04 11:00:05 331.11
2021-01-04 12:00:40 330.78
2021-01-04 13:00:03 331.93
2021-01-04 14:00:03 333.00
2021-01-04 15:00:04 334.59
2021-01-04 16:00:03 334.59
Following image shows the gaps that I want to remove!
Tried to plot them iteratively as follows. The step size 9 in the following script is the number of working hours per day from 8:00 am - 16:00 pm
for i in range(0, 72, 9):
plt.plot(uber_df['date'][i:i+9], uber_df['price'][i:i+9])
plt.show()
got the following plot:

You can get rid of the gaps if your x-axis had ordered categorical (ordinal) variables. One way of achieving this is to convert the datetime objects in your 'date' column to strings:
df['date'] = df['date'].astype(str)
df = df.set_index('date')
df.plot()
plt.gcf().autofmt_xdate()
plt.show()

Related

Create Multiple DataFrames using Rolling Window from DataFrame Timestamps

I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00

This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)

Slicing across a timeseries range in a multiindex DataFrame

I have a DataFrame that tracks the 'Adj Closing' price for several global markets causing there to be repeating dates. To clean this up I use .set_index(['Index Ticker', 'Date']).
DataFrame sample
My issue is that the Closing Prices run as far back as 1997-07-02 but I only need 2020-01-01 and forward. I tried using idx = pd.IndexSlice followed by df.loc[idx[ :, '2020-01-01':], :] as well as df.loc[(slice(None), '2020-01-01':), :], but both methods return a syntax error on the : that I'm using to slice across a range of dates. Any tips on getting the data I need past a specific date? Thank you in advance!

Try:
# create dataframe to approximate your data
df = pd.DataFrame({'ticker' : ['A']*5 + ['M']*5,
'Date' : pd.date_range(start='2021-01-01', periods=5).tolist() + pd.date_range(start='2021-01-01', periods=5).tolist(),
'high' : range(10)}
).groupby(['ticker', 'Date']).sum()
high
ticker Date
A 2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
M 2021-01-01 5
2021-01-02 6
2021-01-03 7
2021-01-04 8
2021-01-05 9
# evaluate conditions against level 1 (Date) of your multiIndex; level 0 is ticker
df[df.index.get_level_values(1) > '2021-01-03']
high
ticker Date
A 2021-01-04 3
2021-01-05 4
M 2021-01-04 8
2021-01-05 9
Alternatively, if possible, remove the unwanted dates prior to setting your multiIndex.

Not getting same results as pct_change when doing manually

My code is as follows:
import pandas as pd
from pandas_datareader import data as web
import datetime
start = datetime.datetime(2021, 1, 1)
end = datetime.datetime.today()
df = web.DataReader('goog', 'yahoo', start, end)
df['pct']= df['Close'].pct_change()
Simple enough, produces:
High Low Open Close Volume Adj pct
Date
2020-12-31 1758.930054 1735.420044 1735.420044 1751.880005 1011900 1751.880005 NaN
2021-01-04 1760.650024 1707.849976 1757.540039 1728.239990 1901900 1728.239990 -0.013494
2021-01-05 1747.670044 1718.015015 1725.000000 1740.920044 1145300 1740.920044 0.007337
2021-01-06 1748.000000 1699.000000 1702.630005 1735.290039 2602100 1735.290039 -0.003234
2021-01-07 1788.400024 1737.050049 1740.060059 1787.250000 2265000 1787.250000 0.029943
... ... ... ... ... ... ... ...
2021-08-13 2773.479980 2760.100098 2767.149902 2768.120117 628600 2768.120117 0.000119
2021-08-16 2779.810059 2723.314941 2760.000000 2778.320068 902000 2778.320068 0.003685
2021-08-17 2774.370117 2735.750000 2763.820068 2746.010010 1063600 2746.010010 -0.011629
2021-08-18 2765.879883 2728.419922 2742.310059 2731.399902 746700 2731.399902 -0.005320
2021-08-19 2748.925049 2707.120117 2709.350098 2738.270020 856623 2738.270020 0.002515
160 rows × 7 columns
So the last row says the pct is 0.002515
My objecting was to reproduce the same result without pct_change to do this i have this code
(1- (df['Close'] / df['Close'].shift(-1))).shift(1)
which produces this:
Date
2020-12-31 NaN
2021-01-04 -0.013679
2021-01-05 0.007284
2021-01-06 -0.003244
2021-01-07 0.029073
...
2021-08-13 0.000119
2021-08-16 0.003671
2021-08-17 -0.011766
2021-08-18 -0.005349
2021-08-19 0.002509
Name: Close, Length: 160, dtype: float64
The last value I get is 0.002509 not 0.002515. Could you please explain why I am getting the last 2 digits off on each calulation?

Percent change is normally the change relative to the initial value:
(final - initial) / initial = final / initial - 1
You have the ratio relative to the final value. Try
df['Close'].shift(1) / df['Close'] - 1
By the way, you only need to shift once in your original expression as well.

How Can I get the first date on or after a given date?

I am using the following function. My index is a series of dates and I am looking to get the first date or first subsequent date if the date is not available of every month. I used the following code which gets the nearest date to the first date but causes a problem when i have in this case the 31st Dec closer to the 1st Jan rather than what should be the 4th Jan.
df['month'] = df.index.to_numpy().astype('datetime64[M]')
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for n in range(len(df)):
d = nearest(df.index, df['month'][n])
print(d)
output:
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2020-12-31 00:00:00
2021-02-01 00:00:00
2021-02-01 00:00:00
Is there an easy way to amend my code so that I get 2021-01-04 rather than 2020-12-31
Date x y z
28/12/2020 3723.030029 3735.360107 133.990005
29/12/2020 3750.01001 3727.040039 138.050003
30/12/2020 3736.189941 3732.040039 135.580002
31/12/2020 3733.27002 3756.070068 134.080002
04/01/2021 3764.610107 3700.649902 133.520004
05/01/2021 3698.02002 3726.860107 128.889999
06/01/2021 3712.199951 3748.139893 127.720001
07/01/2021 3764.709961 3803.790039 128.360001
08/01/2021 3815.050049 3824.679932 132.429993
11/01/2021 3803.139893 3799.610107 129.190002
12/01/2021 3801.620117 3801.189941 128.5
13/01/2021 3802.22998 3809.840088 128.759995
14/01/2021 3814.97998 3795.540039 130.800003
15/01/2021 3788.72998 3768.25 128.779999
19/01/2021 3781.879883 3798.909912 127.779999
20/01/2021 3816.219971 3851.850098 128.660004
21/01/2021 3857.459961 3853.070068 133.800003
22/01/2021 3844.23999 3841.469971 136.279999
25/01/2021 3851.679932 3855.360107 143.070007
26/01/2021 3862.959961 3849.620117 143.600006
27/01/2021 3836.830078 3750.77002 143.429993
28/01/2021 3755.75 3787.379883 139.520004
29/01/2021 3778.050049 3714.23999 135.830002
01/02/2021 3731.169922 3773.860107 133.75
02/02/2021 3791.840088 3826.310059 135.729996
03/02/2021 3840.27002 3830.169922 135.759995
04/02/2021 3836.659912 3871.73999 136.300003
05/02/2021 3878.300049 3886.830078 137.350006
08/02/2021 3892.590088 3915.590088 136.029999
09/02/2021 3910.48999 3911.22998 136.619995

ploting histogram with timedelta series

I have some series of data, which is a timedelta data type. I wanted to plot these timedelta into a bar diagram where the y axis should only be marked in hours instead of some other format. Previously, when I was trying with a line plot in matplotlib, it showed some not understandable numbers. The following is the sample of my timedelta series of pandas:
date
2020-04-11 0 days 02:00:00
2020-04-12 0 days 03:00:00
2020-04-13 0 days 02:00:00
2020-04-14 0 days 03:00:00
2020-04-15 0 days 01:00:00
2020-04-16 0 days 03:00:00
Freq: D, dtype: timedelta64[ns]
When I am trying to plot it in matplotlib, it results in a plot with y axis values look weird to me.
Please help me to work out with the plots, where the y-axis tick labels should be in 01:00, 02:00 like format.
Eagerly waiting for some of the help.

A possible way is to convert the deltas in seconds and define a FuncFormatter.
This is my test series and my final plot:
2020-04-11 02:00:00
2020-04-12 03:00:00
2020-04-13 05:00:00
dtype: timedelta64[ns]
def delta(x, pos):
out = str(datetime.timedelta(seconds=x) )
return out
fig = plt.figure()
ax = fig.gca()
form = matplotlib.ticker.FuncFormatter(delta)
ax.yaxis.set_major_formatter(form)
ax.plot(s.index, s/np.timedelta64(1,'s'))
ax.set_yticks(s/np.timedelta64(1,'s'))
ax.set_xticks(s.index)
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plot non-continuous time series with Matplotlib as datetime not string [duplicate] - python

You can get rid of the gaps if your x-axis had ordered categorical (ordinal) variables. One way of achieving this is to convert the datetime objects in your 'date' column to strings: df['date'] = df['date'].astype(str) df = df.set_index('date') df.plot() plt.gcf().autofmt_xdate() plt.show()

Related

Create Multiple DataFrames using Rolling Window from DataFrame Timestamps

Slicing across a timeseries range in a multiindex DataFrame

Not getting same results as pct_change when doing manually

How Can I get the first date on or after a given date?

ploting histogram with timedelta series

Categories

Resources