I am very new to Python. I usually use scikits.timeseries to process time-series data. Now I would like to use Panda such as read_csv to do the same as the code shown below. I used the read_csv manual to read the file, but I don't know how to convert the daily time-series to monthly time-series.
The input is one column daily data starting from 2002-01-01 to 2011-12-31, so the length is 3652. The output will be one column monthly data starting from 2002-01 to 2011-12, so the length is 120.
import numpy as np
import pandas as pd
import scikits.timeseries as ts
stgSim = ts.time_series(np.loadtxt('examp.txt', delimiter = ',' , skiprows = 1 ,
usecols = [37] ),
start_date ='2002-01-01',
freq='d' )
v4 = ts.time_series(np.random.rand(3652),start_date='2002-01-01',freq='d')
startD = stgSim.date_to_index(v4.start_date)
stgSim = stgSim[startD:]
stgSimAnMonth = stgSim.convert(freq='m',func=np.ma.mean)
Are you asking for resample which converts daily data to monthly data?
Say
rng = np.random.RandomState(42) # set a random seed so that result is repeatable
ts = pd.Series(data=rng.rand(100),
index=pd.date_range('2018/01/01', periods=100, freq='D'))
mts = ts.resample('M').mean() # resample (convert) to monthly data
ts is like
2018-01-01 0.374540
2018-01-02 0.950714
2018-01-03 0.731994
...
2018-04-08 0.427541
2018-04-09 0.025419
2018-04-10 0.107891
Now you should have mts like
2018-01-31 0.444047
2018-02-28 0.498545
2018-03-31 0.477100
2018-04-30 0.450325
Related
I have a dataframe with daily market data (OHLCV) and am resampling it to weekly.
My specific requirement is that the weekly dataframe's index labels must be the index labels of the first day of that week, whose data is present in the daily dataframe.
For example, in July 2022, the trading week beginning 4th July (for US stocks) should be labelled 5th July, since 4th July was a holiday and not found in the daily dataframe, and the first date in that week found in the daily dataframe is 5th July.
The usual weekly resampling offset aliases and anchored offsets do not seem to have such an option.
I can achieve my requirement specifically for US stocks by importing USFederalHolidayCalendar from pandas.tseries.holiday and then using
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
dfw.index = dfw.index.map(lambda idx: bday_us.rollforward(idx))
where dfw is the already resampled weekly dataframe with W-MON as option.
However, this would mean that I'd have to use different trading calendars for each different exchange/market, which I'd very much like to avoid.
Any pointers on how to do this simply so that the index label in the weekly dataframe is the index label of the first day of that week available in the daily dataframe would be much appreciated.
You want to group all days by calendar week (Mon-Sun), then aggregate the data, and use the first observed date as the index, correct?
If so, W-MON is not applicable because you will group dates from Tuesday through Monday. Using W-SUN instead, you group by the calendar week where the index is the Sunday. However, you can use method first on the date column to obtain the first observed date in this week and replace the index with this result.
This is possible with either groupby or resample:
import numpy as np
import pandas as pd
# simulate daily data, drop a monday
date_range = pd.bdate_range(start='2022-06-06',end='2022-07-31')
date_range = date_range[~(date_range=='2022-07-04')]
# simulate data
df = pd.DataFrame(data = {
'date': date_range,
'return': np.random.random(size=len(date_range))
})
# resample with groupby
g = df.groupby([pd.Grouper(key='date', freq='W-SUN')])
result_groupby = g[['return']].mean() # example aggregation method
result_groupby['date_first_observed'] = g['date'].first()
result_groupby['date_last_observed'] = g['date'].last()
result_groupby.set_index('date_first_observed', inplace=True)
# resample with resample
df.index = df['date']
g = df.resample('W-SUN')
result_resample = g[['return']].mean() # example aggregation method
result_resample['date_first_observed'] = g['date'].first()
result_resample['date_last_observed'] = g['date'].last()
result_resample.set_index('date_first_observed', inplace=True)
This gives
>>> result_groupby
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
>>> result_resample
return date_last_observed
date_first_observed
2022-06-06 0.704949 2022-06-10
2022-06-13 0.460946 2022-06-17
2022-06-20 0.578682 2022-06-24
2022-06-27 0.361004 2022-07-01
2022-07-05 0.692309 2022-07-08
2022-07-11 0.569810 2022-07-15
2022-07-18 0.435222 2022-07-22
2022-07-25 0.454765 2022-07-29
One row shows 2022-07-05 (Tuesday) instead of 2022-07-04 (Monday).
I have a dataframe in long format with data on a 15 min interval for several variables. If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 variables?
Note: The code below contains an EXAMPLE dataframe in long format, my real example loads data from csv and has different time intervals and frequencies for the variables, so I cannot simply resample the dataframe in wide format.
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25
df.reset_index(inplace=True)
df2 = df.melt (id_vars = 'index')
df3 = df2.resample('d', on='index').mean()
IIUC:
>>> df.groupby(df.index.date).mean()
speed distance
2015-01-01 29.562500 7.390625
2015-01-02 31.885417 7.971354
2015-01-03 30.895833 7.723958
2015-01-04 30.489583 7.622396
2015-01-05 28.500000 7.125000
... ... ...
2015-12-27 28.552083 7.138021
2015-12-28 29.437500 7.359375
2015-12-29 29.479167 7.369792
2015-12-30 28.864583 7.216146
2015-12-31 48.000000 12.000000
[365 rows x 2 columns]
I have two series, one of which is monthly CPI (consumer price inflation) data and the other being the daily close price of EUR/USD exchange rate. The issue I am having is converting the monthly CPI data INTO DAILY so that I can combine the two series together into a dataframe. As I am using this for a regression machine learning (ML) task, namely XGBoost, I am unsure about the correct way to do this. I've read about a couple of methods: maybe interpolation, e.g. Chow-Lin or perhaps Cubic Spline, I've heard of using a Kalman Filter, I have seen people manipulating the datetime index and filling it with NaN values and then using ffill() to fill in the NaN's, and there's training separate models (which I don't want to do) etc.....
I really don't know the correct procedure to do this, especially when time disaggregation is of a major concern in relation to model accuracy. Here is the code:
from datetime import datetime
import pandas as pd
import pandas_datareader.data as pdr
import yfinance as yf
eurusd = yf.download("EURUSD=X", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21))["Close"] # daily
cpi = pdr.FredReader("CPALTT01USM657N", start=datetime(2000, 1, 1), end=datetime(2021, 7, 21)).read() # monthly
The output of the data is as follows:
(N.B: I will eventually slice the dataframe and have data from around 2008 onwards for the sake of feature engineering and for more recent data for the ML model)
Date
2003-12-01 1.196501
2003-12-02 1.208897
2003-12-03 1.212298
2003-12-04 1.208094
2003-12-05 1.218695
...
2021-07-15 1.183334
2021-07-16 1.181181
2021-07-19 1.181401
2021-07-20 1.179384
2021-07-21 1.178411
Name: Close, Length: 4554, dtype: float64
CPALTT01USM657N
DATE
2000-01-01 0.297089
2000-02-01 0.592417
2000-03-01 0.824499
2000-04-01 0.058411
2000-05-01 0.116754
... ...
2021-01-01 0.425378
2021-02-01 0.547438
2021-03-01 0.708327
2021-04-01 0.821891
2021-05-01 0.801711
[257 rows x 1 columns]
Really appreciate all the help that I can get! Many thanks.
I have tested a few different ways to calculate technical indicators for a large dataframe and am unsure how to determine the most efficient and pythonic way to go about it. The data is stock data (date, price, volume). The goal is to iterate through the dataframe, per ticker, calculating multiple technical indicators, and then sending the result back into the source (SQL db).
The data contains about 4,200 stock symbols with daily price data from 2000 to date (roughly 13m rows x 8 columns).
For testing, I've limited the data to just 2021 date range.
Here is a sample of the data:
Date Open High Low Close Adj_close Volume Tick
529326 2021-01-04 3270.00 3272.00 3144.02 3186.63 3186.63 4411400 AMZN
521846 2021-01-05 3166.01 3223.38 3165.06 3218.51 3218.51 2655500 AMZN
521691 2021-01-06 3146.48 3197.51 3131.16 3138.38 3138.38 4394800 AMZN
514195 2021-01-07 3157.00 3208.54 3155.00 3162.16 3162.16 3514500 AMZN
514038 2021-01-08 3180.00 3190.64 3142.20 3182.70 3182.70 3537700 AMZN
506535 2021-01-11 3148.01 3156.38 3110.00 3114.21 3114.21 3683400 AMZN
506376 2021-01-12 3120.00 3142.14 3086.00 3120.83 3120.83 3514600 AMZN
498871 2021-01-13 3128.44 3189.95 3122.08 3165.89 3165.89 3321200 AMZN
498706 2021-01-14 3167.52 3178.00 3120.59 3127.47 3127.47 3070900 AMZN
491194 2021-01-15 3123.02 3142.55 3095.17 3104.25 3104.25 4244000 AMZN
491037 2021-01-19 3107.00 3145.00 3096.00 3120.76 3120.76 3305100 AMZN
483504 2021-01-20 3181.99 3279.80 3175.00 3263.38 3263.38 5309800 AMZN
483351 2021-01-21 3293.00 3348.55 3289.57 3306.99 3306.99 4936100 AMZN
475802 2021-01-22 3304.31 3321.91 3283.16 3292.23 3292.23 2821900 AMZN
475649 2021-01-25 3328.50 3363.89 3243.15 3294.00 3294.00 3749800 AMZN
468087 2021-01-26 3296.36 3338.00 3282.87 3326.13 3326.13 2955200 AMZN
467939 2021-01-27 3341.49 3346.52 3207.08 3232.58 3232.58 4660200 AMZN
460368 2021-01-28 3235.04 3301.68 3228.69 3237.62 3237.62 3149200 AMZN
460219 2021-01-29 3230.00 3236.99 3184.55 3206.20 3206.20 4293600 AMZN
452618 2021-02-01 3242.36 3350.26 3235.03 3342.88 3342.88 4160200 AMZN
I'm not sure how to fully code dummy data, but here are two methods (should just need Numpy) to create random price and ticker data, I am just unsure how to merge them all into a dataframe. To simulate the same dataframe, there would be 4,200 symbols and 134 days of data.
letters = ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',)
x=np.random.randint(500, size=(134)) # <<< generates random price
y=''.join(np.random.choice(letters) for i in range(4)) # <<< generate random 4 character string
Here are all the imports being used:
#imports
from datetime import datetime, timedelta, date
import time
import sqlalchemy as sa
import pandas as pd
import numpy as np
import yfinance as yf
import pyodbc
import pandas_ta as ta
import talib
Dataframe 'sc' is referenced, with data in the following format:
Date Open High Low Close Adj_close Volume Tick
529377 2021-01-04 38.68 38.69 37.18 37.88 37.88 647700 ACIW
526834 2021-01-04 29.72 29.94 28.68 29.10 29.10 1527600 GOOS
526833 2021-01-04 15.35 15.40 14.92 15.01 14.39 421400 ETV
526832 2021-01-04 42.22 42.36 41.13 41.46 40.84 204000 HMN
526831 2021-01-04 13.94 15.72 13.75 15.38 15.38 880500 GATO
Then I want to iterate through the dataframe (which is sorted by Date) for each ticker and calculate a number of technical indicators. For now I am starting with just two moving average calculations. I've tried 3 different methods and compared the times below.
Talib package: 10 minutes
start_time = time.time()
ticks = pd.unique(sc['Tick'].tolist()) # <<< 4,200 unique tickers
ndf = [] # <<< initialize df
for tick in ticks:
# store ID(symbol), Date, Close(adj_close), and two indicators (SMA,EMA) in variables to
# concat into a temporary df, and then append outside of loop. Not sure if this is most
# efficient/pythonic way to do this.
ID = sc[sc["Tick"]==tick]["Tick"]
DATE = sc[sc["Tick"]==tick]["Date"]
CLOSE = sc[sc["Tick"]==tick]["Adj_close"]
SMA = round(talib.SMA(sc[sc["Tick"]==tick]['Adj_close']),2)
EMA = round(talib.EMA(sc[sc["Tick"]==tick]['Adj_close']),2)
#concat into one df
tempdf = pd.concat([ID, DATE, CLOSE, SMA, EMA], axis=1)
#append into main df outside of loop
ndf.append(tempdf)
print("Completed Indicators for "+tick)
# Concat everything in -ndf into a flattened df (-df)
df = pd.concat(ndf)
df['t_id'] = df['Tick']+'-'+df['Date']
df.rename(columns={'Adj_close':'Close', 0: "SMA", 1: "EMA"},inplace=True)
df=df.sort_values(by='Date')
print("--- %s seconds ---" % (time.time() - start_time))
df.tail(20)
Pandas TA: 13.6 minutes
#Swapped the "talib" lines for Pandas-TA package
SMA5 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=5)
SMA15 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=15)
Rolling method / Pandas: 14 minutes
# Swapped the "talib" lines for Rolling():
SMA5 = sc[sc["Tick"]==tick]['Adj_close'].rolling(5,min_periods=1).mean()
SMA15 = sc[sc["Tick"]==tick]['Adj_close'].rolling(15,min_periods=1).mean()
I am unsure how to gauge what an efficient time would be (is 10 minutes generally good? bad? or is it just dependent on personal requirements?) and if the approach of looping through each ticker, storing each indicator separately, then concating and finally appending back into a master daaframe is an appropriately pythonic approach. The final code will insert the complete dataframe back into a SQL table.
I want to interpolate (upscale) nonequispaced time-series to obtain equispaced time-series.
Currently I am doing it in following way:
take original timeseries.
create new timeseries with NaN values at each 30 seconds intervals ( using resample('30S').asfreq() )
concat original timeseries and new timeseries
sort the timeseries to restore order of times (This I do not like - sorting has complexity of O = n log(n) )
interpolate
remove original points from the timeseries
is there a more simple way with Pandas version 18.0rc1? like in matlab you have original timeseries and you pass new times as a parameter to the interpolate() function to receive values at desired times.
I remark that times of original timeseries might not be be a subset of the times of desired timeseries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, 50, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:4',
'2015-01-04 08:37:05',
'2015-01-04 08:41:07',
'2015-01-04 08:43:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts
ts[ts==-1] = np.nan
newFreq=ts.resample('60S').asfreq()
new=pd.concat([ts,newFreq]).sort_index()
new=new.interpolate(method='time')
ts.plot(marker='o')
new.plot(marker='+',markersize=15)
new[newFreq.index].plot(marker='.')
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['original values (nonequispaced)', 'original + interpolated at new frequency (nonequispaced)', 'interpolated values without original values (equispaced!)']
plt.legend(lines, labels, loc='best')
plt.show()
There have been several requests for a simpler way to interpolate at desired values (I'll edit in links later, but search the issue tracker for interpolate issues). So in the future there will be an easier way.
For now you can write the option a bit more cleanly as
In [9]: (ts.reindex(ts.index | newFreq.index)
.interpolate(method='time')
.loc[newFreq.index])
Out[9]:
2015-01-04 08:29:00 NaN
2015-01-04 08:30:00 277996.070686
2015-01-04 08:31:00 285236.860707
2015-01-04 08:32:00 292477.650728
2015-01-04 08:33:00 299718.440748
...
2015-01-04 08:45:00 261362.402778
2015-01-04 08:46:00 261937.569444
2015-01-04 08:47:00 262512.736111
2015-01-04 08:48:00 263087.902778
2015-01-04 08:49:00 263663.069444
Freq: 60S, dtype: float64
This still involves all the steps you listed above, but the unioning of the indexes is cleaner than concating and dropping.