Slicing pandas dataframe via index datetime - python

I tried to slice a pandas dataframe, that was read from the CSV file and the index was set from the first column of dates.
IN:
df = pd.read_csv(r'E:\...\^d.csv')
df["Date"] = pd.to_datetime(df["Date"])
OUT:
Date Open High Low Close Volume
0 1920-01-02 9.52 9.52 9.52 9.52 NaN
1 1920-01-03 9.62 9.62 9.62 9.62 NaN
2 1920-01-05 9.57 9.57 9.57 9.57 NaN
3 1920-01-06 9.46 9.46 9.46 9.46 NaN
4 1920-01-07 9.47 9.47 9.47 9.47 NaN
Date Open High Low Close Volume
26798 2020-10-26 3441.42 3441.42 3364.86 3400.97 2.435787e+09
26799 2020-10-27 3403.15 3409.51 3388.71 3390.68 2.395102e+09
26800 2020-10-28 3342.48 3342.48 3268.89 3271.03 3.147944e+09
26801 2020-10-29 3277.17 3341.05 3259.82 3310.11 2.752626e+09
26802 2020-10-30 3293.59 3304.93 3233.94 3269.96 3.002804e+09
IN:
df = df.set_index(['Date'])
print("my index type is ")
print(df.index.dtype)
print(type(df.index)) #type of index
OUT:
Open High Low Close Volume
Date
2007-01-03 1418.03 1429.42 1407.86 1416.60 1.905089e+09
2007-01-04 1416.95 1421.84 1408.22 1418.34 1.669144e+09
2007-01-05 1418.34 1418.34 1405.75 1409.71 1.621889e+09
2007-01-08 1409.22 1414.98 1403.97 1412.84 1.535189e+09
2007-01-09 1412.85 1415.61 1405.42 1412.11 1.687989e+09
... ... ... ... ...
2009-12-24 1120.59 1126.48 1120.59 1126.48 7.042833e+08
2009-12-28 1126.48 1130.38 1123.51 1127.78 1.509111e+09
2009-12-29 1127.78 1130.38 1126.08 1126.19 1.383900e+09
2009-12-30 1126.19 1126.42 1121.94 1126.42 1.265167e+09
2009-12-31 1126.42 1127.64 1114.81 1115.10 1.153883e+09
my index type is
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
I try to slice for Mondays using
monday_dow = df["Date"].dt.dayofweek==0
OUT (Spyder returns):
KeyError: 'Date'
I've read a lot and similar answers on stackoverflow, but could fix this, although I understand I do something wrong with index, it should be called another way?

You need filter by DatetimeIndex by DatetimeIndex.dayofweek (removed .dt used only for columns):
monday_dow = df.index.dayofweek==0
So if need all rows:
df1 = df[monday_dow]
Also here is possible simplify code for set DatimeIndex in read_csv:
df = pd.read_csv(r'E:\...\^d.csv', index_col=['Date'], parse_dates=['Date'])
monday_dow = df.index.dayofweek==0
df1 = df[monday_dow]

Related

Turning daily stock prices into weekly/monthly/quarterly/semester/yearly?

I'm trying to convert daily prices into weekly, monthly, quarterly, semesterly, yearly, but the code only works when I run it for one stock. When I add another stock to the list the code crashes and gives two errors. 'ValueError: Length of names must match number of levels in MultiIndex.' and 'TypeError: other must be a MultiIndex or a list of tuples.' I'm not experienced with MultiIndexing and have searched everywhere with no success.
This is the code:
import yfinance as yf
from pandas_datareader import data as pdr
symbols = ['AMZN', 'AAPL']
yf.pdr_override()
df = pdr.get_data_yahoo(symbols, start = '2014-12-01', end = '2021-01-01')
df = df.reset_index()
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace = True)
res = {'Open': 'first', 'Adj Close': 'last'}
dfw = df.resample('W').agg(res)
dfw_ret = (dfw['Adj Close'] / dfw['Open'] - 1)
dfm = df.resample('BM').agg(res)
dfm_ret = (dfm['Adj Close'] / dfm['Open'] - 1)
dfq = df.resample('Q').agg(res)
dfq_ret = (dfq['Adj Close'] / dfq['Open'] - 1)
dfs = df.resample('6M').agg(res)
dfs_ret = (dfs['Adj Close'] / dfs['Open'] - 1)
dfy = df.resample('Y').agg(res)
dfy_ret = (dfy['Adj Close'] / dfy['Open'] - 1)
print(dfw_ret)
print(dfm_ret)
print(dfq_ret)
print(dfs_ret)
print(dfy_ret)```
This is what the original df prints:
```Adj Close Open
AAPL AMZN AAPL AMZN
Date
2014-12-01 26.122288 326.000000 29.702499 338.119995
2014-12-02 26.022408 326.309998 28.375000 327.500000
2014-12-03 26.317518 316.500000 28.937500 325.730011
2014-12-04 26.217640 316.929993 28.942499 315.529999
2014-12-05 26.106400 312.630005 28.997499 316.799988
... ... ... ... ...
2020-12-24 131.549637 3172.689941 131.320007 3193.899902
2020-12-28 136.254608 3283.959961 133.990005 3194.000000
2020-12-29 134.440399 3322.000000 138.050003 3309.939941
2020-12-30 133.294067 3285.850098 135.580002 3341.000000
2020-12-31 132.267349 3256.929932 134.080002 3275.000000
And this is what the different df_ret print when I go from daily
to weekly/monthly/etc but it can only do it for one stock and
the idea is to be able to do it for multiple stocks:
Date
2014-12-07 -0.075387
2014-12-14 -0.013641
2014-12-21 -0.029041
2014-12-28 0.023680
2015-01-04 0.002176
...
2020-12-06 -0.014306
2020-12-13 -0.012691
2020-12-20 0.018660
2020-12-27 -0.008537
2021-01-03 0.019703
Freq: W-SUN, Length: 318, dtype: float64
Date
2014-12-31 -0.082131
2015-01-30 0.134206
2015-02-27 0.086016
2015-03-31 -0.022975
2015-04-30 0.133512
...
2020-08-31 0.085034
2020-09-30 -0.097677
2020-10-30 -0.053569
2020-11-30 0.034719
2020-12-31 0.021461
Freq: BM, Length: 73, dtype: float64
Date
2014-12-31 -0.082131
2015-03-31 0.190415
2015-06-30 0.166595
2015-09-30 0.165108
2015-12-31 0.322681
2016-03-31 -0.095461
2016-06-30 0.211909
2016-09-30 0.167275
2016-12-31 -0.103026
2017-03-31 0.169701
2017-06-30 0.090090
2017-09-30 -0.011760
2017-12-31 0.213143
2018-03-31 0.234932
2018-06-30 0.199052
2018-09-30 0.190349
2018-12-31 -0.257182
2019-03-31 0.215363
2019-06-30 0.051952
2019-09-30 -0.097281
2019-12-31 0.058328
2020-03-31 0.039851
2020-06-30 0.427244
2020-09-30 0.141676
2020-12-31 0.015252
Freq: Q-DEC, dtype: float64
Date
2014-12-31 -0.082131
2015-06-30 0.388733
2015-12-31 0.538386
2016-06-30 0.090402
2016-12-31 0.045377
2017-06-30 0.277180
2017-12-31 0.202181
2018-06-30 0.450341
2018-12-31 -0.107405
2019-06-30 0.292404
2019-12-31 -0.039075
2020-06-30 0.471371
2020-12-31 0.180907
Freq: 6M, dtype: float64
Date
2014-12-31 -0.082131
2015-12-31 1.162295
2016-12-31 0.142589
2017-12-31 0.542999
2018-12-31 0.281544
2019-12-31 0.261152
2020-12-31 0.737029
Freq: A-DEC, dtype: float64```
Without knowing what your df DataFrame looks like I am assuming it is an issue with correctly handling the resampling on a MultiIndex similar to the one talked about in this question.
The solution listed there is to use pd.Grouper with the freq and level parameters filled out correctly.
# This is just from the listed solution so I am not sure if these is the correct level to choose
df.groupby(pd.Grouper(freq='W', level=-1))
If this doesn't work, I think you would need to provide some more detail or a dummy data set to reproduce the issue.

How to resample AAII weekly data to daily?

I would like to import the following file which contains data in a weekly format (Thursdays only) and convert it to a daily file with the values from Thursday filled out through the next Wednesday skipping Saturday and Sunday.
https://www.aaii.com/files/surveys/sentiment.xls
I can import it:
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
Here is the result:
But that is as far as I can get. Even the simplest resampling fails with
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I tried df['Date'] = pd.to_datetime(df['Date']) and other methods with no incremental success.
Thoughts as to how to get this done?
You can try like..
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
your Date column having NaN values so when you trying to convert as datetime it fails to do so ..
>>> df['Date']
0 NaN
1 1987-06-26 00:00:00
2 1987-07-17 00:00:00
3 1987-07-24 00:00:00
4 1987-07-31 00:00:00
So, you to convert the datetime you need to use coerce to get it..
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Now your Date's are processed ..
>>> df['Date']
0 NaT
1 1987-06-26
2 1987-07-17
3 1987-07-24
4 1987-07-31
5 1987-08-07
6 1987-08-14
7 1987-08-21
Now Set your index to the Date column before you can resample as mention in the comments:
>>> df.set_index('Date', inplace=True)
>>> df.head()
Bullish Neutral Bearish Total Mov Avg Spread Average +St. Dev. - St. Dev. High Low Close
Date
NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1987-06-26 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 NaN NaN NaN
1987-07-17 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 314.59 307.63 314.59
1987-07-24 0.36 0.50 0.14 1.0 NaN 0.22 0.382642 0.484295 0.280989 311.39 307.81 309.27
1987-07-31 0.26 0.48 0.26 1.0 NaN 0.00 0.382642 0.484295 0.280989 318.66 310.65 318.66
I think this is the correct answer, converts to daily, strips non-trading days and Saturday/Sunday.
import pandas as pd
from pandas.tseries.offsets import BDay
# read csv, use SENTIMENT sheet, drop the first three rows, parse dates to datetime, index on date
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df = df[3:].asfreq('D', method='ffill') # skip 3 lines then expand to daily and fill forward
df = df[df.index.map(BDay().onOffset)] # strip non-trading weekdays
df = df[df.index.dayofweek < 5] # strip Saturdays and Sundays
print(df.head(250))
There may be a more elegant method, but that gets the job done.

python converting url CSV to pandas for numerical and date value always gives wrong output

Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)

Pulling specific value from Pandas DataReader dataframe?

Here is the code I am running:
def competitor_stock_data_report():
import datetime
import pandas_datareader.data as web
date_time = datetime.datetime.now()
date = date_time.date()
stocklist = ['LAZ','AMG','BEN','LM','EVR','GHL','HLI','MC','PJT','MS','GS','JPM','AB']
start = datetime.datetime(date.year-1, date.month, date.day-1)
end = datetime.datetime(date.year, date.month, date.day-1)
for x in stocklist:
df = web.DataReader(x, 'google', start, end)
print(df)
print(df.loc[df['Date'] == start]['Close'].values)
The problem is in the last line. How do I pull the specific value of the date specified 'Close' value?
Open High Low Close Volume
Date
2016-08-02 35.22 35.25 33.66 33.75 861111
2016-08-03 33.57 34.72 33.42 34.25 921401
2016-08-04 33.89 34.22 33.77 34.07 587016
2016-08-05 34.55 34.94 34.31 34.35 463317
2016-08-08 34.54 34.75 34.31 34.74 958230
2016-08-09 34.68 35.12 34.64 34.87 732959
I would like to get 33.75 for example, but the date is dynamically changing..
Any suggestions?
IMO the easiest way to get a column's value in the first row:
In [40]: df
Out[40]:
Open High Low Close Volume
Date
2016-08-03 767.18 773.21 766.82 773.18 1287421
2016-08-04 772.22 774.07 768.80 771.61 1140254
2016-08-05 773.78 783.04 772.34 782.22 1801205
2016-08-08 782.00 782.63 778.09 781.76 1107857
2016-08-09 781.10 788.94 780.57 784.26 1318894
... ... ... ... ... ...
2017-07-27 951.78 951.78 920.00 934.09 3212996
2017-07-28 929.40 943.83 927.50 941.53 1846351
2017-07-31 941.89 943.59 926.04 930.50 1970095
2017-08-01 932.38 937.45 929.26 930.83 1277734
2017-08-02 928.61 932.60 916.68 930.39 1824448
[252 rows x 5 columns]
In [41]: df.iat[0, df.columns.get_loc('Close')]
Out[41]: 773.17999999999995
Last row:
In [42]: df.iat[-1, df.columns.get_loc('Close')]
Out[42]: 930.38999999999999
Recommended
df.at[df.index[-1], 'Close']
df.iat[-1, df.columns.get_loc('Close')]
df.loc[df.index[-1], 'Close']
df.iloc[-1, df.columns.get_loc('Close')]
Not intended as public api, but works
df.get_value(df.index[-1], 'Close')
df.get_value(-1, df.columns.get_loc('Close'), takeable=True)
Not recommended, chained indexing
There could be more, but do I really need to add them
df.iloc[-1].at['Close']
df.loc[:, 'Close'].iat[-1]
All yield
34.869999999999997

How do you convert time series data with pandas and show graph via highstock

I have some time series data as(financial stock trading data):
TIMESTAMP PRICE VOLUME
1294311545 24990 1500000000
1294317813 25499 5000000000
1294318449 25499 100000000
I need to convert them to OHLC values (JSON list) based on price column,ie,(open,high,low,close), and show that as OHLC graph with highstock JS framework.
The output should be as following:
[{'time':'2013-09-01','open':24999,'high':25499,'low':24999,'close':25000,'volume':15000000},
{'time':'2013-09-02','open':24900,'high':25600,'low':24800,'close':25010,'volume':16000000},
{...}]
For example,my sample have 10 data for day 2013-09-01,the output will have one object for the day with high is the highest price of all 10 data, low is the lowest price,open is the first price of the day, close is the last price of that day,volume should be the TOTAL volume of all 10 data.
I know there is a python library pandas maybe could do that,but i still could not try it out.
Updated: As suggestion, i use resample() as:
df['VOLUME'].resample('H', how='sum')
df['PRICE'].resample('H', how='ohlc')
But how to merge the result?
At the moment you can only perform ohlc on a column/Series (will be fixed in 0.13).
First, coerce TIMESTAMP columns to a pandas Timestamp:
In [11]: df.TIMESTAMP = pd.to_datetime(df.TIMESTAMP, unit='s')
In [12]: df.set_index('TIMESTAMP', inplace=True)
In [13]: df
Out[13]:
PRICE VOLUME
TIMESTAMP
2011-01-06 10:59:05 24990 1500000000
2011-01-06 12:43:33 25499 5000000000
2011-01-06 12:54:09 25499 100000000
The resample via ohlc (here I've resampled by hour):
In [14]: df['VOLUME'].resample('H', how='ohlc')
Out[14]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 1500000000 1500000000 1500000000 1500000000
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 5000000000 5000000000 100000000 100000000
In [15]: df['PRICE'].resample('H', how='ohlc')
Out[15]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 24990 24990 24990 24990
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 25499 25499 25499 25499
You can apply to_json to any DataFrame:
In [16]: df['PRICE'].resample('H', how='ohlc').to_json()
Out[16]: '{"open":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"high":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"low":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"close":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0}}'
*This would probably be a straightforward enhancement for a DataFrame atm its NotImplemented.
Updated: from your desired output (or at least very close to), can be achieved as follows:
In [21]: price = df['PRICE'].resample('D', how='ohlc').reset_index()
In [22]: price
Out[22]:
TIMESTAMP open high low close
0 2011-01-06 00:00:00 24990 25499 24990 25499
Use the records orientation and the iso date_format:
In [23]: price.to_json(date_format='iso', orient='records')
Out[23]: '[{"TIMESTAMP":"2011-01-06T00:00:00.000Z","open":24990,"high":25499,"low":24990,"close":25499}]'
In [24]: price.to_json('foo.json', date_format='iso', orient='records') # save as json file

Categories

Resources